Stack logo
Bright ideas and techniques for building with Convex.
Profile image
Nipunn Koorapati
3 months ago

SELECT DISTINCT without SQL

A magnifying glass looking for a specific property next to a database, representing SELECT DISTINCT in SQL

What is SELECT DISTINCT?

The SELECT DISTINCT statement is a SQL command commonly utilized to retrieve unique records from the database. The DISTINCT clause eliminates duplicate values in the result set, ensuring that each returned row is distinct based on specified criteria. It’s a powerful and useful feature of most databases.

Suppose you have a table named Customers with the following columns: CustomerID, CustomerName, City, and Country. You want to retrieve a list of unique cities where your customers are located. You would use the following SQL to achieve this:

SELECT DISTINCT City
FROM Customers;

To achieve this functionality, the database does a lot of work behind the scenes. Specifically, it will use its query planner to create a best-effort plan for efficient data retrieval, but crucially, it’s imperfect.

From Wikipedia:

When a query is submitted to the database, the query optimizer evaluates some of the different, correct possible plans for executing the query and returns what it considers the best option. Because query optimizers are imperfect, database users and administrators sometimes need to manually examine and tune the plans produced by the optimizer to get better performance.

For large tables (on the order of hundreds of thousands of rows or more) or complex queries, it’s common that the developer will have to provide hints to the database on how to optimally run a DISTINCT statement, which requires that developer to have specialized knowledge of the data design and that database’s optimization features.

Because of its fundamental design, Convex obviates these challenges for even the largest tables. You can get consistent, unsurprising OLTP performance without having to massage the query planner with hints.

SELECT DISTINCT in Convex

Convex doesn’t have a built-in DISTINCT statement because it can be accomplished with existing primitive operators. And what do you gain? You get consistent and predictable performance!

Let me explain how.

With conventional databases, you typically want to minimize the number of SQL statements your code executes against the database. That’s in part because executing statements from your server to the database incurs the overhead of sending the data back and forth for each round-trip. To solve for this potential performance pitfall, traditional databases like MySQL and Postgres provide special syntax for common operations, like DISTINCT. Using this special SQL syntax, the database can offer functionality that requires multiple queries with a single round-trip.

Because Convex functions run next to the database in the reactor, we don’t require special syntax to get good performance. Here’s an example of how to achieve SELECT DISTINCT functionality in Convex using indexes directly.

Say you have a simple version history table with columns for service and version. The data includes 1000s of versions across a small handful of services. Here’s a Convex table schema:

export default defineSchema({
  version_history: defineTable({
    service: v.string(),
    version: v.string(),
  }).index("by_service", ["service", "version"]),
});

How would you query for the unique set of K services in this table of N rows? In SQL, you might write SELECT DISTINCT(service) FROM version_history , especially when you have many versions and few services. What is this doing under the hood? It aims to return one entry for each service, but what index can it use to do this efficiently? As discussed in this post, you can think of a database organized like a spreadsheet. How would you sort these columns to get a tidy list? As we have written, many many many times, we believe in consistent query performance for OLTP workloads that is not at the whims of an opaque query planner.

How to do it

To efficiently solve this query for this workload, a database must make K+1 single-row queries to the database on the by_service index. Each query skips forward to the next service , allowing the workload to be O(K) rather than the naive O(N).

In Convex, you can write a query function like this

export const latestVersionForServices = query(async (ctx) => {
  const latestVersions = {};
  let doc = await ctx.db
    .query("version_history")
    .withIndex("by_service")
    .order("desc")
    .first();
  while (doc !== null) {
    latestVersions[doc.service] = doc.version;
    const service = doc.service;
    doc = await ctx.db
      .query("version_history")
      .withIndex("by_service", (q) => q.lt("service", service))
      .order("desc")
      .first();
  }
  return latestVersions;
});

This function efficiently solves this query for cases where the K << N, minimizing the read set required. A smaller read set leads to fewer conflicts from mutations, fewer query invalidations, and fewer function re-executions. See How Convex Works for more details, but in short, this means that the query function only reads from these intervals, and will only conflict with writes to these intervals. Let’s take this dataset with 8 rows and 3 services.

serviceversion
apple agitator1
apple agitator3
apple agitator7
apple agitator9
banana blender1
banana blender5
cherry crusher6
cherry crusher7

This query function starts by fetching the first row on the by_service index descending, leading to the read set of the interval ((cherry crusher), 7), inf) . Then it queries on the by_service index for the next alphabetically earlier service - skipping backwards. This adds a second interval to the read set: [(banana blender, 5), (cherry crusher, -inf)]. As we repeat this process, we end up with this final read set:

(-inf, (apple agitator, -inf)]
((apple agitator, 9), (banana blender, -inf)]
((banana blender, 5), (cherry crusher, -inf)]
((cherry crusher), 7), inf)

This means that the query function only invalidates when a new service is added, removed, or the latest version of a given service changes. The query itself can efficiently use the index to skip around and calculate the set of services in O(K) rather than a full O(N) table scan.

Performance

Keen observers might be curious about the performance of running K select statements in a while loop. Conventionally, the idea of looping through select statements seems expensive in terms of performance. However, Convex is not conventional. With Convex, functions run inside the database, meaning round-trip latency is nominal and makes this approach extremely performant.

Summary

This example is similar to how a query planner might choose to optimize such a query, but with Convex you can get consistent, unsurprising OLTP performance without having to massage the query planner with hints.

Read more about the shortcomings of SQL in OLTP workloads here.