5 months ago

Why doesn't Convex have SELECT or COUNT?

The above is comment from Theos latest epic video on databases and Convex.

The video was great by the way you should totally check it out, but while I was scrolling through the comments this comment really caught my eye as it has a good point.

Why DOESNT Convex provide a “SELECT” or a “COUNT”?

To show you what I mean lets take a look at this really simple schema.

1export default defineSchema({
2  smallTable: defineTable({
3    smallValue: v.string(),
4  }),
5  bigTable: defineTable({
6    smallValue: v.string(),
7    bigValue: v.string(),
8  }),
9});
10

We have two tables. smallTable just has a single field smallValue that is only going to have a small string in it.

Then we have bigTable that also have a smallValue but also has a bigValue which is going to contain a really large string, like 100kb.

Now lets say we want to query that does the SQL equivalent of SELECT * from smallTable this is how you might do it in Convex:

1export const selectStarFromSmallTable = mutation({
2  args: {},
3  handler: async (ctx, args) => {
4    const smallTable = await ctx.db.query("smallTable").collect();
5    return smallTable;
6  },
7});
8

If we go to the functions tab in the Convex Dashboard we can run the function which turns out to be about 900ms.

If I do the same for the bigTable it takes 2-3 seconds. Makes sense right? as the bigTable is.. well bigger so its going to take longer to fetch that data.

You might be wondering why im doing these queries in a mutation rather than a Convex query.

Well you see, Convex automatically cache’s queries so as long as none of the inputs change then it doesnt need to re-run it, so if I create this query instead

1export const selectStarFromSmallTableQuery = query({
2  args: {},
3  handler: async (ctx, args) => {
4    const smallTable = await ctx.db.query("smallTable").collect();
5    return smallTable;
6  },
7});
8

Then I hop open the dashboard, go to the Logs tab and run this query a couple of times by selecting it from the dropdown then we will see this:

You can see the first time its run then we get a success, then subsequent times, we get a cached status and a near-instant response.

Right, now the question is how do we go about doing a SELECT in Convex? If I'm looking at the bigTable and I'm not interested in bigValue can I make it so I just select the smallValue?

Well if we interrogate our query in the IDE we can see that there is no select function.

But what we can do is just map over the results and just grab the one value we want here and return it like so:

1export const selectSmallValueFromBigTable = mutation({
2  args: {},
3  handler: async (ctx, args) => {
4    const bigTable = await ctx.db.query("bigTable").collect();
5    return bigTable.map((row) => row.smallValue);
6  },
7});
8

If we now run that in the dashboard this is what we get:

Okay well that certainly seems faster! 698 milliseconds compared to the 2-3 seconds before.

But looks can be deceiving as there is a lot of things going on here when we press this button, lets take a look at a diagram:

When I press the button, my browser that is running the Convex dashboard is going to make a request to the convex backend.

Convex is going to spin up a little v8 isolate, which is then going to make our query.

This is going to go out to the underlying storage / database and return the result back to the convex server, which is then going to eventually return it back to the user.

So now the Convex dashboard is going to record that whole time as 698ms which makes sense as this is usually what we want, the time the user of your app is likely going to experience.

But for our purposes what we are really interested in is how long it takes to go from the underlying storage to the convex server.

Is this actually faster because of our mapping in the query?

1export const selectSmallValueFromBigTable = mutation({
2  args: {},
3  handler: async (ctx, args) => {
4    const bigTable = await ctx.db.query("bigTable").collect();
5    return bigTable.map((row) => row.smallValue); // Is this making things faster?
6  },
7});
8

Is convex doing something funky under the covers that optimizes this query because it knows we wont need the bigValue?

Well unfortunately the Convex dashboard doesn't expose this information to us, but all is not lost as we can go into this integrations tab an enable one of these Log Stream providers to get more detailed info about each call.

I'm going to hook up Axiom.

If you aren’t familiar with Axiom its an incredible log streaming and analysis service that I adore and have used extensively in the past for my game BattleTabs.

Once I have Axiom hooked up to our Convex backend I can try our selectStartFromSmallTable function again and hop back to Axiom and checkout the log line.

Right so we have a lot more info in here now. We can see our deployment name, our function we called and a bunch of other stuff.

What we are interested in right now is this data.execution_time_ms and data.usage.database_read_bytes. Im just going to grab both of those and pop them into this table and do that for the other functions:

So interestingly we can see that when we “select” just the smallValue from the bigTable and return that it is indeed faster. The question is, how much of that is time spent serializing the data when we return from the function tho?

Unfortunately this is impossible to tell from this data but what we can say tho is that the number of data that is read from the underling database is the same in both scenarios.

So if the data read is the same in both scenarios then we know that convex must NOT be intelligently selecting only the data required.

So that then validates what SourceCodeDeleted is saying:

And convex does indeed price its database bandwidth based on this.

So then back to the question question then.

Why doesn't Convex provide a SELECT?

I don't think the issue is one of not wanting to and more of a limitation around that Convex is both a relational AND a document database.

Other row-based databases like Postgres benefit from some specific optimizations because they have fixed columns.

So in Postgres if a field value gets too large (over 2kb) then it actually replaces the value in the row with a pointer to a special place called “The Oversized Attribute Storage Technique” or TOAST table.

So that means that when you want to read a number of rows that includes the bigValue it has to do a lookup into this other area first to grab the entire row. So the optimization that it performs with SELECT “smallValue” is to allow it to skip this expensive lookup and retrieval. This is something Convex doesn't do because its actually storing documents to its underlying storage and not columns.

Oh and things start to get super messy when we start talking about indices on both databases so lets not go into that right now.

The key things to know here is that although it might be possible for Convex to do some sort of TOAST like process internally like Postgres, it doesn't want to for another more philosophical reason that finally brings us to our second question.

Why doesn't Convex support COUNT?

If we return to SourceCodeDeleted’s comment and take a look at the next thing they said

They are quite right, Convex queries do not have a count() function

So the only way to count our rows from a query is as the author suggests, to return back all the rows you want then count them. So again, why cant Convex support COUNT? Postgres, MongoFB and Firebase have count primitives.

So here we get down to it.

What does the lack of SELECT and the lack of COUNT have in common?

Both operations sound cheap but usually require reading a lot of data.

COUNT has to touch (or at least index-scan) every matching row. In Convex’s world that means lots of documents read, which blows the cache on every write and costs you in bandwidth.

SELECT would only be cheap if Convex could skip reading the big fields in storage. But Convex stores whole JSON docs, not column chunks so it still has to read the full blob, then throw away what you didn’t ask for.

So Convex’s philosophy is: “no pretend-cheap primitives”. Instead, you model what you really need and you get consistent performance and pricing as you expect.

So could Convex add these?

Sure probably, but the ‘easy’ API would mask real costs, thrash the cache and create unpredictable performance. Convex prefers explicit modeling over magic that SQL and its underlying query planner present.

Solutions

Okay so finally then what are the solutions to these two issues?

So firstly for the “select" issue what I would probably do is split my data up such that I have my bigValues stored in another table then I would only pull in that data when I need to.

So something like this

1  smallTable: defineTable({
2    smallValue: v.string(),
3    bigValueId: v.id("bigTable"),
4  }),
5  
6  bigTable: defineTable({
7    bigValue: v.string(),
8  }),
9

So really if you think about it, this is exactly what Postgres is doing for us with its TOAST. This would dramatically reduce the amount of data that we have to read each query and thus reduce our overall database costs.

As for COUNT well as with SELECT Convex kind of leaves that up to you but what I would probably do is either manually keep a count somewhere in my database so when I change something in a mutation I increment or decrement a count.

Like here for example we have a mutation for “liking” a post:

1export const likePost = mutation({
2  args: {
3    postId: v.id("posts"),
4    userId: v.id("users"),
5  },
6  handler: async (ctx, args) => {
7    await ctx.db.insert("likes", {
8      postId: args.postId,
9      userId: args.userId,
10    });
11
12    const post = await ctx.db.get("posts", args.postId);
13    if (!post) throw new Error("Post not found");
14
15    await ctx.db.patch("posts", args.postId, { likes: post.likes++ });    
16  },
17});
18

We record the like as a row in the database but at the same time we also record the number of likes this post has as a count on the post. That way when we read posts we don't have to query the likes table to find out how many likes a given post has.

For more complex scenarios you can checkout the Convex Aggregate Component that that is designed for this exact purpose. Its got not only counts but sums, maxes and bounds based rankings perfect for things like leaderboards.

Oh and one last thing, I learnt this from a colleague (but don't tell anyone) there is actually a cheeky global count for tables in Convex, that's how the Convex dashboard is able to show this number.

Its not documented and not officially visible through the types but its there. Buuuut I wouldn't use it in production if I were you, unless you wanted your app to break at one point without warning!

Conclusion

Alright well that's about it for me for today. If you have any questions or comments then please do come find me on Discord im always open to discuss some more!

Until next time thanks for reading!

Cheerio!

Build in minutes, scale forever.

Convex is the backend platform with everything you need to build your full-stack AI project. Cloud functions, a database, file storage, scheduling, workflow, vector search, and realtime updates fit together seamlessly.

Get started