4 months ago

Convex Evals: Behind the scenes of AI coding with Convex

AI coding is here: The most productive developers are leveraging AI to speed up their workflows. This ranges from asking models questions about system design to letting AI take the driver's seat with tools like Cursor Composer.

With this new coding paradigm emerging, it has become important for developers and creators of the tools they use to understand how LLMs perform against coding tasks. This means being able to answer questions like “How well can an LLM write TypeScript?” or “Can an LLM determine the best place to use an array versus a hash map?”.

At Convex, we are specifically curious about how well LLMs can write code using our product, a reactive database. AI coding agents interact directly with our product through our TypeScript APIs, so we must understand what this workflow looks like for our customers.

In testing models’ performance, we realized that many of them suffered from the “knowledge cutoff problem” with Convex. The “knowledge cutoff problem” describes the preference for LLMs to use older, more established tools because they have more representation in their pre-training data. Here is an example of a hallucination that OpenAI’s GPT-4o made before our guidelines next to what the actual syntax is:

Without prompting:

1import { action, cronJobs } from "convex/server";
2import { v } from "convex/values";
3
4const emptyAction = action({
5  args: { scheduleDescription: v.optional(v.string()) },
6  handler: async ({}, { scheduleDescription }) => {
7    console.log("Cron Job Triggered:", scheduleDescription ?? "No description provided");
8    return null;
9  }
10});
11
12export default cronJobs({
13  "run every second": {
14    interval: { seconds: 1 },
15    action: emptyAction,
16    args: {}
17  },
18 });
19

With prompting (correct syntax):

1import { cronJobs } from "convex/server";
2import { internal } from "./_generated/api";
3import { internalAction } from "./_generated/server";
4import { v } from "convex/values";
5
6export const emptyAction = internalAction({
7  args: {
8    scheduleDescription: v.optional(v.string()),
9  },
10  returns: v.null(),
11  handler: async (ctx, args) => {
12    console.log(args.scheduleDescription);
13  },
14});
15
16const crons = cronJobs();
17
18crons.interval(
19  "run every second",
20  { seconds: 1 },
21  internal.crons.emptyAction,
22  {},
23);
24
25export default crons;
26

The output without prompting doesn’t compile because it doesn’t import cronJobs from the correct place and initializes the cronJobs object incorrectly.

To mitigate problems like this, we realized that we had to create a systematic way to evaluate LLMs ability to write Convex code. Then, we could use these results to improve its performance.

Evals

This, is where evals come in. Evals are a way to quantitatively evaluate how LLMs perform against tasks. There are three main pieces to an eval:

Task: the prompt we give to the LLM to perform a specific operation
Data: is the input and expected output from an LLM for a given task
Scoring functions: evaluates how well the LLM performed the task. With these three pieces, we can evaluate how LLMs perform against a wide range of Convex-specific tasks.

Below is an example of an eval from our convex-evals repo:

Task

1// Task.txt
2
3Write a backend that has six empty functions in `index.ts`:
4
51. A public query called `emptyPublicQuery`.
62. A public mutation called `emptyPublicMutation`.
73. A public action called `emptyPublicAction`.
84. A private query called `emptyPrivateQuery`.
95. A private mutation called `emptyPrivateMutation`.
106. A private action called `emptyPrivateAction`.
11
12All six of these functions take in no arguments and return null.
13

Data (the expected output)

1// index.ts
2
3import {
4  query,
5  mutation,
6  action,
7  internalQuery,
8  internalMutation,
9  internalAction,
10} from "./_generated/server";
11import { v } from "convex/values";
12
13export const emptyPublicQuery = query({
14  args: {},
15  returns: v.null(),
16  handler: async (ctx) => {
17    return null;
18  },
19});
20
21export const emptyPublicMutation = mutation({
22  args: {},
23  returns: v.null(),
24  handler: async (ctx) => {
25    return null;
26  },
27});
28
29...
30

Scoring function

1// grader.test.ts
2
3import { expect, test } from "vitest";
4import {
5  responseAdminClient,
6  responseClient,
7  compareSchema,
8  compareFunctionSpec,
9} from "../../../grader";
10import { anyApi } from "convex/server";
11
12test("compare schema", async ({ skip }) => {
13  await compareSchema(skip);
14});
15
16test("compare function spec", async ({ skip }) => {
17  await compareFunctionSpec(skip);
18});
19
20...
21

These three pieces are combined to give LLMs scores based on how many test cases they pass. For our test suite, we have seven different categories: fundamentals, data modeling, queries, mutations, actions, idioms, and clients. Creating this suite has enabled us evaluate how prompting can affect LLMs performance using Convex and improve AI’s ability to code with Convex.

LLM Guidelines

Convex guidelines are how we provide Convex developers with the best AI development experience. These curated set of rules increase the success rate of AI writing Convex code by about 20%.¹ Here’s a code snippet of a model writing code before and after using the Convex guidelines. If you don’t believe me, try it for yourself!

Tuning these guidelines to provide to models is effectively prompt engineering, which is the process of crafting prompts to get the right output from a model. It is both an art and a science because it includes anything from understanding which tones LLMs comprehend better to providing to optimal amount of tokens to not overwhelm the context window.

Through iterating on the guidelines, we have been able to make significant progress on the “knowledge cutoff problem” we mentioned earlier. This was a big win because it shows you can significantly improve a model’s performance without fine-tuning it, which can be very costly.

Working with tuning models to optimize these guidelines has been exciting, so I’ll give you a bit of insight into the process. My strategy was to pick the categories that each model was worst at, and adjust the guidelines to pass the missing test cases. This process was continued until we started seeing improvements from prompting. Below, is my process for tuning the actions category of our evals.

Tuning actions

The actions category of our evals has eight different tests. I noticed that four of them were consistently failing, so I decided to look into what the LLM output looked like compared to the expected output.

The two main gaps that I found were that the LLMs did not know how to use “use node” and did not understand our storage APIs correctly. Below are snippets of the LLM output versus what we would expect:

Node action eval:

actual

1import { action } from "./_generated/server";
2import { v } from "convex/values";
3import crypto from "crypto";
4import path from "path";
5
6// Define the action using Node runtime
7export const processWithNode = action({
8  args: { data: v.string() },
9  returns: v.object({
10    hash: v.string(),
11    normalizedPath: v.string(),
12  }),
13  handler: async (ctx, args) => {
14      const hash = crypto
15        .createHash("sha256")
16        .update(args.data)
17        .digest("hex");
18
19      const normalizedPath = path.normalize("/some/test/path");
20
21      return {
22        hash,
23        normalizedPath,
24      };
25  },
26});
27

expected

1"use node";
2import { action } from "./_generated/server";
3import { v } from "convex/values";
4import crypto from "crypto";
5import path from "path";
6
7// Define the action using Node runtime
8export const processWithNode = action({
9  args: { data: v.string() },
10  returns: v.object({
11    hash: v.string(),
12    normalizedPath: v.string(),
13  }),
14  handler: async (ctx, args) => {
15      const hash = crypto
16        .createHash("sha256")
17        .update(args.data)
18        .digest("hex");
19
20      const normalizedPath = path.normalize("/some/test/path");
21
22      return {
23        hash,
24        normalizedPath,
25      };
26  },
27});
28

Storage api eval:

actual

1import { action } from "convex/server";
2import { v } from "convex/values";
3
4export const writeTextToStorage = action({
5  args: { text: v.string() },
6  handler: async ({ storage }, { text }) => {
7    const encoder = new TextEncoder();
8    const textBuffer = encoder.encode(text);
9    
10    const storageId = await storage.store(textBuffer, "text/plain");
11    const url = await storage.getUrl(storageId);
12    
13    return { storageId, url };
14  }
15});
16
17export const readTextFromStorage = action({
18  args: { storageId: v.string() },
19  handler: async ({ storage }, { storageId }) => {
20    const textBuffer = await storage.get(storageId);
21    if (!textBuffer) throw new Error("File not found");
22    
23    const decoder = new TextDecoder();
24    return decoder.decode(textBuffer);
25  }
26});
27

expected

1import { action } from "./_generated/server";
2import { v } from "convex/values";
3
4/**
5 * Writes text content to Convex storage and returns the storage ID and URL.
6 */
7export const writeTextToStorage = action({
8  args: {
9    text: v.string(),
10  },
11  returns: v.object({
12    storageId: v.id("_storage"),
13    url: v.string(),
14  }),
15  handler: async (ctx, args) => {
16    // Store the text as a blob
17    const storageId = await ctx.storage.store(new Blob([args.text], {
18      type: "text/plain",
19    }));
20
21    // Get the URL for the stored file
22    const url = await ctx.storage.getUrl(storageId);
23    if (!url) {
24      throw new Error("Failed to generate URL for stored file");
25    }
26
27    return {
28      storageId,
29      url,
30    };
31  },
32});
33
34/**
35 * Reads text content from Convex storage by storage ID.
36 */
37export const readTextFromStorage = action({
38  args: {
39    storageId: v.id("_storage"),
40  },
41  returns: v.string(),
42  handler: async (ctx, args) => {
43    // Get the blob from storage
44    const blob = await ctx.storage.get(args.storageId);
45    if (!blob) {
46      throw new Error("File not found in storage");
47    }
48
49    // Convert binary data back to text
50    const text = await blob.text();
51
52    return text;
53  },
54});
55

After seeing these outputs, I added the following prompts to our Convex guidelines:

1'Always add `"use node";` to the top of files containing actions that use Node.js built-in modules.'
2
3Convex storage stores items as `Blob` objects. You must convert all items to/from a `Blob` when using Convex storage.
4

Once these prompts were added, the LLMs performance was improved on these tasks. After confirming that this didn’t regress other evals, I moved on to improving evals for other categories.

Different categories had mistakes that ranged from hallucinating Convex syntax to not understanding how to use the TypeScript Record type correctly. The most interesting thing, however, is that the models behaved very differently to prompt tuning.

Model Behavior

Claude was extremely responsive to code examples and directions. Generally, if I added a specific prompt, the responses would immediately become more correct. It also seems to have great general knowledge of coding, specifically with TypeScript.

GPT-4o surprisingly performed very differently from Claude. It was not as responsive to prompting and often forgot some of its context. The prompts would often improve the model’s performance against one axis, but decrease them across other axes.

For more granularity on the results from our testing, check out our LLM leaderboard here!

Conclusion

This testing has provided us great insights into how well AI can write code with Convex and opened our eyes to the importance of evals. Although you may not have the ability to fine-tune models, prompt engineering can have a huge impact on AI task correctness. With our extensive testing, we believe Convex is the best platform for AI coding. Try it out below!

Thanks to Sujay Jayakar for feedback on drafts of this post.

Footnotes

These results are for Claude 3.7-Sonnet and GPT-4o. ↩

Build in minutes, scale forever.

Convex is the backend platform with everything you need to build your full-stack AI project. Cloud functions, a database, file storage, scheduling, workflow, vector search, and realtime updates fit together seamlessly.

Get started