7 months ago

Which LLM writes the best code? Convex Chef model comparison

At Convex, we’ve been building Chef, an AI app builder that understands backend code. A key part of Chef is that it typechecks backend code and feeds the results back into the LLM, helping it correct its own errors. This approach lets us verify that the AI-written code works correctly before deployment.

To build this workflow, we explored the performance of several different models. Models tend to have different “personalities” and strengths. For example, Anthropic’s Claude 3.5 Sonnet is strong at following directions and writing concise answers, while OpenAI’s GPT-4.1 tends to create better UIs but struggles with tool use.

Here’s a quick overview of what we found from testing different models at Convex.

Claude 3.5 Sonnet

Pros

Great at following directions
Concise
Very good at coding
Great at function calling
Writes the best Convex code on the first attempt

Cons

Expensive
UI output is average
Streaming is somewhat slow

Gemini 2.5 Pro

Pros

Much longer context window
Supports significantly more completion tokens
Cheaper
Builds much better UIs (big improvement over Claude)
Very fast streaming

Cons

Sometimes struggles with function calling
Often too verbose and hits token limits, even when raised to 20,000

GPT-4.1

Pros

Cheaper
Solid UI generation
Fast streaming

Cons

Poor at function calling
Doesn’t follow instructions very well

How we chose our default model

For our use case, the most important qualities were:

Instruction following
Coding ability
Reliable tool use
Handling long contexts

Claude 3.5 Sonnet stood out for its ability to follow directions and write clean, usable code, so we made it the default model for Chef.

The LLM space is evolving fast. We’ll likely change our default model over time, or even switch models dynamically based on the user’s prompt. What matters most is knowing which strengths you need for your use case—and being able to measure models against those requirements with real data.

You can see how different models perform in our LLM Leaderboard. We've also built full-stack benchmarks for AI coding, where Convex shows stronger results compared to other platforms.

We’ll keep publishing updates as we continue testing and improving.

Build in minutes, scale forever.

Convex is the backend platform with everything you need to build your full-stack AI project. Cloud functions, a database, file storage, scheduling, workflow, vector search, and realtime updates fit together seamlessly.

Get started