
Which LLM writes the best code? Convex Chef model comparison

At Convex, we’ve been building Chef, an AI app builder that understands backend code. A key part of Chef is that it typechecks backend code and feeds the results back into the LLM, helping it correct its own errors. This approach lets us verify that the AI-written code works correctly before deployment.
To build this workflow, we explored the performance of several different models. Models tend to have different “personalities” and strengths. For example, Anthropic’s Claude 3.5 Sonnet is strong at following directions and writing concise answers, while OpenAI’s GPT-4.1 tends to create better UIs but struggles with tool use.
Here’s a quick overview of what we found from testing different models at Convex.
Claude 3.5 Sonnet
Pros
- Great at following directions
- Concise
- Very good at coding
- Great at function calling
- Writes the best Convex code on the first attempt
Cons
- Expensive
- UI output is average
- Streaming is somewhat slow
Gemini 2.5 Pro
Pros
- Much longer context window
- Supports significantly more completion tokens
- Cheaper
- Builds much better UIs (big improvement over Claude)
- Very fast streaming
Cons
- Sometimes struggles with function calling
- Often too verbose and hits token limits, even when raised to 20,000
GPT-4.1
Pros
- Cheaper
- Solid UI generation
- Fast streaming
Cons
- Poor at function calling
- Doesn’t follow instructions very well
How we chose our default model
For our use case, the most important qualities were:
- Instruction following
- Coding ability
- Reliable tool use
- Handling long contexts
Claude 3.5 Sonnet stood out for its ability to follow directions and write clean, usable code, so we made it the default model for Chef.
The LLM space is evolving fast. We’ll likely change our default model over time, or even switch models dynamically based on the user’s prompt. What matters most is knowing which strengths you need for your use case—and being able to measure models against those requirements with real data.
You can see how different models perform in our LLM Leaderboard. We've also built full-stack benchmarks for AI coding, where Convex shows stronger results compared to other platforms.
We’ll keep publishing updates as we continue testing and improving.
Convex is the backend platform with everything you need to build your full-stack AI project. Cloud functions, a database, file storage, scheduling, workflow, vector search, and realtime updates fit together seamlessly.