Jordan Hunt's avatar
Jordan Hunt
26 minutes ago

Which LLM writes the best code? Convex Chef model comparison

Which LLM writes the best code? Convex Chef model comparison

At Convex, we’ve been building Chef, an AI app builder that understands backend code. A key part of Chef is that it typechecks backend code and feeds the results back into the LLM, helping it correct its own errors. This approach lets us verify that the AI-written code works correctly before deployment.

To build this workflow, we explored the performance of several different models. Models tend to have different “personalities” and strengths. For example, Anthropic’s Claude 3.5 Sonnet is strong at following directions and writing concise answers, while OpenAI’s GPT-4.1 tends to create better UIs but struggles with tool use.

Here’s a quick overview of what we found from testing different models at Convex.

Claude 3.5 Sonnet

Pros

  • Great at following directions
  • Concise
  • Very good at coding
  • Great at function calling
  • Writes the best Convex code on the first attempt

Cons

  • Expensive
  • UI output is average
  • Streaming is somewhat slow

Gemini 2.5 Pro

Pros

  • Much longer context window
  • Supports significantly more completion tokens
  • Cheaper
  • Builds much better UIs (big improvement over Claude)
  • Very fast streaming

Cons

  • Sometimes struggles with function calling
  • Often too verbose and hits token limits, even when raised to 20,000

GPT-4.1

Pros

  • Cheaper
  • Solid UI generation
  • Fast streaming

Cons

  • Poor at function calling
  • Doesn’t follow instructions very well

How we chose our default model

For our use case, the most important qualities were:

  • Instruction following
  • Coding ability
  • Reliable tool use
  • Handling long contexts

Claude 3.5 Sonnet stood out for its ability to follow directions and write clean, usable code, so we made it the default model for Chef.

The LLM space is evolving fast. We’ll likely change our default model over time, or even switch models dynamically based on the user’s prompt. What matters most is knowing which strengths you need for your use case—and being able to measure models against those requirements with real data.

You can see how different models perform in our LLM Leaderboard. We've also built full-stack benchmarks for AI coding, where Convex shows stronger results compared to other platforms.

We’ll keep publishing updates as we continue testing and improving.

Build in minutes, scale forever.

Convex is the backend platform with everything you need to build your full-stack AI project. Cloud functions, a database, file storage, scheduling, workflow, vector search, and realtime updates fit together seamlessly.

Get started