Mike Cann's avatar
Mike Cann
10 days ago

How to Build Async AI Apps with Convex and TypeScript

LLM workloads break the request-response model that most TypeScript backends were built around. A user sends a prompt, the model takes thirty seconds to respond, the user closes their tab halfway through, and the agent still has three tool calls and a summarization pass ahead of it. If your backend was holding that work inside a single HTTP request, the work is gone. If it was holding it in memory on one server, the work is gone the moment that server restarts.

This piece walks through why async programming for AI apps with TypeScript requires a different backend shape than most teams reach for, and how Convex's reactive database, durable functions, and end-to-end type safety remove the glue code that usually fills the gap.

Why AI Apps Are Async by Nature

AI apps are async because the work outlives the request. An LLM generation can take seconds or minutes, often involves multiple tool calls and retries, and frequently needs to keep running after the client that started it has disconnected. Holding that work inside a streaming HTTP request couples the lifetime of the computation to the lifetime of a single TCP connection, which is the wrong coupling.

The framing I keep coming back to is that not everything you do with ML should be driven from a streaming HTTP request to an API. The request is a fine trigger, but it's a poor container, because triggers exist to start work and containers exist to hold it, and an LLM call needs the second job done by something that doesn't vanish when the network blips.

Promise-Based Async vs Long-Running Async

TypeScript developers already live inside one kind of async. A Promise resolves inside a single process, await yields the event loop, and the runtime stitches the continuation back together. That model assumes the process stays alive long enough to see the promise settle, which is a fine assumption for a database lookup or a quick fetch, since the work fits comfortably inside a single function invocation.

Long-running async programming is a different animal, because the work needs to survive a client disconnect, a server restart, a deployment, and, in some cases, a multi-week pause waiting on a human or an external event. You can't model that with a single await since there's no single process whose memory you can trust for the duration. The state of the computation has to live somewhere durable, and some scheduler has to pick it up again when it's time to make progress.

The Streaming HTTP Request Anti-Pattern

The Client-LLM Lifecycle showing failure modes on the client, cloud, and LLM sidesThe Client-LLM Lifecycle showing failure modes on the client, cloud, and LLM sides

The default pattern in a lot of AI tutorials is to open a streaming response from the server to the browser, pipe model tokens through it, and call it done. That pattern fails in predictable ways:

  • The user navigates away or closes the browser
  • The network blips on a train or in an elevator
  • The load balancer enforces a timeout
  • The server process gets recycled
  • The model hits a rate limit and the SDK throws halfway through.

Every one of these breaks the user-visible work, because the only place the work existed was inside that one request handler. If the generation was worth starting, it's usually worth finishing, which means it needs a home that outlives the connection. The whole point of moving the work out of the request is to make the connection optional rather than load-bearing.

What Breaks When You Treat the Cloud as One Box

Production cloud infrastructure is multi-node by default, so any architecture that assumes "the server" is a single machine breaks the moment you scale past one. Reconnect-and-resume only works if the resumed state is shared across nodes rather than pinned to a single process's memory, since the node that started the work may not be the node that handles the reconnection. Sticky sessions paper over this for a while and then fail loudly when an instance is replaced.

This is why persistence has to come first and the network call second. If the prompt, the partial output, and the intermediate tool results live in a database the moment they are produced, any node can pick the work back up and any client can resubscribe to its progress.

Disconnects, Timeouts, and Multi-Node Reality

The failure modes compound. A serverless function host might cap execution at a few minutes. A long-poll might be killed by an intermediary proxy. A websocket might survive the disconnect of the original browser tab but have no way to deliver its messages anywhere useful, because the receiving client process is gone. None of these are exotic failures, since they're the normal operating conditions of a deployed app.

The architectural move is to stop treating the server as a place where work happens and start treating it as a place where work is scheduled, journaled, and resumed. The actual progress lives in storage that every node can read and write, which means any node can advance the computation and any client can observe it.

The Dropbox Transcoding Lesson

There is a useful analogy here from video transcoding. There are more videos on Dropbox than on YouTube, and Dropbox transcodes them into HLS so they can be played in a browser. Transcoding a full video inside a single request would be hopeless because users disconnect, whereas HLS is chunked by design, so each chunk can be transcoded independently, persisted, and resumed. The same shape applies to AI work:

  • Break the long task into steps
  • Persist each step's output
  • Make resumability a property of the storage rather than the connection.

The Reactive Backend Model

Convex mental model: Client connects to Mutations and Queries which read and write a reactive DBConvex mental model: Client connects to Mutations and Queries which read and write a reactive DB

A reactive backend turns database queries into live subscriptions. When a Convex query reads a set of rows and a later mutation writes to any row inside that read range, every client subscribed to that query receives the updated result over a websocket automatically. There's no polling loop, no manual cache invalidation, and no webhook to wire up.

This is the mechanism that replaces most of the glue code in an async AI app. The LLM worker writes partial output to the database, and every client looking at that conversation sees the update. The worker doesn't need to know who is subscribed, and the clients don't need to ask whether anything changed, because the subscription itself is the change-notification system.

Mutations, Queries, and Subscriptions

Convex splits backend functions into mutations and queries with different guarantees. Mutations are serializable transactions that can read and write, so two mutations touching the same row see a consistent, ordered view of the world. Queries are pure reads that the system can safely re-run and cache, which is what makes them eligible to become reactive query subscriptions that push updates to clients.

Long-running work, including LLM calls and other side effects, lives in actions. An action can call out to a model provider, then schedule a mutation to commit the result. The transactional boundary stays clean because the network call isn't inside the transaction, which matters since you don't want a flaky model API holding open a database lock while it retries.

Type Safety From Schema to Client

Schemas are defined in TypeScript and validated at runtime, so a deploy that doesn't match the data at rest fails before it ships. Argument validators sit on every function, so the inputs are checked at the boundary rather than five layers deep. Types flow from the schema and validator definitions through the function signatures and into the React hooks on the frontend, with no separate code-generation step to remember to run.

For an async AI app this matters because the messages, tool calls, and run state are exactly the kind of nested, evolving shapes that drift between server and client when types aren't enforced end-to-end. A run row that gained a cancelledAt field last week needs to surface that field in the React component reading it this week, and the type system should refuse to compile if it doesn't.

Building an Async AI Workflow in Convex

The pattern that solves the streaming-HTTP problem is to persist the prompt first and run the model second. The client calls a mutation that writes the user's message into a messages table and enqueues an action to handle the generation. The action calls the model, streams partial output back into the database, and a query on the same table keeps every subscribed client in sync.

Save the Prompt, Then Call the LLM

AI workflow diagram: Write Prompt mutation, Call LLM action, Write Response mutation, List Messages query, all touching the DBAI workflow diagram: Write Prompt mutation, Call LLM action, Write Response mutation, List Messages query, all touching the DB

The mutation is short and synchronous; it writes the user message, creates a placeholder assistant message, and schedules the action that will fill it in.

1// convex/messages.ts
2import { mutation } from "./_generated/server";
3import { v } from "convex/values";
4import { internal } from "./_generated/api";
5
6export const send = mutation({
7  args: { threadId: v.id("threads"), prompt: v.string() },
8  handler: async (ctx, { threadId, prompt }) => {
9    await ctx.db.insert("messages", {
10      threadId,
11      role: "user",
12      content: prompt,
13    });
14    const assistantId = await ctx.db.insert("messages", {
15      threadId,
16      role: "assistant",
17      content: "",
18      status: "pending",
19    });
20    await ctx.scheduler.runAfter(0, internal.generate.run, {
21      threadId,
22      assistantId,
23    });
24    return assistantId;
25  },
26});
27

Because the mutation is a serializable transaction, the user message and the placeholder land atomically. If the client disconnects the instant after the mutation returns, the scheduled action still runs and the assistant message still fills in, so there's no orphaned state and no need for a reconciliation pass on the next page load.

Streaming Updates Without HTTP Streaming

The action drives the model and writes chunks back into the placeholder row; the query the UI subscribes to doesn't change shape during the stream, since it just keeps returning the latest content.

1// convex/generate.ts
2import { internalAction, internalMutation } from "./_generated/server";
3import { v } from "convex/values";
4import { internal } from "./_generated/api";
5
6export const run = internalAction({
7  args: { threadId: v.id("threads"), assistantId: v.id("messages") },
8  handler: async (ctx, { threadId, assistantId }) => {
9    const history = await ctx.runQuery(internal.messages.history, { threadId });
10    let buffer = "";
11    for await (const chunk of callModel(history)) {
12      buffer += chunk;
13      if (shouldFlush(buffer)) {
14        await ctx.runMutation(internal.messages.appendChunk, {
15          assistantId,
16          text: buffer,
17        });
18        buffer = "";
19      }
20    }
21    await ctx.runMutation(internal.messages.finish, { assistantId, tail: buffer });
22  },
23});
24

Per-token writes feel like the obvious move and are almost always the wrong one. Flushing at sentence or clause boundaries produces a UI that looks character-by-character to the user while keeping write volume reasonable. What looked like character-by-character streaming in our demos was actually sentence-by-sentence under the hood, and users couldn't tell the difference.

Coordinating Agents Across Tables

Agent architecture diagram showing the action/workflow loop with LLM calls, tool calls, branching, and handoffs coordinated through the DBAgent architecture diagram showing the action/workflow loop with LLM calls, tool calls, branching, and handoffs coordinated through the DB

Multi-agent setups fall out of the same model. Two actions can write into the same runs table, or a planner agent can write tasks into a queue table that worker agents pull from. There is no need for a separate message bus because the database is already the coordination point, and the reactive layer means every participant, including the UI, sees the same shared state.

The Convex Agent component formalizes this for the common cases, exposing thread, message, and run abstractions on top of the same primitives. Reach for the Convex Agent component when you want the conventions handled for you rather than rolling them by hand. The agent workflow documentation covers the underlying patterns in detail. Hand-rolling the same shape is fine when your data model is unusual, but the component captures the patterns most teams converge on after a few iterations.

Durable Functions for Long-Running AI Work

State Persistence slide: unsaved work is wasted work, checkpointing, alternating Persist and Retry-able stepsState Persistence slide: unsaved work is wasted work, checkpointing, alternating Persist and Retry-able steps

Durable functions are serverless functions that journal each step they take, retry automatically on failure, and can pause for arbitrary lengths of time, including months. They exist because the regular function model assumes a short, in-memory execution, whereas a multi-step agent might wait on a tool call, then on a human approval, then on an external webhook before completing.

When the model call fails or the tool returns a transient error, the workflow doesn't start over, since it resumes from the last journaled step with the previous results intact. That property is what makes durable functions different in kind from "an action with a try/catch," because the resume point is recorded in storage rather than reconstructed from logs.

Automatic Retries and Journaling

Each step in a workflow is recorded before the next one runs. If the process dies between steps, the workflow runtime picks up at the last recorded checkpoint when it's rescheduled. Retries with backoff are configurable per step, so a flaky external API doesn't require you to write retry logic by hand for every call site.

This is the difference between async that handles a five-second model call and async that handles a forty-minute agent run with seven tool calls and two retries. The first works fine inside an action, whereas the second wants a workflow, because losing partial progress on the second one is expensive in both latency and tokens spent.

Pausing Workflows for Months

A workflow can sleep until a specific time or until an external event fires. This is what makes patterns like "schedule a follow-up email in three weeks if the user hasn't responded" or "wait for the human reviewer to approve the draft" expressible as a single function rather than as a sprawl of cron jobs and state machines.

The Convex Workflow component provides this durable execution model, with the journal stored in the same database your queries are reading from, so workflow state is reactive in the same way ordinary data is. A pending-approval UI is just another query on the workflow table, and an approval mutation is the same shape as any other mutation.

When to Escalate From One-Off Functions to Workflows

A one-off action is the right tool when the work is a single bounded task, completes in seconds to a couple of minutes, and tolerates being retried from scratch. Escalate to a workflow when the work has multiple steps that shouldn't repeat on retry, when any step might pause for human input or an external event, or when partial progress is expensive enough that losing it is unacceptable. If a model call costs a few cents and finishes in ten seconds, an action is fine. If a workflow orchestrates five model calls, a vector search, and a tool invocation, the workflow runtime earns its keep.

The decision is rarely close once you frame it that way (see background job management for more on structuring this tradeoff). The cost of running a short task inside a workflow is some overhead and some indirection, whereas the cost of running a long multi-step task inside an action is losing the whole thing the first time anything goes wrong.

A Live Demo of Cross-Client Sync

The reactive model produces a UX win that's hard to appreciate without seeing it. Open the same conversation in two browser windows, send a message from the first, and the assistant response streams into both at the same time without any per-client wiring. Hit the abort button in the second window, and the first window sees the generation stop. Two different websockets and two different React contexts, with one reactive backend.

Two Browsers, One Reactive Backend

Neither client knows about the other; both subscribe to the same query on the messages table. When the action writes a chunk, the database notifies the query, and the query pushes the new result to every subscriber. The clients are interchangeable, which is the point, since the client identity isn't load-bearing in the architecture.

Aborting Generation Across Clients

Aborting Generation Across ClientsAborting Generation Across Clients

Cancellation works the same way. The abort button fires a mutation that flips the status field on the assistant message to "cancelled". The action polls that status between chunks and exits cleanly when it sees the flag, while every subscribed client sees the cancellation reflected the moment the mutation commits. There's no extra channel to manage and no out-of-band signal to coordinate just another write on the same table the UI is already watching.

The same shape extends to pause-and-resume, throttling, and any other control-plane signal you want to send to a running generation. Each one is a column on the row and a check inside the action loop, which keeps the surface area of the cancellation protocol roughly zero.

Tips for Shipping Async AI UIs

A few practitioner-level notes that tend to come up once you start shipping these patterns to real users.

Batched Updates Over Per-Token Streaming

Smooth Streaming tips slide with useSmoothText code exampleSmooth Streaming tips slide with useSmoothText code example

Flush model output at sentence or clause boundaries rather than per token. Users perceive the result as smooth streaming, database write volume drops by an order of magnitude, and the OCC contention surface on the messages row shrinks accordingly. If the model emits a long code block, flush on newlines, because waiting for a sentence boundary inside a fifty-line code sample looks like the stream has hung.

Hybrid HTTP and Async Patterns

Hybrid Async and SyncHybrid Async and Sync

You don't have to go all in on async. A short, latency-sensitive completion can still ride a regular HTTP action and stream over the response if that's what the UX needs. The architectural rule is that any work whose value outlives the request should be persisted first, whereas work whose value is bounded to the response can stay in the request. Mix the two as the workload requires, since not every model call is a multi-minute agent run.

Optimistic Updates on the Client

Because the mutation writes the user message before the action runs, the UI can render the user's message immediately from the subscription rather than from local state, and the assistant placeholder appears the same way. Optimistic updates are still available for the rare cases where you want the UI to move before the round trip completes, but the reactive subscription usually makes them unnecessary, since the round trip is fast enough that the placeholder arrives before the user's eyes have moved.

If you're shipping a chat agent, a streaming generation, or a multi-step background workflow, persist first and let the reactive layer handle sync. That single decision removes most of the polling, webhooking, and cache-invalidation code that async AI apps tend to accumulate.

Frequently Asked Questions

Q: What does "async" mean in the context of AI apps? A: Async programming in AI apps refers to work whose lifetime exceeds a single request or process. An LLM generation might take thirty seconds, involve multiple tool calls, and need to continue running after the client that initiated it has disconnected. This is different from the Promise-based async TypeScript developers use day to day, which assumes the work completes inside a single live process.

Q: How do you keep an AI agent's progress synced to the client without polling? A: Use a reactive database. The agent writes its progress into a table, and the client subscribes to a query that reads from that table. When the table changes, the query result is pushed to every subscribed client over a websocket automatically, so there's no polling loop and no manual cache invalidation to maintain.

Q: What happens when a client disconnects mid-LLM-generation? A: If the generation is running inside a streaming HTTP request, the work is lost when the connection closes. If the generation is running in a backend action that writes to the database as it goes, the work continues independently of the client. When the client reconnects, it resubscribes to the same query and sees the current state of the generation, including any output produced while it was offline.

Q: How do durable functions handle retries and long-running AI workflows? A: Durable functions journal each step they execute, so a failure causes the workflow to resume from the last recorded checkpoint rather than restarting from the beginning. Retries with backoff are configurable per step, and workflows can pause for arbitrary lengths of time, including months, while waiting for external events or human input.

Q: How does a reactive database differ from a traditional SQL backend for AI workloads? A: A traditional SQL backend requires the client to ask whether anything has changed, usually through polling or webhooks. A reactive database turns queries into live subscriptions, so any write that affects a subscribed query result is pushed to the client automatically. For AI workloads, where the server is producing streaming output that multiple clients may want to see, this removes the need to build a separate notification system on top of the database.

Q: How do you cancel an in-flight LLM generation across multiple clients? A: Store the cancellation state in the database. A mutation from any client flips a status field on the generation row, the backend action polls that field between chunks and exits cleanly when it sees a cancellation, and every subscribed client sees the cancelled state through the same query they were already watching. One write, every client in sync.

Putting Async AI Patterns Into Practice

Async programming for AI apps with TypeScript gets simpler when persistence comes first and the network call comes second. Once the prompt, the partial output, and the run state live in a reactive database, client disconnects stop mattering, multi-node deployments stop requiring sticky sessions, and cross-client sync stops requiring a separate notification layer. Durable functions extend the same model to multi-step workflows that need to survive failures and long pauses. The combined effect is that the parts of an AI backend that usually require gluing together several services collapse into a single reactive backend with end-to-end TypeScript types.

Spin up a Convex project and ship the demo from this post to see the reactive model in action on your own workload.

Build in minutes, scale forever.

Convex is the backend platform with everything you need to build your full-stack AI project. Cloud functions, a database, file storage, scheduling, workflow, vector search, and realtime updates fit together seamlessly.

Get started