How to Build Async AI Apps with Convex and TypeScript
Okay, so I work at Convex. We're going be talking about async, building things asynchronously, uh, AI, all that. Um, quick poll who developers in the crowd who don't know about already. Okay, cool. Do you people cover briefly what that is? And, um, do people know what a async means async asynchronous? Yeah. Happens in the background. Great. Uh, cool. So, async, this is what this async is. I I feel like it I took it as a given for a while, but then I was going to people who are building real apps with real customers. And when we would talk about their architecture, I was like, wait, I think we need to all keep catch up on like, you know, the fact that not everything you do with ML should be driven from a streaming HTTP request to an API. Yeah. Okay. But that's, you know, getting headed. Um, my name is Ian. I work at ComX as a developer experience person kind of doing producty stuff, building libraries and components on top of convex. We'll talk about this. Before that, some backend for Dropbox. I was responsible for like video transcoding, thumbnailing, um, scaling the previews infrastructure. I did some full stack u kind of consulting and way way back when I did some iOS stuff. Um yeah so let's talk a little bit about convex. So convex is a product for developers who write code in typescript. They write their backend code in a little convex folder. Everything in that convex folder gets deployed to our hosted back end and serves your requests. In that folder you also define your database schema. So this is not just like we run a few functions for you but it's like your whole back end is there. We have O, we have you know database it's it's reactable type. Uh so it's cool because you write TypeScript and all the types uh that you define for your database show up in the functions that you write that interact with the database. So types are great. It also does runtime validation of all that schema. Uh cool fun fact. If you try to push to production or even to your dev deployment with a schema that doesn't actually match the data at rest, it will like fail. So you deploy it, it will only deploy code that matches exactly the data uh that you expect. Um cool. So that's nice. Um the types go back to your functions and those type those functions interact with your types of front end and all the types of flow through with argument validation. So it's type time at the run time. Um the thing that kind of like sets context apart as a database is it's a reactive database. So uh usually you have things that are doing SQL inserts and you know things that are doing SQL gets and maybe you're doing it in transactions maybe you you know are polling maybe you're trying to use like the notify list but same thing until it fails to scale whatever you're doing. Um the cool thing about context is when you define a query you say hey I'm going to read these ranges of the database. going to read these documents and that query becomes a subscription. So anytime that someone else edits any of that data inserts data into that range whatever your query will rerun and all those results will be pushed to different. So what does that mean? Like let's say I have a front end that lists lots of chat messages and somebody else answers the chat message without having to tell my app to update in any way that that has to do. and like, oh, that query that you did for those messages, that's been invalidated. So, we're gonna grab those. We're going to like send over the web soocket and your front end react code is just be like, hey, now I have these messages. I'm going to show them. Um, so that's cool. Okay. So, comics kind of about this syncing motion of like the rights that you do to the database from mutations, which are kind of all they're all transactions, serializable transactions for those who care about ACID. Um, and the buries are just pure reads and they all synchronously work. So it's reactive. It's typescript and it's async in that you have writers decoupled from the readers and the readers stay up to date by default by the age of platform. Uh so what does that mean in like the AI world? So you you know whereas you would usually have like a request that has a prompt it does some things and then returns the result the kind of like kinetic way to do it is you save that prompt to the database. then the client's already seen that that has been persisted and then you call an LLM maybe in like some background task maybe just in line you know um it the thing is that it doesn't matter because then when it writes the response that also flows through the the front end and there's some affordances for like optimistic global updates and things like that uh so that you can have like a responsive front end um but that pattern is really nice so then when you have like something like an agent you have uh the client and sends a prompt which gets written and then that prompts message can get passed in or you can pass a prompt directly you can pass ID to it talk about a little bit what are some of that things but you can go to this flow where it's like you know call an LLM call some code like you know do some tool calls and then as it's going it can be like writing those messages to the database to get the reactive updates and especially the new history right so maybe you actually have two processes into the same table or maybe they're writing separate threads and just kind of coordinating. Um we do this in an async this fashion that can either be like a oneoff function right but we you know like it's a serless environment so you don't have infinite resources maybe it's like you've been running for 5 minutes like hey I'm just going to pause I'm going to like you know start again or hey I failed I'm going to run again and all this stuff can happen asynchronously so I've like developed u some nice things that retry your functions automatically we have like a workflow system where you can like write code that will automatically like journal each step and you can have retries that I can like pause for months at a time and just durable functions, all that stuff. It's pretty cool. Um, let's see. All right, I know it's a little I'm just going to throw a bunch of stuff out there. Um, actually, so let's let me just show a demo with probably the worst UI you've seen all day. But the thing to watch is that there's let's see I all my keyboards are too. Okay, so you have two browser windows. This is two different users. You have two different websockets, two different contexts. And the question is, you know, when one takes an action, how is that probably other people? If you've been like paying attention and you probably know what's going to happen, but here's a cool thing. I'm going to send a message on one and then I'm going to abort that generation on the other and see what happens. So for here I'm going towarded it and they both so one started it they started generating they're both getting the gap updates the other one boarded it. So that's that's kind of hopefully makes a little more creative what I'm talking about. Um cool is this track game so far? Good. Good. Okay. Yeah mostly good. That's good. Uh okay so let's talk a little bit about async. uh like the foundations for like you know thinking about AC. So you know this is a pattern I see a lot where like clients make a call to some cloud thing that is very stateful it will then just kind of spin in its own loop for a very long time. It will call out to LLM which are um so reliable. It's thing that I find most LM is they're just so great at being reliable. Um and this thing is running on the cloud infrastructure. So all the things that can happen your client can navigate away they can close their browser they can lose internet connection the network can have some blip the cloud infrastructure can have a timeout server failure they run some resource limit u and interestingly there's another little arrow here that I added um there's like kind of a a lot of people think about like the cloud and things as like this one box that you talk to and you're like cool well if my like network is I just reconnect and just like pick up where I left off, right? But it's not one box. It's not one node. It's like running in I mean for most environments in like a multi-node sort of environment. Uh and so what that implies for us is like uh like unless you have a sense of like really sticky sessions where it's like hey every time a client makes a requested the same node you kind of already need to be thinking about how you manage that in a uh persistent way in a way that's persisted that can be shared between many clients. Um, an example from uh my work at Dropbox doing video transcoding is like, you know, turns out there's more videos for Dropbox than there are on YouTube. And so like we have to video transcode to like show live previews of that, right? It's like well that's kind of a tall task, right? So obviously with a cache and we try to pre-generate some things, but to do the live transcoding it's like oh the request comes in. Okay, great. I'm going to start doing it. I'm going to start running into this file. Oh, their network disconnected, right? Like that would be really expensive if you start when you start transferring every time because something is you have to transfer it in and you can't jump. Um, so what you do? So you have something that starts running and it just saves out little chunks. Um, and then the clients just do like HLS. So they fetch one chunk at a time, right? So you end up figing out what is the persistence story that allows you to kind of give the client side behavior of screaming with the kind of like sur ergonomics of like writing things from some asynchronous background process that can pick up where that all start. So you have these like global checks um and you know and then the question is like how do you get those reactive updates? Does the client pull does it do it all in one line? Um is there a web hook? But for us in phonics, we just like save the message, it gets saved. The client has some hook that's just like waiting for the messages um and they come back. So this is from like a uh an agent developer makes it kind of easy to use these things. Um so I know time so I'm just going to do uh the kind of like bullet points here of like some tips and tricks for async development. Um, you can smooth and make a UI really like slick without doing really fine grain updates. You can actually batch all those updates, send them down. That's smooth on the client. The demo that you saw was not HTTP streaming, right? So, this is like it looked scrolling character by character, but it's actually coming in like sentence by sentence. And you just have to do the first one fast, you know? It's like it's fun shapes. Um, Okay. Uh, second one is yeah, you can do a async warning or updating as long as the thing that's running in the background just checks on status. So, not too bad. Um, and you don't have to go all in on async, right? You can have an HTTP request that is optimistically sending down things over HTTP stream. If that disconnects, it can continue running or you can have something that might ret. Um, so I know a lot of stuff, but the takeaways I guess would be like one is that async AI really likes async. Um the second one is that async kind of implies some sort of synchronization of like getting data through and context happens to be really good at like at synchronizing stuff. Um yeah so happy to talk more about it. Um use component um or just use in whatever way you want. It's pretty cool platform I like building on. Um any questions? or says yeah so uh context is hosted platform the teams mostly come from Dropbox where we care care a lot about like scaling things like that um it runs in really fast V8 isolates uh really close to the database so you like millisecond lookups um and the all the yeah are pre-warm you don't have cold starts Um, it scales really well. We have horizontal scaling automatically. Um, and yeah, I don't know. More scaling questions, you can let me know. But yeah, people are pretty happy with it. I don't know if I made it to one. If anyone else has a question, great. All right, you made it. Good job.
LLM workloads break the request-response model that most TypeScript backends were built around. A user sends a prompt, the model takes thirty seconds to respond, the user closes their tab halfway through, and the agent still has three tool calls and a summarization pass ahead of it. If your backend was holding that work inside a single HTTP request, the work is gone. If it was holding it in memory on one server, the work is gone the moment that server restarts.
This piece walks through why async programming for AI apps with TypeScript requires a different backend shape than most teams reach for, and how Convex's reactive database, durable functions, and end-to-end type safety remove the glue code that usually fills the gap.
Why AI Apps Are Async by Nature
AI apps are async because the work outlives the request. An LLM generation can take seconds or minutes, often involves multiple tool calls and retries, and frequently needs to keep running after the client that started it has disconnected. Holding that work inside a streaming HTTP request couples the lifetime of the computation to the lifetime of a single TCP connection, which is the wrong coupling.
The framing I keep coming back to is that not everything you do with ML should be driven from a streaming HTTP request to an API. The request is a fine trigger, but it's a poor container, because triggers exist to start work and containers exist to hold it, and an LLM call needs the second job done by something that doesn't vanish when the network blips.
Promise-Based Async vs Long-Running Async
TypeScript developers already live inside one kind of async. A Promise resolves inside a single process, await yields the event loop, and the runtime stitches the continuation back together. That model assumes the process stays alive long enough to see the promise settle, which is a fine assumption for a database lookup or a quick fetch, since the work fits comfortably inside a single function invocation.
Long-running async programming is a different animal, because the work needs to survive a client disconnect, a server restart, a deployment, and, in some cases, a multi-week pause waiting on a human or an external event. You can't model that with a single await since there's no single process whose memory you can trust for the duration. The state of the computation has to live somewhere durable, and some scheduler has to pick it up again when it's time to make progress.
The Streaming HTTP Request Anti-Pattern
The Client-LLM Lifecycle showing failure modes on the client, cloud, and LLM sides
The default pattern in a lot of AI tutorials is to open a streaming response from the server to the browser, pipe model tokens through it, and call it done. That pattern fails in predictable ways:
The user navigates away or closes the browser
The network blips on a train or in an elevator
The load balancer enforces a timeout
The server process gets recycled
The model hits a rate limit and the SDK throws halfway through.
Every one of these breaks the user-visible work, because the only place the work existed was inside that one request handler. If the generation was worth starting, it's usually worth finishing, which means it needs a home that outlives the connection. The whole point of moving the work out of the request is to make the connection optional rather than load-bearing.
What Breaks When You Treat the Cloud as One Box
Production cloud infrastructure is multi-node by default, so any architecture that assumes "the server" is a single machine breaks the moment you scale past one. Reconnect-and-resume only works if the resumed state is shared across nodes rather than pinned to a single process's memory, since the node that started the work may not be the node that handles the reconnection. Sticky sessions paper over this for a while and then fail loudly when an instance is replaced.
This is why persistence has to come first and the network call second. If the prompt, the partial output, and the intermediate tool results live in a database the moment they are produced, any node can pick the work back up and any client can resubscribe to its progress.
Disconnects, Timeouts, and Multi-Node Reality
The failure modes compound. A serverless function host might cap execution at a few minutes. A long-poll might be killed by an intermediary proxy. A websocket might survive the disconnect of the original browser tab but have no way to deliver its messages anywhere useful, because the receiving client process is gone. None of these are exotic failures, since they're the normal operating conditions of a deployed app.
The architectural move is to stop treating the server as a place where work happens and start treating it as a place where work is scheduled, journaled, and resumed. The actual progress lives in storage that every node can read and write, which means any node can advance the computation and any client can observe it.
The Dropbox Transcoding Lesson
There is a useful analogy here from video transcoding. There are more videos on Dropbox than on YouTube, and Dropbox transcodes them into HLS so they can be played in a browser. Transcoding a full video inside a single request would be hopeless because users disconnect, whereas HLS is chunked by design, so each chunk can be transcoded independently, persisted, and resumed. The same shape applies to AI work:
Break the long task into steps
Persist each step's output
Make resumability a property of the storage rather than the connection.
The Reactive Backend Model
Convex mental model: Client connects to Mutations and Queries which read and write a reactive DB
A reactive backend turns database queries into live subscriptions. When a Convex query reads a set of rows and a later mutation writes to any row inside that read range, every client subscribed to that query receives the updated result over a websocket automatically. There's no polling loop, no manual cache invalidation, and no webhook to wire up.
This is the mechanism that replaces most of the glue code in an async AI app. The LLM worker writes partial output to the database, and every client looking at that conversation sees the update. The worker doesn't need to know who is subscribed, and the clients don't need to ask whether anything changed, because the subscription itself is the change-notification system.
Mutations, Queries, and Subscriptions
Convex splits backend functions into mutations and queries with different guarantees. Mutations are serializable transactions that can read and write, so two mutations touching the same row see a consistent, ordered view of the world. Queries are pure reads that the system can safely re-run and cache, which is what makes them eligible to become reactive query subscriptions that push updates to clients.
Long-running work, including LLM calls and other side effects, lives in actions. An action can call out to a model provider, then schedule a mutation to commit the result. The transactional boundary stays clean because the network call isn't inside the transaction, which matters since you don't want a flaky model API holding open a database lock while it retries.
Type Safety From Schema to Client
Schemas are defined in TypeScript and validated at runtime, so a deploy that doesn't match the data at rest fails before it ships. Argument validators sit on every function, so the inputs are checked at the boundary rather than five layers deep. Types flow from the schema and validator definitions through the function signatures and into the React hooks on the frontend, with no separate code-generation step to remember to run.
For an async AI app this matters because the messages, tool calls, and run state are exactly the kind of nested, evolving shapes that drift between server and client when types aren't enforced end-to-end. A run row that gained a cancelledAt field last week needs to surface that field in the React component reading it this week, and the type system should refuse to compile if it doesn't.
Building an Async AI Workflow in Convex
The pattern that solves the streaming-HTTP problem is to persist the prompt first and run the model second. The client calls a mutation that writes the user's message into a messages table and enqueues an action to handle the generation. The action calls the model, streams partial output back into the database, and a query on the same table keeps every subscribed client in sync.
Save the Prompt, Then Call the LLM
AI workflow diagram: Write Prompt mutation, Call LLM action, Write Response mutation, List Messages query, all touching the DB
The mutation is short and synchronous; it writes the user message, creates a placeholder assistant message, and schedules the action that will fill it in.
Because the mutation is a serializable transaction, the user message and the placeholder land atomically. If the client disconnects the instant after the mutation returns, the scheduled action still runs and the assistant message still fills in, so there's no orphaned state and no need for a reconciliation pass on the next page load.
Streaming Updates Without HTTP Streaming
The action drives the model and writes chunks back into the placeholder row; the query the UI subscribes to doesn't change shape during the stream, since it just keeps returning the latest content.
Per-token writes feel like the obvious move and are almost always the wrong one. Flushing at sentence or clause boundaries produces a UI that looks character-by-character to the user while keeping write volume reasonable. What looked like character-by-character streaming in our demos was actually sentence-by-sentence under the hood, and users couldn't tell the difference.
Coordinating Agents Across Tables
Agent architecture diagram showing the action/workflow loop with LLM calls, tool calls, branching, and handoffs coordinated through the DB
Multi-agent setups fall out of the same model. Two actions can write into the same runs table, or a planner agent can write tasks into a queue table that worker agents pull from. There is no need for a separate message bus because the database is already the coordination point, and the reactive layer means every participant, including the UI, sees the same shared state.
The Convex Agent component formalizes this for the common cases, exposing thread, message, and run abstractions on top of the same primitives. Reach for the Convex Agent component when you want the conventions handled for you rather than rolling them by hand. The agent workflow documentation covers the underlying patterns in detail. Hand-rolling the same shape is fine when your data model is unusual, but the component captures the patterns most teams converge on after a few iterations.
Durable Functions for Long-Running AI Work
State Persistence slide: unsaved work is wasted work, checkpointing, alternating Persist and Retry-able steps
Durable functions are serverless functions that journal each step they take, retry automatically on failure, and can pause for arbitrary lengths of time, including months. They exist because the regular function model assumes a short, in-memory execution, whereas a multi-step agent might wait on a tool call, then on a human approval, then on an external webhook before completing.
When the model call fails or the tool returns a transient error, the workflow doesn't start over, since it resumes from the last journaled step with the previous results intact. That property is what makes durable functions different in kind from "an action with a try/catch," because the resume point is recorded in storage rather than reconstructed from logs.
Automatic Retries and Journaling
Each step in a workflow is recorded before the next one runs. If the process dies between steps, the workflow runtime picks up at the last recorded checkpoint when it's rescheduled. Retries with backoff are configurable per step, so a flaky external API doesn't require you to write retry logic by hand for every call site.
This is the difference between async that handles a five-second model call and async that handles a forty-minute agent run with seven tool calls and two retries. The first works fine inside an action, whereas the second wants a workflow, because losing partial progress on the second one is expensive in both latency and tokens spent.
Pausing Workflows for Months
A workflow can sleep until a specific time or until an external event fires. This is what makes patterns like "schedule a follow-up email in three weeks if the user hasn't responded" or "wait for the human reviewer to approve the draft" expressible as a single function rather than as a sprawl of cron jobs and state machines.
The Convex Workflow component provides this durable execution model, with the journal stored in the same database your queries are reading from, so workflow state is reactive in the same way ordinary data is. A pending-approval UI is just another query on the workflow table, and an approval mutation is the same shape as any other mutation.
When to Escalate From One-Off Functions to Workflows
A one-off action is the right tool when the work is a single bounded task, completes in seconds to a couple of minutes, and tolerates being retried from scratch. Escalate to a workflow when the work has multiple steps that shouldn't repeat on retry, when any step might pause for human input or an external event, or when partial progress is expensive enough that losing it is unacceptable. If a model call costs a few cents and finishes in ten seconds, an action is fine. If a workflow orchestrates five model calls, a vector search, and a tool invocation, the workflow runtime earns its keep.
The decision is rarely close once you frame it that way (see background job management for more on structuring this tradeoff). The cost of running a short task inside a workflow is some overhead and some indirection, whereas the cost of running a long multi-step task inside an action is losing the whole thing the first time anything goes wrong.
A Live Demo of Cross-Client Sync
The reactive model produces a UX win that's hard to appreciate without seeing it. Open the same conversation in two browser windows, send a message from the first, and the assistant response streams into both at the same time without any per-client wiring. Hit the abort button in the second window, and the first window sees the generation stop. Two different websockets and two different React contexts, with one reactive backend.
Two Browsers, One Reactive Backend
Neither client knows about the other; both subscribe to the same query on the messages table. When the action writes a chunk, the database notifies the query, and the query pushes the new result to every subscriber. The clients are interchangeable, which is the point, since the client identity isn't load-bearing in the architecture.
Aborting Generation Across Clients
Aborting Generation Across Clients
Cancellation works the same way. The abort button fires a mutation that flips the status field on the assistant message to "cancelled". The action polls that status between chunks and exits cleanly when it sees the flag, while every subscribed client sees the cancellation reflected the moment the mutation commits. There's no extra channel to manage and no out-of-band signal to coordinate just another write on the same table the UI is already watching.
The same shape extends to pause-and-resume, throttling, and any other control-plane signal you want to send to a running generation. Each one is a column on the row and a check inside the action loop, which keeps the surface area of the cancellation protocol roughly zero.
Tips for Shipping Async AI UIs
A few practitioner-level notes that tend to come up once you start shipping these patterns to real users.
Batched Updates Over Per-Token Streaming
Smooth Streaming tips slide with useSmoothText code example
Flush model output at sentence or clause boundaries rather than per token. Users perceive the result as smooth streaming, database write volume drops by an order of magnitude, and the OCC contention surface on the messages row shrinks accordingly. If the model emits a long code block, flush on newlines, because waiting for a sentence boundary inside a fifty-line code sample looks like the stream has hung.
Hybrid HTTP and Async Patterns
Hybrid Async and Sync
You don't have to go all in on async. A short, latency-sensitive completion can still ride a regular HTTP action and stream over the response if that's what the UX needs. The architectural rule is that any work whose value outlives the request should be persisted first, whereas work whose value is bounded to the response can stay in the request. Mix the two as the workload requires, since not every model call is a multi-minute agent run.
Optimistic Updates on the Client
Because the mutation writes the user message before the action runs, the UI can render the user's message immediately from the subscription rather than from local state, and the assistant placeholder appears the same way. Optimistic updates are still available for the rare cases where you want the UI to move before the round trip completes, but the reactive subscription usually makes them unnecessary, since the round trip is fast enough that the placeholder arrives before the user's eyes have moved.
If you're shipping a chat agent, a streaming generation, or a multi-step background workflow, persist first and let the reactive layer handle sync. That single decision removes most of the polling, webhooking, and cache-invalidation code that async AI apps tend to accumulate.
Frequently Asked Questions
Q: What does "async" mean in the context of AI apps? A: Async programming in AI apps refers to work whose lifetime exceeds a single request or process. An LLM generation might take thirty seconds, involve multiple tool calls, and need to continue running after the client that initiated it has disconnected. This is different from the Promise-based async TypeScript developers use day to day, which assumes the work completes inside a single live process.
Q: How do you keep an AI agent's progress synced to the client without polling? A: Use a reactive database. The agent writes its progress into a table, and the client subscribes to a query that reads from that table. When the table changes, the query result is pushed to every subscribed client over a websocket automatically, so there's no polling loop and no manual cache invalidation to maintain.
Q: What happens when a client disconnects mid-LLM-generation? A: If the generation is running inside a streaming HTTP request, the work is lost when the connection closes. If the generation is running in a backend action that writes to the database as it goes, the work continues independently of the client. When the client reconnects, it resubscribes to the same query and sees the current state of the generation, including any output produced while it was offline.
Q: How do durable functions handle retries and long-running AI workflows? A: Durable functions journal each step they execute, so a failure causes the workflow to resume from the last recorded checkpoint rather than restarting from the beginning. Retries with backoff are configurable per step, and workflows can pause for arbitrary lengths of time, including months, while waiting for external events or human input.
Q: How does a reactive database differ from a traditional SQL backend for AI workloads? A: A traditional SQL backend requires the client to ask whether anything has changed, usually through polling or webhooks. A reactive database turns queries into live subscriptions, so any write that affects a subscribed query result is pushed to the client automatically. For AI workloads, where the server is producing streaming output that multiple clients may want to see, this removes the need to build a separate notification system on top of the database.
Q: How do you cancel an in-flight LLM generation across multiple clients? A: Store the cancellation state in the database. A mutation from any client flips a status field on the generation row, the backend action polls that field between chunks and exits cleanly when it sees a cancellation, and every subscribed client sees the cancelled state through the same query they were already watching. One write, every client in sync.
Putting Async AI Patterns Into Practice
Async programming for AI apps with TypeScript gets simpler when persistence comes first and the network call comes second. Once the prompt, the partial output, and the run state live in a reactive database, client disconnects stop mattering, multi-node deployments stop requiring sticky sessions, and cross-client sync stops requiring a separate notification layer. Durable functions extend the same model to multi-step workflows that need to survive failures and long pauses. The combined effect is that the parts of an AI backend that usually require gluing together several services collapse into a single reactive backend with end-to-end TypeScript types.
Spin up a Convex project and ship the demo from this post to see the reactive model in action on your own workload.
Build in minutes, scale forever.
Convex is the backend platform with everything you need to build your full-stack AI project. Cloud functions, a database, file storage, scheduling, workflow, vector search, and realtime updates fit together seamlessly.