Implementing work stealing with a reactive database
In “Push vs. pull: patterns for compute-heavy workloads” I discuss a technique for distributed processing called “work stealing” and compare it to traditional “push-based” approach of load balancing HTTP requests. As a reminder, this is the general flow:
In this post I’ll dig into the details of how I implemented it for llama farm.
Tracking work
I use a table “jobs” representing what needs to be done:
jobs: defineTable({
work: v.object({
// details about what needs to be done
}),
status: literals(
"pending",
"inProgress",
"success",
"failed",
"timedOut"
),
lastUpdate: v.number(),
workerId: v.optional(v.id("workers")),
janitorId: v.optional(v.id("_scheduled_functions")),
}).index("status", ["status", "lastUpdate"]),
I’ll explain lastUpdate
, workerId
, and janitorId
below.
When there is a request needing an LLM, a job is inserted into the table.
await ctx.db.insert("jobs", {
work: {
// details about what needs to be done
},
status: "pending",
lastUpdate: Date.now(),
});
In my case, the work included the message to update with the result, and the chat context. All clients in the group chat subscribe to the latest messages, so as the worker updates the message with streamed results, they all see updates automatically.
Workers
Each worker is subscribed to a query isThereWork
that just returns true or false:
export const isThereWork = query({
args: {},
handler: async (ctx) => {
const work = await ctx.db
.query("jobs")
.withIndex("status", (q) => q.eq("status", "pending"))
.first();
return !!work;
},
});
A benefit of Convex queries is they automatically have caching and reactivity. All workers subscribed to the query will get the same cached response, and when a pending job is inserted, the cache will automatically be invalidated, the query re-run, and all workers will receive true
. So long as there is one or more pending jobs, the query will remain true
.
A more sophisticated algorithm could entail:
- Returning the work to be done, to help workers decide what they want to claim.
- Subscribe each worker to a subset of job queues. This can be used to:
- Scope workers to certain users or organizations. And extension to llama farm I plan to implement is having user-supplied workers that only process requests for the groups that the user is in.
- Shard work - where each worker pull the oldest request from the subset of queues it’s responsible for, stealing work from other queues only if it’s totally idle. This can help avoid database contention as traffic scales.
The work loop
Waiting for work
In a loop, the worker waits for work to exist (for the isThereWork
subscription to return true), tries to claim work, and then processes it, if it claimed any:
while (true) {
await waitForWork(client);
const work = await client.mutation(api.workers.giveMeWork, { apiKey });
// null if another worker claimed it first.
if (work) {
await doWork(work, client, apiKey);
}
}
-
The
waitForWork
function uses theonUpdate
API of theConvexClient
to make an await-able promise fromisThereWork
.function waitForWork(client: ConvexClient) { return new Promise<void>((resolve, reject) => { const unsubscribe = client.onUpdate( api.workers.isThereWork, {}, (thereIsWork) => { if (thereIsWork) { resolve(); unsubscribe(); } }, reject, ); }); }
Claiming work
To claim work, it:
-
Finds the first pending job, if there is one. It also fetches the latest version of any related data that the request needs, such as previous messages in the group chat.
-
Marks it as
"inProgress"
and note theworkerId
of the worker claiming the task. The worker passes an API key to authenticate its request. I use a custom function to make aworkerMutation
which validates the API key and looks up the associated worker. -
Schedules a “janitor” function to mark the job as
"timedOut"
if we haven’t heard from the worker in a while. The worker will periodically callimStillWorking
to update the job’slastUpdated
field to let the server know it’s still alive. When it does so, it will also cancel the janitor function by itsjanitorId
and schedule a new one.if (job.janitorId) { await ctx.scheduler.cancel(job.janitorId); } job.janitorId = await ctx.scheduler.runAfter( WorkerDeadTimeout, internal.workers.markAsDead, { jobId: job._id } );
All of these operations leverage Convex’s strong transactional guarantees (in particular, serializable isolation) to not have to worry about race conditions. The scheduled functions will not be scheduled if a transaction rolls back, and conflicts are automatically retried.
Submitting
When a worker is done, it calls submitWork
which:
- Updates the message associated with the job.
- Marks it as success or failed.
- Re-submits it to be retried if it hasn’t been retried too many times already. See below for more details.
If the request was a streaming request, it publishes updates periodically, so the user can see the response stream in. It batches updates by only sending them when it sees punctuation like , . ! ? \n
or if it’s longer than 100 characters.
Handling retries
A worker can either fail explicitly or disappear:
If the worker’s LLM fails in a way it can detect:
- It will retry a few times with backoff, and report failure if it fails the last attempt.
- If the worker reports failure, the job will be attempted by another worker by inserting a copy of the job as “pending” and checking for previous attempts by the same worker when claiming a job.
If the worker dies unexpectedly:
- The worker stops calling
imStillWorking
, solastUpdated
stops being incremented. - The janitor function
markAsDead
eventually executes, marking the request astimedOut
. - The job is not retried, with the reasoning that in a chat app, by the time a request times out, the value of a response is significantly diminished as they may have already sent other messages since then, and they can explicitly request another response if they want.
Summary
In this post we looked at how to implement a work stealing pattern in Convex, using subscription to connect jobs completed by workers with clients who submitted the jobs.
Check out the repo, and be sure to read this post on the tradeoffs of work stealing if you haven’t already.