3 months ago

Why Convex Limits Transactions and How Concurrency Control Shapes Your Database

Why does Convex limit my transaction length? Why do I get right conflicts or OCC errors? Can't convex just solve this for me? I mean like Postgress doesn't have these problems. Well, I'm here today to tell you that unfortunately Convex can't solve these problems and Postgress does have these problems. So, if you've ever been curious to know more about things like record contention, database consistency, isolation levels, and just kind of how to keep databases fast, I'm going to try to give you everything you need to know in this one video. If you watch this, you should be good to go, whether using Convex or to be honest, any other system. So, let's start out with the most basic case, the happiest database, the easiest thing to explain. So, meet our castle. This is going to be our over wrought metaphor to talk about databases today. So there's a vault in this castle. We can see on the left and it has pedestals and each pedestal contains a ledger. So there are five nobles, Ashcraftoft, Waverly, etc. here. And then there's the king and his special gold pedestal. We also have a number of accountants, each one that works for the a particular noble that will be conducting transactions on their behalf. Let's start with the simplest possible thing to do, a low contention transaction set to run here. So on this day, the nobles are all harvesting uh on their farmland and selling those crops. So they have some deposits to make into their accounts. So, their accountants each go into the vault, head to the pedestal that belongs to their uh employer, and then they add $10 to the account to represent the wheat or whatever that was sold that day. Very straightforward. When they're done updating that ledger, they head back to their desk and kind of wait until there's something else to do. Um, you might notice that they're taking a few seconds to do work when they're at the pedestal. And that's because it's like the 1400s or whatever, right? So, like, let's just say they're using abacuses or something. We can't add as quickly as back then as we can now. And so, it takes a little time to conduct the transaction, but they're all happy. Everybody's transactions are getting finished. There's no contention on any record. That's going to come soon. This situation is one in which we'd call this load embarrassingly parallel. It just means everyone is able to work independently, right? No one depends on anything else. No ledgers depend on anyone else. And so we can run this for a lot of nobles all at the same time. SPOF is in some respects the kind of opposite of embarrassingly parallel which means like a single point of failure or a single point of contention which means there's one resource that everyone sort of depends on and then that has can both have problems in terms of reliability of the system and contention meaning if everyone wants to get to it at the same time that might become a problem. This kind of pattern we saw here, very straightforward parallel editing of ledgers independently. What database systems behave that way? Every single database system, every database system looks like a champ uh including Postgress and Convex and Reddus and everything when you have no contention. So let's move on to where things get interesting. And there's no headache worse as we all know than tax day. So on tax day, everyone has £100 in their account and it's time to pay the king. In this case, all the accountants rush into the vault and then they all grab their ledger. They carry it over to the king's pedestal where the king's ledger is and they have their abacus and they sit there and they do the calculation. So, why do they do this? Why are they right in front of the pedestal? Why are they blocking the other accountants from accessing it? It's because it because it takes a little time to add things up on the abacus. They need to make sure the king's ledger doesn't change while they're working. And they also don't want their nobles ledger to change while they're working either. So they hang on to their nobles one in their hand and they sort of block everybody out from the king's ledger. And I don't know, they manipulate the abacus with their other hand. I don't know how many hands we have here. And then once they come up with the right value, they write it down in the king's ledger and they write it down on on their nobles ledger and go put it back. Right? They're like, "That's it. We did it. I did the transaction. I took money out of my noble's account and I put it into the king's account." And you can see that's what's happening here. But you'll also notice a little bit of a queue forms, right? So, um, because people have to wait for that abacus work to be done one at a time. So, it might even start to get like kind of crowded in the vault, right? It might be hard to push past people and uh conduct unrelated transactions, right? What if there's a war somewhere and the Navy needs to pull some money out of the king's account, right? And it has to wait in line behind a whole bunch of thing and it's like, "We need warships. We're losing. You got to wait." Right? So contention is a problem in databases just like it's a problem in this vault here. Uh if you everyone is trying to access the same record and it they all take a little time to conduct their transaction, you can easily kind of pile up work. It's effectively kind of waiting on a lock is what's happening here. So even trickier is this scenario. Some of you may have heard of deadlocks and this is a real problem that can happen to systems where you're trying to make sure a transaction happens correctly that involves two different records. So in this case Ashccraftoft and Hartwell, two nobles are probably trying to transfer money to each other, but the accountant grabs uh their their employers first and then walks over to the other table. I guess in this case, Hartwell's accountant is sitting there looking at the empty table going, "Oh, someone must be using Ashcraftoft's leather ledger right now. I'll just wait a minute until they're done." But the problem is that's the exact same reasoning that Ashcroft's accountant is using about Hartwell's ledger. So, they're never going to make any progress until they suddenly give up and start wandering around trying to see, is something going on here, and I will never get this back. So, this is a deadlock. This approach is pessimistic concurrency control. Why is it pessimistic? Well, because you assume contention will happen and so you lock everything ahead of time to prevent it. Right? That's what the accountants were doing by blocking everyone from using the ledgers is they make sure only they can touch ledgers right now so that they know when they do the math that they're writing the correct next value into both ledgers at the same time. And that blocking on access can tie up resources, right? Like obviously no one else can get to the king's ledger that they all need. So they have to wait. And you have to be careful about deadlocks. If you need two ledgers to conduct an operation and you grab them in the wrong order, then you could end up kind of tying up all the transactions related to either of those two accounts. And fundamentally, the problem we have here is that any operation that takes n seconds, you can only do one over n of those a second, right? So even if I only take half a second to add stuff up because I'm the fastest abacus user in the world like okay now we can do two transactions a second there's no way around it right because we need to make sure no one else changes we cannot parallelize the work on these ledgers what systems is this like this behavior we saw in this simulation this is pretty close to using something like Postgress if you're very careful to use select for update which is like locking your dependent records while you um conduct your transaction. Now, you don't have to manage this problem this way to have correct transactions. Just as there's pessimistic concurrency control, so is there optimistic concurrency control. So now the king says to everybody, "I don't like it how crowded it is in here. My navy is trying to get money. Nobody could get through. There's too many bodies in the room. I need way less people in the vault so that I can conduct the king's business." So, let's just say the accountants get together and they come up with a new scheme. This new scheme involves this kind of window they make into the vault, right? Where the accountants can line up along this window and they can just look in and they have really good eyesight and they can just look right now. How much money does the king have and how much does Ashcroft have? They just take those two numbers and then they go back to their desks and they do the calculation there. Let's see what that looks like. So they go and they read the values and then they go back to their desks and they do the work there. So they're not tying up a ledger anymore while they do the long expensive part of their jobs which is doing all the aacus stuff. They do the aacus stuff at their desk. But you'll probably notice that some of them when they go over right where it restarts here, the first one succeeds, but then there's a red check marks next to the other one, right? So those ones fail because they notice that the ledger has changed since they look through the window, right? And because the ledger's changed, they decide, oh, I'm going to need to go back and try again. So you can see up here in the corner, it says king right retries seven. So eventually we get all the transactions done. If we watch up here eventually the tax payments we will get all five tax payments in but what will happen is a few people will have to try again because they kind of lost the race right by the time they got to the ledger to write out their calculation. They realized that the ledger had changed. So they needed to just repeat the process, take another look from the window at what the current values are and then go back to their desks and take the time to calculate. This means that they're going to waste some work, right? That's what this trade-off is making. The trade-off is saying sometimes you're going to have to calculate multiple times because you kind of lost the race and someone else changed ledger before you finished. But the vault is no longer packed full of people waiting, right? So, if uh um someone needed to come in here and get to some other nobles ledger, you're not pack you haven't packed the vault full of uh people. Um you can get the job done. So you might say this feels like well this metaphor doesn't really apply to databases but but it really does. So it ends up that if the on the left the vault really represents kind of the database server and on the right those are all the like application servers that are calculating things. The d if the database server has just a ton of pending work all piled up waiting for finite amount of resources. It can eventually be that there's like no threads available to conduct the database transaction. So there is a benefit to this optimistic concurrency control even though it is hypothetically and actually less efficient because you do extra CPU work and that is that the database server which is the rarest scarcest resource that's stateful um doesn't deal with all the back pressure of all the jobs waiting to run. The application servers deal with that back pressure and that is often a very good trade-off in real systems. So yeah, in summary, optimistic systems usually work better when contention is rare, when most rights to the database aren't all trying to update the same record. In practice, in most systems, contention is usually kind of rare. Most transactions are related to us users personal data or maybe a small chat room they're in. They're not normally on the order of everyone using the app, for example. Optimistic concurrency control is willing to waste work on contention, right? So um you will maybe have to go try the transaction again if you lose the race. Um but it doesn't bog down the scared the the scarce resource the database primary. It keeps all that back pressure about work waiting to happen in the kind of application servers which um are can be scaled up infinitely and don't have this kind of single point of failure element to them. Um what systems work this way? Convex works this way. Foundation DB, TAKV. A lot of newer systems are powered by optimistic concurrency control under the covers because it often works out better in real systems even though it's less ideal in a perfect system. Postgress can be made to work this way in serializable isolation level, which it's not set by default, and we're going to talk more about that soon, but it's pretty awkward to use that way. So, it's very rarely uh configured to have serializable isolation level. Let's dive even more into why optimistic concurrency control is quite good in real systems. So imagine there's a new noble in our kingdom, Dukeington. This is our new money dude. And correspondingly, as you would expect, the infrastructure at the castle doesn't have a lot of respect yet for Duke Washington. So they go hire a new accountant, a junior accountant that is not very proven yet. and they say, "Uh, Duke, we got a great person for you. They're going to take care of your account." Now, in a situation where we're using pessimistic concurrency control, here's what could happen. So, all the nobles rush out. They're all trying to pay taxes, but Duke Ashington maybe isn't that fast yet at doing abacus work. They're new on the job. or maybe even they're pretty unreliable and they just fall asleep at the pedestal while they are uh working on the ledger, but they're blocking everyone else from making progress. As you can imagine, pretty much the whole kingdom will grind to a halt in this case. having one actor in this case uh Duke Cashington's young accountant who is able to effectively stop the entire system from making progress is an extremely serious problem that most real uh backend systems sometimes have if they're not very careful. You might say once again this doesn't have much to do with real code but it does. So the bigger your system gets, the more careful everyone needs to be to not write a query that is takes too long to run in a pessimistic system. The more servers you have, the more likely you are to have a process that pangs for some reason or a thread that runs into a problem. And if any of those things happen, then the lock will kind of be held potentially indefinitely. Your entire site can be down. In fact, I would go as far as to say if you read SEV reports from public companies, especially as they're growing, like going through their kind of first three to five years, some of their first SEVs are related to this problem. Uh, a query that was too expensive, held a lock open too long, and either failed or suddenly became extremely slow, maybe because of query planner change, and it effectively tied up all of the throughput in the system waiting on that lock. Now let's see how that same scenario plays out in a system that uses optimistic concurrency control. So in this case the purple uh new accountant there goes and reads a snapshot and starts working at uh his desk and he falls asleep at his desk. That's not great because it means that uh Duke Washington is maybe not paying his taxes today. But you will notice that despite that accountant sleeping, all the rest of the business of the kingdom gets conducted and the vault is not tied up at all. In fact, um, Ashccraftoft right here actually has an additional transfer to do, transfer some money from Duke Washington to Ashcraftoft. Probably Ashcraftoft sensed an opportunity, offered Duke Washington some pretty bad farmland for too much money. uh and Duke Cashington knew money was happy to pay it. Um so even though in this case Duke Cashington's ledger was even involved, the transaction was still able to be done by Ashcraftoft's accountant because no lock was held not only on the king but there's not even a lock held on Duke Washington. So there's no locks in this system. And in real systems that get more and more code, more and more developers working on them, maybe people start to lose track about what things are doing or what are best practices. You kind of want to isolate problems from becoming systemwide problems. Um, and optimistic concurrency control is excellent at that. So pessimistic concurrency control, pessimistic locking, it is definitely more efficient in ideal conditions. If there is contention and everyone implements things optimally, then you will spend less total time in the system um than you would in an optimistic system which allows itself to repeat work. But in real conditions, some things take much longer than other things. Sometimes new code you just wrote is both less hardened than existing code because it doesn't have any production time on it yet and it's supporting features that don't matter as much yet because there's not the core of your pro project and it's written by people newer to the team. There's all these correlations with introducing new things as code bases grow and you want to be able to isolate work from other work and have it fail gracefully. You also want to make it so that you can't deadlock the system, which you don't have to be careful with lock ordering with optimistic concurrency control. There's never any locking. For this reason, the compositional power of optimistic concurrency control tends to be much much more favorable to keep a system online as it grows. And that's why most new database systems and backend systems are preferring optimistic occurrency control if it's at all possible. And that's why Convex does as well. All right. So I've said a lot of caveats throughout here about Postgress like wait till later hold on I'll explain. You might ask how does Postgress really work out of the box and this is pretty close to how my SQL works out of the box too. So how do mainstream traditional RDBMS's how do they fit into this picture and is that why they don't have the problems or the constraints that convex has? The short answer is yeah they are making trade-offs differently than convex does. Here's how they actually work. You might be surprised. So, by default, the world's darling legacy databases actually use a mode called read committed. This is an isolation level. The isolation level where all the records are definitely correct is usually called serializable, which means the world kind of feels singlethreaded. And that serialization is kind of why only one person can be working with the king's uh journal at a time, right? The king's ledger is access to it is is serialized. But because that is performance problems and because it doesn't interact very well with traditional programming language environments, most databases use something called read committed out of the gate which has pretty surprising data correctness characteristics. And let me show you exactly how it works. So just like in convex the accountants do not take locks by default. They do just sort of read the values they need and then write later. But what they actually read is they just read whatever the latest committed value was at the time of the select query. Um so that's why it's called read committed. So what that can mean though is if a bunch of them read and then they do some work and then a bunch of them write, they can basically make it so that they lost rights. Let's uh let's step through that real quick. So in this case, we're going to see three accountants line up and they all read the current values of all the journals. Just like in the optimistic and currency control case, they go back to their desk and they begin working. So, in this case, they each observe that the king had $1,000 and that their employers each had $100. They use the abacus. They determine that the king uh needs $1,010 because they're making a $10 tax payment. The issue is that when they go to the king, they each write $1,010. Once all three are done and they consider the transaction finished, the king's balance is only 110 even though $30 has been subtracted from the nobles journals. They don't check that the king's ledger has changed when they issue their right. They don't know that the accountant right before them already wrote 1 and that they need to go back and recalculate because the basis of their calculation is different. Now, you can see up here in our little thing, we've expected at this point 30 pounds of tax paid. There's only 10 that's been posted to the ledger and $20 literally dropped out of the system. Despite the fact that this simulation should be zero sum, the kingdom just lost $20 out of the books. As we run this, you can see the next group of two, the first one succeeds, the second one silently loses data, and we just lost another $10 from the system. So, we're up to $30. Now this is actually how Postgress works out of the box. So by default systems like Postgress and my SQL they just grab the last committed value of a record when the select runs. This is a little bit of a detail but they're not even in snapshot isolation level out of the box. Meaning that if the two selects run after two different commits, they could actually read the committed values from two completely different transactions from the two records they depend on. This is a very loose isolation level to be honest. And most developers are surprised when they hear this. If you ever run into weird values in your database that you thought were your fault, there's a chance that it's actually attributable to this if you were relying on acid to like actually work. Double check your app. And the way around this is to switch to select for update which even in readcommitted isolation level Postgress will use that to start locking rows. But see all the previous notes about being very careful about pessimistic locking because those are pessimistic locks and you can deadlock your database or start tying up your database's threads or processes. You have to be very careful. So now you might be saying so is this all just hopeless? Is there no way to speed up contended transactions? You know, let's say that there's thousands of nobles inside the king's kingdom and there's a lot of tax collecting going on all the time. How can I accurately add all those up if I can't process all of you know a thousand transactions in whatever time period in a minute, an hour, however often they're paying tax, right? How do you break this speed limit when it does feel so fundamental hopefully by now that it's just a physics thing? Well, there's a few strategies, but they're mostly variations of a couple of patterns because to be honest, there's not that much you can do. One is you just make the operation faster, right? So, like if you're using a calculator instead of an abacus, then you'll be able to do a lot more transactions per second because the kind of critical section is shorter. Regardless of whether you're doing optimistic concurrency control or pessimistic concurrency control, the system throughput will go up if you speed up the operation. That's always one thing you could do. But if you can't speed up the operation, the most common strategy is to leverage staleness, which means kind of like to relax consistency. So let's look at an example of it here. So in this scenario, we have a lot more nobles, but the nobles don't directly pay the king. The nobles pay a tax collector. And we have four tax collectors that each of them are assigned to. And so those tax collectors are managing like four or five relationships. And then the king's ledger, only those tax collectors are allowed to update it. And that means there's only four people, four accountants in this case that are trying to update the king's ledger, which keeps the contention down. The key thing is those tax collectors, they put the entire sum of all the transactions they've conducted in as one transaction, right? So if they take $10, $10, $10, then when they go to the king, they just put 40 at once. And that lowers the total number of transactions the king has to take. Let's look at this simulation for a second. So all of those nobles up above, they're all racing in and they're they're conducting transactions with the tax collector. The tax collector is occasionally moving over and taking their whole balance. They're waiting in the queue for the king and they're adding the the ledger entry for all of the proceeds that they've collected. Um and so this is a kind of tree of computation, right? The the thing that makes it stale is that let's say from Ashcroft's perspective, the noble in the upper left, the second they've paid their taxes, they're pretty darn sure they've paid, right? Because their account, the money's come out of their account. But the king wouldn't say yet that they've gotten that $10, right? The king won't see the $10 until after the second transaction happens where the tax collector um goes and moves money from their intermediate fund into the king's ultimate ledger. And you'll notice we also have this new entity on the right side, the Navy, building warships, doing war stuff, right? Doing what navies do because the king isn't too contended because they only have to deal with the tax collector. The Navy has no problem getting in there and taking sizable chunks of money out of the king's ledger. We can see the vault's balance is going down over time, but it's not it's zero sum in terms of the correctness of all of the tax payments in this big um network we have here. Um the only money that's going out is the money net to the Navy who was able to get their transactions done as well. As you can imagine, this tree of transactions could be many levels deep if it needed to be. and we would just basically roll up aggregates over time, but we would never have too much contention on any one record. So yes, you can work around most contention problems by introducing staleness, selective staleness in a way that's eventually consistent and in a way that your application can tolerate. There are levels of aggregation and/or periodic batch work that's done by a smaller number of actors than all the actors that want to conduct a transaction. You can always speed up the operation if you can. That's the easiest thing to do. But we put so much work in optimizing systems these days that normally that is harder to do than to come up with some data structural strategy that involves staleness. And this is what most convex components do for you. You just call the function. You don't have to worry about how to implement all this. Um, so if you're worried about these performance things, make sure to go check out convex components. There's often a data structure in there that will make the OCC problem go away. So, contention, consistency, isolation levels. Why won't my database go faster? Why is convex only letting me run for one second? Why can I only write one megabyte of data? What is an OCC error? Right? Um, you know, hopefully you have a better sense about why we're surfacing that information to you and some things you could do about it and how we are faced fundamentally with the same trade-offs any database system is. There's nothing slower or faster about convex. This is just physics stuff. There is a fundamental speed limit on consistent operations on a contended record. Um you can introduce staleness. You can willingly lose data right like Postgress does out of the box or you can speed up the operation itself if possible. Pessimistic is theoretically more efficient. You don't waste time computing stuff and recomputing it. Um but is definitely more fragile as projects get bigger and teams get bigger. And uh it's very hard to reason about isolation levels other than serializable. So I am implicitly making an argument here that post out of the box it's actually very hard to build correct applications with it and if you're on convex use convex components and most of the time you won't have to think about these things. Hopefully that was useful for you all. It's probably the point where I say like like or subscribe I do this it's down there somewhere. Thanks for watching.

The Physics of Contended Writes

Every database has a speed limit on consistent operations against a single contended record. Convex enforces a 1-second and 1MB transaction cap and surfaces optimistic concurrency control (OCC) errors when transactions race on the same record. Postgres and MySQL face the same physics, but their default isolation level hides the problem by silently losing writes instead of erroring.

This guide walks through the trade-off between optimistic and pessimistic concurrency control, why Convex chose OCC, what the read committed isolation level actually does in Postgres, and the concrete patterns (staleness, aggregation, components) that let you scale past contention without lock-style workarounds.

What Record Contention Is and Why It Slows Databases Down

Record contention is what happens when multiple transactions want to read or write the same row at the same time. Without contention, every database looks fast. With contention, a queue forms, so how that queue is managed determines whether the system stays online under load.

The Embarrassingly Parallel Case Where Every Database Looks Like a Champ

Imagine a kingdom of nobles, each working their own plot of land and recording the harvest in their own ledger. Hundreds of nobles can write to hundreds of ledgers in parallel, because no two writers touch the same record. This is the embarrassingly parallel case, and any database (relational, document, key-value) handles it well. Throughput scales with hardware because there's nothing to coordinate.

Most benchmarks live in this world, telling you almost nothing about how a database behaves when the workload concentrates.

Tax Day When Everyone Wants the Same Record

Now imagine tax day. Every noble in the kingdom sends an accountant to the castle, and every accountant needs to update the king's single ledger. Suddenly there's one record and many writers, so a queue forms at the door. The king's ledger has become a single point of failure for throughput, because no matter how fast each accountant is, only one can touch the ledger at a time.

This is the part of the workload where database design choices stop being academic. The question is no longer "how parallel can we get?" but "how do we order and validate concurrent writes to the same record without losing data?" Every answer involves a trade-off, and that trade-off is the difference between pessimistic and optimistic concurrency control.

Pessimistic Concurrency Control Locks First and Asks Questions Later

Pessimistic concurrency control (PCC) assumes conflicts will happen, so it prevents them by locking records before reading or writing. A transaction holds the lock for as long as it runs, which means other transactions wanting the same record must wait. In SQL this surfaces as SELECT FOR UPDATE and row-level locks under serializable isolation.

How Pessimistic Locking Works

Back at the castle, the rule under PCC is that an accountant who wants to update a ledger first takes physical possession of it. They walk to the king's chamber, pick up the ledger, carry it back to their desk, compute the new total, write the update, and only then return the ledger to its shelf. Anyone else who wanted that ledger waits outside the door.

This works. As long as transactions are short and well-behaved, locks are released quickly and the queue moves. The cost is paid in coordination, because every reader and writer is forced through a serial choke point at the contended record.

Pessimistic concurrency control

The Deadlock Problem

Pessimistic locking introduces a failure mode that doesn't exist under OCC: deadlocks. Accountant Ashcroft holds Hartwell's ledger and wants Ashcroft's; accountant Hartwell holds Ashcroft's ledger and wants Hartwell's. Neither can proceed, because each is waiting on a lock the other holds. The database has to detect the cycle and abort one of the transactions, which means the application has to handle the retry.

Deadlocks aren't exotic. Any transaction that touches two records in a different order than another concurrent transaction is a candidate. As schemas grow and code paths multiply, the combinations multiply with them.

Why Pessimistic Systems Get More Fragile as They Scale

The deeper problem with PCC is blast radius. Picture Duke Cashington's junior accountant, who walks into the castle, picks up a busy ledger, and then sits down to do something slow. Maybe they're computing a complicated tax. Maybe they got distracted. Maybe their process hung. While they hold that lock, every other accountant waiting on that ledger is also stuck, and any transaction that needs a second lock those waiters already hold is stuck behind them.

A single slow or stuck transaction can freeze a meaningful slice of the system. In practice, a significant share of early-growth-stage database incidents trace back to one query holding a lock too long. The system was fine until one transaction misbehaved, and then everything that touched the same hot record went down with it.

PCC has been studied since the 1970s, and the trade-offs were spelled out clearly in the foundational OCC paper by Kung and Robinson in 1981, "On Optimistic Methods for Concurrency Control" (ACM TODS). That paper proposed a different model entirely.

Optimistic Concurrency Control Reads Computes Validates and Commits

Optimistic concurrency control (OCC) assumes conflicts will be rare, so it lets transactions run without locks and validates at commit time. If two transactions touched the same record and one already committed, the second one fails and the application retries. The classical model has three phases:

read
validate
write

How OCC Works in Practice

Back at the castle under OCC, no accountant carries a ledger anywhere. Instead, they walk up to a window, read the current value of the ledger, and go back to their desk to compute. When they're ready to commit, they return to the window and say "I read version 47 and want to write version 48." If the ledger is still at version 47, the write succeeds. If someone else got there first and it's now at version 48, the accountant's transaction is rejected and they start over.

This is what an OCC error is at the database level: a transaction lost the race to commit against another transaction that touched the same record. Data is not lost and there are no locks, so no deadlock is possible because no transaction ever waits on another.

Why OCC Wins in Real Systems Even Though It Looks Less Efficient on Paper

On paper, PCC is more efficient under high contention because OCC has to do work it might throw away. In practice, OCC tends to keep systems online as they scale, and the reason is back-pressure. The database is the scarce stateful resource, whereas application servers are stateless and scale horizontally. Under OCC, when contention spikes, the failed transactions bounce back to the application tier, which can retry, queue, or shed load. Under PCC, the contention manifests as locks held inside the database itself, which is the one place you can't easily add capacity.

A stuck OCC worker doesn't block anyone, because it doesn't hold a lock. A stuck PCC worker can freeze every transaction that touches the same record, and every transaction queued behind those. The compositional power of OCC, which keeps failure local to the transaction that lost the race, is why Convex chose it.

Optimistic concurrency control

Which Systems Use OCC

OCC is standard in modern distributed databases. Convex uses it, as well as FoundationDB and TiKV. Postgres can be configured for serializable isolation, which gives you OCC-style behavior, but it's rarely used in production because it's awkward to opt into and the default isolation level papers over the problem in a different way (more on that below).

Optimistic and Pessimistic Concurrency Control Side by Side

Dimension	Optimistic (OCC)	Pessimistic (PCC)
Conflict assumption	Conflicts are rare	Conflicts are likely
When conflict is detected	At commit time	At read or write time
Locking behavior	No locks held during transaction	Locks held for duration of transaction
Performance under low contention	Excellent, no lock overhead	Good, but lock acquisition has cost
Performance under high contention	Higher abort/retry rates	Higher queue/wait times
Deadlock risk	None	Real, requires detection and resolution
Retry behavior	Application retries on conflict	Database queues; some retries on deadlock abort
Failure-mode blast radius	Local to the failed transaction	Can cascade across all transactions touching the locked record

The table makes the trade-off visible, but the practical answer is shaped less by raw throughput and more by what happens when something goes wrong. Under PCC, a single slow transaction can take the system down. Under OCC, a single slow transaction loses its own race and the rest of the system keeps moving.

Pessimistic and optimistic locking

What Postgres Actually Does by Default and Why It Is Worse Than You Think

Postgres defaults to the read committed isolation level, not serializable. Under read committed, two SELECT statements inside the same transaction can read from two different committed snapshots of the database, which means a transaction can silently lose updates without any error. Most developers assume Postgres protects them from this. It doesn't, unless you explicitly opt in to stricter isolation or use SELECT FOR UPDATE.

Read Committed Is Not Snapshot Isolation

Read committed guarantees only that you won't see uncommitted writes from other transactions. It doesn't guarantee that the data you read at the start of your transaction will still be there at the end, and it doesn't guarantee that two reads of the same row inside one transaction will return the same value. This is looser than snapshot isolation, looser than repeatable read, and dramatically looser than serializable.

MySQL's default is similar. Both databases ship with the looser default because it's faster and because most workloads don't hit the failure mode often enough to notice.

The Lost Update Problem in Action

Here's the scenario that should be more widely known:

Three accountants each want to credit $1,010 to the king's account, which currently holds $0.
Under read committed, all three transactions can read the balance as $0 at roughly the same time.
Each computes the new balance as $0 + $1,010 = $1,010.
Each writes $1,010.

The transactions all commit successfully with no error. The kingdom received $3,030 in tax revenue but the ledger shows $1,010. Two thirds of the day's collections just disappeared. There's no exception, no log line, no retry. The data is gone.

This is the lost update problem, and it's the default behavior of the two most popular open-source relational databases. Most application code doesn't hit it because real conflict rates are low, but when it does hit, the failure is silent.

Fixing It With SELECT FOR UPDATE and the Trade-Offs

The standard fix in Postgres is SELECT FOR UPDATE, which takes a row-level lock for the duration of the transaction. That fixes the lost update problem by opting you back into pessimistic locking, with everything that implies:

Queue waits
Deadlock risk
The lock-hold blast radius described earlier.

You can also set the transaction isolation level to serializable, which uses OCC-style validation, but the ergonomics are awkward and most application frameworks don't default to it.

Most teams running Postgres in production have never audited their code for read-committed lost updates, and most of them are fine. The failure rate is low. But "low failure rate with silent data loss" is a different risk profile than "visible error you can retry," so the choice between them should be deliberate.

Postgres read-committed behavior

Why Convex Surfaces OCC Errors and Transaction Limits

Convex caps transactions at 1 second of execution time and 1MB of read/write data, and it surfaces OCC errors when your transaction races another write to the same record. These limits are the database honestly telling you that a transaction has hit a physical constraint that every database has, so you can address it at the application layer instead of discovering it as a production incident.

The 1-Second and 1MB Caps Are Honest Not Arbitrary

Any transaction that runs for a long time, in any database, is paying a cost. Under PCC it holds a lock for that whole duration, which means every other transaction wanting the same record is queued behind it. Under OCC it accumulates more chances to lose the commit race, so longer transactions on hot records have higher abort rates by construction. Convex bounds the problem at the front end by capping how long and how large a transaction can be, which prevents a single misbehaving query from cascading into a system-wide slowdown.

You can read the precise wording in the Convex transaction limits documentation, but the design intent is the same: enforce the constraint early so it can't hide. A 30-second transaction in a database that allows them isn't a feature; it's a future incident.

What an OCC Error Actually Means

An OCC error in Convex means your transaction read a record, did some work, and then tried to commit, but another transaction committed a write to the same record first. Your transaction is aborted and your client (or the Convex runtime) can retry. The data is consistent. Nothing was lost. The error is the database telling you "this record is contended; the work you did is stale."

Diagnosis is straightforward. Identify the record that multiple transactions are writing to. The treatment is everything we discuss in the next section. For the full error model and remediation patterns, the Convex documentation on OCC write conflicts covers the diagnostic flow, and the design rationale here is similar in spirit to the one explained in why Convex omits .select() and .count(), which surfaces opinionated constraints honestly rather than hiding them.

How to Speed Up Contended Transactions

The fix for a contended transaction is almost never "make the database lock harder." It's one of three patterns:

Shorten the critical section
Introduce staleness through aggregation
Use a pre-built component that handles the pattern for you.

Three contention remedies

Option One Make the Operation Faster

The cheapest fix is to do less work inside the transaction. Read fewer rows, write fewer rows, move computation out of the transaction where possible. A transaction that takes 50ms loses far fewer OCC races than one that takes 500ms, and it holds far fewer locks for far less time if you're on PCC. Most contention problems are partially solved by shrinking the critical section before you reach for anything more sophisticated.

Look at what's inside the transaction. Network calls and slow third-party APIs should never be there. Heavy computation that doesn't need the transactional view of the data should be moved out, run separately, and only the final write should re-enter the transaction. The smaller the critical section, the less surface area you have for races.

Option Two Introduce Staleness Through Aggregation

The deeper fix is aggregation. Back at the castle, instead of every accountant writing directly to the king's ledger, you appoint a tax collector for each region. The accountants write to their regional collector. The collectors batch the totals and write to the king on some interval. The king's ledger now sees one write per region per minute instead of one write per accountant per second.

The leaves of the tree are eventually consistent, but the root is correct. You've traded some staleness for a dramatic reduction in contention at the hot record, and most read paths can tolerate the staleness because they read the rolled-up total rather than the live stream of writes. This pattern works in any database. It's the underlying mechanism behind sharded counters, leaderboard buckets, and most "high write throughput" systems you have seen described as scaling miracles.

The mental shift is admitting that you don't actually need the absolute latest value at the hot record for most reads. You need a value that's correct as of some recent moment. Once you accept that, the design space opens up considerably.

Option Three Use Convex Components

Many of these patterns are already implemented as Convex Components so you don't have to build them yourself. Sharded counters, aggregations, rate limiters, and similar building blocks ship as composable modules. If your application is hitting OCC errors on a hot record, the right move is usually to swap the direct write for a component that already handles the staleness/aggregation pattern correctly, rather than inventing your own.

Building these primitives from scratch is harder than it looks. A naive sharded counter is easy, but one that handles compaction, rebalancing, and read-time aggregation efficiently isn't. Reaching for a component that already encodes those decisions is usually the right call.

When to Choose Optimistic and When to Choose Pessimistic Concurrency Control

The choice comes down to your contention profile and what failure mode you can tolerate.

If your workload is read-heavy with rare write contention, use OCC. The overhead of locks is wasted and the retry rate under OCC will be negligible. This describes the majority of application workloads.

If you have predictable high-contention hotspots and very short critical sections, PCC can be more efficient on paper, but watch for deadlocks and the blast radius of any single slow transaction. In practice, OCC plus aggregation almost always beats raw PCC for the same workload, because aggregation removes the contention rather than serializing through it.

If you're on Convex and hitting OCC errors, don't look for lock-style workarounds. Shorten the transaction, introduce aggregation, or adopt a pre-built Convex component that already solves the pattern. Lock-style fixes aren't available in Convex by design, because the same fragility patterns that pushed the industry toward OCC in distributed systems apply here.

As a rough heuristic from the literature, OCC outperforms PCC when conflict rates are below roughly 10 to 20 percent, and PCC wins above that. Real-world operational fragility shifts the practical answer further toward OCC, because the failure mode of OCC (visible retry) is easier to recover from than the failure mode of PCC (cascading lock waits).

Frequently Asked Questions

Q: What is the difference between optimistic and pessimistic concurrency control? A: Pessimistic concurrency control (PCC) locks records before reading or writing them, so conflicting transactions wait in a queue. Optimistic concurrency control (OCC) lets transactions run without locks and validates at commit time, aborting the loser of any race. PCC trades latency for guaranteed serialization, whereas OCC trades occasional retries for the ability to keep the database server unlocked and the failure mode local to each transaction.

Q: Why does Convex limit transactions to 1 second and 1MB? A: Long transactions are expensive in every database. They hold locks longer under PCC or lose more commit races under OCC. Convex bounds transaction length and size so a single misbehaving query can't cascade into a system-wide slowdown. The limits surface a constraint that every database has; most just hide it until production.

Q: Does Postgres really lose data by default? A: Under the default read committed isolation level, Postgres can silently lose updates when two transactions read and modify the same row concurrently. The classic case is two transactions both reading a balance, both computing a new value, and both writing it, so one write overwrites the other with no error. The fix is SELECT FOR UPDATE or serializable isolation, but neither is the default.

Q: What causes an OCC error and how do I fix it? A: An OCC error means your transaction read a record, did some work, and then tried to commit, but another transaction wrote to the same record first. Your transaction is aborted and retried. The fix is to reduce contention on the hot record: shorten the transaction, introduce aggregation so the hot record sees fewer writes, or use a Convex component that handles the pattern.

Q: When should I use serializable isolation in Postgres? A: Use it when you need strict correctness guarantees on transactions that read and then modify the same data, and when the application can handle serialization-failure retries. It's the right choice for financial logic, inventory, and anywhere a lost update would be a bug. The cost is occasional retry overhead, which is the same trade-off OCC makes everywhere.

Q: How do I handle high-contention records in Convex? A: Identify the hot record, then apply one of three patterns: make the transaction faster (fewer reads and writes), introduce staleness through an aggregation tree (sharded counters or rolled-up totals), or adopt a Convex component that already implements the pattern. Lock-style workarounds aren't the right shape in an OCC system.

Putting Concurrency Control to Work in Your Convex App

Summary Summary

Concurrency control is one of the few places where a database's design choices are visible in your application code. Convex's 1-second and 1MB caps plus its OCC error model are the database’s refusal to hide a constraint that exists in every system. If you understand the contention profile of your workload, the right pattern is usually obvious:

Shorten the critical section
Aggregate through staleness
Compose a pre-built component that already does it

The teams that hit OCC errors and read them as a signal that this record is contended and the application needs an aggregation layer get further faster than the teams that treat them as bugs to engineer around. The error is information. It's telling you that the shape of the workload has changed and the data model needs to change with it. Acting on that information early is cheaper than acting on it during an outage, which is the alternative path most databases offer by silently absorbing the contention until it breaks.

If you're hitting OCC errors today, the most effective move is to explore the Convex Components library and find the one that matches your contention pattern, rather than trying to engineer a lock-style workaround that doesn't fit the system. The component approach also pays compounding dividends, because the same aggregation primitives that solve your current hot record will solve the next one, and the one after that, without each team having to rebuild the pattern from scratch.

The summary is short. Every database has a speed limit on contended writes, because physics. PCC hides that limit behind a queue and risks cascading failure when one transaction misbehaves. The Postgres default hides it behind silent lost updates that most teams never audit for. Convex surfaces it as an OCC error you can see, retry, and design around. Given the choice between a visible constraint and a hidden one, the visible constraint is the one you can engineer against, and the patterns to engineer against it (shorter transactions, aggregation, components) are well-understood, portable, and available today.

All gas, no breakages

Convex is the reactive backend platform that keeps up with you and your agents. Database, functions, workflow, sync, search, file storage, and more. All TypeScript, zero glue.

Get started