What You Missed in 2024 in the World of Databases, with Andy Pavlo
[Music] good day and welcome back to the database podcast I'm James callling CTO and co-founder of convex and today we have a very special guest joining us he is an esteemed CMU databases Professor the author of the 10 database Crack Commandments the beneficiary of the Steven Moy foundation for keeping it real and a stor proponent of living a database Centric lifestyle is of course the great end Pavo James you don't introduce yourself as Dr James anymore that's disappointing James you know what in silic and Val you keep that in the down low technically technically Dr James yes yes hey thanks for having me it's good to see you again it's good to see you it's good to see you now Andy I was going to direct the audience to read more about you on um Wikipedia because you do have a lot of awards you got the dyer award recently congratulations and no no no hold on that wasn't me that was that was maren from Snowflake oh then we you didn't get that that's they flew me out to talk about them we're not going to we're not going to edit this we're going to push that's fine yeah but but and you are a you know an aidite gentleman with a with a lot of accolades but sadly the Wikipedia overlords have conspired uh to remove you from their Pages um now if someone wants to learn more about you or more important wants to learn more about databases and is's looking for the best and most fun resource on the internet to learn more about databases in general where should we point them I say for the Wikipedia article that's been a long a multi-year battle someone originally wrote the article and then I think they copied and pasted my bio from somewhere that said I was born born on the streets of Baltimore which is sort of true but not in the way you may think of that and so that got flagged and then it got taken down this year for citations but whatever I I I find it amusing more than more than anything else now in terms of like databased material you know as as your listeners or viewers might know we try to put everything we do at carne M University online for our database group so everything we do in courses is all public on YouTube including also all the materials that to do the homeworks and projects as well so we've been doing that for several years now and it's and it's pretty exciting to see like how many people have actually you know done gone through it and then email the person like hey I got a brand new database job because of your course and the best part is they didn't pay CMU any money even though that pays my salary I think it's fantastic so hey companies out there give give the C database group some money but I mean I have personally um watched some episodes of the um the history of databases series um and sensibly I I know this stuff I meant to know this stuff already but like you know I learned things and it's just entertaining I mean there's people in the the YouTube comments that are refusive everyone's like wow this is great content thanks for putting it together yes the whole YouTube thing started because when I started at Carnegie melon I was like the only database system Professor there right the prior to me was Natasa alaki and she left for Switzerland in like 2007 so I showed up 2013 and it's just me there was another professor there but he's more graph mining data mining so from my pers I'm like okay I got to get tenure I'm sort of competing against the MIT Stanford and Berkeley and all those great people there what do I do to sort of put myself out there like all right I'll just do everything I do put it on put it on the internet you know it served a couple benefits is like one is it it makes me be more professional I see less crazy things in the lectures so what you're seeing on YouTube is the the cleaned up version of Andy I also get people send me emails when I when I make mistakes and say hey like you misspoke here this is wrong or like sometimes we talk about papers and the guy who wrote the papers in the 90s they like oh no I was in the room with Phil Bernstein or Jim Gray here's what really happened so like that's been awesome and like I said it's it's and not ever can go to CMU seemu is expensive it's very very hard to get into so like all right well if you can't go to CMU here here's here's a seeu experience for free go at it and so we'll put a link in the show notes to um some of Andy's database series just again I can't I can't recommend more if you want to learn more about databases the theory in a very substantive way like not a l harder very substantive way but in an entertain and Funway check this stuff out but Andy we brought you on to talk about your retrospective on 2024 in the world of databases um you have an annual yearend review blog post we'll link to that as well but obviously we're big database people on the podcast big database people at convex a large part of our core demographic may have been in aom for 2024 and needs to ramp up before they enter 2025 they need to know what happened in the world of databases in 2025 four and you've got this great blog post where basically going to go through the sections and and and and get your take on on just the the big happenings in the world of databases the part of reason I do this is because there's so much going on in databases you know not just you know in Academia but also in obviously in industry and there so much happening you get overwhelmed it's good to sort of you know take take a step back like okay what really happened the last year what actually matters and it's not things like oh so and so put out this new release unless it's like something amazing this ground groundbreaking you know a new version of postgress is is cool and all but like I don't consider that a major event it's it's trying to keep track of like what are the major things that happen and sort of get a an overall uh sort of view what's the trajectory of where the fields and Industry is going absolutely so what things happen in 2024 that might affect how things happen in 2025 and on that topic we can start with licensing and so open source licensing isn't necessarily the sexiest topic but it's something a lot of companies debate certainly we debated at comvex how much to be afraid of Open Source because it's very easy to be in hack and use and I'm one of those people and like open source is great everything's wonderful but there is a dark side to open source for companies because um sometimes they get their ideas taken right sometimes they get kind of eaten up by big companies like AWS and so what's your Reflections on on um you know the life of an open source database company in 2024 um especially you know companies like redis and elastic search who came into a lot of heat last year for changing your licenses yeah so I guess the just to rearize what you said is there's been a proliferation of of Open Source databases the last 15 years or so uh maybe a little bit earlier but like when s nosql stuff became real big in the late 2000s and the so that's fantastic there's all these great you know datas choices that are out there the problem is of course now is how do you make money of that how do you make it actually sustainable to a business could actually continue building the thing that people want to use and pay for it and the has made this super challenging because now you have these boohemoth Cloud companies the Amazon the microsofts the Googles some extent Oracle where they could take popular open source software under the right license just slap it up as a web service or as a service and then start selling to all customers and end up making more money than the company that is actually paying to actually for the development of this thing and so around 2016 17 18 you started seeing open source datab system that were being backed by for-profit companies switch their licensing model to a more restrictive approach that specifically calls out and says you can't take this open source software and resell it as a cloud service so one of the first ones that did this I think was mongodb when they switched from I think they were agpl and they switch to this I think server side public license um they basically again says it's open source but you can't resell it as as a as a cloud service I think Maria DB did the same thing with the business source license around the same time time scales are the same thing like this we've seen this pattern uh quite often and so in the last year the the big license change that I think drew a lot of interest was was with reddis think was like March 2024 reddis announced that they were switching from what was originally BSD 3 which is like a extremely permissible license you can basically do anything and they switch to a combination of a sort of proprietary license and and the mongodb service side license yeah and within days of them doing this everyone was like oh this is terrible how can they do this there was a lot of public backlash and then within a couple days there were you know major efforts to to Fork the post sorry Fork the redis code at the moment the commit right before they switched the server the the license and start brand new projects so vkey is probably the the biggest one out of this which I think started with somebody at Amazon and then the Consortium grew actually quite large in a short amount of time include orle Google and I'm sure there's a there a bunch of others I'm forgetting there and then the obviously the the valky project is is compatible with redus at the time that it occurred you know at the fork but it's it's continuing on in a different branch it's a hard Fork under the bd3 license and so I think you know there was a couple other ones license changes but the redest one is really interesting because you know when [ __ ] for example changed their their license from you know to be to one that was more permissive to a one that was you know the service side public license you know there was nobody was like oh my God me there was backlash in some ways but no one really took the effort to Fork it do a hard Fork the way uh people have done with reddis um and certainly you know a bunch of companies didn't get behind a major Fork [ __ ] and start developing that completely separate so that that one been really fascinating because like I said there's been other changes that but this one has been the people have been most offended by this and so the so interesting understand what happened right yeah I mean because you can take two two takes on this you could say well hey like they did the work and so they need to they need money to not to be rich people on boats maybe they need the money to pay their software Engineers to to um to make their database and and they're worried about um their big plays eating them up fair enough right didn't do all the work right yeah there we go that's yeah that's where you're going yes a couple posts that came out that said they they look the commit history on you know the redis main branch and you there's there's this company redis uh you know redis incorporative redis LCD that the the major company behind backing redus now when you look at the commit history it doesn't it's a bunch of other companies that have contributed technically more than than R the company some of these commits are are can you can't determine who who actually did the work but from from from an analysis I think like was alib Baba was was like number one so it kind of looks like Rus the you know RIS the company is trying to monetize and take credit for a bunch of contributions that they didn't actually produce themselves and they legally able to do this presumably giving the terms of their their licensing it's just the community is not happy with this correct yes yeah I think you turn over the copyright of the work when you get it committed or and pushed into the the mainline Branch yeah I the lesson for me as a as an author of a database that is that is that is open source or at least largely open source we're going to open source more of convex soon I think we didn't do it quite right so convex is uh FSL license functional Source license is very similar to the to the business source license just a little bit simpler and that license says that um you can mostly do anything you want with this source code except you can't directly compete with us for two years so as of a certain commit for two years you can't go and host that on AWS and resell it why two years like I think like BSL was four years yes why two years four years I mean our reasoning was that you know if we can't innovate enough in two years that we perhaps don't deserve to be a Innovative company you know like like that that we wanted to to have as as Limited restrictions as possible on the code base and um and you know we think we can do enough work in two years that convex is still an interesting platform now there's a thing that we didn't do right I haven't talked about this publicly yet we've only open sourced some of our codebase right and I don't know if I'd actually recommend someone would run open source convex in production right now soon you absolutely should we're going to just put it all out there but the challenge was we were doing this at a time where I didn't know the right license but you have to kind of get these answers right very early on right because like there's companies that that that kind of got the license right and everyone knows where they stand early on but it's quite hard to come and put the genie back in the bottle if everyone thinks you're a Community Asset it's quite hard I mean you we saw this with um with WordPress also not the databases world but very hard to be like hey we're we're Community member oh wait a second we want to pull back and that's where you see this big backlash the WordPress one is a bit more complicated too because then he was like going after one of their like specific competitors right redus didn't go that far although I will say one thing they did do that was was unique I've never seen before is that they seem to be trying to consolidate control over all the the extensions or Reddit we have a separate line of research we've done We examined the extensibility of a bunch of database systems and the reddest one is very unique because it is seemingly very permissive what you can build in like plugins you can add into redus but it's always done as sort of a layer above the storage the core storage but they they basically have sunsetted a bunch of these different modules or plugins you can add to redus and for other ones if the name of like the module was like redis hyphen something then they've been they've been supposedly the rumors are they've been paying off the the authors of those extensions to get control and put under the reddis GitHub back again yeah now what one question for you Andy is if you were starting a database company right now what would you do like what what what are your observations how would you license it would it be open source I would say it depends on what the sort of target market you are for the database system right if you're trying to replace postgress right as a sort of General progress operational database system then yeah you you kind of need to be open source and I think going sort of something you guys are convex two years is it's very generous but like a fouryear automatic conversion to Apache like in the BSL that probably would be the way to go again of course it depends on like if you're saying all I'm going to build this you know I'm going to raise some money and then build it now then yeah you should start with that immediately the challenge course is then can you if you're trying to do it organically and it and it wasn't a business but it comes a business you know how do you how do you how do you make that switch and it's probably better better to do sooner sooner rather than later yeah the lesson I have like you know as a startup founder is that I think in in Tech we can sometimes think it's um earning money is a bit dirty right but all open source software is primarily written by people with with real jobs who have real salaries who go and um you know pay for their families when they go home and so there has to be some way of make monetizing making money off these off off these Services convex is in a good place because because our target market is people who want a hosted platform right so open sourcing everything for us doesn't really erode our Market position because most people don't want to do that mostly they rather us have it hosted but I think it's it's a tricky position for folks who selling stuff that that that you know they they need to they need to be inclusive and and um and build trust by being open but they want to make sure that they don't get kind of um eaten by a big by a big player yes so speaking of big players the battle of the Olaf Giants so it's going to move forwards to so you know I'm an oltp guy so I'm a online transaction processing like live low latency you know live sight facing database kind of guy but I I have to say like the most of the excitement and Innovation and growth recently has been in the olap space in the analytics space it's just out of control and you know there's these two giant companies have emerged I mean they're still they're not public well Le datab brick is not public I'm not sure what snowflake is but they they count as giant companies right data bricks and snowflake I've heard them something say oh we're still a startup like no way no it's not on paper maybe yes because they're not public but no they're yeah they're huge companies $60 billion whatever private companies I I I believe so and so this is these are the big names and we've all kind of accepted to some extent over the past few years that these are the these are the names these are these are the big dogs and I know there's obviously a lot of players in the space um but but these there's a bit of a turf war going on shall we say yes the background history is that datab bricks and snowflake have been sniping at each other for several years now I think two years ago I think datab bricks came out with uh TBC DS numbers so TBC DS is a sort of standard Benchmark that you use to measure the performance of a databas system when you do analytics so datab BRS had a big announcement part of like you know rolling out their new Photon query engine like oh like look how much faster we are than Snowflake and so they wrote A Blog article about it how they got certified and so again for background people in the in the 1980s all it was basically the wad Western databases and every single vendor would put out their own numbers like look how much faster we are than everyone else but it was like they would make up their own benchmarks they they would kind of gain the system a bit and so there was set up by uh a bunch of other bunch of companies and sort of spearheaded by Jim Gray who was you know one of the the The Godfather sub databases then a bunch of stuff that we take use today one of the touring War databases in the 90s so he set up this thing called the transaction process and console which is meant to be like a independent third party Arbiter of of these sort of Benchmark results and basically how it works is if you want to report like here here's official numbers you would give the TBC people money they would have an audit auditor come out and make sure you vet your setup so you can prove that you're you know you're running things uh correctly and fairly and so that was a big thing people did in the 90s 2000s a little less so then more recently it's kind of been relegated where people don't really go through the process of going TPC anymore you still use the best work but you're not going to go say these are official ones yeah when I was in grad school I mean tpcc was still huge I might have even used your your code I don't know might have used your tpcc implementation but I've written it four times so yes everyone so so this Andy and I go back quite a long way but you know we always felt a little bit dirty doing it because tpcc was the standard Benchmark for like live site kind of transactional databases like a we like a um like a running a online store almost yes um and but you could cheat so much and everyone would kind of tweak The Benchmark to make it parallel because it was such an old Benchmark that wasn't necessarily that relevant to Modern Times And so we'd everyone would be like well let's make a bit more modern Centric yeah I I don't want to name names but basically a lot of the major D vendors when they see queries show up that look exact like you know when it's t what a tpcc query when those things show up there's optimizations you do that a c normally would not get like for example instead of using a B plus tree for the warehouse table you you use a fix size index or fixed size array because you know it's always be the same size there's tricks like that so that that's what the art the the the auditor is supposed to do is prove that you're not like gaining it that way so going back to data bricks this a few years ago they they for GU marketing purposes they had an official audited and vetted tbcs numbers and showed that they were much faster than than than snowflake again give you a sense to your listeners like like it's important but like it how outdated the model kind of is they had to pull somebody retirement uh to come do the audit because like nobody nobody was around could even do it anymore so they found some old guy to come come help out any so they put out this Vlog article snowflake didn't take kindly to the comparison and so they did they ran their own you know not not a they ran their own version of tbcs show that oh No when're in fact they are faster than datab bricks and then datab bricks respond with their own blog article say no no you did it wrong here's what here's the real numbers are we're much faster you know I was asked to uh at a couple f journalists be like give quotes about these kind of things the fight and I I I didn't I didn't want to get involved but I would always say without again naming names high level people at data bricks and high level people at at snowflake are sending me text messages and emails saying like can you believe this [ __ ] like they're lying to like so I'm just kind of like in you know trying to be the switch on the databases and just say hey guys calm down so that died down that was I think that was 2022 and then this this past year the the the fight got even even uh blooder um where the the the first event the datab bricks put out their own llm dbrx t how how great it was how much money they spent tuning it how big the parameters were um and then within a month later s put out their llm called Artic and in their announcement they write up they explicitly call out like we're better than dealer X we cost less money we we have more parameters like look how much better we are on on what they call Enterprise task so like like natural language to SQL um and so I thought that was kind of amusing because now and instead of just competing against Benchmark numbers and for queries it's like oh here here's our you know very expensive llm we're much better than than you are so that was that was a new dimension of of the battle between the two of them we have we haven't seen before in other Davis vendors but then the big big story in the past year was the acquisition of tabular by data bricks um so tabular was the the VCB company that was spearheading development of of Apache Iceberg Iceberg I think spun out of uh Netflix and it was you know tabular was the company that was going to try to commercialize it and so what I'm saying is is is been publicly reported so I'm not like telling Secrets here but supposedly snowflake was tried to acquire tabular for $600 million um so snowflake has been trying to sort of ramp up and expand their their iceburg support for the last couple years you know in to in response to datab Brick's Unity catalog and their Delta Lake offer so again I don't know how the word got around but once dat briak found out that that tavet was going to get bought by 600 million they came in and threw two two billion dollars in their face uh and and then acquired tabul like they pulled the rug out like that what I've heard is they were like a day away from inking the deal and then datab bricks came down like no [ __ ] you here's all the money right uh and then basically you know pull the rug out on snowflake right that's business I get that but then datab bricks did a bunch of things afterwards of like man they're really rubbing the salt of wounds you know they announced the acquisition I think the day of the the the big snowflake conference uh where the CEO was going to announce hey we have this new you know clone of of Iceberg called Polaris like that day is when they announced it and then the next week you they open source their Unity catalog at least some parts of it so it's just like datab B is clearly you know trying to nip at the heels at at uh at Snowflake and it's all again ramping up in anticipation of like you know this IPO of of of T whoever knows this happening just matter matter when yeah and I mean this is like we we're seeing this kind of resurgence of the database Wars I think one um and I'm you know basically stealing your postulate here but I think one thing that's interesting is that there was database wars in the past were really about about benchmarks like performance throughput latency things that I think a lot of companies don't actually care about that much and now it's a bit more about functionality expressivity and and breadth of the platform and that seems like almost like a better thing for the industry maybe to be competing about feature set and expressivity than um just speed yeah it's one thing to say you're faster of these certain benchmarks but of course then like you know that that can give you a sense of like whether one system is potentially better than another but the end of the day it comes down like if I run my application my workload my queries my data on your database like is it gonna be better or not right like who cares if your tpcs numbers are amazing if for your workload things are bad that's traditionally how the datas companies have competed against each other I mean going back to the 80s of like you know Ingress versus uh or back in the day in the 90s it was Oracle versus uh inic maybe side B you could throw that in in the jumo as well um but to your point nowadays the the core architecture of for for running olac queries that architecture is kind of set in stone or Le solidified and it's based on the work done you know done at snowflake or the lot of core ideas of snowflake from like 2013 2014 so every pretty much every single olot system since then has been replicating that sort of same vectorized parallel execution model that that snowflake is predicated on and so all right well if everyone basically has the same architecture and everyone can run queries as roughly the same almost bare metal speed and you're running on Amazon S3 or whatever Object Store you have so you sort of limited how the bandwidth coming out of the object store all how how do you differentiate yourself and so it's these additional features with that you know interoperating with with things in the the database ecosystem or the data ecosystem and these things like llms and other AI ml Integrations like those are the things that will set each other AP and that's what you basically see what happening between stow and data bricks yeah and so going to 2025 does there's no sign that this is going to slow down um there's uh a lot of innovation in this space IPOs on the horizon um and maybe an exciting time for analytics and question is like I I've thought about you know would data bricks sort of dial things down after they go IPO and you kind of see some overtures of data bricks and actually still like same the same thing like they don't really see themselves as competitors with each other even though clearly they are they're all looking at how to how to attack out Microsoft now so there was like a a couple anal analyst articles that come out in the last couple weeks that said oh you know datab really cares about Microsoft not s or snow really cares about Microsoft instead of instead of uh data breaks so I say Microsoft what what what the company or the or certain product I it' be like you know uh what it's called fabric now right their data Warehouse like all that Azure sort of data science infrastructure yeah what's quite interesting to me is you a lot of tech folks you know a lot of software Engineers think everyone uses a Mac right because they do and all their friends do and a lot of uh software Engineers certainly silic and Valley think everyone uses AWS and Google Cloud because um you know that's what they use but Azure is doing great you know Microsoft is a very successful company there's there's a Microsoft products and there also like Microsoft post I know they're crushing with that you know it's it's not not close to uh you know Amazon's RDS levels or Aurora levels but like they're clearly a number two in the pr game and without really doing you know a lot of effort to like you know to to evangelize it and modernize it like have people on staff work and postcard but like not to the level of like rewriting things the way like Google did with allb or Amazon did with Aurora all right so if we move from say um Big Data to maybe slightly smaller data and I'm referring to duck DB and so it's kind of hard to be online these days in this space and not here duct mentioned every day and every company who wants to kind of um do some analytical query processing can in their software as a library uh they're reaching for for for duck DB including the convex we actually have some prototypes of us doing analytic stuff um on convex data with du DB so can you tell us more about what duct DB is why we should care and also I mean an extension afterwards is why do we see so many duck DB postgress extensions showing up that's quite interesting to me yes so uh duck DB is a embedded database system designed for analytics so I think their catchphrase is absolutely spoton it's like SQL light but for analytics so SQL light you can run transactions you can it's a row store you know it's widely used in in almost pretty much every desktop application and anything you can think of every cell phone is running seite but seite you know is a Road store and it's it's a single threaded query engine so it's not really meant to do analytics and so ducky basically said okay we want the same form factor of seal light where you can plop it in to any application it wrongs basically anywhere with with zero dependencies but you'll get the state-of-the-art you know architecture to do analytics sort of what I was saying before like the snowflake architecture from 2013 that's all inside duct DB other than you know it doesn't scale across multiple machines as everything's running in in process so dctb is fantastic we use it in our class we teach it we have a whole homework assignment where we have the students run queries on sle light run queries and dctb obviously faster than dctb so then they learn the rest of the semester why is that the case the core architecture of dub a lot of that is inspired by uh the hybrid system out of Germany for at T unic which is I also I consider a state of theart system so ducky B is is a phenomenal piece of software and so now that you have basically the query engine of something like a snowflake or you know a click house that you can run basically anywhere the the opposite thing to do is like okay well let's stick it into a system where we have a bunch of data but isn't so great analytics all postgress and so in the last year in 2024 there was four announcements there was actually a fifth one out of China the Name Escapes me but someone someone emailed me after my article came out and complained but there's been a bunch of these extensions uh for embedding duck DB inside of postgress the first one one was uh crunchy data the second one was uh from parade DB and then there's an official one from uh from Hydra and and mother duck that it lives in that actually the duct TB GitHub repository or GitHub uh organization and then the last one it came out in November December was this thing called uh Mooncake and so the basic idea is all the same that that like you you you run your queries into in postgress like stand standard SQL query and then the the the extension you've added to postest can look at the tables you're trying to access intercept the query and then reroute it to to duct TB because those tables are being managed or accessible through through duct TB and so you get the benefit again of something like a high performance analytical system like a click house but through this sort of single pained glass of the near of our interface going through post class what's interesting to me is that you know I and like many database people have kind of been skeptical of something called htap you know hybrid transactional analytical processing so this is basically saying you have the one database and it allows you to do short High um low latency High through um transactional queries and also um High latency maybe low through high bandwidth um large analytics queries on the same database and so because the different architectures favor these these um these two query patterns and also these two query patterns don't actually play that nicely together especially if they're kind of stepping on each other's toes with regards to locks or optimistic currency control but now you do see things like drb postgress uh extensions um and PG Vector where people are doing you know Vector database stuff on postgress has it changed your mind maybe your mind was maybe you always Pro htap but has it changed how you think about the future of of queries and um and hybrid transactional analytical processing a single data system that can do both that's been the Holy Grail people like obviously from a organizational perspective seems like this sense because instead of paying for you know your postest or Aur or whatever that is and paying for snowflake a separate thing if I can have and and worry about how you you get move data from one side to the other and keep everything synchronized if I can get rid of all that and just have a single databas Sy that every goes to that always has the latest information and I don't have interference between the analytical side and the operational side fantastic that that would be great in practice though especially at large Enterprises this is very difficult to do and not for technical reasons but for some business organiz a reasons because the people that run the operational databases your SQL servers your postgress and oracles they're typically not they're on a different team than the people that running the data warehouse and so if you say hey I have I want this H chap system that kind of is okay at operations and okay at at analytics both sides would be screaming like well I don't care about analytics I want the best operational Dev you can give me and I don't want you know guy would say the exact exact same thing so that's always been a big Challenge and there's been a couple attempts to try to to to sell something like this but it's you know it's like like I said it's been hard so duct B now I don't think it replaces a complet supp plant something like a snowflake a click house a red ship whatever like got a lake yeah yeah that doesn't go away what I think you can do now with this WB stuff which is interesting is for for small applications or small organization fantastic this this this is one less thing you have to Big you know you know to spin up a snowf that makes sense for the middle guys um even the larger guys I think would allow you to do is push some of the analytics that you that you would typically have done way to to do on your backend data warehouse you can start moving that closer to where the data actually resides and as it as it arrives doesn't replace everything because obviously you don't want to you blow out all the cores on on your box you're doing analytics but for some some smaller things I think this makes sense yeah the experience I have so firstly um at Dropbox I was the tech lead for the metadata team for a long time and and we completely um failed to meet the needs of the the business side of the house they just wanted to use snowflake or whatever because they have very different requirements they're trying to like close the book at the end of the month to make the SEC happy right and we're trying to run four million transactional queres per second to keep Dropbox running right completely different worlds and it just make complete sense for them to be different systems right uh at convex we see a slightly different pattern so obviously there's the back of the house kind of analytic stuff like how many um what's my Revenue over the past month right but we see users who have a website they they're serving you know low latency High throughput queries and every now and then they want to run some big aggregate they want to do some big table scan that would perform very poorly um on convex or on postgress right like anything that has to read do a large transformation or very very large query and you wouldn't consider it traditional olup it's just like a um every now and then like figure out the table of high scores for this you know this game or something yeah and so it seems like there's a lot of use cases there where it's like baby analytics baby olab right where you can have little side car of dark DB that's a slightly perhaps slightly stale version of your data within one second and be running those queries for users well so when you say see the stale part that's the question is like what duck what is Dub gonna re yes and so a lot of the examples people when these duck DB extensions came out like oh great I can go read my Iceberg tables through a postest interface via duck DB okay great but that's not going to be your your use case you just mentioned where it's like the you know the one second behind data because you have to go through the tri information process to extract that data out of pogress then put an iceberg and then dub can quered against it which I think that the PG dub can do is I think they can read directly against the Poco Heap tables yes right now it's still going to be row oriented so it's not g to get the full benefit of a column store of like compression and all the tricks that Snowflake and others do to make these things run fast but it's certainly be better than what what post can currently do for you having some little small amount of inefficiency for anything that's written basically since the transaction was since the query has started for example yes and then like it has to understand the the the Version Control the version chains so like even though you know if a record gets updated there's a bunch of physical versions you got to go Rectify and collapse down like just because it's it's it's duck DB and it's a fast engine if it's reading sort of unoptimized data you're gonna have issues so again it's better than what postas can do but it's it's not going to be this magic pania yeah and so if if I'm summarizing you correctly your take and certainly my take is that you know the the the snowflakes and the data bricks of the world aren't going anywhere and that they they serve a very important um part of the ecosystem it's getting more important every every year as as as kind of business intelligence gets more important um and the rise of AI but there will be a a need a demand for you know perhaps small analytics is queries happening uh on transactional databases is part of business logic and there's a difference between like you know application business logic and then back of house kind of analytics yes again if you have a small amount of data and the duct approach value is the right way to go yeah all right so if we move on to just random happenings over the past year and in your blog you cover a lot of them and we probably don't have time to go through all of them I picked out a couple but you feel free to steer the conversation where you're going the the first one is Amazon made this announcement of Aurora dsql not D yeah Aur dsql interestingly I think has very little to do with Aurora um but it's basically maybe you could think about it as Amazon's um spanner my take of this was quite interesting for someone like convex quite interesting for someone building a database who wants wants you know strong um you know linearizability and like you know great time stamp protocol I'm not sure if this is um if this this is intended for you know web programmers to to use you could describe it as as their version of spanner um so the couple interesting things about it is first the name Aurora and again they haven't really public talked about how it's actually been built yet there's a Blog article from one of their top Engineers he talks about some things but it's it doesn't really go into too much detail my understanding is that it doesn't share any code uh with like what you would say is postc Aurora now which is you know it's Amazon's proprietary version of of for posts but the roar name certainly I saw this when in my own in my own work in a startup like the roar name carries a lot of lot of weight people know what it is and and think very highly of it you may not understand why what it does but they think of like if I switch from regular PR to Aurora it's Som magically going to get faster I certainly heard that from from people we know not always the case right so a couple things one is like they would reuse the name Aurora even though it's completely separate product line two they came out first with announcing this you know D dsql thing with saying that it supports postgress doesn't support postgress entirely some features of postgress they're going to add over time but that that would tell you again it's not a hard Fork it's not entirely based on postgress and and sort of hacked up version that you've seen from like neon and other vendors right it's it's a lot of is written written from scratch but like it's interesting because they came out out of the gate saying we support postest versus like when they first announced Aurora if I remember correctly like 2016 they it was originally for my squel my Aurora was my squel absolutely and Aurora and Aurora Ser are slightly different products it is quite a confusing space and no one knows what RDS is and whether that's got something to do with Aurora or not yes yes so one I say and this is good for I guess convex is like the conventional wistem has shifted that postest is the default Cho of people the fact that Amazon prior you know the other big announcement with Aurora they started to say hey it's Sports ble and then they added postgress this one coming out of the gate saying postgress so postgress has become sort of the standard default choice for operational databases it's a good thing um I'm glad to see it like that's I think that part is interesting so I think it'll be interesting over time how they continue to add more features from post grass and and improve support and it' be interesting to see like when they talk more about what the architecture actually looks like of how they actually implemented something I'm sure they're gonna write a blog article or a paper about it just hasn't been yet one thing you you said there which you know resonated with me right just switching to Aurora doesn't make your database necessarily better right and just using postgress of MySQL does not make your database better either right necessarily right they they um actually internally accom people might not know this we use both MySQL and postgress internally is part of like a durable Rider head log and they have quite different performance characteristics on the edge they have different Behavior around pipelining and and query planner hints and so anyone who's running 5,000 databases knows there's a lot of nuance differences between these two and um my SQL in some respects is better than postgress on some of these workloads uh has been pressure tested probably a little bit more at this large scale but you know my SQL 9 came out to people care and is just postgress the winner the other one I th I would throw out there also is Maria DB right the background there is Maria DB is a hard for of MySQL 5.5 5.6 when Oracle bought son they took over they took over uh you know the myql team and then the founder or the creator of bicycle Monty widness did a hard fork and that became marid B so marid B is that's out there I just feel like that is sort of the the the going back to hard Forks the fact that you got my squel and you haveb they're completely separate code bass it's kind of like you they're all reimplementing the same thing and it's kind of not wasted effort but like it's I just haven't seen the acceleration in in the development of the features and capabilities of these database systems the way you've seen in post grass in the last five five six years um so yeah I absolutely 100% agree that that postcast is the default choice now and has become just sort of again the standard I mean you look at Amazon's docs they seem to kind of want you to use postgress frankly you know when when you're reading like the RDS Aurora dos even like you said they started off as as as my squl with Block Level replication I think that's maybe how you describe what Aurora was originally yeah that's quite interesting um and also interesting that like the the forked efforts around around ML and um and Maria DB there's another not Fork going on but rewrite going on there's um there's limbo so limbo is a rewrite of SQL light I'm a huge sqlite fan a SQL light in Rust by the turo folks who I believe were also the the S DB folks one guy one guy I think he used used to be there okay yeah my bad but um glaba over there um and and the turo folks um great team and they're in the process of rewriting SQL like in Rust which is very hot right now to rewrite things in Rust let me tell you that like you pretty much raise VC money now if you say hey we're X but written in Russ you know we're elastic sege written in Rust like PE people throw money yes yeah the the limbo project is interesting and it's still it's still very early and that they announced it and I think one of the big challenges they're going to face is is well I say it's it is it's actually manageable right because it's the seal light source code it's actually not that massive um because a lot of it has been extracted away through this internal VM so basically your seq query shows up they convert it into basically their own op codes and they have like similar to like the jbm has his own bite code may have a VM so like if you reimplement that VM piece you get a lot of the INF you know get a lot of free and you have a sort of a standard Target you can build against so I think that it's doable let's see how far they get the thing that's also really important too is is testing so one of the great things about SE light it's open source going back to licenses it's actually public domain so there is actually no source code license it's just it's public you do whatever you want with it I think you might have certain like code uh Community requirements or things like that like I think it's hard to get code into s light I think they're very selective yeah oh like that's not yeah you're not gonna get anything to S light yes yeah uh now say the the Creator site Richard hip he's a friend of mine that dude is is [ __ ] brilliant right and I'm jealous because he sits at his house and write database code that like it's used everywhere right I'm I'm you know doesn't teach classes doesn't do students I'm insanely jealous and like SE line is three guys right it's it's Richard he's got another guy who writes documentation and then the third developer last time I talked to Richard it's been a while but he didn't the guy didn't have a fixed address he just backpacks up and down somewhere in East Asia and just sends code to Richard randomly right so it's a for for the massively excessive and widely used system it's it's a pretty small team and Richard is is rightfully so like very protective of it so one of the things that s is very good at is testing and so there's a he's a very expensive test we like it it's certified to run on avionics so sqlite runs in airplanes like that's a very high threshold of like of like software uh engineering you have to do to make sure this works so what I saw from the NS with the Torso guys what I appreciated was that they're trying to start from the very beginning of having deterministic testing a very uh you know strict and and regimental testing protocol for the D system to ensure that they don't have problems later on um sqlite was famous I think for having more lines of test code than database code yes and obviously lines lines doesn't you know a very approximate measure but on that topic um convex is part of this thing called the deterministic simulation testing Alliance so I'm not going to act like I I'm kind of cluess about the space um you know I think there is a bit of a surgence in popularity of deterministic simulation testing so this is basically determinization tests to ensure that with either absolute certainty or very high confidence that your code is is correct correct you know um and so there's been a lot of efforts in the space before it was a huge research area um Jamie and sujay and some of the folks I work with here um implemented this for um Dropbox for the desktop client that ran a ran a deterministic testing framework called um Trinity I believe the foundation DB folks were really influential in um in thinking about deterministic simulation testing and then they started a company called anus and now you have taret Beetle who I mean I love yuran and the and the taret of folks just great group of people doing deterministic simulation testing and we do some of it inside the convex we just don't really talk about it publicly and um and now turo as well and they're kind of going down this path of saying hey if if if we're bring building a database then you should probably trust us we should give you some evidence for doing so and so we're going to you know very um rigorously test this software yes I think a lot of that work also came out of like sort of embedded devices because obviously like you you know you can't Pokemon debuger on on you know better advice it's got to work in the wild um so I know there's another database I think out of Washington DC called Extreme DB that I think it's embedded devices on missiles so like in that case like they have a ter frame yeah so like it's it's not widely used it's difficult to do um but I like to see I like that that the the Torso guys are starting from the beginning tiger beetles the same thing they started from the beginning like let's make sure we have D testing in place rather than try to graph it on later on because once you you know if after codebases worked on for four years it's really hard to go back and like in re instrument it absolutely absolutely and look I like that you're focused on the testing side because look there's a lot of Buzz around just rewriting something in Rust magically makes it better and I've led a number of products where this has happened and yes we ended up with better code now rust is I mean convex is a is a rust shop but rust is is is hard to use I mean it's not like an intuitive language for someone to figure out you you're not going to patent match your way to being a successful rust program you got to go read the Rust book I would conjecture that like the the writing and rust sure you get a lot of safety and and memory safety Etc but I think sometimes the rewrite is is what some of the benefit is you're just rewriting like you're you're you're rethinking your abstractions you're building a new code base that's clean it is not I don't know how it's SQL is now it must be several decades old 2001 okay 20 199 around that time yeah so it feels like maybe maybe that the the rewrite is the is the part that's that's significant here I the hot thing with students like that's the like that vs and the language they want to learn that's the hot thing now prior to that was probably go um you there's there's always some Trend like that I've had students interrupt my class or ask questions class like when we talk about like you know deadlocks in B+ trees in you know in data structures and indexes he's like does Ross make this all go away like no like you still have to do this thing correctly uh but it does fix a lot of stuff you can write bugs for sure on Russ but I yes if Russ code compiles there there's a good chance it's pretty close to being correct at least um compared to python let's say all right so so reflecting on the past year lot of Acquisitions happen as well that's kind of a tale as old as time but you know the the the people that are doing the Acquisitions changes over time yes I we've already mentioned the tabular that like that that's probably the biggest acquisition I could think of uh for databases since like you know the one like Oracle buying like Bea or or s or sbase get bought by sap like that's a massive acquisition you don't you don't normally see things like this yeah I want to throw out one that I found interesting which is rock set going to open AI yes this is not like a database company buying database company this is an AI company buying a a a very very talented team of of of of Engineers with a with with a database too yeah so rocket was a analytical database Service uh they sort of position themselves as what we'll call real time analytics so you think of like snowflake is is like you're streaming data in but it's gonna take a while to propagate and then you run queries on it right where roxet was trying to say as data is coming in you we can adjust it very quickly and you can immed run analytics on it and the way they made it work was they basically took your data and then made three copies of it in like a column store a row store and an inverted index so they had they bu a bunch of index on your data as it shows up and that's how they got the the good performance I understand they were rock set was doing okay and then openi came with the big big wallet and and bought them out and so it kind of makes sense though when you think about it so you you know right now A I think my understanding is a lot of what OB is doing is is based on Cosmos DB because they're they're very much embedded in in the Azure infrastructure or ecosystem I do know they run a a single massive postest database for like metadata stuff but I think all the chats and interactions all that goes to CS with DB but he started think about like okay they're obvious going to building web crawlers and things that go out in the start collecting data you need a you know you need a datas team and I think the rock set was a I think it was a smart acquisition for them absolutely yeah I think it's it's it's quite interesting just to track you know who are the who are the big players who's going to be doing Acquisitions where is the industry heading um I wouldn't be surprised if you see um big AI companies buying up more infrastructure because in many respect AI is turning into a scale game or has been a scale game for quite some time um and you got you got to put the data somewhere right that's a database yeah absolutely absolutely absolutely and and huge amounts of money getting spend on infrastructure yes all right and anything else from the past year you want to I I didn't mention aune I I mean aune was a startup I did with with two of my students uh Dan vanin and boen Zang Dana was my PG student and the basic idea we were using machine learning to uh automatically tune configuration OBS for for D Sy like post SQL like buffer pull sizes and cash policies and things like that and you know we there's a lot of things I learned that we did right and did wrong I would say that the end of the day what was satisfying is it actually worked you know I don't play too much into The Stereotype of AI is going to put people out of work but like we definitely had a couple customers be like you like we hired DBA as a consultant to to you know try to tune our database our post database but you know with within a day or two days of autotune running you basically did all the same thing that the spensive DBA was doing and so they just switch over to autotune so there obiously there's not a lot of companies in the space there's a few other sort sort out there that try to do ml for for dat studing I think think that the the major lesson I've learned is is not so much the technology the machine but it's more about the almost The Human Side of things of like what you show the person how you convey that that there's benefit to what they're doing because honestly sometimes this stuff takes a while for the machine learning kicks in and says okay here's what you're actually doing and so it's it's a lot of these like almost psychological things to make sure that like people feel happy with way or you know satisfied with what the product is actually doing so again is that research no right there's other stupid things we to do like you know a customer complained that the the recommendations we were generating had too many uh like decimal places in it so just in the web interface we we just rounded the number you know to to make it look you know like 0.1 instead of you know 0 point blah blah blah blah right like is that research no but like that's the kind of things we end up doing to help sell the product yeah the lesson here databas is it for the people right and uh all the stuff that that you do and I do I mean it only exists because there's applications of people using it on top yes so I think the company we folded we we gave Bunch investors money back you had to make sure everyone was taken care of afterwards that was sort of the stressful part because I had some of my former students were on you know uh student visas and so they had to get uh you know positions within 60 days and so that the good news is like they they got taken care of so like yeah it sucked I learned a lot from it and it's not so much the money thing as the time was was stressful but uh I'm glad to be you know back full-time at CMU uh we're probably going to do something again just maybe anounce later this year here uh with my current students uh but I'm definitely we take a different approach than we did with aune yeah and and I just want to put in like a you know a huge respect for how you how you ran that company and afterwards and and also i' I've worked and and hired some of Andy students and and they've all just been just hugely effusive of Andy as a as a person and as a mentor and as a as a leader and all those kinds of things so see the CMU students are brilliant and this the best part of my job is as a professor is they don't know they're smarter than me when they show up they by the time they figure it out they graduate and leave so then I look like a genius but like these kids are already coming up showing up my door step being being brilliant right place at the right time maybe but I think you got being humble too Andy it's been really great having you on as always great talking to you I want to put one more plugin for folks to if you know want to educate yourself on how backends work how databases work check out Andy's content uh CMU um databases class online um and I hope you can speak again in the future new semester starts on Monday so we're doing tiing query optimization everything will be public and then we're going to do a seminar series next semester also too on query optimization and actually we should invite you guys for replaces or sequel oh I would love hey yeah if we can get on one again sometime and argue about squel we're probably going to agree about a lot of things deep down but um that would be that great I'd love to come uh give a give a talk so right thanks thank you [Music] la
James is joined by Andy Pavlo, Associate Professor of Computer Science at Carnegie Mellon University, about the dramatic shifts reshaping the database industry in 2024. The conversation explores how Redis's controversial licensing changes sparked unprecedented community backlash and the fierce battle between Snowflake and Databricks that culminated in a $2 billion acquisition.
Andy also goes into modern testing approaches, particularly deterministic simulation, and why database wars now extend beyond performance metrics. The conversation also touches on the evolution of database education, the rising dominance of Postgres, and how modern testing approaches are reshaping reliability expectations in database development and deployment.
The CMU Database Group has a ton of excellent educational content available online if you want to learn more about databases and how they work.
Build in minutes, scale forever.
Convex is the backend platform with everything you need to build your full-stack AI project. Cloud functions, a database, file storage, scheduling, workflow, vector search, and realtime updates fit together seamlessly.