Listen to this article
SQL’s slow. SQL’s stupid. We hear these claims every time a new shiny tool enters the market, only to realize five years later when the hype dies down that SQL is actually a good idea. In this super techie episode of the Data Engineering Show, Andy Pavlo, Associate Professor at Carnegie Mellon University, joins the bros to delve into database internals and optimization. Andy discusses leveraging ML for autonomous database optimization, using Postgres for practical applications, tuning production databases safely, and why SQL is here to stay.
Listen on Spotify or Apple Podcasts
Transcript:
Benjamin (03:22.99): Hi, everyone, and welcome back to another episode of the Data Engineering Show. We're super fortunate to have Andy Pavlo on today. Andy's professor at Carnegie Mellon University CMU in Pittsburgh in the US. Many of you might know him from either Auditune or his kind of super well -known lecture series on database internals on YouTube.
Welcome to the show, Andy. Great to have you.
Andy Pavlo (03:54.235) Hey Ben, thanks for having me.
Benjamin (03:56.782) Yeah, it's a pleasure. Last time we chatted, I was there kind of giving a guest lecture in your advanced database systems, or no, the like vaccination seminar series. So it's super nice to have you on today on the podcast chatting about all things databases. Yeah. What have you been up to?
Andy Pavlo (04:05.979) Seminar series. Yeah.
Andy Pavlo (04:17.178) Uh, well, it's the middle of the semester. So there's obviously I'm teaching. Um, I mean, there's like research Andy and then there's like auto tune Andy. Uh, and we can talk about both of those. Um, so research Andy, uh, we have, that's the thing you always get, you get both. Uh, and I will say, I try very hard to keep, uh, research Andy, like university Andy and auto tune Andy separated for legal reasons. Uh, which we, which we can talk about it. You want as well. Um,
Benjamin (04:29.614) who's in the room with us right now.
Yeah.
Andy Pavlo (04:47.29) And so research, Andy, we've been focused still on applying machine learning to optimize database systems. But in the prior efforts in previous years, we actually were building a system from scratch called NoisePage, where we were designing from the internals to be entirely run by machine learning components and autonomous operation. For a variety of reasons, we've scuttled the project and just focused now primarily on Postgres. It's good and bad.
present concept to this approach in terms of research. But we've been mostly focusing on training, sort of constructing an environment we call sort of like a gym. Like basically instead of building a simulator for the database, as often occurs in like robotics and other sort of autonomous operation, you know, research areas, we just use the database as a simulator itself. And the idea is like, you just instrument the system and collect all the telemetry that you then train machine learning models to.
and determine the best way to add indexes, what knobs to tune, you know, to have your table partitioning, everything you could possibly human could tune, we try to automate inside of Postgres. And so one of the cool things we do at the gym is we instrument Postgres to accelerate queries faster than they actually would really run, by cutting things off earlier sampling, because you don't actually need the real results if you're just training stuff in a machine learning environment, or machine learning models.
And then I have another student that's looking into doing more, more advanced models for automatically tuning databases. So instead of having like these bespoke models, like here's my index tuner model and here's my query tuning model. We think we can build a single model that can accomplish everything. And what's cool about it, it can exploit the similarities between certain actions. Like if you have an index on A and B, it's very similar to an index on ABC.
So rather than treating each one as a discrete action, you can learn the sort of latent relationship between all of them. So there's still a huge effort we're doing on applying machine learning to automatically attune databases. On the system side, we have two ongoing projects. One is looking at applying query compilation techniques or compilation methods and optimization methods to user -defined functions. So there was a whole line of research at a Microsoft, actually at ships and SQL server today, where they can take a, like a,
Andy Pavlo (07:11.096) a UDF written in T -SQL or PL SQL. And you convert that inline into either SQL or relational algebra and inject that into the call and query. So now the optimizer sees a bunch of queries or sort of a subset of a query or a subquery rather than an opaque box UDF. And so we've been looking at going further than Microsoft and trying to handle all possible UDFs and do a combination of identifying what part needs to be inline and what part needs to be compiled out.
And so that one actually we're doing a DuckDB, because it has a, as I said before, a German style query optimizer that's based on Hyper out of Munich, which Ben, I know you're very familiar with. So they have probably the best optimizers that's out of the source we could use for this project now. And then I have one last student that's looking at BPF optimizations for databases. So this would be the Extended Berkeley Packet Filter in Linux. You can write these... you can write these basic verifiable programs that you can then load into the kernel and run in a safe manner to avoid the overhead of copying data back and forth between the user space. So we did some initial work of using it for making a Postgres proxy run faster, like making PG Bounce run faster. But now the basic is that, okay, if you put the entire database in the kernel with BPF, what can you do? All right, so that's a lot there. Any questions about any of that? Again, that's all just the research side of things.
Benjamin (08:38.35) Nice, that's awesome. Let's wrap up kind of the full circle first, kind of tell us about industry, Andy, as well, and then we'll start digging into some questions.
Andy Pavlo (08:47.64) Absolutely, yeah. Yes. So, industry Andy is still with AutoTune. This is a project that started that we spun out of the University of Research lab where we were using machine learning to optimize databases. That one we were taking an opaque box approach where you assume you can't touch the internals of the database system, but you can't modify anything on the inside of the source code. How much can you optimize?
Eldad (08:48.112) opening up everything.
Andy Pavlo (09:16.856) just through external methods, external APIs. And.
Benjamin (09:19.662) what would be examples of things you would kind of change then in the kind of system configuration.
Andy Pavlo (09:25.24) So this would be in terms of like the internals of the source code or what we can tune.
Benjamin (09:30.734) what you can tune. So you can tune things like indexes and all of that stuff, right? It's just you're not touching Postgres source code, basically. Gotcha.
Andy Pavlo (09:37.912) Correct, yes, or we also support MySQL. But Auditorium originally focused just on knobs. So these are these configuration parameters that pretty much every data system exposes, like caching policies, buffer pool sizes, work memory, stuff like that. And so we can train machine learning models that can predict how the system's behavior will change as you start changing these knobs. And then you can sort of do a search to look for... a set of configurations for the given workload running on a given database that'll maximize, you know, minimize latency, maximize performance and so forth. And so basically what happened was while this was a research project, a bunch of people emailed us and said, we had the exact problem. We'll give you money to fly a student out and set it up for us. And this happens so many times you figured, okay, let's spin it off as a startup.
Benjamin (10:27.95) That's awesome. And especially on Postgres, it's like such a, like the differences can be so huge. Like I remember when we wrote that scheduling paper for SIGMOD and we benchmarked Postgres and like you take the out of box configuration, it's just horrible. And like, I mean, as a researcher, you should do a good job kind of measuring the other systems in the fair environment. So it was actually a lot of manual tuning there as well, just to kind of get Postgres to perform. And then it's an amazing system, of course.
Andy Pavlo (10:40.44) Yeah.
Andy Pavlo (10:52.92) I mean, notice they're immune to this. They all have this problem. Now, when you run on a hosted environment, like for example, Amazon has RDS, they have a host of version of Postgres, Aurora is sort of a similar thing. They've done some tuning for you. Like they know that you're running on this instance size that has this many cores and this amount of memory. And so they've set the parameters to some general purpose of lowest common denominator setting for that environment. But if you tune it for exactly how the... the applications using the database, like what queries are running, what the database, what the data looks like, then you get much better performance. And I would say Postgres is particularly tricky, not just for scheduling, but the auto -vacuum is always the thing that trips people up, and it's just an artifact of Postgres's architecture. So tuning that makes a big difference.
Benjamin (11:42.414) Cool. So kind of then to now close the loop and go back to research and you write like one thing I'm curious about is like this transition from building your own system, right? Like kind of Peloton and then noise page now to using some of the existing kind of open source databases and mainly doing research on those. And I feel like that's also becoming more common. Like you see so many papers nowadays implementing things into doc DB into post -press and so on. Like. Tell us about that transition, right?
Andy Pavlo (12:13.207) Why do we go down this path? Why do we abandon writing system? Yeah, I would say the reason why we abandoned the writing system from scratch was threefold. One is it was right when we forked off the startup and I was just sort of spreading myself too thin. Two is the pandemic. And the artifact of that one that caused the problem was not so much that we were remote and work from home. It was that... I took on way more students to work on the project than I should have. Basically that first year in 2020, any student that contacted me at CMU and said, I lost my internship. Can I just, can I do something, right? Can I work on something so that like, I don't have a gap in my, my resume. So we took a bunch of students who had maybe not taken our database class before, maybe weren't the strongest C++ programmer, but it was like my way of just trying to help.
And so we ballooned to like 35 people and it was not sustainable. The code really suffered. Correct, yes, of varying quality, yes. So, you know, so it was that and the quality of the code sort of tanked. And as part of this also too, like we didn't have actually a good query optimizer. So like, even though we could do all this great machine learning stuff, in the end of the day, if you're picking crap, crappy query plans.
Eldad (13:19.428) an army of interns.
Andy Pavlo (13:42.263) It's all for not, right? And then the last one is my wife and I had a baby who was colic and it was just screaming 24 seven. And I was like, I can't do this. So we looked around and said PostGus was readable at the target. And like I said, there's pros and cons to all of these. Like the auto vacuum is the probably the, I mean, for the sum of stuff we're doing, it doesn't cause too much problems just because we know it's something we need to tune and automate.
It just needs to be careful when you run experiments, make sure it doesn't like, you know, start making crazy changes and slow things down. In terms of DuckDB, we chose that one, as I said before, for the user defined function researcher because they could do, it's one of the few systems that implements the Thomas Norman paper on arbitrary unnesting of subqueries. Now they didn't support lateral joins and my students last year actually submitted a pull request to.
DuckDB to get that in there. So now we can unlast lateral joins. And that's the last piece you need to use it up functions. Like, so because hyper is not open source, umber is not open source, we didn't have anything we could use for this. So we chose DuckDB. And I say, I miss it. I miss building a new system. It was, how does, we were, we were, it's not that I was competing with the, the, the data researchers at Munich, but, you know, I was definitely heavily inspired by a lot of stuff they were doing and like, I know that they're working their deepest all the time and sort of kept me kept me excited and want to keep working on our system. But I think I guess at this point in my career, it wasn't sort of too much. Now, then where we want to go next is, of course, I want to go back and build a new system. But when you look around the landscape of what the Davis market looks like right now, I don't see anything immediately obvious. I'm like, oh, this is a very this is a niche we could we could we could go down. That would be set us apart from, the duck DBs, the click houses, the fire bolts, the snowflakes and so forth. So where I'm at right now is actually rather than building a new data system from scratch, we're actually gonna build a query optimizer first. And we have an ongoing project on that being sort of like something like to replace calcite. And then that way, yes, yes, yes. It's called Opti, it's very, very, very early. So I don't publicly really talk about it because like, you know, we're,
Benjamin (15:56.782) Nice. And that's built in, that's built in Rust, right? Kind of from, from what I know. Nice. That's awesome.
Andy Pavlo (16:08.215) we're doing basic things right now, like predicate pushdown and so forth. But the goal is, and eventually have a sort of clean code base for this, and then we can just sort of keep rolling with it and hopefully, you know, get far enough along where we can expose it to other people. And then, again, the same way that people have sort of adopted Calcite, we hope that this will help people out as well. So that's sort of, that's where my current, I was gonna say also too, query optimization is the thing I know the least about in databases. And so to me, I'm naturally, gravitation towards that because it's hard. And so we'll see how that goes.
Benjamin (16:39.374) But it's like a kind of top -down cascade scale optimizer, right? From what I saw. Okay, it's funny because like we here at Firebolt, we also do like bottom -up kind of DT and so on. So cascades, like it never clicked with me. Like up to this day, I've tried multiple times in my life to really understand it. And I've always, how the heck do you build this without just running in circles or whatever? Like it's always...
Andy Pavlo (16:44.567)Yes.
Andy Pavlo (17:00.727) It's. I'm the exact opposite. To me, when I read the hypergraph paper, again, from Thomas Neumann, again, this is gonna sound like we're just talking about Thomas nonstop, but he's prolific, writes a lot of papers. When I read those papers, I'm like, I guess, right? But to me, the Cascades one, because it's all this unified model and he threw everything in, that makes more sense. Now, I will say the inventor of Cascades made a passing comment at Sigmod in... 2017, 2018, when he won sort of the test time awards. And he sort of mentioned at the end, he's like, yeah, if I had to do it all over again, I would use Cascades to do all the initial planning, but then run the bottoms up DP to do joint ordering. So we might eventually get to that step, like have the extra thing at the end to do the DP, but right now it's just purely a Cascades.
Benjamin (17:55.534) Nice. That's awesome. Anyways, I'm super excited to see how that project will go. Like for us as well, when we rebuilt our query optimizer kind of from scratch, it was like, right, like kind of looking, hey, kind of what's there? What can we draw inspiration from? And on the C++ side, for example, like there's nothing like Calcite, right? And we wanted to stay all C++. So kind of going with Calcite wasn't an option. So when we rebuilt everything kind of, we actually like took many of the kind of design choices that Calcite did in terms of modeling the algebra and so on. But we had to build it all from scratch kind of in C++. So yeah, that's going to be a kind of awesome project and very nice for the community as well.
Andy Pavlo (18:26.967) Mm -hmm.
Andy Pavlo (18:32.279) There's two other sort of query optimizers that are floating out there now. There's from the data fusion guys, they have something. But when we looked at it, I think it's very much rule -based and heuristic based. I don't think it's cost -based. My understanding is a student in China sent them a pull request to add Cascades and I think they've rejected it. And then the Velox team at Emetta, they have an experimental branch from Velox for a query optimizer called Verax.
I don't know again, how robust it is. We haven't looked at it yet, but those are two other projects I think that might float around that might go somewhere.
Benjamin (19:10.254) Thanks for the hints. That's awesome. Cool. So transitioning a bit to kind of in -
Eldad (19:15.748) Since we're all throwing around optimizers and projects around, so let me take you back a few years back to the Huffler -Platner Institute in Germany. So Firebolt actually started by taking that weird, amazing little project from there. Obviously the team rewrote everything, but yeah, your discussion kind of reminded me on picking the right component at the right time. But yeah, lots has evolved.
Andy Pavlo (19:18.583) Yes.
Andy Pavlo (19:43.063) I mean, you're referring to high rise, right? Yeah.
Benjamin (19:45.486) Yes, exactly. So the optimizer originally in Firebolt was part of that. And we connected it to ClickHouse Runtime, but yeah, not anymore.
Eldad (19:46.66) Yes.
Andy Pavlo (19:55.287) We looked at high rise originally, I think a few years ago, and I think what spooked us was the license, the source code license. We're not lawyers, but like, because it wasn't Apache, it was like this very, it was Hasno Plot, it was something very bizarre. We were like, we're not touching this.
Benjamin (20:12.686) But they also have this high rise too, like they actually have two versions, right? Like at some point they rewrote it completely. And I think kind of the newer one is under kind of standard open source license. But anyways, I think we got very deep now into like, uh, query.
Eldad (20:16.164) Yes. Yes.
Eldad (20:26.692) They did it after most of the team went to work for Snowflake and they kind of had the generation shift and one of the decisions was yes, let's open the license.
Benjamin (20:33.87) Yes. And now all of the high -rise two people are also at Snowflake and they're all great. So, nice. In terms of industry, Andy, right? Like it seems like actually many of the things you're doing in academia now, I guess, are inspired by the like real industry problems you've then seen at AutoTune and so on. Like how's the kind of connection between those two? Right? In the beginning, you said, okay, you're trying to keep them separate.
Eldad (20:40.196) Hehehehehe
Benjamin (21:02.286) from your story now, it seems hard to keep them separate because they are working on similar things.
Andy Pavlo (21:08.311) Yeah.
Eldad (21:09.732) That's anti -lawyer, he was not invited to the podcast.
Andy Pavlo (21:14.391) I had to sign this contract, this license contract with the university that basically says like the autotune research project at CMU is dead and everything is over on the, everything's now in the startup. So I would say that like the problems we're solving at the startup are certainly, I mean, they're related, but there's things you have to deal with real world production databases that honestly in the university and most of academia don't touch.
So, and I also give an example of something we saw in the startup that we said, Hey, this is an interesting problem. We don't have the resources to solve it. And then that went, you know, I brought along back to the university. Um, so with autotune, the original idea was that you, the customer or the user would clone their database, make a snapshot of it, collect a workload trace, and then have a spare machine or backup. They would then run experiments on tune that figure out the best configuration, then apply it to the production database. And when we did initial pilots with the university project, one was like at a big bank and we talked to other people at like the US Patent Office. Like these had full time DBAs that had the know how and the time to actually just do this setup. And when you read all the papers on doing ML for databases, they're all pretty much making the same assumption as well. Like you're going to run on the spare machine because you don't want to interfere with the production database. When we then spun it out as a startup, you know, sure, some of the initial people we were first talking to could do this. Uh, but the majority of people cannot, cannot like, you know, Amazon does make it easy to make, to take a snapshot and clone the database. But the workload capture and replay tools for the open source database is like Postgres and MySQL pale a comparison to the commercial ones like, like Oracle's tools, for example. So even if you could capture the workload trace, you're not really going to re simulate the production environment on the clone. And then furthermore, some people are running on like really expensive machines on the cloud. Um, you know, like, you know, $40 ,000 a month, uh, you know, instance with backups and replicas and provision, I have all that. Like no one's going to spend about $40 ,000 a month, uh, instance just to run an experiments on. So a lot of what we did in auto -tune since spinning it out has been to make the machine learning models more safe and a bit more conservative because we know people are pointing at us at production databases. Like in the very beginning, when we first came out of stealth,
Andy Pavlo (23:41.623) everybody we talked to, we'd get on a call and like, okay, like just so you know, you don't want to run this in production, right? Cause it's machine learning has to learn. You want to do this on a replica and everybody that would listen to us like, oh yeah, okay. I do want this, but I can't, I can't do that. I'll just jack up my instance size, pay Amazon more. Cause that that'll solve my current performance problems. For some reason, everyone heated our warnings except for Brazilians. Uh, like in the very beginning, Brazilians for some reason, like we don't care. We'll point it at production database and we're like a brand new startup. Like I wouldn't put, I mean, the very beginning, I would not put it on alternate out of production database. And now I think it's safe, but in the very beginning it was like, but for some reason, Brazilians is like driving without a seatbelt. They were okay with it. Um, so since then we've, we've set the service up to, to get, like I said, be more conservative, be mindful that we're running production databases, avoid, you know, wild swings and performance do things incrementally. Uh, we need to see 24, our observation window is 24 hours, so we sort of see the day night pattern. We automatically skip weekends and holidays, right? So the book, like all those are the things we need to do to make this people comfortable with using a machine learning based tool for the databases. There's other stupid thing, yes. Yes.
Eldad (24:50.404) So Andy, quick question. Let's fast forward to data warehouses having full workload isolation, metadata decoupled, every compute spin up, every cluster spin up, basically looking and being able to redo the same exact same workload up to the query history that remembers everything you've run. So kind of retrying the same idea behind the project on that architecture, which has almost none of the limitations that you've mentioned from the past where you're like, sure nothing. Yes, you run one system in production. Now you say, okay, so how do I reproduce? It's kind of like you do auto -scale, like you do a version change while running a workload on a snowflake, right? That project might actually be much more acceptable and safe because you cannot hurt the system. The warehouse, right? The cluster, the warehouse that runs, it just runs. And then the warehouse that learns just emulates, but it's the exact same thing. So you spin up another warehouse on the side, it's isolated. It kind of solves a lot of that.
Andy Pavlo (25:51.383) Yes.
Andy Pavlo (26:03.991) But who's paying for that? Who's paying for that compute? That's the challenge, right? If you have, if you.
Eldad (26:08.068) So you as a user, you do, yes.
Andy Pavlo (26:12.279) Right, so some people don't wanna do that potentially, right? So to your point, yes, it does make things easier. Microsoft basically does something very similar for their auto indexer in SQL Azure. They call them B instances, they'll spin up a replica, do a snapshot, split the traffic so it gets a mirror of the traffic as well. But again, Microsoft is doing that underneath the covers, the user's not aware. And again, Microsoft is just eating that cost. Going back to University of Andy research, one of the things we may want to try to be able to do is to transfer learning, be able to take like a T3 micro, the smallest you can get, train some models to learn from how the workload is going to respond on a scaled down version of it, and then apply those same changes to the larger production instance. Because then like T3 micros cost nothing. But right now, none of the work I think is, at least in our own work, is getting into this space. A lot of the research out there is like, has to be exact same hardware every single time. But to your point, yes, there are aspects of shared architecture that would make this a lot easier. But in the Postgres MySQL world, that doesn't exist.
Eldad (27:24.9) Exactly. That's a whole different problem. In fact, this is the problem. So kind of building a solution that can tune, change knobs while flying is a post -growth problem. It's kind of a, you know.
Andy Pavlo (27:27.223) Yes.
Andy Pavlo (27:38.615) Yes. And that's, that's what we were trying to do with the noise page project, like to build the system. So that if you change the comfortable size, you don't have to restart. Like you're doing, you know, in me, that was the whole goal. Uh, we just, at the end, like I said, we.
Eldad (27:49.82) But then there's a whole range of problems that you've solved that go beyond solving the architecture challenge. So once, yes, we can afford running for 30 minutes, assuming it takes 30 minutes to get to an optimized result recommendation. So I spin up a snowflake warehouse, it runs for 30 minutes. It's exact same warehouse definition. I have no clue about the hardware. I just kind of... I assume that the warehouse at Snowflake of type S will be the same if I spin it two, three, 10 times. I run it. Yes, I completely ignore all the infra architecture kind of post -gress challenges, but then it goes straight to learning and really improving pure database knobs on that new architecture, which could be amazing. Even as a project for students that can experiment their research on modern products, like, as you say, like Microsoft has that since last year, complete work with isolation, Snowflake has that, BigQuery obviously, yeah.
Andy Pavlo (28:51.055) Yes.
Benjamin (28:51.182) I mean, you even have products around stuff like like Kibo now kind of do these kind of cost optimization things for these, like the couple storage and compute systems.
Andy Pavlo (29:02.575) Yeah, so Kibo is a startup founded by a Danish friend, Barzama Safari, a professor at University of Michigan and his current student who's now a professor at University of Illinois. So they're doing two things. One, Snowflake doesn't expose that many knobs. So I think it's like three knobs they can actually tune. So one is the search space is much less than what we have to deal with at Postgres and MySQL for autotune.
The other thing they do is also sampling. So they'll rewrite your query to hit a sample table rather than the full table. And they can provide some statistical guarantees about the error rate. So like they're doing a bit more than autotune in terms of like doing query verifying because they're sitting in front of the snowflake. And we currently can't do that autotune because it requires additional infrastructure. But I was to say, there's other stupid things we had to do that again, it's not research, it's not deep science to make this work. So one example was, a bunch of customers started complaining saying that the values of the recommendations looked like things that a human wouldn't generate. Like there was too many decimals. So we rounded them up, right?
Benjamin (30:09.39) Nice.
Eldad (30:11.351) That's it exists. We have it. Benjamin, you see this abuse of telemetry, really the decimal numbers. Thank you Andy for everyone out there. Please don't abuse decimal numbers.
Andy Pavlo (30:26.157) But like, so we added that, right? Like I said, that, and those are the things, because we know we're running production databases, we just gotta be more cautious for. Yeah, so I forgot what the original question was, but like the, it's, you know, actually the challenges of putting stuff in production and then what we've learned about, you know, from maybe Autotune or the research and how we sort of cross -pollinate them. So let me give an example of something that we saw a lot in Autotune that we then brought back to the research side. We found a lot of people that were using proxies in front of Postgres. And it's mostly PG Bouncer. If you're running on Amazon RDS, then they have their own proprietary version called RDS proxy. And from a research side of things, they're very, very, I couldn't find any papers about the modern incarnations of proxies in front of database systems. And so we... when we were looking at sort of this BPF research, we said, oh, this is clearly something we could accelerate with BPF, the proxy, because all it's really doing is packet shows up, read the header, figure out where it needs to go, and then shove it right back out in the NIC. So instead of copying up from the NIC through the kernel up into user space, you just intercept the packet as it comes into the kernel, figure out where it needs to go, and then shove it right back out. And you can scale this thing much higher than PG Bowser can do. Because the mem copy is what slows you down. So that's a good example of like, this is something like, hey, it'd be nice if we could solve this problem, but we just can't do it at the startup. So we brought it back to the university and it worked.
Benjamin (32:05.838) Nice. That's awesome. Super cool. Sweet. So one other thing in kind of transitioning a bit away from this, right? So I follow you on Twitter, of course, a lot of people follow you on Twitter. One thing that keeps coming up is your, your take on SQL as a whole, right? So like zooming out a bit specifically from these tuning aspects, like SQL databases. And I think you also have a paper kind of coming out with Mike Stonebreaker. soon what goes around comes around and around where you kind of say, hey, like, this has been around for a really long time, kind of nothing's replacing SQL anytime soon. But AI is going to eat it. Like, there's no more SQL in five years, really. I think you're totally wrong. No.
Andy Pavlo (32:50.539) Sure, yes.
Eldad (32:56.02) It will end up as a Netflix Doctor.
Andy Pavlo (32:56.139) Don't you work at a SQL company?
Benjamin (32:58.094) Yeah.
Andy Pavlo (33:00.779) You work at a SQL database company, what are you talking about?
Benjamin (33:03.438) No, of course, it wasn't, it wasn't serious. Um, but so there are changes, right? And like, okay, you have vector -based databases becoming super popular now, all of those things. My personal take is that like all of this will just be in the longer term kind of soaked up by SQL system. Like there's no good reason why this wouldn't be okay. A vector database type kind of in your relational SQL database, you have some aggregate functions, you have some scalar functions, whatever.
Andy Pavlo (33:05.355) Yes.
Benjamin (33:33.326) So. Yeah, like give us a, your take on kind of SQL as a language kind of surviving for so long and still being strong. And then maybe especially now in the context of all of the generative AI stuff going on.
Andy Pavlo (33:46.827) Yeah, so let me give some background about this particular paper you were referencing. So the first edition of this paper was called What Goes Around Comes Around, and that was written by Mike Sturmbricker with also Joe Hellerstein out of Berkeley in 2005. And it's basically Mike's recounting of the history of data models since the very beginning of the 1960s and up until 2005. And he sort of goes through like, you know, in the beginning, it was the network model, the hierarchical model, the codosil, and then the relational model comes around, and then there's all the sentience and variations of them. I think he gets up to XML databases, which became in vogue in late 1990s, early 2000s. And the main takeaway from that paper is basically how the relational model is the superior model, and it can account for, actually, very specific, the object relational model, which... Stonemaker coined as part of the Postgres project, that's what he considers to be the superior model. But object -relational, he just really means extensible relational model, whether it's XML or JSON or arrays and so forth. And so again, this paper lays out, here's all the things that people tried and none of them has overcome SQL. And they're all basically making the same arguments every time something new comes out.
Oh, SQL slow, SQL stupid, relational model is slow and stupid. Here's this new thing that's shiny that's so much better. Oh, turns out, you know, five years later after all the hype dies down, oh, SQL and relational model is actually a good idea. So we've, we've. It's, it.
Eldad (35:25.869) Nobody read the paper. That's why we ended up with mongos and with no sequels and new sequels and all sorts of shit for 15 years, by the way, 15 years. Like, like nobody wanted to listen to people who know SQL. That was the thing, right? Like big data, go away, SQL guys. And it took 15 years to take it back. Really.
Andy Pavlo (35:32.587) Yeah. Yes.
Andy Pavlo (35:40.907) Yes.
Andy Pavlo (35:47.787) So it's what goes around comes around. So that was the impetus of writing this paper, the follow -up, because actually there was a hacker news, there was a hacker news comment that I saw where someone writes, I don't know why people want to use relational database. Everything should just be a graph database. And what they wrote, yeah, but it literally was almost like a verbatim quote of what was in the original 2005 paper. So I emailed Mike and I was like, we gotta write the follow -up, because it's, you know,
Benjamin (36:05.294) Is it web scale? Is it web scale?
Andy Pavlo (36:17.643) 15, 20 years have passed and it's what goes around comes down all over again. As you said, the new single guys basically are making the same arguments that the XML database people made, that the object database people made. And so the paper is coming out I think later this year, depending on when this podcast is put out, the paper should be available. But basically we go through the last 20 year history of other alternative data models, like the document model for the JSON stuff, key value stores, although they've been around since the 1990s, late 80s, but now there's key value systems. Graph databases, obviously, array databases, vector databases, of course. And the MapReduce really isn't a data model, but as I said, the big data movement was also around that architecture. And we basically just say, look, here's the same thing as people tried before. They make the same arguments that SQL slow, SQL stupid. You don't wanna use that. And then it turns out, oh, there's actually value to using the relational model and a declarative language like SQL. And although, and then the proponents of these other alternative data models, their systems end up morphing into supporting SQL. And actually, I would say that part is unique because the object relational database guys and the semantic database people in the eighties, they never adopted SQL. They just died and withered on the vine. The new SQL systems, Lee saw the writing on the wall and said, oh yeah, SQL's inflation multiply actually good idea. And they've all pretty much morphed into being something that looks like a SQL database. And so now, again, as they said, the hot thing is vector databases. You see some wild claims basically saying the same thing, like how vector databases are gonna destroy, you know, relational databases, SQL stupid, SQL slow, yada, yada, yada, right? And what will happen is of course, these systems will either, you know, will morph over time the vector databases.
And add something that looks like relational model and add something that looks like SQL with obviously these vector indexes to accelerate things. And then likewise, SQL will evolve over time and add support for vector primitives or vector built -ins and vector functions. And you're already seeing that now pretty much every database that's out there, I don't know if Firebolt has anything yet, but like every database has some kind of vector index.
Benjamin (38:34.862) So in 15 years, when we do kind of season 27 of the data engineering show, will we then have the same conversation of the past 15 years? We're just about vector databases. Like, do you think it's another 15 year cycle or do you think it's becoming shorter?
Andy Pavlo (38:51.21) I think, that's a good question. I have the cycles probably five, 10 years for like when something new comes out, a lot of excitement. And then, uh, and then like, again, SQL expands and evolves. Like I would say what's different this time is, um, like with no SQL and the Jason databases, so in particular couch, GB, Mongo, and so forth. Uh, it took a while before the relational databases, the, the, the incumbent systems add a support for Jason. Oh, which is surprising because they add, they add a support for XML and Jason isn't that, but that much farther from it. They could, they could add it pretty, pretty easily. And then the SQL standard added support for JSON, I think 2016. I think the most recent version they had for their data types in 2023. But it took a while for that to happen. What surprised me with the vector index stuff is you look at chat TBT, chat TBT blew up November 2022, right? Like December 2022. I mean, it's obviously been around before then, but that's when everyone was talking about it. It was on the news. Look out, you know, like. It was in the zeitgeist of humanity, right? And not just like a tech bro, tech only thing. So, but within a year or less than a year of chat CPT becoming super hot, again, all these common databases add a support for vector indexes. And that's, I think it's a combination of two things. That's one, it's either because the hype was so much, like how could you ignore this? And they were just sort of trying to ride the wave.
And also too, there was enough open source tools or vector indexes out there that could do approximate nearest neighbor search that were good enough that people could just adopt them and download them and integrate them. Like disk ANN from Microsoft, FICE or FAST from Meta Facebook. Like most of these people didn't roll their own vector index. Now you have the change, like the query operator is like, there's some changes up above, but it's, from my perspective, it's just another index. It's not a, it's not.
Andy Pavlo (40:51.562) It's not, does not necessitate writing a completely brand new database system architecture, right? It's basically all of these, these vector databases, they're going to do vector search faster than the relational guys, but is that, is it fast enough to, and better enough than to give them a moat to protect them from other people encroaching.
Eldad (41:12.958) There's so much value in kind of just talking to users the way they picture it is, look, we are using Chatch UDP as a part of our data, new data pipeline. So user goes, sends a query or UI sends a query. We get results. We need to send it to Chatch UDP. We need to get the result back. Can you do UDF? Can you do a lateral join? And for every value, can you cache the result of Chatch UDP? Like those are the kind of questions that people ask and for good reasons, it's just like another data pipeline project for them. And it should end up as a type index and an extension to a lot of good stuff that already exists within database systems. Maybe not necessarily just SQL, but I think it makes sense for users to view that as yet another type or index. Absolutely. To what you're saying. So.
Andy Pavlo (42:10.058) But I would say the vector database is what they do better than most of what I've seen out there for other relational databases is they have better integration with the AI ML tooling, like Lang chain or TachyBT, Lama index, like all that you can call directly within the data system itself rather than having to bring it back to the application of Python code. I mean, you could run the Python UDF, but as far as I know, I haven't seen any, any vendor providing like, hey, here's our language integration, UDF package, or things like that. Then you also then go back to the problem I was saying before. Now, if you're calling UDF, depending on where it is in your query plan, that's gonna cause problems with the query optimizer, because it's gonna see us as an opaque box. So there's no free lunch. It poses new challenges, but at the end of the day at Ohio, I don't think it's significantly different.
Eldad (43:03.773) interesting to see how the planner evolves.
Benjamin (43:04.814) head nodding all around.
Andy Pavlo (43:08.554) Yeah, sorry. Yeah, the planners at intranet one too also as well, because the, you know, you have the pre -filter, post -filter problem where, you know, your where clause, assuming again, the SQL, you're doing some lookup, you know, to find the tens, the 10 highest rankings for some embedding. But then you also want to like filter like, for the additional metadata, like attributes and, you know, like, find me all the people that promote someone to this person above this age or something like that, right? So when do you actually... When is it better to get the rankings then filter? Should you filter first? Then how do you incorporate that into the search index? The Pinecone has some claims how they can sort of integrate all that once. Same with Weav8 has something as well. I haven't looked deeply enough to say what they're actually doing, whether they're maintaining, if it's using HNSW, whether that metadata is embedded in that graph as well. That part I don't know entirely. But as you said, like. Now you have this index that you can't maybe get back selectivity because it's not like an exact lookup. It's like, hey, give me 10 things that are close enough. And so how do you cost that when you do that versus another predicate is a very interesting problem. I think it's still open.
Eldad (44:23.511) Exactly. And those kind of problems are, this is why at least I believe that it cannot just stay within a vector database or kind of a vector database needs to grow itself or we'll see some interesting things happening in databases in the future.
Andy Pavlo (44:38.41) But there's two paths that could go down, right? And I did the WeV8 podcast and I basically told the CTO, you guys are gonna have to support SQL in like five years. And so that's what they go down the path of evolving, sort of similar to how MongoDB has evolved to their hosted version, Atlas, support SQL. But going from like, oh, JSON only, JSON API, no SQL. Now they've had SQL and Xpand because people want to start hooking up MongoDB to more stuff. So they could either go down the path of becoming more relational like the other NoSQL systems so they can fit into the rest of the data ecosystem that's out there. Or they go down the path of still maintaining their proprietary query language and their API and then just going down the path of something like Elastic where you don't get SQL, right? Because it's not meant to be the primary database of record, it's the separate thing that you can do your analytics on or searches on. So I think that's the two paths that have to go down. And they go down the first one, they'll have to support SQL.
Eldad (45:49.142) And the market will define it for them, right? Like you think Elastic, they had great timing. They had like those few years where nobody competed with them. Nobody was interested into what they're doing. And obviously anyone that was building a database was somewhere else. So that opened up new users. And then you're saying SQL is back, at least from what we're seeing, it's like, yes, it's because the people who own the budget now want to write SQL. And a few years back, that was a... a cloud ops engineer using node .js and that was the great thing to do to become a human optimizer and just write down your planner. That was it, that was the job. Because their SQL database couldn't handle it. So that opened up right like this dynamic in market, like write the stars. Having that opportunity open for elastic that crazy run, which is now obviously.
Andy Pavlo (46:25.409) Yes.
Eldad (46:46.196) much more complicated as everyone is trying to turn any vertical into a data type. So whether it's a search, whether it's just another type. So yeah, it's going to be very interesting and competitive in the next few years to redefine that, even the SQL, right? Like the barrier of entry going forward to build sophisticated solution on SQL are dropping with AI.
Andy Pavlo (46:56.385) Yeah.
Andy Pavlo (47:13.249) Yes.
Eldad (47:14.386) There's no need to reinvent the universe. Just plain dead simple. Generate boilerplate to kind of parse JSON, right? So it's an ugly task that someone would spend a few days on and there was forums and people spending and now it's done, over. It's just, so yeah, nice.
Andy Pavlo (47:35.169) So the Weaviate guys make an argument, which I, if I can get for their perspective, it makes sense, is like, they don't see themselves potentially having to support SQL because you can just put a transformer in front of that that rewrites SQL into whatever API that they have. So they don't really need a parser or anything like that. We'll see how that plays out. But that was their perspective, like why, like, SQL to them is just another interface that could be natural language, could be SQL, could be... whatever you want. But we'll see.
Benjamin (48:07.502) LLMs all the way down. I think those are some great closing thoughts, Andy. Thank you so much for being on the show. It was a pleasure having you. Good luck. All the best both to industry Andy and academia Andy. So we're rooting for you. Yeah. Thanks for being on.
Andy Pavlo (48:09.345) Yes.
Andy Pavlo (48:22.656) Yes.
Andy Pavlo (48:26.239) Hey, thanks for having us, this was fun.
Benjamin (48:28.078) Awesome.