<- Back to all posts

October 2, 2024

October 2, 2024

The Resurgence of SQL: Insights from Ryanne Dolan from LinkedIn

Multiple contributors

No items found.

Listen to this article

Powered by NotebookLM

Listen to this article

In this episode of The Data Engineering Show, Ryanne Dolan from LinkedIn joins the Bros to discuss LinkedIn's Hoptimator project. Ryanne explains how they’re simplifying complex data workflows by automating them through SQL queries, integrating Kubernetes, Kafka, and Flink. The conversation highlights the shift towards a consumer-driven data model and the future of data engineering.

Listen on Spotify or Apple Podcasts

‍

Transcript:

Into/Outro (00:00:04) The Data Engineering Show is brought to you by Firebolt, the cloud data warehouse for low latency analytics. Get $200 credits and start your free trial at firebolt.io.

Benjamin (00:00:15) Hi, everyone, and welcome back to the Data Engineering Show. It's been a while, but we're back today and with an amazing guest, Ryanne Dolan. Welcome to the show. So good to have you.

Ryanne (00:00:25) Thanks. Happy to be here.

Benjamin (00:00:26) Ryanne, is a Senior Staff Software Engineer at LinkedIn, and he's been there for closing in on three years at the end of this year, at least, and kind of also outspoken kind of in the community, giving talks, thinking a lot about Kubernetes, databases, and a bunch of things we want to chat about today. Ryanne, do you just want to quickly introduce yourself, talk about kind of your story?

Ryanne (00:00:47) First off, I started at LinkedIn almost exactly 10 years ago, but I left and came back. So I'm one of those big tech boomerangs that everyone wants to be. I think that lends me sort of an interesting perspective, right? Because I've been around for long enough that I've seen the company sort of grow and change. And that's kind of what I want to talk about today is, you know, sort of like industry trends that I've seen sort of evolve and change over time. The Kubernetes as a database talk. Is one such like something I couldn't really imagine talking about 10 years ago. Things changed so quickly. And I joined LinkedIn, like I said, 10 years ago from an acquisition. And I stayed for four years left to join Word and Works, Cloudera, Twitter, and came back to LinkedIn. So that's sort of been my growth story.

Benjamin (00:01:33) It's going to be a fun show because like I'm like a database internalist guy, right? So kind of when I prepped for the show, it was like, oh, database, just the Kubernetes. I was like, have some fun. I want to learn more about this. In the database community, it's the other way around. Like people are like, oh, databases should start replacing file systems, operating systems, your network stack, whatever. So looking forward to this.

Eldad (00:01:54) Kubernetes is just a compute of a database, right?

Ryanne (00:01:57) I would say everything could be a database. I mean, if you think about it, a database is like just built on what, like spinning rust. So at some point, right, it shouldn't be surprising you could build a database on practically anything, right? And sort of the trendy thing these days is, yeah. You know, pick your most reliable, easiest, cheapest way to store data and then build a database on top of it, right? If that's S3, go for it, right? If it's your favorite blob store, go for it. And I think that makes a lot of sense. And so with Kubernetes, it's kind of a, it's so ubiquitous, right? Like it's obvious that Kubernetes is there. You just take it, take for granted that it's there. It's the computer. It's, you might as well think of it as hardware or as a RAM or CPU, which is part of the stack, at least at some companies. So if you just sort of take for granted that Kubernetes is there. It starts to work. It starts to kind of make sense as sort of the operating system of the cloud or big distributed computer sort of thing. So, yeah, it shouldn't be that surprising that it's easy, sort of, or at least interesting to build a whole database right on top of that as, you know, sort of like a sort of obvious substrate.

Benjamin (00:02:59) So you already hinted at it. It's like in your introduction, right? But one thing that's on your mind is like this trends over the past 10 years, right? And having been at LinkedIn back then and kind of being there now, there's something we'd love to learn more about. Kind of. Give us your take on that, right? What data challenges do you have today that you still had years ago? Which ones are much better? Would love to hear more about it.

Ryanne (00:03:20) I think a big one, like I noticed this right away, almost day one when I came back and it absolutely shocked me. And I'm not sure it's industry-wide, but my guess is this is industry-wide. So I sort of like stepped out of the industry and sort of focused on startups and enterprise-y stuff for a while and then kind of got back into big tech. And when I came back, SQL was popular. Right? Like when I was in, you know, undergrad, grad school. You sort of learn about SQL and it's like, okay, it's this way that you talk to databases, but no one uses those things.

Benjamin (00:03:47) Old guy on the corner, like, what, who is this? Is it web scale?

Ryanne (00:03:51) Exactly. Relational was like, you know, something from mainframe, old and slow. But yeah, when I came back to LinkedIn, I was literally shocked to find that there were, you know, these projects which were 100% SQL. Like there would be a repos that were just hundreds of files of thousands of lines of SQL, which first of all, I didn't even know was SQL. It was a thing. My experience of SQL at the time was that here's a query, here's a response, here's a query, here's a response. I never thought of it as sort of a programming language and sort of a way to sort of like this like data engineering concept of moving data around and manipulating and making models and all that stuff. Completely foreign to me. So I think that that's been sort of the biggest change that I've seen, right, is suddenly SQL is popular. And sort of in that same vein, it used to be that relational databases were considered slow, unscalable and great for enterprise use. You know, I used to think, right, that like if you use a relational database, you're talking about like maybe a hospital, right? You're not talking about Twitter, right?

Eldad (00:04:51) The worst part of the hospital, of course. Exactly.

Ryanne (00:04:54) Exactly.

Benjamin (00:04:55) CrowdStrike and the database two things.

Eldad (00:04:58) SQL Server 6.5 running on. That's what we all imagine when we think databases, or at least that was sometimes.

Ryanne (00:05:07) Absolutely. And that was my prejudice. That was my sort of, you know, you should training. I mean, as you as you progress through grad school, undergrad, early career, you sort of trained to focus on scalability, right? Like my first resume was like at the very top. It's like I'm interested in scalability. Like that's what you had to say. So that's always front and center, especially sort of breaking into big tech is everything has to be scalable. So it was my sort of assumption that SQL was intrinsically not scalable. Relational databases were intrinsically not scalable. And fast forward years. I feel like that's really been challenged. There's a sort of like year window when all these big companies were building their own no SQL databases. Basically on the assumption that SQL was fundamentally broken and fundamentally not scalable. At Twitter, we had Manhattan. At LinkedIn, we have Espresso. Most big tech companies have like multiple such databases that they fill for different use cases. But we're sort of getting to the point where there's so much benefit from the relational model, so much benefit from SQL. That any sort of performance. Penalty is probably worth the penalty. Right? So I think a lot of that's being driven by AI revolution where I think fundamentally sort of the value of writing code is sort of diminished. So you want to do as much work with as little code as possible because if you have a bunch of like block of code that you want to generate, just have ChatGPT do it. Right? They don't have to get paid to do that anymore. So people are trying to like climb on top of this high hill, this pile of stuff. And if you climb high enough, you get to SQL. Right? SQL is like this abstraction layer that microservices, data pipelines, caches, all these things that we used to build by hand kind of fall out of like, let's just subtract this as SQL. So, yeah, that's been a big change. So suddenly SQL is popular and suddenly no SQL is not popular. Right? We don't need a specialized database just for key value store. You can build, you know, just a bunch of stuff. You can build distributed SQLs for relational databases with real constraints, real consistency. And you get value out of that, which, of course, we knew in the 1950s. But it's something that we forgot about, I think, for 15-20 years.

Eldad (00:07:23) But it's all Silicon Valley's fault. See? So as you describe kind of history and evolution, it all happened for a good reason. It all happened because databases reached an endpoint. They sucked. They just couldn't scale to cloud scale. And all the talent moved to big tech. And big tech could not wait for a third party to stop, to depend on. So they just started to build and build and build for the purpose of releasing features. And then kind of that whole open source came to be. Then it took 15 years for everything to converge back to normal, realizing that SQL is just marketing. And it's just a language. It's not even a real language. It's just an intermediate language, as you've mentioned, AI today. And research and startups and companies in databases caught up. So as you said, you get consistency at scale. There's so much innovation happening way below the SQL. But you're right. The biggest advantage of SQL is the optimizer that runs it. And a smart optimizer running on a great stack, a scalable stack, can do a lot. And that's kind of you coming back to LinkedIn. It's crazy. Because LinkedIn was one of those companies. Right? Hadoop. And all of the evolution. No SQL. Right? Hadoop, no SQL. Strata takes me back to those days. So now everything is mixed and just like good fashion, retro. There was always good retro. And SQL is retro, I think. So I'm very happy about it.

Ryanne (00:08:48) It's also sort of like this sort of like old culture of just giving raw materials to developers. You know, it used to be engineers just thought, okay, give me a bunch of spinning disks. Give me a bunch of SSDs. And I'll build my database. I'll build my product. Right? Give me enough RAM and I can do anything. And I think AI, to some extent, has sort of lessened the value of just writing code for code's sake. Just building things for building things' sake. And it's more like we want to build something as easily as possible. We want to lower the friction. We want to make something quickly. We want to build this new pipeline in seconds, not quarters. That shift has made SQL sort of like an obvious choice where it wasn't before. I think that's even a... Sort of a culture change on individuals, you know, individual engineers. I want to write one line of SQL and get something running. I don't want to spend a quarter building my own database from scratch with my own REST APIs from scratch and all this stuff, which is, again, sort of like mind blowing looking back. Because like when I was in college and looking for my first jobs, that's what the job was. The job was, here's some raw materials, go build a scalable thing. And like I thought, I thought like my whole purpose of being was to build the most scalable thing purpose built for me. My whatever project I was working on. And now it's all it's like in the big tech industry, we're talking about like doing more with less. Every company wants to do more with less. And I think that's sort of an easy way to phrase what's happening. But it's really about we want to accelerate the pace of accomplishing our next goal because AI especially is moving so fast. Right? We can't even fathom the next product that we want to work on in a couple of quarters. Right? None of the stuff that we're working on today could have been fathomed five years ago. So or even two years.

Eldad (00:10:34) The only thing that will survive in five years, which is like infinity for AI is a SQL. It's like everything looks different. But for some reason, AI sticks to SQL.

Ryanne (00:10:45) I'm telling you, AI loves SQL. There's a huge corpus of SQL code floating around. So it's and it's it's easily parsed and it's easily understood. And Benjamin to your point, you don't even have to be that good at writing SQL because there's the optimizer. There are decades of optimizer improvements. So if you can have AI plus SQL plus a great optimizer, you can actually get performance out of very little human input. Right? So I think that's like fundamentally why things like SQL becoming popular, things like databases on top of everything, the relational models becoming interesting again and not just building stuff out of raw materials like we used to do when we were, you know, when I started years ago.

Eldad (00:11:27) I will tell you something that maybe will a bit blow your mind. At Firebolt, even compute is part of SQL, meaning you're thinking Kubernetes, your units of compute, whether it's for a query or workload, whatever, it's part of the language. So we are so strong and not just us. Everyone from the database community is so behind SQL that it's not just expanding the type system.

Benjamin (00:11:51) Eldad is getting so worked up about SQL. Everything is like throwing his microphone around.

Eldad (00:11:56) Exactly. I'm so excited about SQL. Finally, for years I've been here. Thank you for sharing the love for SQL. And I think Benjamin agrees as well.

Benjamin (00:12:06) Totally agree. And I think it's funny, right? Because we sent out with this no SQL trend and kind of the industry scrambling around, figuring out, hey, here's this like, I don't know, I need to take things to crazy scale. What technology can actually cater to that? And in a sense, you're seeing a similar scrambling now around AI, right? And if there's, oh, everyone wants to build an AI-enabled applications, and then you have a million different ways popping up on how to do that, right? Kind of dedicated vector databases, kind of other types of systems, whatever. And as a relational database kind of guy, I of course think, okay, kind of it's all the same thing all over again and it will all be relational databases in the end. So love that you're seeing it the same way, Ryanne. If we look at the data stack then that you're using today at LinkedIn, basically, right? Kind of take us through it. Like what are the actual technologies you're leveraging every day?

Ryanne (00:12:55) Oh, I mean, I can't get in too many details, but typical stack. So we are moving to Kubernetes, we've talked a little bit about that. We are in the migration phase right now where we kind of like an old stack and a new stack. And the new stack is very much Kubernetes based. We sort of built things on top of it, which is sort of typical. We of course use Kafka all over the place. We of course use distributed SQL all over the place. But yeah, it's hard to talk about a stack when you're at a company like LinkedIn because it's been around for long enough that there is no stack in the sense of, like-

Eldad (00:13:30) Engineering is the stack.

Ryanne (00:13:31) Yes, exactly. It's, we're not talking about like LAMP, you know, we're talking about like a bunch of different teams making their own choices, right? There are some commonalities, but I mean, I think generally the commonalities are more out of convenience than constraint in the sense that Kafka is there. So people will pick it up and start using it. No one's saying we have to use Kafka. So yeah, the stack is sort of nebulous and hard to define, which is not a very gratifying answer, but in reality,

Eldad (00:13:59) But all the complexities is behind SQL.

Ryanne (00:14:02) So that's sort of like what I'm working on actually is trying to sort of roll up all this complexity that's been human made, you know, taking quarters of engineering effort and sort of rolling them up into abstractions that we can easily just express in SQL. So I spent many years working in data pipelines, building, you know, streaming data pipelines, batch data pipelines. And it turns out like if you zoom out, right, all these sort of pipelines look pretty much the same. And I think like Flink especially has sort of opened a lot of people's eyes to the reality that very complex work can be done in a few lines of SQL, or what used to be very complex work can be done in a few lines of SQL. I remember, well, like, 10 years ago, I built this sort of like multi-stage pipeline that was an individual Samza job, and all it did was consume records and repartition them. And then this other one like was, you know, there was like, you know, adding a little bit of data, maybe adding a column to a route, Each of these stages was hand-written in Java using structs essentially, right? Like not even any sort of like relational concept, but just like, here's your class, serialize it, deserialize it. And then each of these things had their own metrics, their own on-call responsibility, all this stuff.

Eldad (00:15:16) ORM. That's kind of when you know you've went.

Ryanne (00:15:18) Exactly. This was all very hand-built. It was sort of the theme I'm going after here, right? Like it's like hand-built stuff that scaled really well, or at least felt really low level even at the time. But yeah, if you zoom out and you think about these like multi-stage pipelines that took quarters of engineering effort, they're just essentially like insert into select from is like so much of what we built years ago was like one liners in SQL, especially with something like Flink. And then like another, so LinkedIn traditionally has had, I think this is pretty common, but there's been traditionally sort of like two sides to the company. There's been the offline grid side where we did things like Spark, Pig, all these bash things on Hadoop. And then there's like the streaming side of the company, which is Kafka, Samza. And there was no commonality almost at all between these two sides of the company, not in terms of from an organizational perspective or a tooling perspective. And I was a tech lead many years ago on a team that owned a bunch of pipelines. And we literally split the team down the middle and had engineers focused on building the back end, the batch version of the pipeline and the streaming version of the pipeline. And it was like different languages. There was, I mean, it was different tooling, different experiences, different skill sets entirely. And yeah, again, zooming out, all of that work could have just been select star. It's essentially what we were doing, right? So that's sort of what I've been working on recently is trying to like roll up those complex tasks that we used to build piecemeal and get them to the level where we can actually just write, insert into select star and actually get that machinery, running. Again, I'll point to AI as sort of like a motivating factor here. If you can express, I want to data pipeline to AI and it generates some SQL and that SQL generates a whole, like all the machinery, all the databases, all the web services, all that stuff. I don't think we're far away from that level of complexity under the hood where the actual developer experience is plain English. It takes seconds, not quarters.

Benjamin (00:17:17) That's crazy. So one of the then open source LinkedIn frameworks you're contributing a lot to is this, is this Kafkameter framework. Is this, kind of does this tie into the bigger story here?

Ryanne (00:17:30) Yeah, absolutely. I mean, Kafkameter is sort of an experiment and it's right in that, the vein I just described where we take literally just an ANSI select query. If you think the SQL, the simplest, most narrowest definition of SQL is like an ANSI query. That's it, right?

Eldad (00:17:49) Select one.

Ryanne (00:17:50) Yeah, exactly. We're not talking about DML. We're not talking about all this fancy stuff. We're just talking about selecting data and then like having a pluggable catalog of how do I get data from Kafka? How do I get data from Espresso? How do I get, whatever. And then just presenting that select as like a user experience. Like the user gives us just the data that they want. And from that literal query, we can stand up Flink jobs. We can set up materializers. One of the interesting things we do internally is wire up all the ACLs. If there's a new Flink job running from day zero, that Flink job doesn't have access to anything. Sort of like the traditional way of building a new service is to like send a bunch of emails around. Like, can I have access to your data? And so we've gotten to a point where it's all automated. So like when a new stage of a pipeline is deployed for the first time, it'll actually go and request ACLs and get permission and all that stuff totally automated. So that's what we've built with Kafkameter Project. It's all about what can we get into production from just a select query? And it ends up being quite a bit. Like you would, there's sort of like this easy mental leap, a SQL query and like a Flink job, but it's not just a Flink job. It's like all the surrounding things that, that Flink job has to touch. We'll go and we'll create Kafka topics. We'll go and, you know, request ACLs and set up schemas and all that stuff. The reason it's called Kafkameter, if you, you know, anyone who hasn't heard the project, which I imagine is almost everyone, right, that day behind the Kafkameter is it's automating these like multi-hop data pipelines. So we're not just talking about one-off jobs. We're talking about sequences of jobs, complex, you know, workflows, streaming workflows, and even sort of like batch and streaming at the same time, right? Like going back to my earlier example, we had a team that was just split down the middle. This is the batch side, this is the streaming side, completely different tooling. Your job is to do bootstrapping every day, backfilling every day, and that sort of thing. And your job is to stream the data as quickly as possible. Again, both of those sort of roles could be expressed in what is the data that the end user wants? So the end user just wants this data. So can they write a select query? And if they can, then we should be able to automate the process of standing up all the stuff that we built by hand, the streaming ingest, the backfill, you know, all this stuff.

Benjamin (00:20:01) This is super cool, but like it's also kind of meta, right? Because really what you're building is part of a database now because you're taking like an input query and you're actually figuring out, hey, what steps do I have to kind of perform to serve that SQL query at the end of the day in terms of splitting it into multiple pipelines? Like at the heart of this is understanding SQL, optimizing your SQL, just in the end, in your case, you have this multi-step pipeline instead of, okay, your traditional like operator graph that you then kind of plug into your runtime.

Eldad (00:20:32) But there is a problem, Benjamin. Unlike what you're imagining in your head, which is kind of the optimizer and the runtime are playing well together, here it's kind of reversed. It's first you build a complete system that has zero awareness for SQL. Zero. Like never knew no SQL. And then afterwards you're coming and saying, oh, we're going to wrap that in SQL. And it's going to be correct, and it's going to be consistent to a degree, hopefully. But then when you run your select, it spins up a Kubernetes job. It connects Kafka. It gets out of kind of catalogs the schema. It finds the bucket. Security is a nightmare. Just think about security. But if that actually works, if that works, Ryanne, then there is a future to engineering. Like we can reuse and recycle all of that without rebuilding. That's important and that's actually super fascinating.

Ryanne (00:21:29) That's exactly the idea. I mean, basically the essence of the project is, okay, we built all this low-level stuff that's not going anywhere overnight, right? Like these sort of, you can even consider them legacy systems to some extent. For so long, anywhere overnight. What we want to get to is this relational SQL-driven world. So can we just layer SQL on top of it? Right? And again, that surprising that that's possible, at least to some extent. Because again, at the end of the day, like a relational database is just spinning rust, right? It's like you're layering something on top of unreliable, inconsistent metal.

Eldad (00:22:08) But Ryanne, just one warning. Don't connect Looker to that system. They like, every dashboard refresh, every BI dashboard refresh will... Spin up those things. You'll need an admission controller and resource manager. But that's the beauty. Like, it could actually work.

Ryanne (00:22:26) Yeah, and what we've been doing so far, I call this out in my original design doc. It's like, you know, we could do this. The risk is, okay, people are just running random queries and we're like standing up, you know, like big tech level jobs that are like thousands of machines in service of their requests just because the optimizer is like, this is how I get the result in this amount of time, right? So that is a real risk.

Eldad (00:22:49) Just knowing what people need, having that interface in one place, like you have a single version of the truth, not of the data, of the queries. So you get the query history of all the outside system people you don't know about. And you get this one place where it tells you like, this is how people interact with our legacy, with our huge multi-complex system. It's beautiful.

Ryanne (00:23:12) Yeah, I actually, I'd love to drill more on that concept because what you just alluded to is something, I call producer-driven versus consumer-driven. I've read about this a couple of times, but it's sort of like, I don't know if it's trendy, but it's sort of obvious that you want to enable data producers to produce data. And so like everything we've built in the history of big tech, and I think the industry at large has been to enable producers to produce more data. But you have data products and things like that, which is all around, okay, you're the owner of the data. You're producing the data. You have the responsibility, and you own the schemas, and you store the data. You build the APIs. That whole model works to a point. What? What you're alluding to is what you really need is to know what all the consumers need. It doesn't actually help if everyone just throws in data all over the place.

Eldad (00:24:04) It won't work. We're not producers by nature. We're consumers by nature.

Ryanne (00:24:08) Exactly. But in development mode, when it's your job to own data, your whole mental model is to be a producer. I want to build this new API. I want to store this data somewhere. Your whole mindset is to produce data, and so we've built things around that structure.

Eldad (00:24:22) But look what happened to you. You turned a production system. So someone says, oh, I'm a producer creating a data pipeline. But this is a select query. Yes, the select query might generate a billion records as a report, but it's still a select. So producing data is actually running a select versus. Generating a data pipe with a lot of complexity that ends up, as you said, as an isolated mart.

Ryanne (00:24:44) Yes. Exactly. What I want to do is sort of like turn that model on its head. What you want to do is enable your consumers to express what data they need. And then the whole producer aspect largely goes away. And, you know, if you sort of like zoom out in the perfect world, you have these sort of like source of truth databases that someone has to produce at some level because otherwise you don't get any data. You have to have producers somewhere. You want to have a smallish number of high quality source of truth databases. And then you don't want to use that same technology, the same level of effort to build all of your data products, for lack of a better term, where all of your data engineers are building all these components that all have their own schemas and their own databases and their own SLAs and all this stuff. What you really want is you want like your source of truth databases with this high quality. And then you have your consumers, which can be front ends, can be mobile devices, can be other pipelines. And if you can wire up the consumer's requirements. To what the producers have and spin up pipelines automatically, then you sort of eliminate a whole lot of engineering. You don't get into this like, especially like in a streaming sense, you have you end up with like Kafka topics on top of Kafka topics on top of Kafka topics. They just keep going forever and ever and ever. No one ever says, let's delete this Kafka topic and just refactor this into one. You know, let's take these and collapse it. No one ever does that. It just keeps building and building and building and building. But if you zoom out and say, like, who are the end consumers? Well, you know, I have one application that needs status messages. Right. And then everything in between can be theoretically auto-generated from these sources of truths to the consumers. And questions like, well, should we use Kafka? Should we use a database? Should we use key value stores? Should we use relational database? Those are all optimization problems that a relational database could solve.

Eldad (00:26:33) They could be abstracted from the user and handled at the back end because it's about optimizing data access. So there are many ways to solve it and many products. But the abstraction with SQL, moving from being a producer mindset to a consumer mindset is super, super interesting and feels like that's exactly where the market is heading. But it's actually really nice to see it's also heading on those huge, large, impossible to change tech companies like LinkedIn, which is amazing.

Ryanne (00:27:01) I mean, it'll take time, but we're getting there. Yeah, we have some use cases in production at this point. So I'll say that Hoppimator is largely experimental. It sort of has always been experimental. But we've taken some interesting learnings from it and actually starting to deliver some real value. So we'll see.

Benjamin (00:27:17) Super exciting. And then to close out this conversation, I guess this is then where it also all connects to Kubernetes style control plane things, right? Because in the end, you just write SQL and then a tool like Hoppimator kind of figures out the rest. And it becomes a bit like your Kubernetes operator in a sense, kind of which knows how to spin up compute, kind of do like your request. So it's a bit of a reconciliation and so on, I think.

Ryanne (00:27:43) It's funny, like if you come at a control plane problem with the experience that sort of like skipped over Kubernetes, you know, like you're talking about people who may be a decade out of phase with me, right? You hit the industry when there were relational databases and things, and maybe there was Kafka before Kubernetes. If that sort of generation of engineers think about this problem, it's like you're just talking about data pipelines again. You're just talking about here's a bunch of metadata, and you're trying to get to the next state, right? Which is what you're describing as a Kubernetes operator is just a transformation of like this is the input state, this is the output state. And if you sort of like think about Kubernetes as just a metadata store, and you think of it as operators and controllers, as just these things that take an input state and produce an output state, which is largely what they do, right? Not but largely what they do, then you can sort of solve that problem in many ways. It's just a complex state machine. It's a workflow. It's a function from inputs to outputs. Incidentally, the Hopp-to-Mator project, I kind of backed into using the operator and controller model. That's sort of native to Kubernetes to solve this problem. It wasn't like the only way to do it. But we, again, we sort of like we're using Kubernetes. Flink is running on Kubernetes at LinkedIn. So sort of a natural fit. So the sort of like weird thing we arrived at was we've got SQL at the very top. And when you, you know, when you write the SQL query, a bunch of Kubernetes YAML gets generated. So we kind of have this like SQL at the top and YAML is sort of this intermediate layer. And then a bunch of controllers go and, you know, spin up. And it ends up being really easy to throw like database tech, you know, something like Apache Calcite is what we use, right? To turn SQL into YAML. It's actually not hard at all. And so that's sort of like the core of what Kafkameter is. And Kubernetes as a database talk is sort of, you know, spun out of that. Like if we can take SQL and turn it into YAML, why not just treat all of Kubernetes resources as SQL tables? Like why can't I select star from pods and get all my pods? And it turns out you totally can. And like, is that valuable? I'm not sure. But it's definitely cool and interesting.

Eldad (00:29:51) It's valuable if you force everyone to use only that to access Kubernetes.

Ryanne - 00:29:56) If the only language you've got is SQL, then it's super valuable. And, you know, it's probably beneficial to like stay away from YAML as much as possible. So like if you give developers SQL as an authoring experience or a querying experience instead of having them write YAML and, you know, like checking in YAML and deploying YAML scripts and all that stuff. I think it's really cool. I think it actually might be a better long-term strategy. But yeah, it's sort of like a happy accident, right? That was we were playing with Kubernetes in YAML and SQL and it's like, we can translate between these different media. Why not treat Kubernetes as a database and a meta store? And that's basically what we do with Kafkameter . Again, we take SQL, we generate YAML, and then the Kafkameter project doesn't know anything about Flink. It doesn't know about Flink syntax, doesn't know anything about Flink. It just knows how to generate YAML that the Flink operator can then pick up. You can just like extrapolate from there. If any system just has an interface exposed as a YAML spec, then it's easy to wire up. Okay, this sort of table, this sort of pipeline is implemented by this sort of YAML and then just leave a controller to actually make that happen. Sort of manifest those, you know, those actual pipelines or those actual tables. And so that's sort of the direction we've headed with the Kafkameter project. And the Kubernetes as a database is just funny. Well, let's take this to the absurd level of, what if we just store metadata? What if we just query pods and query deployments and everything as a table? It's kind of a fun thing you can do, which I'm not convinced it's useful, but it's definitely interesting.

Eldad (00:31:26) Everything can get into information schema. You'll be surprised. Any metadata can be squeezed in there. Information schema is endless. That's how it can expand forever. If you look at the typical database, yeah, it's that place where everyone throws in all those weird tables that usually you don't want to scan, but you need to scan to understand the system from time to time. So having Kubernetes as a first-class information schema citizen would be interesting to see how that rolls out and what people will do with it. Super interesting, really. Super exciting.

Benjamin (00:31:58) Agree. Thank you so much, Ryanne, really, for taking us through this journey from the evolution of data pipelines to data at LinkedIn, Kafkameter, and then Kubernetes. Really awesome work. Love that you're such a SQL guy. Any closing words from your side?

Ryanne (00:32:14) I don't have any top-of-mind thoughts to close with, other than I'll correct you that I'm not, I haven't traditionally been a SQL guy. I've sort of like come around to it. So I can, I cannot say that I've always been right on this point. Like, I cannot say I told you so.

Benjamin (00:32:29) Real growth is realizing your past mistakes and then learning from them. Awesome. And then thank you so much for being on the show. Was an absolute pleasure having you and excited to see where Kafkameter goes.

Ryanne (00:32:43) Pleasure. Thanks so much, guys.

Eldad (00:32:45) Thank you.

Intro/Outro (00:32:48) The Data Engineering Show is brought to you by Firebolt, the cloud data warehouse for low latency analytics. Get $200 credits and start your free trial at firebolt.io.

Table of Contents

This is some text inside of a div block.

This is some text inside of a div block.

FuzzBerg: Hunting Bugs in Iceberg and file-format readers

Firebolt open-sources FuzzBerg to accelerate security testing of Iceberg and other file based readers.

Abhishek Sen

Firebolt Connector for Confluent : Real-Time Applications, Powered by Streaming Data

Firebolt Connector for Confluent now validated, enabling Real-Time applications, powered by streaming data.

Abhishek Reddy

Unlock Real-Time Analytics: Connecting Firebolt to Tableau Cloud

Real-Time Data analytics powering real time dashboards that deliver real insights

Kushagr Nagpal

Intrigued? Want to read some more?