In this special roundup episode of The Data Engineering Show, the Bros revisits some of the best bits from episodes with data thought leaders Zach Wilson, Matthew Housley, Joe Reis, and Krishnan Viswanathan, spotlighting essential trends and lessons learned across the evolving data engineering landscape.
Listen on Spotify or Apple Podcasts
The Data Engineering Show is brought to you by Firebolt, the cloud data warehouse for low-latency analytics. Get $200 credits and start your free trial at firebolt.io.
Benjamin - 00:00:10: So on my end, I'm just kind of like C++ database nerd. I care about how to build concurrent index structures. I don't know, like how to build a fast multi-threaded join and so on. And whenever I look at data engineering, right, and kind of like I interface with a lot of data engineers, Firebolt kind of always seems kind of really daunting just because like the kind of breadth of the field in terms of like just amount of different technologies you have just seems so crazy, right? So like how do you decide on kind of what to like focus on in terms of actually kind of, yeah, like teaching people kind of skills that are valuable for their career?
Zach - 00:00:52: Like for my bootcamp, for example, right, there's six weeks. Only two of those weeks are actually tech specific. The other four weeks are not tech specific. They're tech agnostic. So where, because I think there's a couple things and a couple philosophies that are really important in data engineering that are actually like, they apply regardless of if you're using Spark or Snowflake or Databricks or Presto or Flink or like whatever, you know, tech you want to use for it. And there's a couple of them. Like one is like around like data modeling, how to do proper data modeling for dimensions and facts and how to really get those things like compacted down. And there's a lot of trade-offs in that space that is very art, very, it's not as science and it's a lot more art and you have to understand. Like your consumers. Yeah. And they have to have that empathy. And that part is powerful. Like for example, for me, like when I was working at Airbnb, like I would say 80 to 90% of the impact I had was going to be in two buckets, right? It was in the leadership bucket of inspiring other people and helping them grow. And the other one is data modeling and getting, making robust data models that can then be used by a large number of people downstream. What wasn't as important was like, how good I was at Spark. Like, I mean, I look at Spark as more of like a means of accomplishing something or like as like a, it's just one kind of path forward and that like you just, it solves the problem and maybe it can be a little bit faster. Maybe it can solve those things. But really, this is the fundamental thing that I think a lot of data engineers need to remember is your product that you sell is data. It's not a pipeline. A pipeline, it can help, right? In term of maintenance and like pain and suffering. Like if your pipeline sucks, then like the data is going to be annoying. But like generally speaking, the value you're providing is in the data sets that you provide. And that is like, and if those data sets are not modeled properly, then that's where you can have a lot of unnecessary cost, right? And like, in big tech companies, these mistakes actually cost them millions and millions and millions of dollars a year. Because of like, if you don't model things the right way, then downstream, the compression doesn't work the same way. And then it can blow the data up again, right? And there's a lot of interesting, like tricky things that I've noticed with like how data modeling works. So that's one. I'd say another kind of tech agnostic thing is around data quality, right? And understanding, again, there's technologies here like Amazon DQ and great expectations. And there's going to be 10 trillion more coming. And but like, it's more again around like how to test data like of like, is this quality? Is it not quality? How to validate it, right? That's very like agnostic of like the tech that you're using. And you should definitely be able to do that. And then the last bucket of things that are tech agnostic is storytelling, right? You tell a compelling story, can you like make some cool charts? Can you persuade people to give you the time to make this data? And other things like that, because there's like the story, there's like the before data story. And then there's also the after you have data story. And both of those stories matter. And being able to construct those narratives in a compelling way. Very important persuasion. And then the tech is the last one. And like, and like, in some ways, I think the last one, but also not as important and- But it's tricky because and this is the thing I hate about industry in some regards is that like, If you go into an interview, right? And you go into the interview, like 80% of the questions are going to be on like Spark or Flink and be like, oh, do you know this very specific minor detail about Spark? And it's like, dude, this doesn't matter that much actually in the end. But that's how things are tested, right? And I hope that industry changes in that way.
Benjamin - 00:04:49: Gotcha. So, I mean, looking at this bootcamp, because you're framing it in that context, kind of you said in the beginning, it was like the goal is kind of from good to great. So usually like those will be people who already know Spark, right? Kind of know how to maybe like write Scala code for their Spark stuff, like those types of things. On the other end, and it's hard to say from good to great, is like, okay, you have someone, right? Maybe that, okay, not the influencer with 25,000 followers talking about data engineering, but just someone wrapping up high school who wants to get into data engineering. And if you have to pick up some technology, right? And like there, it just seems kind of like daunting to in a sense, like make that choice, right? Kind of what horses do you bet on? Kind of with what do you get started to actually start with that career?
Zach - 00:05:33: I mean, I totally agree. I think it's similar to, like, so I'm a pretty like athletic, sporty guy, right? And like one of the things that I remember as a kid growing up, my parents were always like, you got to do sports, right? And then I was like, okay. And like, then I tried a bunch of, I tried like soccer, I tried basketball, I tried baseball, I tried like all these different sports that like were all interesting and different. And like, one of the things that I learned about it was, especially going through that process was like, yeah, you just got to pick one and be like, I'm going to get good at this one. And for me, that was basketball. And I mean, I got lucky, I'm tall or I'm 60. So basketball was the easy, obvious choice. And that's one of the things that's tricky about tech sometimes is that the choices aren't so obvious, right? They're not like, oh yeah, this one is 6'7'' and this one's 4'2". So we should go with a taller one or whatever, right? It's not that like obvious a lot of the time. And so I think there's kind of a couple pieces there, like on like how to pick like technologies is one is going to be like, okay, what do you see on social media? Like, I know that that's like, I'm not going to say all social media because I still don't trust TikTok here. But on LinkedIn, like if you have enough of a network on LinkedIn, do a poll. Polls are broken on LinkedIn. Like if you do a poll on LinkedIn, like even if you have no followers, it's going to be seen by like 10,000 people because polls are broken and they're very good. And the reach they get is too good. And so you can learn, you can ask, right? And I found that there's like two or three really high fidelity sources of like, where to get like good information. You have LinkedIn. LinkedIn doesn't give you... The thing about LinkedIn though, is it doesn't give you very... It's not as good about negative feedback. Like... So if you're like asking someone to be like, hey, say why my video sucks. Most people aren't going to do that in the comments section on LinkedIn because like, they're like, I don't want to look like an asshole. And so that's one. Reddit. Reddit's better for that. Reddit's almost too good for that. If you want people to tear you down, go to Reddit. Like Reddit or Blinds even better if you really want to get... But those places, you can also get the more like guidance on like, okay, should I learn these texts or these texts, right? And like, I think like really the big things are going to be just like learning the languages first. SQL, Python, right? If you really want to break into DE, SQL, Python, learn the languages first and then you can learn the tech after that actually. Because you can do... If you just do like Postgres to learn SQL, just like a very basic database, then you can build into the more complicated ones like Snowflake or Firebolt or Spark or whatever you want to use, right? And then like, you can kind of go into the cloud. And like, I found that kind of building more locally first and like kind of learning languages that way. I've had more success with some students kind of teaching them that way as opposed to like being like, okay, here's how you set up an EC2 instance on AWS and now you have a computer in the cloud and it's going to crunch the data for you. And like a lot of that feels like a lot more complexity that they don't really need yet until they have more confidence in their own skills.
Benjamin - 00:08:45: Looking at SQL databases, I guess kind of one of the good things is looking at the space right now. It's like so many, like so many systems are actually kind of converging around the Postgres dialect. There's not like you need to kind of learn like seven different kind of flavors of like SQL or like window function syntaxes, whatever. Actually, like a lot of systems tend to be at least more similar to date than maybe they used to some time ago. So that's super cool. So thanks for the insights there. I mean, like I'm sure a lot of listeners will appreciate that. Cool. Zooming out a bit, right, from the kind of specific technology, like, do you see any kind of big kind of trends at the moment? Like, when I look at LinkedIn, for example, like, one thing that keeps coming up is, like, data observability, kind of data monitoring, data quality. Like, those seem to be what some of the things kind of generating a lot of buzz. Kind of, yeah, what else is out there?
Zach - 00:09:39: Yeah, like, data monitoring, MLOps, data versioning. There's all sorts of, like, interesting things. And then there's, like, a couple of them that come back around sometimes, like, data mesh. Like, I hear data mesh, like, once every three months or something like that. And I'm like, hey, it's there. It's a thing, right? But, like, I think a couple of things that I really am seeing, yeah, is definitely data observability of, like, yo, like, how is this data changing over time? And, like, it's very closely linked with data quality. And honestly, they should be closely linked because if you aren't aware of, like... How your data, the shape of your data, what it looks like over time, then you really don't have good data quality checks because you haven't done your due diligence on looking at what is normal and what is abnormal. I mean, there's data quality checks out there that are very easy to know or what is normal and not normal. Is there any data? No data is abnormal, right? Or this column's null when it should never be null. That's abnormal. Very easy check. But then things, what you define as normal versus abnormal, it gets more and more complicated as you look at more and more different data points together. And that's where if you look at week-over-week row counts, that's going to be one that, what is abnormal versus normal? It depends. Because a lot of times those week-over-week row counts, on Christmas Day, they fail because there's not as much data or there's too much data. And it's actually not... You're looking at the wrong period. Instead of week-over-week, you really should be looking year-over-year and looking at it on zooming out to find the actual pattern that matters the most. And that's why people do week-over-week instead of day-over-day because of the Sunday-Monday phenomena. And Sunday-Monday is super annoying as well. That one's very common to trip people up. That's why week-over-week is better. But it still misses the holiday patterns. So I think that those kind of observability, things are super important because it's linked to quality and that is linked to trust. And because it's like without quality, you don't have trust, right? And definitely, I think that that, I would say, is the big thing that I've definitely been seeing. I've also been seeing a little bit more of a push towards streaming and trying to get more people involved. I've been hearing about ClickHouse so much recently. Everyone's like, you got to try ClickHouse. You got to try ClickHouse. I have not tried ClickHouse yet, but I need to. Just because it's been... I've seen it in every single comment section of all my posts. So yeah.
Benjamin - 00:12:19: One thing I'm curious about in general, right, kind of like coming out of this is we're saying, hey, our tools got much better, right? But we still have many of the same problems we used to have. We still keep cycling around using kind of topics. And both you, Joe and Matt, kind of, right, you're teaching a lot. Kind of you're doing thought leadership. Kind of you're writing blogs. You kind of wrote that super well-known book. You're affiliated with the University of Utah. You're consulting. Kind of like arguably you could say, okay, if we have all of those amazing tools now and we're still cycling around the same kind of types of problems, right? Maybe we're just not teaching it well enough. So what does that mean for your approach to kind of delivering these things to A, professionals, students, those types of things?
Rob - 00:12:58: We failed.
Joe - 00:13:00: The kid comes out swinging.
Rob - 00:13:07: That's a good question.
Joe - 00:13:08: I mean, I think part of the problem, and this is not to trash vendors too much, I think vendors build fantastic products.
Rob - 00:13:14: Horrible products.
Joe - 00:13:14: Yeah, yeah, yeah. But I mean, if I'm in sales for a vendor, I'm not necessarily focused on how I use the tool. I just want to get the tool out there and get people using it, right? And that's where there is more need for people on kind of the meta level to come in and say, all right, you've decided on X, Y, and Z tools. How can we actually use these to help the company? And do that training all along. I mean, I think Joe and I have complained a lot about the lack of training for undergraduates and data specifically. And part of that training as we build it out needs to be, obviously, they need to learn data fundamentals like data modeling, thin ops, cost management, but also what it's like to work inside a business and the kinds of things that businesses care about and how they can communicate better. I mean, communications are notoriously difficult to teach, right? Because how do you teach someone out of a textbook how to communicate with someone? But we need to keep thinking about these problems and figure out how to give students practical, concrete experience with communicating with businesses and stakeholders.
Benjamin - 00:14:16: So how do you do that? Because that was also a very abstract answer.
Joe - 00:14:20: Fair, fair. I mean, I think from our perspective, a lot of this comes down to building better collaboration between undergrad and master's programs and businesses. Shockingly, often we see that you've sort of got this MBA world, that operates almost in a vacuum separate from the business world. And that's not ideal, right?
Rob - 00:14:40: Well, the academic world operates separate from the business world too.
Joe - 00:14:42: Yeah, absolutely.
Rob - 00:14:43: In some cases, that's good. In a lot of cases, I think it's pretty bad. It does a disservice to students. So that's one thing I'd like to see change, right? So you talk about concrete stuff. I would also like to see more apprenticeship-type programs. I think the notion of a university being a necessity, I think is absolutely the wrong way to go. So I think more people could be trained on this from practical things apprenticeships... I'm creating a new MOOC class right now, a course for a really big MOOC. One of the things I'm doing is... It's a simulator. It's your first day on the job as a data engineer. You get to go do business requirement gathering. You get to go find out what stakeholders want. And part of it is identifying, okay, so you're given this list of requirements. What are people actually asking for? So I think that's the other part. We spent too much time teaching tools and not enough time teaching the techniques. So I think those are concrete ways that we could address it. Because it's easy to do PySpark tutorials and stuff. But that's the wrong way to teach data. I think the way we teach it is absolutely backwards. Know the techniques and then learn the tools. That's why we wrote the book the way we did it. It's technology agnostic, for example. And pretty much every company in the universe is using it for their data teams right now. Almost every university that we know is increasingly being used as a default textbook for data engineering. So to me, that's part of the process. But it's not going to be an overnight thing. But I think the way we approached our book is similar to how Martin Kleppmann approached his book. It's agnostic. It stands the test of time. And that's kind of where we need to get to. We are making an effort. It is slow, especially universities are slow. They're so slow. And that's part of the problem with them.
Benjamin - 00:16:20: In the intro, you were kind of mentioning this like data deja vu, right? Kind of seeing things over and over again. And we talked about that data quality aspect of it. Like in what other areas of like data engineering as a whole are you kind of having this like data deja vu nowadays that you had in the past?
Krishnan - 00:16:39: So we talked of quite a few things, right? What I faced in 2000, 2001 for a company financial metrics and seeing that again today. So I moved organizations, but I see the same thing happening here. And there's a lot of... Thing that I'm noticing that are consistent in those days here. Some of them is because we operate as a startup. And so there are some challenges there. But the other thing is, it's also a financial industry. And there is a very strict review. So they are very conservative in how they work on this platform. Also remember, from 2006 approximately to like... Recently, I've gone towards the vendor side and I've also worked in other companies which were more open to buying vendor products. We are back here to a point where most of it is in-built and in-house. So that's another shift. So I know from a vendor perspective, there are a lot of... Availability in terms of tools and technologies that we could easily incorporate and put in, but that's not how we built at Blackhawk. We invent, we do it here. And a lot of reasons is because we eventually end up spending, and sharing it with our clients. So if I were to go and bring a data in, I don't have to only look at how much money is spent in buying that. I also have to license it for other users. So to make it profitable for us, we have to be able to build it and scale it to our clients. So the engineering world that I'm in, recently. It is interesting because what I'm trying to do is... Given all the experience that I have had, can I take my team? And my team is fairly young. Average age is probably 30. I'm bringing the age up pretty high, right? But 30, 32. How do I take them through this? When they have one, I think part of them is mostly software engineers, not data engineers. I have to transition them. So I have to go back to dig into my old days of how I reacted to this. Get them data savvy. So that's that, right? The whole people transformation, because these are great people. So I need to translate. And then how do I also translate that into what we can objectively achieve year over year and turn better? And I've been only here for a very short time. So it's a long process. I'm still learning a lot of things and a lot of challenges, but I believe that is, like I said, the next three, four years are going to be massive in terms of how the whole industry changes to solve for this, but also how our companies are going to make a big difference. And I see a lot of potential there.
Benjamin - 00:19:26: Another thing, right? We talked about it earlier in terms of the dot com, right? And kind of the predictions you were seeing and so on. Like now as well, right? That's kind of like at the macro level at the moment, like not a great time and especially for tech. So one thing which is coming up more and more kind of in conversations I'm having with people in this data space is this question of kind of proving that you're... All this data you're collecting is actually worth, right? That at the end of the day, it's kind of contributing to the bottom line of the business. What are your thoughts on that?
Krishnan - 00:19:56: I have... I have... Dealt with two sides of that coin. So remember my pre-greencrumb date and my post-greencrumb date. So when it came to, so the pre-Green Plum days, I was like, why are we collecting data that we cannot even process? What's the value of that? Why are we creating a data swamp, right? So I have that. That's part of my branch, right? But when I was part of Green Plum and I was the product manager for Green Plum, I had to put that on a really back burner and not bring that up. I had to talk about all the cool things you can do. But reality is, and this is why I think as a technical and technology industry, we didn't do a good job of educating our clients. You can't just continue to collect data and not process them if you don't even know what exists. The hard-op days and the data breaks days, those days are great for technology and great for infrastructure spend. I'm glad that we had easy money at that time. But now I don't think that's going to fly. Because even at a company like Clorox, then I did a couple, I joined Clorox. I didn't talk enough about Clorox, but I joined Clorox to... To first upgrade that, modernize that data platform. First on-prem to Oracle Web Data Warehouses and Exadata and stuff like that. And eventually I did it to migrate to Google Cloud and Azure. So even there, because we are CPG and margins are really small, there was always this underlying question. Why do we need to collect so much data? How can we optimize our data process? So what I ended up doing was not only doing the data transformation, doing the data inference and bringing the data in, I actually ended up building a couple of applications on top of it to showcase what data means to the company like Dropbox. So this ties in very well with your question because executives are not sending any black check at that company. We use data to try and predict forecasting accuracy for products. Can I predict how much can I sell it? So we couldn't do it really good. So that's what we did. We tried using NLP and voice interaction to see if we can. Predict and call out any product concerns. So that kind of worked out okay. I built a new version of executive information system for Clorox. We called it the daily briefing, which was a mobile application, but pretty much following the same standard. And it was on a mobile phone. That became a huge success. That was like my last run. And those are all built on the cloud, right? All of these things are built on the cloud. So I always worked in industries where we had a very narrow path between how much data we collect and what's the value of that data. Nobody gave me a blank check, except in the green firm days. Even in the green firm, we were telling customers. Nobody gave me a blank check to go in and load as much data as you can. We did one project which eventually became called the CDP, the Consumer Data Platform. And we brought in cookie information and cookie data. And I did it for about... One year, and we were collecting about a billion records per day, right, when you explore that marketing data. But in a year, we could not find any valuable metric out of that. At least the marketing team couldn't find any valuable metric. And we are brick and mortar. We're not weak. So it may have been better for online than us, but nevertheless, that was shut down fairly quickly. And it was all on-prem. We had a Hadoop cluster. I set it up. That goes to show. There were some people who were not in the technology world who didn't buy into this whole hype on collect as much data as you want and then run your algorithms on top of it and just get the results. I think that did work.
Benjamin - 00:23:49: So, yeah. Do you have any kind of closing things you want to say? So maybe one thing for people getting into this space, because you said, okay, you're taking software engineers now kind of at BlackRock and getting closer to this data engineering world. If I was starting today, out of high school, going to college to get into this space, any advice from your end?
Krishnan - 00:24:14: I think when I grew up, there was no Google. There was no cloud. There was no YouTube. There was no, I think that's what it's called, online learning, online training, course errors. So all those are available today. So all I have to do, which makes my job a little bit easier, is point my team in the right direction and have them go and learn and get better. The biggest shift from software engineering to data engineering, in my mind, is the data domain knowledge. I possibly intuitively do that because of my past experience, but understanding how data is processed on top of the code is such a big thing, which software engineers don't care. Software engineers think of data as an afterthought. Whereas data engineers think of data as this first step, right? So how do I scale? How do I make this consistent, secure? That would be my first thing. It talks of the... Learning, I think there are already so many tools and technologies in place. It's easier. It's a whole lot better today than it was in my days. You might think of me as going back to the old days, but that's true. We didn't have the right tools and technologies. I can get the job done much faster today. In some cases, you're stuck with legacy code. That's too bad. We've got to get that out. I see a great potential for people who are coming into the data world. And I wouldn't have said this. Earlier this year, which I'm saying now, right? The AI winter is possibly near and here, but it's going to pass that is so many. Opportunities with AI, with data, and with the whole data industry that We don't even know what the Future applications are going to look like. Future applications. Are going to be driven by data. Making sure that you have the right data, the right place, processed properly, will make your company more effective, more worthwhile. I've seen this again and again and again. And. Like I said, I did an application in 2000 without understanding what the implications are. I did the same application at Cora. With the intention of, nobody gave me the requirement. I came up with that model. I pitched that idea to my CIO, made that success. And the whole reason was I had that background in data. I had the background in business knowledge to say, from an executive point of view, this is what they would look at. And of course, I had help. I'm not going to say that I didn't get. Trying to fine tune that. I could do that on a different technology, on a different platform, because I understood the data and the value of data and how people process that data. So data itself is abstract. Data is difficult to comprehend. Making it easier, that's the job of a data engineer, data scientist. However you want to label it, data analyst, data scientist, data engineer. Those are all different flavors of the same thing. Make sure when you are in front of an audience and trying to pitch, you understand the data. Have a good data representation, valid, not biased, valid data representation that can help others see the picture you want them to see. That itself is a big hurdle to pass. And then once you have done that, the third aspect of it is... All the cool technologies that are in place. You can build AI and you can build ML, but if you don't do these two things and give them what can be the truth, the future of the possibility, I think data will just remain unprocessed.
Outro - 00:27:52: The Data Engineering Show is brought to you by Firebolt, the cloud data warehouse for low latency analytics. Get $200 credits and start your free trial at firebolt.io.