Data engineering should be less about the stack and more about best practices. While tools may change, foundational principles will remain constant. Joseph Machado, Senior Data Engineer at LinkedIn, is on The Data Engineering Show to talk about principles that are key to success, leveraging AI for automation, and adopting software engineering methods.
Listen on Spotify or Apple Podcasts
Benjamin (00:07.422)
Hi, everyone, and welcome back to the Data Engineering Show for another awesome episode. We have Joseph Machado joining in today, who did his master's at Columbia, then spent 10 years as a data engineer, data scientist in the industry. He's a senior data engineer at LinkedIn right now. And in parallel, he also has an awesome blog called Start Data Engineering, and is teaching data engineering to, I guess,
aspiring data engineer. So good to have you on the show, Joe. Awesome. Cool. So like, where should we start basically, right? There's, there's so much stuff to talk about here. Yeah. Maybe do you also want to kind of say a few sentences about yourself, your background? Exactly.
Joseph (00:38.008)
Thank you for having me.
Joseph (00:54.376)
Yeah, sure. As you said, I went to Columbia here in New York City with electrical engineering, although most of what I did was like network analysis. So like K-means clustering, that sort of thing. And then I started as a software engineer but quickly got interested in the database side of things. So like automizing indexes, making sure people are writing good queries, things of that nature. So I was in software engineering, moved into data engineering, but at that time my title was data scientist, but...
basically I was just doing data engineering, writing some SQL queries. And this was back when there's a Java MapReduce and HDFS. So I started there and then slowly along with the industry mode with like Spark, Snowflake, Artflow. Yeah, I've seen a lot of tools. Yeah.
Benjamin (01:39.402)
You've been through it all. You've seen it all. Hehehehe
Eldad (01:41.25)
You know, but it's amazing if you follow each step, it started with implementing an algorithm. So like an expertise with an algorithm, right? Like K means, you said K means, that was the domain. And then it expanded into a micro process and then it became bigger and bigger and then moved into a data warehouse and then you ended up with Snowflake.
Joseph (01:54.043)
Yeah.
Joseph (02:08.533)
Yep.
Eldad (02:10.702)
But I think it kind of tells a story where, and then of course I'm glad that the database always prevails, but that's a side story. The real story is that databases have grown way beyond anything that anyone predicted. And I think today kind of if we'll get your share of the pie.
and your experience around that and how you started there and ended utilizing data stacks. I think that's kind of what I'm personally looking for in today.
Joseph (02:50.152)
Yeah, I think that's a good point. I like it started off, I started off with IBM DB2. I don't know if you guys worked with it. It was like back in the day, IBM data warehouse stuff. Yeah, honestly, it wasn't too bad. I mean, the main thing was the data was modeled properly. So it was very easy to use. I worked with data warehouses where data warehouse is great. The technology is great, but data isn't modeled. So it's hard to work with.
Eldad (03:01.598)
or Swarp, that was my first IBM product.
Joseph (03:18.144)
But when I started off, luckily, the data was modeled very well that made it super easy to work with. But I started off with like just writing Python scripts directly accessing DB2 and Hive. That's pretty much it. And orchestration scheduling was just Python and was that Windows task scheduler. And it continues to work. I think it's been running for like eight years now. Yeah, it works. If you like... Mm-hmm.
Eldad (03:41.062)
It worked, right?
Just a second, Benjamin, for Benjamin and the rest of the young audience, there were a lot of keywords that most of you don't understand or know. This were kind of at the beginning, right? I love it. I've been there. But for Benjamin and the rest, this is how it all started. Sorry, go ahead.
Benjamin (04:01.646)
Thanks for watching!
Benjamin (04:06.264)
Thanks for watching!
Joseph (04:09.94)
Oh no, that's a good context.
Benjamin (04:10.007)
I love this has become a recurring segment on the podcast. Aladad.
Eldad (04:11.882)
I'm trying, I'm trying to tell Benjamin that databases and warehouses were not born in the cloud.
Benjamin (04:17.802)
Eldar explaining things to me before that happened before 2010, because I wasn't alive back then. That's awesome. Well, so having gone through that journey, I just like kind of from Hadoop, MapReduce, those types of things to now modern cloud data warehouses, like what changed, right? That kind of like, what's the same? Like when you look at the space today, what are the main challenges you're seeing also in your job at LinkedIn?
Joseph (04:25.186)
Sorry.
Joseph (04:33.234)
Mm-hmm.
Joseph (04:45.82)
Yeah, I think from like a conceptual standpoint, the technology has gotten much better. However, the fundamentals still remain the same. Good software driven practices, proper testing, that's still the same and a lot of places, software engineering based concepts like testing, making sure you have proper CI, CD set up, it's not really followed in the data team. So I feel like data engineering teams are kind of lagging, although that's changing these days,
That's one thing I've seen. The technology hasn't gotten so much better. It also makes it easy to build things quickly without following good practices, which leads to like long-term pain and having to migrate or do things like that. So technology is growing super fast, a lot of features, but fundamentals haven't changed much in my opinion.
Benjamin (05:37.474)
So this is actually something I'm curious about is, right? Like when you're talking about testing, so when you started out, like this was, there wasn't like big data engineering teams like at those companies back then, right? Like these were purely software engineering teams probably kind of working with these big data technologies in many cases. So would you say already then kind of people didn't do enough testing on these big data things or would you actually say, well, so 10 years ago we were in a better state?
Joseph (05:48.152)
Mm-hmm.
Benjamin (06:06.39)
kind of in terms of testing our data pipelines, kind of our infrastructure, maybe because...
Eldad (06:11.766)
everything was consistent it worked db2 transactions committed you know yeah
Joseph (06:16.344)
It worked, yeah. I wouldn't say better or worse. Like it just depends on the team, but it's a pattern I have seen. Like if you focus on fundamentals, if your team has solid fundamentals and good data platform, it helps a lot. And with the team size, right? Like the data engineering team that you mentioned, yes, there are a lot more data engineers now, but I also feel like there's so much more complexity in most cases unnecessary.
that adds to like the toil, the developer experience toil, if you will, of getting something into production. I used to be able to, yeah.
Benjamin (06:53.134)
It's funny you're saying that because the tools should have gotten simpler, right? Like it's been kind of 10 years and like, okay, they are simpler in the sense that you can write something in like 10 lines of Snowflake sequel that would have been hundreds of lines kind of like complex map-produced tasks back in the day. But this explosion of complexity to get things into production, it's like, it's a bit crazy to me. Where do you think that's coming from?
Joseph (07:13.916)
It is. I have a hypothesis. I'm not sure how accurate that is, but my hypothesis is that when the SaaS companies start building tools, they make it really hard to test locally. So if you have Postgres or something like that, you could easily test it locally. On Snowflake, it's hard to test. Databricks, it's not simple to test. You can do it. And one of the reasons why DBT is so popular, it makes testing super easy. So with...
with a lot of new features that we were also giving up a lot of these core software principles, like having a virtual environment locally or Docker or whatever you wanna run it and being able to quickly run tests with data teams, that's a hard ask, but it is what it is. And then focusing fully on SQL, while SQL is great, sometimes it's hard to test specifically. So I wouldn't say it has gotten
worse or better. I think it has been messy and it will always be messy and clearing up the mess is up to the individual team. But the technology is far superior now. I don't know how to like, as you said, multiple lines of Java code, compile it, push it and I could just write a SQL query. So yeah, that's what I've observed.
Joseph (08:39.872)
Yeah.
Benjamin (08:42.466)
Awesome. So what can you maybe take us through some of the challenges you're seeing today, like in your job at LinkedIn or other industry exposure you had around these types of things you're thinking about nowadays?
Joseph (09:00.308)
Yeah, I won't say specifically about LinkedIn, but I could say it like as a general, kind of generalized idea, what I've seen is the developer experience is really lacking, especially in the data space. Like I worked on software engineering teams where we can deploy like in an hour, if you put up a hour, someone reviews it, that's it. But that's not always the case with data teams. Sometimes you spend like a month, you spend a certain amount, a week.
Benjamin (09:09.546)
Sounds great.
Joseph (09:28.46)
a few days to actually validate your data. So I think in that aspect, the data teams could do better.
Eldad (09:34.042)
Don't be shy, a month is good, a month is great. Really, 30 days or 31 days.
Joseph (09:42.69)
31 days, yeah. But yeah, the developer experience, I wish it were better. And it also partly comes from the whole, I feel like the domain itself, right? Like from like backend engineering or application development perspective, you have clear definitions, you have clear scope, you have clear, let's say UI or clear behavior. But from a data perspective, it's hard to quantify what right data is. So that's the kind of...
difficulty that I'm seeing. Because if you quantify what right data is, what right data means to you, it should be pretty straightforward. But when the data grows in complexity, and there are so many product teams that you have to coordinate with, defining what right data, it's in of itself a huge task. And it's never ending. That's like new edge cases, and then you modify a code, yeah.
Benjamin (10:30.189)
Yeah. So what's your take on like data observability tools like Monte Carlo or something like that? Like where do you see their place?
Joseph (10:39.444)
I think they do definitely help. LinkedIn has its own system. We were using something like that at my previous place with DBT. It definitely has its place, but at the end of the day, it's just a tool. It cannot define what good data is. It can give you guidelines or freshness, check for these qualities, sure, but the business rules.
For example, like variation of a threshold over time, how should it vary? What is the seasonal? It's hard to automate that with a tool. You need to kind of dig into the data to manually figure that out. But those tools do make it easy to kind of set it up, if you will, super simple. Yeah.
Benjamin (11:12.408)
Right.
Benjamin (11:25.826)
makes perfect sense. One thing you just mentioned is like internal tools, right? And I feel like, okay, if you're working in the big company, like also like as a software engineer, right? Like you're going to Google, you're going to use a lot of their internal tech around how you deploy things kind of on Google data centers around like their internal version of your PC, all of those things. If you go to Facebook, same thing, internal tools, if you go to Microsoft, same thing, right? It's like, as a data engineer, I assume that's not so different. Like if you go to kind of big technology companies, which have
like exabytes of data they're managing, there will be in-house tools for specific problems. How do you think about that in terms of staying then relevant and kind of up to date with technology, right? Because it seems like that actually makes it harder. And especially data engineering, I feel like it's even more about the tools you know how to use compared to software engineering, which is already about that. So what's your take on that? Data engineering kind of had big tech then.
Joseph (12:24.52)
Um, I, I have like a opposing opinion to that. I don't think tools matter. I think the, the principles matter, like test data before you publish it to your stakeholders, um, how do you quantify test? Those sort of things matter. The design principle, if you will. I don't think tools matter as much. For example, you can have spark, you could have snowflake at the end of the day. They're both distributed systems. You, if you know how to.
look at distributed query plans, optimize it, you're good, that's my opinion. But I do know when you apply for jobs, you need to take certain boxes. So the way I think about it is if I have experience in like, let's say Snowflake, I would just try out Spark on my own, see how it works. So yeah, sometimes companies just want, yeah.
Eldad (13:13.082)
This is super interesting, you know, because in many ways, what you're saying is many steps within the data pipeline are commoditized. You can pick, you know, like each step you have a choice of 10 tools. And each one of them is unique on its own, et cetera, advantages, disadvantages.
by the end of the day you look at the whole data pipe, the whole funnel, right, that's your product kind of the input and eventual output, right, you're talking unstructured data coming in defining raw metadata on top of it, like this is very delicate stuff completely owned, I think dominated by the human factor, right
Of course, once it gets semi-structured and of course, structured, it's easy, right? Like the universe becomes much easier. But I think in many ways, it's really some of those big steps are mostly about efficiency, getting the job done. How fast, right? How robust can you plug and play each part and have the human part own it?
interesting to get your opinion especially where you're at right where in-house dev happens how you apply AI on those delicate parts of the process right is that applied is there ELT.AI being yeah one
Benjamin (14:59.646)
Let's start with one question, Elda. It's been a million questions, I can't keep track. Ha ha ha.
Eldad (15:03.858)
And so one of those each one of those is great. I mean, but yeah.
Joseph (15:09.504)
I think as for AI, there is an internal tool to convert text to SQL queries. It's like the low hanging fruit type product. I think everyone is doing it. Convert text to SQL queries based on the metadata information we have. But for designs, we do have like a template one can use because at the end of the day, most pipelines are kind of similar. So there are templates we can use to...
kind of quickly spin something up. However, within those templates, we still have to write a Spark job. We still have to see how joints are done. We still have to figure out what is the best way to design that code, how to structure it, how to organize it. That has not been automated yet. So I guess that's why we have a job. But yeah, I don't know if it will ever be automated because there are so many constraints, especially in a big company with so many teams, so many formats, so many...
data nuances, which kind of brings me to another point, which is like the separation of product teams and data teams. I think it's pretty not great. It's bad, I think it's because product team operates on its own, data team operates on its own. The way there is like a huge disconnect. So if there were more kind of connection like embedded data engineers within the product team that might...
kind of enable a more AI driven development, but so far I haven't seen any. No, no, adding, trying more people, I don't think is always the solution is it's just, yeah. This adds a lot of confusion.
Eldad (16:41.794)
More people, more people.
Eldad (16:50.318)
You know, it's no matter how you place, you know, like you can centralize, you can decentralize, you can do all sorts of options. And I think all of the work for certain use cases at certain times, for certain sizes, if you have a good team, they can utilize any stack and deliver value.
Eldad (17:22.039)
And yes, so as you're saying, data is becoming very boring and products don't matter anymore. It's just squeeze out more, you know, more efficiency, more value, no appreciation for the little things.
Benjamin (17:28.418)
That's it.
Joseph (17:28.724)
I did not say that.
Benjamin (17:40.302)
That's it.
Eldad (17:40.533)
Great outlook.
Joseph (17:42.748)
I do think that eventually, backend engineers will become data engineers or resource as well. That would be my ideal scenario where you have one team that builds the backend system and front-end, if you will, and kind of owns the data as well. That way you don't have these two separate teams with two different roadmaps with two, everything is separate. If it were a single team, I understand it's...
Benjamin (18:07.542)
Only engineers. Amazing.
Joseph (18:09.817)
Not only engineers, but I understand like bigger teams, you know, with bigger companies.
Eldad (18:12.954)
Give everyone an engineer title, everyone, and everything will sort out, no matter how you structure the things. Put them far away, right? Remote, Benjamin? Ha ha ha.
Joseph (18:18.744)
There you go.
Joseph (18:22.303)
Yeah.
Get Stuff.
Benjamin (18:28.315)
Good stuff. So Joseph, one thing you're also actively doing is right, kind of like teaching and thinking about how to like teach data engineering, like you have a really successful kind of blog and newsletter called Start Data Engineering. Take us a bit through that journey, kind of like what, how did you start that? What are you talking about there kind of now? What's top of your mind?
Joseph (18:50.848)
Mm-hmm. Yeah, so I started during COVID. I had like the extra commute time of us, like, okay, what do I do with this commute time? Just...
Benjamin (18:56.758)
This is like such a consistent thing. Like I love it, right? It's like kind of for everyone we have on the show, we're like in kind of thought leadership and kind of education is always, how did you start? Oh, like during COVID, like we had so much spare time. Maybe COVID should.
Eldad (18:58.234)
Ha ha
Joseph (19:07.989)
Yeah.
Eldad (19:09.706)
Everyone remembers COVID as a very positive memory experience.
Benjamin (19:14.41)
Exactly. It got me into data engineering thought leadership. Nice. So sorry for interrupting.
Joseph (19:14.673)
No.
I guess so. No, no, please. That's funny. No, yeah, I just started during COVID. I try to write about what people are looking for, not what I like to write about. I would like to write about more low-level operating system. Oh, is there a way we could use systemd for argument? More low-level type stuff, but that's not what people are looking for. People want some...
projects or DBT explanation, things of that nature. So I try to write about what people are looking for.
Benjamin (19:50.862)
this. Like this is different because so far always the answer has been, oh, just write about whatever you're passionate with, start with something that even no one cares about and you'll build an audience over time. This is the opposite. Pick what's popular, pick what gets the most likes and then kind of commit to that. I love it.
Joseph (19:57.904)
Yeah, that doesn't.
Joseph (20:08.792)
Yeah, I try to make it actionable. So always code, not just text. Because that's what I prefer. As you said, I don't like just text. There is some sort of actionable. So yeah, that's pretty much it.
Benjamin (20:26.622)
this.
Eldad (20:26.874)
So if you had to build the perfect dbt benchmark.
Right? Like run the craziest stuff, the hardest stuff. What would be the kind of the hardest stuff to run on dbt? Have a global open benchmark, right? Just switch, plug and play and run. How would you do that? What the queries would be there? Is there something going on there? Like, right, there's so many things you can solve with dbt. It is such a powerful obstruction. Tell us more like what's out there?
Joseph (21:04.984)
I, well, what's out there, basically, I think everyone is just mostly doing the same, 95% of the companies are doing the same, using the dbt kind of project structure to build their own. I do think there is a need where there will be like a business vertical type product. So let's say advertising, right? I can see like someone building an advertising stack. So right from segment to data warehouse.
that you build it with DBT, you can apply it to different advertising companies. Same with like finance stuff, from like pulling data from Experian or whatever it might be, getting some dashboards out to analysts. Because I feel like I've worked in multiple verticals, marketing and advertisements, little bit of finance. So they all have the same sort of input. Like if you look at marketing, click stream, click orders, blah, blah.
Eldad (21:32.998)
Thanks.
Joseph (22:00.068)
Sales, it's the same thing, opportunities, things of that nature. So if you model it right, and if you make that pipeline specific to a vertical, you can use dbt and just deploy to different people or different companies in the same vertical. It's at the end of the day, it's the same data they're collecting. It's just that everyone does their own implementation. So I think that, that we might see more of.
Eldad (22:25.39)
You know, they said the same about SQL. They said, oh, you just write it once and you can run it everywhere on any database. And look, look where we are today. What, what a mess. But SQL is still very consistent. If you play by the book.
Joseph (22:32.376)
Haha, wow. So, yeah. I guess that is the... yeah. Mm-hmm.
Eldad (22:46.182)
plays well, really well. And DBT is same, it's very similar, like hearing you out, you're really treating DBT as a standard. Super interesting to see where this grows as an ecosystem for verticals.
Joseph (22:48.772)
Yeah, there's the, uh, mm-hmm.
Joseph (23:03.784)
Yeah, I do see that dbt, you know, there's like a lot of community support as well. So if, if they officially don't support a database, there's like community drivers to enable dbt to run on like different databases. So I do think a lot of companies are moving there, especially, um, new, new startups, LinkedIn is also starting with dbt. Um, so yeah.
quite popular. Hopefully they get to profitability soon and don't change the license but we'll see.
Benjamin (23:40.062)
One other thing you talked about recently in your blog post was open table formats, right? And I think this kind of also ties into the kind of modern data landscape. So things like Apache Iceberg. What's your take on those? Like where do they fit? Kind of do you think they're basically eating the world? What are your thoughts here?
Joseph (23:43.948)
Mm-hmm.
Joseph (23:51.98)
Mm-hmm.
Joseph (24:01.552)
It depends on the company size. I do not think they're going to eat the world anytime soon. Just because Snowflake has its own internal format, if you will. I forget the name. It has its own thing, which is very similar kind of to Apache Iceberg. Spark has its Delta Lake format. Iceberg, I think it will be helpful for companies, bigger companies, specifically working cross clouds and cross systems. So...
In LinkedIn, we use Apache Iceberg. I mean, we have a wrapper on top of it, but it's Apache Iceberg. We use it to shift, move data between our on-prem and cloud resources, and we can use Spark and our Trino, whatever it may be. So at a bigger company, it makes a lot of sense because there are so many different stacks, but I do not see smaller companies using Iceberg just because the impact would not be as high. You could just get the same with Snowflake.
And smaller companies are not usually not going to have like two or more data processing systems usually. So that's my opinion on that, but I do, I do think it's growing fast. People, there's a lot of interest, but mostly from bigger size companies.
Benjamin (25:16.546)
Yeah, makes sense. I mean, yeah, it's like, to me, I mean, one of the big questions there is, and we'll also have to see what the verdict is around performance, right? It's like kind of one thing that vendors who have kind of first-class managed storage, like Snowflake, for example, would claim, is that they can build a file format and kind of like manage storage is going to be faster than Apache Iceberg over Parquet. Of course, then kind of vendors who are into the open formats, like kind of Databricks.
would disagree on that. And also Snowflake is now moving heavily in the iceberg direction. So I think this will be like very interesting to see how it plays out over the next couple of years.
Joseph (25:54.12)
Yeah, they just opened public preview, I think two months ago or last month, something like that for a iceberg interaction. Um, yeah, with Spark, it's super easy. I do think that like, because it's open source, that'll be a lot of adaptions specifically in like features parity, specifically from Snowflake and Databricks side.
Benjamin (26:16.79)
The spec is actually huge, like kind of iceberg as a specification, like with all the features it has, it's massive. It's actually very hard to implement into a data warehouse.
But okay, as a user, you of course kind of don't care about whether it's hard or not. It's just nice if it works.
Benjamin (27:06.926)
Awesome. Cool, Joseph. So anything else that kind of is on your mind in terms of data at the moment that you wanted to chat about today, kind of wanted to bring up.
Benjamin (27:23.382)
You don't have to say anything.
Eldad (27:27.163)
Ha ha.
Benjamin (27:43.126)
I see. I think those are great kind of closing words. It also really resonated with me, like kind of like this idea around learning like concepts rather than focusing on tools. I totally buy into that. I mean, I come from a software engineering background and like there, this is 100% the case. So of course, like probably the same with data engineering as well. So thank you so much for being on the show. It was really great having you Joseph. Yeah. And see you around.
Eldad (28:10.822)
Thank you, Joseph.