Matthew Weingarten from Disney Streaming about Data Quality
March 26, 2024
March 26, 2024

Matthew Weingarten from Disney Streaming about Data Quality Best Practices

No items found.

Listen to this article

Powered by NotebookLM
Listen to this article

Matthew Weingarten, Lead Data Engineer at Disney Streaming, talks about principles essential for data quality, cost optimization, debugging, and data modeling, as adopted by the world's leading companies.

Listen on Spotify or Apple Podcasts

Transcript:

Benjamin (00:01.282): Hi, and welcome back everyone to the Data Engineering Show. It's my total pleasure to have Matt Weingarten on the show today. He's a lead data engineer at Disney Streaming right now, based out of Seattle. Yeah, spent quite a bit of time in industry already. Was at Facebook before, did his master's at the University of Florida. Really great to have you on the show today, Matt. Um, so yeah, do you quickly want to introduce yourself?

Matt Weingarten (00:32.616): Yeah, well, thank you for having me, of course. So yeah, Matt Weingarten here. I've been in the data engineering space for roughly seven years. I technically started as a software engineer, like probably a lot of people in the data world, and then kind of just learned the tools of the trade of data engineering from there, because a lot of the things I was working on were data-centric. And so I felt that's what best applied to me, and I've stayed there ever since, and I really enjoyed it. So I think the data space as a whole, it's really exciting, what's going on in the last few years, you know, seeing how it's evolved from, you know, when I started, which even was a more evolved form from a course a few years before that when, you know, data engineering kind of really started to take off. So it's a great time to be in this space and I'm glad to be on this program.

Benjamin (01:18.326): Nice. Yeah. Uh, that's definitely super exciting. Uh, and I couldn't, I couldn't agree more on your comments on the data space. Um, so what's kind of on your mind these days, right? So kind of, yeah, like what, what are the big topics basically that you're thinking about at the moment in terms of data?

Matt Weingarten (01:38.592): Yeah, so to give some more context, I work on a team that's working with clickstream analytics within Disney. So on an average day, we're processing around a billion records, representing different user actions within the Disney streaming ecosystem. So when you're logging into an application like Disney Plus or working with any of the different ESPN applications, we're tracking all of that data.

Benjamin (01:53.206): Wow, that's a lot.

Matt Weingarten (02:06.596): And so it is a true big data application that we have in place due with the scale of what we're handling. So there's various challenges involved with that, although I think we've gotten it into a pretty good form at this point. So really, what we're focusing on is just business projects, anything that our data can leverage in order to help contribute to business. That's a big aspect of that. But then really, where I pay some attention, is just trying to see how we can do various things better. You know, for example, like one, I think that everybody's kind of trying to approach right now is data quality. Lots of data, of course, can be a cause of a lot of issues. So we wanna be able to see how we can kind of really, you know, look more into what we're actually producing and making sure that what we have in place is like of the highest quality. So that along with some of the other standard challenges, I feel is like where we spend most of our time these days.

Benjamin (03:06.614): So tell me a bit more about that, right? In terms of kind of data quality, cause this is something that comes up very consistently in this show and there's so many aspects to this, right? Proper testing, having a staging environment, writing like unit tests for data, having data observability tools. What's your take on this space, right? What have you seen really provides a lot of value? Have you seen something that maybe doesn't work so well? Give us your thoughts on data quality.

Matt Weingarten (03:34.976): Yeah, so it's kind of interesting, but kind of like a lot of the principles that I like within data quality, I actually got from Facebook, well, what's now Meta, during the time I was there. So one thing that they did really well, and with a lot of the pipelines they had, what you would first do is you would write data to kind of like an intermediate table, and then you would run various DQ checks on top of that data. Once all those DQ checks passed, data was pushed to its final location. So essentially, you weren't pushing data to its final location. Yeah. You had an intermediate layer to basically be like, all right, are we sure everything's good here? And I think that's a great practice. I wish that would, and I feel like that's becoming more standard, but I wish that was like the standard because what I've seen in a lot of places, for the places that even have data quality checks implemented.

Benjamin (04:09.93): Like a bit like write, audit, publish. Okay.

Matt Weingarten (04:32.088): But they'll be doing it on top of like the final layer of the data itself, which is good that you have it at least, but you want to make sure that you have a single source of truth. And if you're running those checks on top of that and the data turns out to be wrong, well, now you have to go back and fix that. You know, teams who are running reports off of that data already have to go back and reprocess their things. I like to keep the final layer as clean as possible, or at least that's my philosophy on it. So... I think that's one good practice that I've seen. And to go back to some of the other points, yes, of course, having lower environments where you can do testing, building in unit tests, which I think is something we're gonna actually start to really try to focus on a little more, at least within my team. I think some of the projects that we've worked on in the last few months have kind of revealed the need to have some more of that in place. So we've realized that we need to have a good practice there because... If we don't have that proper testing in place beforehand, we go and deploy something, we start running, other people point us out to issues, and then we have to start all over again. We have to make the fixes, we have to backfill data, which is an expensive exercise. We wanna make sure we're doing things as cleanly as possible. So I think we really wanna take some lessons there and make it a stronger practice for us.

Benjamin (05:50.094): Gotcha. In terms of this auditing step, right? Of like having an intermediate stage where you put your data before you actually then publish it to your downstream consumers. What are the big things there, right? Is okay, there's some easy stuff you can do. Looking at the distribution of certain values, making sure, hey, are all of the timestamps I have in there actually from the last couple of days? Do we have certain correlations, certain distinct counts, all of that stuff? Anything fancier? Kind of you're doing or kind of you advocate doing. Um, when you then look at whether the data you have in that like staging area basically meets your, your quality bars.

Matt Weingarten (06:30.824): Yeah, so I mean, this is something we have kind of just started approaching. I have a lot of thoughts in this space, but really it's going to be kind of like a continual approach, or an incremental approach rather, to get to that, you know, just taking it one step at a time. So I think first, of course, is defining, you know, the core checks that you would expect with data. Like, hey, this is a primary key, are there any duplicates? Because if there's duplicates, there's an issue. Like, we know that right away. So having that... You know, checking for nulls where they shouldn't apply. Some of those just like, you know, basic checks we should have in place. Then we want to have checks against like our, you know, core metrics, our KPIs, making sure those all look good. And then we kind of want to dive into the area of trend checks. So like, you know, a lot of our data, what we see is a lot of seasonality involved with it, especially because we're working with sports data. We know exactly when, you know, we're going to see spikes in our data. Like for example, the Super Bowl is this Sunday. It's one of the biggest sporting events in the world. So we know we're gonna see a spike this Sunday of traffic compared to other weeks. You know, during the week, things are usually pretty quiet. During the weekend when a lot of the sporting events happen, that's when we see those jumps. So we wanna make sure we're accounting for that. Like over a month or over a week or whatever time period makes sense, we wanna make sure that, hey, we've seen a drop of like 20% of logged in sessions.

Benjamin (07:26.875): Just coming up with a timely

Matt Weingarten (07:51.38): You know, if that might be normal behavior, if we were expecting that, but of course that can totally reveal something different. So we really wanna make sure we have that in place. And then I think another point about data quality that sometimes isn't necessarily discussed enough, but one thing with this is we wanna make sure that, you know, our approach to data quality is a conversation between us and our business stakeholders as well. You know, we have an idea of what our data quality checks should look like, but our stakeholders might have some other ideas based on what they think is critical. So we should take that into account as well, because it shouldn't just be something that applies to us. It's their data at the end of the day as well. So we want to make sure that we kind of go in both direction in terms of how we're incorporating that.

Benjamin (08:32.398): That makes perfect sense. So one thing kind of coming myself from a software engineering background, right? That I'm seeing is okay. Kind of writing tests, checking for the data integrity. Like that makes perfect sense. The next step then is you have something failing, right? Okay. Like you have a test in your CI, uh, that doesn't succeed as a software engineer, or you have some data quality checks kind of that, that fail in your, before you publish that data. Um. How do you approach debugging that? Right? Cause like the actual like root cause for why your data isn't in a good shape might be very far kind of upstream in your data pipeline. And what, what are your thoughts on that?

Matt Weingarten (09:11.884): Yeah, we've definitely had to do a fair amount of that. Sometimes we have to go all the way up to the source of our data and then actually check there to see what's going on. It could just be some things like bad data there or some issue with timestamps, which can come into play all the time. It seems like it's always timestamps, it feels like, more often than not, in some way, shape, or form. So we just have to, you know.

Benjamin (09:33.175): Yeah.

Matt Weingarten (09:37.216): We definitely make sure we do the proper analysis. And of course, I think part of that is making sure that we're enabled to do that analysis quickly, because sometimes it can become a process to really try to actually, you know, uh, dive down into that data, but we want to make sure that we can speed that up so that it's easier to do that because yes, there's a lot of comparisons that we need to do when it comes to that. So we want to make sure everybody's enabled to do that, uh, properly.

Benjamin (10:01.422): Are there any tools at the moment that you're particular excited by in this space? So one company we hadn't shown, for example, in the past was Monte Carlo, who were doing a lot of like kind of data observability and lineage and these types of things. This feels like such a quickly changing space, right? And kind of like there's new tools coming to the market all the time. Anything that maybe you've worked in the past doesn't have to be at your current employer that you find really exciting.

Matt Weingarten (10:29.28): Yeah, so I think Monte Carlo is great. I've been really impressed by Barr as a thought leader in the data space, especially. And I know that we have kind of looked at their product in the past. But the thing with, and this is one thing I always say with companies like the size of Disney, is that what we do in one area can be completely different than what's going on in another place. It's huge.

Benjamin (10:38.679): Definitely.

Matt Weingarten (10:59.324): It's kind of hard to get that full landscape. So, Monte Carlo has been one. I know that we've also just tried smaller things like Deque, the library that Amazon, I think, built a long time ago. Great Expectations is one that a lot of teams use as well. So, we've looked at a few of those things. I think as we start to turn more towards both having a quality, observability, monitoring, all those aspects, you'll need some type of platform for that. So whether it's built in-house or whether it's using one of those tools, I guess that remains to be seen. But yeah, I've been really impressed by some of the developments that have been going on in that area.

Benjamin (11:43.346): That's awesome. Um, cool. So one other thing that also always comes up when, when we talk to people on the show is cost, right? So data quality, of course, is on everyone's mind. We're serving the business in the end and to make good business decisions, we need to make sure we look at the right data. And the other part is we need to provide that insight in a cost-effective way. Um, what, what are, what's your, like, what are your thoughts on that basically?

Matt Weingarten (12:12.12): Well, I think it's kind of funny that you asked that because that's something I'm very passionate about actually is that whole area. And I think a lot of companies have been over the last year or so if not before then, because you know, 2023, if you were going to kind of summarize that, you know, there was layoffs that were happening almost everywhere. Even right now, they're still happening in a few places. But you know, 2023 was really when a lot of that took hold. And we kind of realized that, you know, especially during the beginning of the pandemic, you know, stock market was in a great place. Everybody was hiring like crazy. And now things have to be scaled back a little bit. And one of the things that we can look at first is data products, because data applications, and this is something you kind of overlook when you're talking about costs and optimization and that whole area, which is kind of referred to as FITOPs for those who are familiar with the term. But big data applications are one of the most expensive things usually, because you're working with big data. You set up big servers. You have a lot of data storage. You need to make sure you're doing that in a cost-effective manner, or you'll see a very big bill at the end of the month from whatever your cloud provider is. So we've definitely done a lot of work in that space. And I would say we have, and we still have to continue to do so because it's just something you have to keep working on. You can't just work on this in one month and then say you're done. This whole thing is a continuous effort, and we've made a lot of good progress in that space through some various practices, which I'm happy to dive into more. But yeah, there's still a lot of work that needs to be done.

Benjamin (13:43.21): So let's talk a bit about practices, right? Say I'm a data engineer. My manager is coming to me saying, Hey, like we're spending 500K a year on this data pipeline. Like let's figure out how we can reduce costs here while retaining certain level of data freshness or, or whatever. Uh, like what are tools in my toolbox now as a data engineer? And where can I learn more about that?

Matt Weingarten (14:09.428): Yeah, so the way I approach it is there's two different aspects that I think are critical when you're looking at how to fine-tune these data applications. First side is the storage side. Effective almost always, you have to store data somewhere because it needs to be accessible. If you're using, for example, some file system like S3 coming from an AWS background, that's the first one that comes to mind. So with that, you can see some really expensive costs in S3 because by default, all files are being stored in standard storage, which is great because you can retrieve those things really quickly, but it's also the most expensive form of storage. Now, if I'm storing data there from five years ago that I'm only gonna refer to once maybe every few years for auditing purposes, then you don't wanna keep that in standard storage and you wanna put that in some glacier or something like that, which is much cheaper. And when you start to apply that to tens, hundreds of terabytes for your data, that cost can drop really quickly. We saw a lot of improvements just from looking at that. The other side, of course, is compute. Now, compute, there's a lot of progress that's gone on this space. It just feels like every other week, you're seeing like Databricks or Snowflake or any of the internal tools within any of the cloud companies. They're bragging about how they've kind of helped in this aspect, whether it's having serverless technologies or optimizing compute from some other means. There's been a lot of great development in that space. There's a bunch of different fine points that we've kind of just followed by looking at some of their overall best practices. And we've made a lot of good progress there as well. Just making sure we're using only what we need. That's a big one because sometimes by default, you'll just like copy, clone something and it'll end up being like a 100 node cluster that you could be doing with 10 nodes. So just that type of making sure we're doing things correctly, optimized as much as possible, that's really helped us get to where we want to. Although like I said earlier, still a lot more to come in that space, but a lot of progress to this point.

Benjamin (16:06.348): Right?

Benjamin (16:25.622): Interesting. How about things like modeling, right? So tools are the one thing. Sure. We want to optimize our compute. We want to optimize our storage bill, right? These types of things. But then you also have things like, okay, on the serving side, right? Like, do I denormalize or have like a highly normalized schema to optimize for query performance, which should in the end kind of go into spend, like, do you feel like also in the community as a whole, um, people think about these cost aspects enough when they're approaching how to model certain scenarios, for example, or do you think modeling is actually not that important there?

Matt Weingarten (17:02.452): I feel like that one does get overlooked, but that's certainly a very relevant point. One thing that I think the primary consideration when it comes to modeling, and sometimes what happens first, is you try to think about your consumers, your stakeholders, how you're modeling that data for them. We often see sometimes that you try to stuff a lot of information into one table or like some super big view because... That just makes it easier for consumers to query. They're not necessarily going to be as experienced with joining a bunch of tables together, exploding, nested elements, all those things. So you try to keep it very simple. But of course, that can have implications for the cost. But I feel like that definitely does get overlooked. So I think when designing models, if you want to really do it the effective way, you've got to balance your consumers' needs.

Benjamin (17:38.316): Right.

Matt Weingarten (17:58.92): Along with what you think is the proper architecture. Because of course, yes, you could go for some super normalized form, which if you were in a database classes in college, it would mark off all the checkboxes. You would get an A+. But that's not always the most business effective solution. So you kind of have to weigh those considerations and see what makes the most sense for everything.

Benjamin (18:15.423): Yes.

Benjamin (18:21.326): Nice. Awesome. One thing in these cost debates that then often comes up is right. This idea of data ROI, because it might be perfectly fine for a certain pipeline to cost a whole lot of money. If it's providing that value again, with a certain markup, of course, kind of 40 organization, um, do you have any thoughts on that? Right. It's like, cause as a data engine, data engineering team, you should be able to kind of say, hey, our data pipelines are kind of the dashboards we're providing the analytics we're providing and so on is contributing at the end of the day to the bottom line of the company. Um, what's your take here?

Matt Weingarten (19:01.725): Yeah. Yeah, I think we do need to get to a better definition on that. And I know that's something we're trying to enable. One thing that I feel like has been a struggle, and this is a primary question that sometimes we don't even have all the answers to, is who is using our data? We have this available, and we know some of our stakeholders, but do we know about every single one of them? So you can't really go to the data nail down an ROI unless you can trace that back to everybody who's using your data for what purposes and what areas they were contributing to. Our team, we're more of like a middle man, or a middle person rather for data within the company in that we provide this for other teams and then they're using it for a variety of analytics related use cases. But we don't necessarily see to it too much beyond that. So we can't really get a reasonable calculation on what type of value and ROI our data provides without seeing more details along that. Obviously there's a lot that comes out of it, but we kind of really need to draw that all together through proper lineage and kind of just like proper ownership to really figure out like what our value is overall to the company.

Benjamin (20:13.79): Yeah. My feeling is that in many cases, or at least in, well, many is maybe too much, but in some cases in software, that's easier, right? It's like you're a company, you have a prospect, you want to sell your software to, and then there's okay, like here's like three features that are missing to unlock this much revenue, right? And then it becomes very clear of, Hey, okay. Do we deploy the engineering resources to be able to then win that deal in the end for data, which is mostly about improving decisions, right?

Matt Weingarten (20:25.654): Yeah.

Benjamin (20:40.654): Like this lineage aspect just becomes so much harder because a lot of the decisions in the company that in the end you're powering, uh, like you can't directly trace it back to, Oh, this person like looked at that dashboard, uh, during that day and then made it like decision, which saved the company. I don't know, like $10 million or, or whatever.

Matt Weingarten (21:02.996): Yeah, yeah, no, I definitely agree. Like, of course, if you're working at, like, your typical SaaS company where you're shipping something to clients, then of course, you can really trace that pretty easily. But when you're working within a bigger organization and you serve different teams from the data scan point, it's a little harder to put all those details together. So that's why you really need to have that type of collaboration in place with stakeholders, which we've certainly done a much better job of over the last few years, I feel. But yeah, there still needs to be some work to really get to a proper number of, you know, what is the ROI of our data? Because I'd be curious to know that ourselves. I couldn't give you that answer if you asked me.

Benjamin (21:39.618): Yeah. So we talked a lot about kind of tools now, right? Testing data quality, kind of ROI on data, data cost, all of those things. What do you feel is something that's missing in the ecosystem? Right? Kind of, if you started a company today to work on some tool to make your life as a data engineer easier. Do you feel like there's something where there's a lot of opportunity or kind of not a very mature tool yet?

Matt Weingarten (22:09.3): Um, I feel like we've made a lot of great progress in this space within the last few years. Um, You know one one, uh area that I would think of that first came to mind and then my questions were kind of answered There is the um is the area of cost actually So, you know if you're a data engineer, of course You can always think of you know the days when you would have to you know, debug some, you know slowly performing pipeline um where you'd have to look into the spark plan, look at all the details, is there any data skew or spill, ask yourself all those questions, and then you would have to put together some approach and then just kind of see what the overall improvement of that is. Then when it comes to costs, you kind of have to do that debugging as well and see if you can optimize your pipelines in any way. But, and this was kind of something that caught me by surprise at the beginning of, I think 2022 was when I first heard about it. But there's this company by the name of Sync Computing. I'm not sure if you've ever heard of them before or not, but one thing that they do, sync, S-Y-N-T. Yeah, well, not like sync the engine appliance, but like sync the syncing process. Right. So, yeah, so, you know, I first heard about this tool, I think actually through Reddit,

Benjamin (23:13.034): What was the first word? Like what computing? Sync, like data sync. Okay. Haven't heard of the... Yes, data sync like the data. Yeah.

Matt Weingarten (23:34.476): I use Reddit mainly for work purposes at this point. I'm a loser like that. So, I saw about this tool, but essentially, they could actually look at the logs for your EMR application or your Databricks job, and they could automatically recommend an optimized configuration for it. That didn't really exist before. This was a lot of fine tuning you would have to do. And then of course, you might see some costs that made sense and then you would just leave it be.

Benjamin (23:50.114): Nice.

Matt Weingarten (24:04.128): But of course, perhaps it could even be better. So to have something like that can actually apply, you know, like, you know, that proper modeling and artificial intelligence that didn't really exist in that area to kind of tell you, you know, how you could be doing things better. I think that's a great use case for data engineers because that was one of the ones that, you know, we had kind of struggled on in the past. Yes, data quality and a lot of those other aspects you can have, tools already existed to answer those questions or it wasn't as difficult to put those tools together. But especially with cost becoming such a big concern for companies in the last few years, it's great to see that there's some emergence in the space of those sorts of tools. So I guess to kind of answer your question, I'm not sure, I mean, yes, of course, if you really think about it, I'm sure there's a lot of unanswered questions in the data world, but it just feels like whenever you have an unanswered question, you just wait a few weeks and then you find out, oh, this thing is out there or it's coming out. That's how fast this space evolves. So... I think you just kind of like take it week by week and evolve with the times. What you're thinking of, what you think are big things at the beginning of the year are going to be completely different at the end of the year. I think going into 2023, for example, did we really know what to expect with chat GPT, generative AI, LLMs? I hadn't heard of any of those terms before, and now they're everywhere. And all the companies are going to probably emerge into that space this year if they haven't already.

Benjamin (25:22.068): Right.

Matt Weingarten (25:32.224): Yeah, I think it's really exciting. And that's why I said being a part of the data world is so interesting right now. So I guess we'll see what's in store.

Benjamin (25:42.73): Yeah, definitely. I think you mentioning these like cost optimization tools. That's, that's super interesting. Sync, which you mentioned seems to be mainly focused on Databricks. I know that there's also Kibo on Snowflake, for example, who offer these types of things as well. Uh, one thing I'm always curious about is like,

Matt Weingarten (25:55.161): Yeah, yeah, well... Yes. And I think it's great to have all those aspects covered, of course, because there are companies who go with Databricks, there's companies who go with Snowflake, there's companies who even use both, just depending on what works best in various use cases. So having those tools in place to kind of figure out how you can be doing things smarter is great because Databricks and Snowflake are great tools, but if you don't know how to use them properly, you're going to get hit with a very big bill. They're very good about that. Just as any... You know, cloud application would be, because that's how it's designed. So, got to make sure you're doing it the right way.

Benjamin (26:33.314): Yeah, definitely. Um, awesome. Cool, Matt. So all of this was awesome. Hey, I learned a lot about how you think about data engineering. I think all of the things you mentioned around data quality cost are on everyone's mind at the moment, right? Like everyone is struggling with these things, working towards solutions, kind of trying to implement things, um, in their own organization there. So hearing your thoughts and how you think about that, uh, was, was super interesting. Any closing words from your end and things you wanted to mention or bring up.

Matt Weingarten (27:04.54): Uh, no, I mean, thank you. Once again, thank you for having me. Um, one thing that hopefully everybody who, you know, listens to this program knows that the data community on LinkedIn and kind of just like that whole social media form is very, very strong. Um, so there's a lot of good thought leaders in the space who put out some thoughts, you know, we've touched on some names there and some companies. But, you know, if anybody has any suggestion or any names, um, you know, or any, you know, would like to know about any of those names, you know, feel free to reach out to me, I'd be happy to connect you to the right people. Cause. I've definitely spent a good portion of like the last two years really trying to expand my network. And that's how I end up here and in various venues like this. So we're all a very tight-knit community and it's great to see the evolution that this space has had during that time. But yeah, thank you, Ben, for having me. Thank you to the Data Engineering Show. It was great to be on this program today.

Benjamin (27:54.41): Awesome, it was great having you Matt and see you around!

Matt Weingarten (27:57.301): All right, thank you so much.

Read all the posts

Intrigued? Want to read some more?