Listen to this article
This time on The Data Engineering Show, Xiaoxu Gao is an inspiring Python and data engineering expert with 10.6K followers on Medium. She’s a data engineer at Adyen with a software engineering background, and she met the bros to talk about why both software and data engineering skills are so important.Without software engineering skills you’ll be limited to the rigid capabilities of your stack. But without data engineering skills you’ll find it hard to be cost effective and see the bigger picture.
Listen on Spotify or Apple Podcasts
Benjamin (00:03.101) Hi, everyone. Welcome back to the Data Engineering Show. We have Xiaoxu joining us today. She's a very well-known data engineering blogger, would you say that's correct, Xiaoxu, and thought leader, influencer. You can tell us all about that in a second. And she's a data engineer at Adyen which is a financial technology platform, basically, which I'm sure we'll also learn a lot about. Yeah, do you want to say a few words as intro to Xiao Shu?
Xiaoxu (01:06.814) Yeah, so I actually started my career in 2017. I first started my career as a software engineer, and then I joined International Bank for a few years. And then I moved to a startup as an official data engineer. And then right now, I'm at Ardian, which is a fintech company.
What we do is we do payment worldwide. So as you can imagine, data is everywhere in the company. We need data for reporting, fraud detection, machine learning model, et cetera. And on the side, I'm also blogging, as you mentioned. And right now I'm busy with my first data engineering online course, which hopefully can be finished soon. But yeah, I love writing content and then do knowledge sharing with the community.
Yeah, that's in a nutshell about myself.
Benjamin (02:04.958) So you're hosting the data engineering course basically like you're or you're taking it
Xiaoxu (02:11.026) It's an online, let's say self-paced learning materials. It will be published on a platform called Educative, where you can basically learning things through reading content, not really video, but through reading. So yeah, I'm busy with the final reviews, et cetera, so hopefully it can be finished soon.
Benjamin (02:34.617) Awesome sounds awesome. We look forward to looking at that once it's out. Nice. Sounds great.
Eldad (02:40.327) Obviously if there is any material we can link and share to our listeners.
Xiaoxu (02:44.53) Yeah, thank you, really appreciate it, but not yet.
Eldad (02:47.989) Okay.
Benjamin (02:49.509) We'll just have you back. We'll have you back in a couple of months once everything's ready. So what got you into blogging basically, right? So you have more than 10,000 followers on Medium. Like tell us about that journey kind of going from software engineer to data engineer to then being like a thought leader in this space.
Xiaoxu (02:51.762) Yeah. Thank you so much.
Xiaoxu (03:09.054) Yeah, sure. So I can start with my writing journey. I started writing in May 2020, basically right at the beginning of the pandemic. I always call my medium blog as my pandemic baby. So the thing is before pandemic our team had this weekly knowledge sharing session where you know like everybody shared their insights, findings, products in this meeting.
And during pandemic, this type of meeting became extremely important because we can't really see each other anymore. And that meeting was basically the only moment that we can learn from each other. So whenever it was my turn to do a presentation, I always put in a lot of effort to make the presentation by giving a bit more context. and then give more examples and then prepare a Jupyter notebook so people can reproduce it at home whenever they want. So I remember it all started with one presentation. I was doing the thing about data class and the name tuple in Python. I did a session because I couldn't really find the things I want online, so I did my own experimentations and then I did a session. It went pretty well.
And after the session, I thought, okay, maybe I can share with a larger audience online because nothing was really confidential. And also at that time, things we were staying at home, I was also trying to find a way to be connected with the rest of the world. So I quickly summarized my note and then published my article in just two days. And yeah, sorry.
Benjamin (04:54.873) So this is back when you were at ING, right? And back then you were still a software engineer. So was your blog about software engineering initially, and then you transitioned to data engineering at some point?
Xiaoxu (05:06.398) Yes, exactly. So the first few blogs were written during my ING time. So it was all about Python and those little details about Python and comparing different packages and the optimization, etc. So indeed, it was all started with the software engineering content. And yeah, after I published my first article, I was really surprised because like...
On the next day, it got so many views. Like, I don't know if it should work like that or not. I was super excited and also motivated. And then I collect my other notes and then I published two other articles in the first week. So that was the most productive week in terms of my writing. And then the more I write, the more I enjoy it. So I kind of like make it as a habit and I start producing more and more content since then.
Benjamin (06:02.357) So you went viral right away basically.
Xiaoxu (06:04.91) I don't know, it's still a puzzle for me. But as I write more articles, I realized that not every article should be like that. So it was definitely a luck. Maybe other engineers also had that problem at that time, I guess. Yeah, I was really surprised.
Benjamin (06:31.801) Eldad, your mic's not working.
Eldad (06:35.199) Thank you for that Benjamin. Tell us, so when did you start feeling like, no actually no, I haven't said anything. When did you start feel like a data engineer and less of a software engineer and how did it feel like?
Benjamin (06:39.429) He was talking all the time. He was talking.
Xiaoxu (06:49.502) Yeah, that's a very good question. I actually intended to be a data engineer. So my career change was from, that was the moment from ING to DOT, which is the first time I became a data engineer. So the thing is when I was at ING, I was working with a data integration platform team as a software engineer. And then my main responsibility was to help the team
the data integration platform from scratch. So although we were busy with the platform work, like setting up infrastructure, writing software, I was also writing software to integrate data into different systems. And as you can tell, it was already a sort of like a data engineering work, although I wasn't really realizing that. But then a few years later, I started to read a bit more about data engineering. I'm trying to recall why, but I think it was because all the news from the cloud providers, or maybe just more people on my LinkedIn start to have data engineer title. And then I started to read a bit more and then I was really interested in those cloud services. And then I followed some tutorials here and there. I was really surprised because essentially what I was...
doing in the data integration platform can be requested at a service in the cloud providers. And it was kind of mind-blown for me. I don't mean like it can be technically like physically replaced, but like on a conceptual level, because we were doing like a Kafka cluster. We were building our own Kafka cluster streaming engine. And then we were building our own scheduling system.
And then we were doing our own monitoring, alert, et cetera. All these can actually be requested as a service from the cloud providers. And then I really wanted to get more into that area. So I started to look for like jobs, which uses the cloud services, and then like also provide those official data engineer title.
Xiaoxu (09:10.334) And that's how I, yeah, transit from a software engineer to a data engineer.
Benjamin (09:17.921) How, how... Go ahead, Eldad.
Eldad (09:18.003) Would you say that, go ahead Benji, would you say that being a software engineer gave you some advantage in becoming a data engineer? Is that related to the project you've been involved with as a software engineer or it's just your journey? What have you seen out there? What can you share?
Xiaoxu (09:36.55) Yeah, that's a good question. So I think software engineer and data engineer are very similar. So if people are thinking about changing their career from one to the other, it's definitely possible as many people have done this in my friend circle and I'm also an example. But at the same time, they are like also very different. They have their own strengths and skills.
and they can also learn from each other. And for me, I feel like there are two things which really helps me in my data engineering journey. One is that as a software engineer, we have this mentality of building things from scratch. We love coding and we love writing unit tests to make sure our software works perfect. But.
The testing part is not really a standard in data engineering, especially the unit testing part. It is what I felt. People would write more data validation test as part of the production pipeline rather than unit test as part of the CSCD. But when you start to create really complicated transformation logics, you need unit test to make sure that you know what you have written and you are confident on your code. And when I was at DOT, we were mostly using SQL to write transformation logic. And then writing unit test for SQL was such a pain. But I did see the value of doing that. So, and also one thing is we were using dbt, so everything was in SQL. So I came up with this unit testing framework in dbt that allow us to do unit testing.
in dbt which works really well in the end. This is one let's say advantage I can see that we can introduce more software engineering basic practices into data engineering field which makes it more yeah more robust and then more correct and also I'm happy to see that dbt will soon support unit testing natively as well.
It is also a trend and means that people want to introduce more software engineering best practices to the data engineering.
Benjamin (12:09.253) So when you're talking about unit testing, dbt, then like, this is something like, okay, here's a part of the pipeline. Then here's some input data. Here's some output data and kind of making sure that the transformation works as expected.
Xiaoxu (12:16.988) Exactly.
Yeah, it doesn't really run in the production pipeline. It's the task next to the regular data transformation, but it runs in the CSCE. And it can block the release if there is something going on there. So indeed, yeah.
Eldad (12:37.263) Benji, it's like user spinning up engines.
Xiaoxu (12:40.746) Yep.
Benjamin (12:41.197) And it runs on the same data warehouse that you're running on anyways, just with like a then significantly reduced data set basically.
Xiaoxu (12:51.398) So for the unit test, I usually prepare my own input and output. So I don't usually use the production data for unit test because one, it can change, and two, it's usually quite big, the data set, so it can take a longer time. So I prepare my own data set and I also know what is testing in the data set. So I have more control over it.
Benjamin (13:17.213) but it runs on the same system. So if you're running on BigQuery, for example, your unit tests would also then be a DBT job running kind of on BigQuery orchestrated by whatever CI system you use.
Xiaoxu (13:19.102) Yeah, run another system. Yeah.
Xiaoxu (13:25.351) Exactly.
Xiaoxu (13:29.726) Yeah, so for me, it will create a data set dedicated for unit testing BigQuery, for example.
Benjamin (13:37.073) That's good. So we talked a bit about then, I love this angle by the way, kind of your like software background, kind of giving you like allowing you to think about testing and all of those things kind of maybe in a rigorous way. Like you finished university in 2017, right? Like how would you say your like traditional computer science degree in a sense prepared you for like a data engineering job now? Like, do you think universities are doing a good job here? Do you think...
It didn't help at all. What are your thoughts here?
Xiaoxu (14:08.99) Yeah, I think university did help me because in my bachelor's I didn't really study computer science. I was studying electrical engineering. So it was more like on the hardware side. Master program was the first time that I started to know about computer science and learn coding, et cetera. So it definitely helped me for the long term. But of course,
what we learned in a university was very, let's say low level, it's, and very, let's say basic. And a lot of things I still learn through the work self. So learning by doing it.
Eldad (14:52.996) Life, Benjamin.
Benjamin (14:54.641) life.
Xiaoxu (14:55.44) Yep. Learning by doing. Yeah.
Eldad (14:56.243) You don't learn much at school. But it's great, it's good for the CV. Where did you study?
Xiaoxu (15:06.282) Where? My bachelor was in China in Shanghai and then my master was in the Netherlands in Delft, Delft University.
Benjamin (15:19.105) and you fell in love with the Netherlands and kind of now we're in Amsterdam. That's awesome.
Xiaoxu (15:23.834) Yeah, except for the weather. The weather is crazy these days. But yeah, for the rest, so far so good.
Benjamin (15:26.449) Thank you.
Eldad (15:29.223) You know, Amsterdam has a very long tradition of innovation on databases. CWI in Amsterdam, MonetDB, like a lot of stuff, Duck DB.
Benjamin (15:33.729) Peace.
Xiaoxu (15:41.234) And also duck DB duck DB and Python as well. Yeah. Dutch people like a program.
Eldad (15:46.655) Yes, yes.
Eldad (15:52.719) Wait till you visit Munich, by the way. But it is, it is like, yeah, it is a lot of innovation coming from there. We have like, we have a big office in Munich and we love those places. And so, yeah, it brings me warm memories. What is it like you've mentioned bringing migrating software engineering practices to data engineering? Is that a good thing?
Xiaoxu (15:54.908) Okay.
Eldad (16:21.827) Is there a way to actually rethink software engineering practices, given the fact that data engineers now interact a lot with new stakeholders, with users, with the business? I've always thought about data engineering as an evolution in many ways of software engineering, because you're much closer to the final outcome, the value of the stuff you're building.
Is it changing? Are we still kind of doing the same data engineering stuff all over? Or what do you see there? Are there new trends coming on how data engineering should evolve? Or is it basically, yes, we trust software engineering. Because if you do a subset of the data, then you do a unit test on it. And will it reflect actually that should you run it on the production? Reason I am mentioning it, we get those questions all the time. And frankly.
We don't know what to answer. So usually we don't try to intervene in how people perceive the data lifecycle. We just try to build a product or kind of to support their journey. There's so many ways to get things done right, uh, with data. But I was wondering, are we trying to rebuild our software engineering stack? Or are we also going to improve it or simplify it so we can have a bigger data engineering community because it's a lot about the community as well.
Xiaoxu (17:46.014)
Yeah, so what I see is they are not really conflicting. I feel like the software engineering skills only gives me benefits rather than, let's say I had a stereotype on something and I couldn't accept what is being done in the data engineering side. But well, to be fair, when I just...
became a data engineer in the first two months, I was indeed really surprised by the actual work the data engineer was doing. Because before I was doing this low level programming, hardcore everyday, but when I just switched, I actually spent most of the time like reading and understanding the systems rather than coding. And...
At the end of the day, what you need to do is maybe just changing a few configurations. And for me, this change at the beginning was a bit strange. Like I used to do this coding, I don't know how many lines per day, this type of thing. But in the end, it was just like, okay, changing it. Yeah. I like changing a YAML file or a click a button in the cloud. But then the more I learn, I feel like there's another dimension in it.
Eldad (18:52.939) PRs.
Xiaoxu (19:08.046) the theory and how the system works under the hood, which is also very interesting for me. So that's why in the end, I also very, very enjoyed this work. But coming back to your question, how this data engineering skills influence software engineering, another point I can see is that
the software engineering skills brings a lot of potential to a data engineering team, because of course we have those modern data stack that we can leverage without reinventing the wheel or without having a lot of engineering effort. But sometimes we are also constrained by these two links. And when we are constrained, then we need people with coding skill to expand this capability.
And I can give you like one example. When I first joined, I was trying to, like the team was trying to make a connection between dbt cloud and the airflow. At that time, the dbt cloud operator was not here yet. So I implemented our own customer operator and stuff like that. In the end, it worked well. And of course, a few months later, the operator came out and then we replaced that one with the official one.
But we can make this possible a few months earlier because we have someone in the team who can do this for us. So I feel like a data team should have at least one data engineer with software engineering background because they can really bring a lot of potentials to the data team. Yeah, this is how I feel about it.
Eldad (20:55.531) Absolutely, absolutely. Data apps, data platforms, they run 24-7. They're very unpredictable, even though we think of them as totally predictable with the modeling, and we've tested everything. But they're not. They serve other users, and those users do take those data apps to their extremes. And you're right. Keeping business continuity by having this depth in the team.
even though it's not needed on a daily basis, like a software engineering team, that can make the whole difference. And I've seen that many times spot on.
Xiaoxu (21:32.906) Yeah, and another good example is the on-call culture. A while ago, I also wrote an article about it. In the software engineering world, we do a lot of on-call duties because those softwares are mostly like APIs. They are exposed to the outside world. And if the API is done, then we need to do duty, et cetera. But in the data engineering world, the on-call culture is very important.
culture isn't really a thing there. But at the same time, I feel like we do need to have this culture. It doesn't mean that we need to be on duty 24-7. If a dashboard breaks, probably it doesn't really matter. But it's more about how data engineers should handle on-demand requests, should handle unplanned work, should handle all those kind of requests from the other teams, like incidents.
Um, on that point we can learn a lot from how the software, software engineering teams do the, do the on-call, for example.
Benjamin (22:43.177) I loved your comment on the dashboard being broken not being important. Our last guest, Wim Wischwichte, said he never saw a dashboard that had positive ROI, so it seems like everyone's...
Xiaoxu (22:46.102) I'm sorry.
Eldad (22:49.395) Hahaha
Eldad (22:53.28) He never saw a dashboard that's not broken Even if it returns results, it's deep inside broken
Xiaoxu (22:56.034) Ha ha!
Xiaoxu (23:00.808) Well, that's a true story everywhere, I guess.
Benjamin (23:07.229) The consistent data engineering show theme. No one cares about dashboards. Nice. I love that. I love that angle. So one thing I'm curious about is you have quite a journey behind you now in terms of data engineering. How do you notice today that you've actually gotten better? So for a software engineer, you get better at designing clean interfaces, getting into new code bases, all of those things.
How does growth actually look like as a data engineer to you? Like, how do you know that today you're better or more senior than you were a couple of years back?
Xiaoxu (23:43.872) You mean like how do I know if the day that becomes better or I become a day better?
Benjamin (23:49.361) with you as a data engineer, so like kind of personal growth as a data engineer.
Xiaoxu (23:53.966) Ah, okay, personal growth, not the data itself, okay. Yeah, that's a very good question. I evaluated myself from different dimensions. One is the, let's say, the landscape I know on the data technologies, although I know there are so many stuff out there. It's basically impossible to catch up with everything. But I see a few...
core technologies which are really important for me. And I try to learn them and go deep into them as far as I can. So for example, the cloud stuff, like you don't really need to learn every single cloud provider, but getting production experience with one cloud provider can really help you learn a lot as a data engineer. And also another thing is Spark.
I found Spark really interesting and a powerful, that's a data processing engine. This is also what we do a lot at Ardian, like we have a really Spark cluster and that is my first time to learn Spark as well. So there are like so many new stuff out there as well. And another stack is Airflow, like the OG.
for the data orchestration, you must master it. There's no question. And some other data transformation, let's say, toolings, I love dbt. I feel really lucky. When I joined Dot, which was my first data engineer company, I was exposed to most of the tools that I wanted. Even now, I learned cloud, dbt, and airflow, etc.
So one is on the tooling side and another dimension is that I think in the end we try to make data as a product. So there's also a lot of connection with the business, like how I would make sure that my data fit its purpose, how it helps the business, how it impacts the business. This is something that I'm trying to.
Xiaoxu (26:09.93) yeah, be better on it as well, either by developing toolings. Like I was dreaming about this data status page. I did a POC when I was at DOT. So basically, every data pipeline or data product has its own SLA and SLO. And then we have a status page which can show the status. And then we can communicate with the stakeholders. So either by developing toolings or building my own
software skills and then to talk to stakeholders and then be better and storytelling, etc. So yeah, two parts. One is on the technology side and one is on the, let's say, data acceptance. Yeah, and also improve the data quality to be better used by the users.
Benjamin (26:59.365) So those like business skills of understanding how your data is used, kind of understanding what value it provides and so on. Would you say getting better at that is transferable between companies or like it gets hard reset every time you switch companies?
Xiaoxu (27:04.364) Yeah.
Xiaoxu (27:13.022) Yeah, it definitely gets better. So when I first joined DOT, it was like a small data team. We had six people, a data engineer, and then a few analytics scientists. So the number of stakeholders were not that much. So for me, it was more about learning new technologies rather than, let's say, scaling my business skills. But now I moved to a company which has 250 data people and a lot of stakeholders because for me I work in a product team. So in a product team I need to talk to my stakeholders on a daily basis. So I really need to understand their feelings and to make sure that what I do fits their purposes and also talk to the other data people within the company so the scale is much bigger. So I feel like in my current role I'm building my business skills much more than the technical skills.
Eldad (28:18.223) By the way, I've heard, I've seen, and I know people that actually also moved to product. So they started in software engineering, they went to data engineering, then they actually, they ended up in product because data engineering puts you in front of users much more frequently than software engineering. As you said, it puts you in front of stakeholders, you find yourself talking product and being product lead growth. And the product is the business. So
more and more companies operate like that. I think this is the big revolution that goes into how we run businesses. It's kind of, we see that from consensus over dashboards to really driving the business with a data product. And I think that's where data engineering will be pivotal and, and you're one of them, like you're the, like, really, like, uh, to me, at least kind of, uh, the perfect definition of the future of data engineering. Um, really, I'm sorry. I have to say it.
Xiaoxu (29:11.531) Oh, oh, oh my gosh.
Eldad (29:16.14) I agree with everything you're saying and with kind of your philosophy and mindset.
What can I say, I hope many more will follow
Xiaoxu (29:22.834) Yeah, but I also. Yeah, thank you. I also agree with what you say that some data people will become product people, because in the end, like you said, right now I'm in like a product team. So although I'm a data engineer, but I need to know a lot of stuff about the product. So my team is doing report for the financial controller. So I need to know all the financial products that we do at the company and all those business logics are maintained by us. So maybe one day when I get enough business knowledge, then I can easily switch to a product team and then do product. Yeah, it's definitely possible.
Benjamin (30:06.441) Awesome. Sounds good. Xiaoxu, any closing words on your end?
Eldad (30:06.527) Sounds good.
Xiaoxu (30:14.243) Yeah, so maybe like we were talking about switching from like for me, I switched from software engineer to a data engineer. But maybe like some of the audience want to do the other way around, like switching to a software engineer from a data engineer. And they might wonder like what their past experience can help them in the future.
So maybe two closing remarks on that. One is, I think as a data engineer, we have very strong end-to-end ownership because we don't just look at one single component, we look at the entire chain from end to end. And I feel like the end-to-end ownership sometimes is lacking in a software engineer. But having that really helps because we been discussing about data contract. It's mostly a contract between backend team and data team. But it can also be a backend team and a backend team. So to have those close connection between your upstream and downstreams can help you and help the company prevent a lot of issues. This is one, let's say, advantage. And another is that maybe not that common is if you work with cloud, you probably let's say know a lot about the cost of optimization side. And you can also bring this mentality to a software engineering team by optimizing the resources and having this kind of mindset can also help the team grow as well. So it works both way. Because it's never the case that one is the subset of the other. Like people can always switch around and learn from each other.
Eldad (31:59.619) Impossible. It's data engineers have cost consciousness embedded in their DNA. Software engineers, they work in budgets and come and have a lot of software excuses. Just kidding. I love software engineers. Thank you. Thank you for the insightful last kind of ending comments.
Xiaoxu (32:05.474) Hehehehe
Xiaoxu (32:09.331) Yeah.
Benjamin (32:19.237) Definitely. Thanks. Thanks so much for being on the show. We had a great time. Enjoy your evening. And yeah, we look forward to reading your learning course soon.
Xiaoxu (32:30.334) Yeah, thank you so much. Thanks for the podcast. Thank you.