Diving Into Github's Data Stack
October 21, 2021
October 21, 2021

Diving Into GitHub's Data Stack

Listen to this article

Powered by NotebookLM
Listen to this article

It’s the mother of all development projects. You use it daily. And so do 65M developers around the world. This time on the Data Engineering Show – A deep dive into GitHub’s data stack. Arfon Smith KimYen (Truong) Ladia shared GitHub’s data engineering challenges and solutions and explained why every developer should know and adopt the ADR protocol.

Listen on Apple Podcasts or Spotify

Boaz: Welcome everybody.

Boaz: Hi, Eldad.

Eldad: Hey, Boaz.

Boaz: We are here with another podcast of the Data Engineering Show. That's the name of our show? Right?

Eldad: Yes.

Boaz: This is it. And with us today, we have two great guests. We have Kim Lydia and Arfon Smith, both from GitHub. So, thank you, Kim and Arfon, for joining us.

Boaz: Welcome our guests.

Eldad: Welcome, welcome everyone.

Boaz: So actually, you know, we are very excited to have you on the show when we were sort of discussing in advance, we wonder how GitHub is dealing with data from an engineering perspective, and you know, how all the way we work with software has tickled into the way we work with data. That is an interesting topic in itself; but for starters, how about you two introduce yourselves?  Kim, let's go with you first and tell us a little bit about yourself. What do you do at GitHub?  And then we will switch to Arfon.

Kim: Hi everybody. I am Kim. I am a software engineer at GitHub, and I have been at GitHub for almost two years now and I am working on the data platform team. Before joining GitHub, I was helping to build a small data warehouse for a nonprofit.

Boaz: Arfon, now it is your turn.

Arfon: Yes. Hi, my name is Arfon Smith. I am a product manager for data at GitHub, and I support data engineering, data science teams, a whole bunch of, sort of, data-focused teams internally. I have a background in astronomy, so actually, previously, I was running a data archive on behalf of NASA for the Hubble space telescope in Baltimore. Before that, I was actually at GitHub. So, I seem to have liked working at GitHub. This is my second time in the company.  I was back at GitHub in 2013 for about three years as well.Boaz: Awesome, Awesome. So, we want to dive into, you know, the data world. What do you guys do? So let us start with sort of, just tell us about your data stack and the kind of challenges it has intended to serve.

Kim: Okay. So, our data stack, at a storage layer, we use Azure Dialect storage, and then we use Azure Hadoop and distributors service or Insight. For KeyCloak, you have metastore for transform data. For ingesting data, we use Airflow. We have people using AML for the machine learning pipeline. For compute, we have Presto, Snap, Azure data explorer. And, for BI, we have some custom tools for the individual customer, like Product 360 that you will hear more about later; and we have a more general tool, like Looker and Power BI as well. So, that's a lot.

Boaz: Yeah.

Kim: We have different things going on.

Boaz: Maybe I should have asked what you don't have.

Eldad: S3, no S3 buckets.

Boaz: I think you're the first guests and we have done like, I think, six episodes or so, so far; all have been on AWS. I wonder, you know, if we can talk a little bit about, sort of, Azure. Were you always on Azure? Was there like it was a transition at some point? Was AWS on the table, if you could share some thought processes there.

Kim: Yeah, we used to be on AWS and we made them move to Archer recently. And during that move, we were not just moving from one cloud to another cloud, we were actually moving the kinds of services that we use as well. On AWS, we were operating hundreds of nodes on the Hadoop cluster and it cost us tons of money. When we moved to Archer, we actually used Archer Hadoop and distributors Insights, which manage all of that infrastructure for Hadoop for us and we pay for service rather than paying for the machine first, and so, it is a lot better in terms of how much we pay the bill for it. So, yeah.

Boaz: And was it a complete transition as in Azure only at this point or do you still keep both in, sort of, a multi-cloud strategy?

Kim: The whole data warehouse is fully operations at Azure, right now.

Arfon: That was a big lift for the team.

Eldad: Who was the person who deleted the S3 bucket, kind of when all the tests are passed and then like someone must have pressed the button, right? “drop bucket,” which probably still runs.

Boaz: Terrifying moments, probably. How long did the transitional process take?

Kim: There is a lot of transition that carries out. So, we think we have a lot of things like we may take a quarter to transition one thing and then keep everything else the same. And then, every quarter, we move another product over.  I was not involved in the beginning, but I was involved in moving Airflow at the very end. So, yeah, we took 3 months at least to move Airflow from operating on AWS to Azure.

Arfon: Almost it took over a year to move. I think it was a long time. It was a big lift.

Boaz: Are all the mental scars healed by now from that transition? You know, people typically dread these things, like it sounds tough. It typically is tough, but you have done it and it worked well, so kudos! Awesome! Let's talk about all the people involved with data. If you could tell us how the data teams are structured? You know, which software teams deal with data?  Are they the data engineering teams? Or they are, you know, data science, and everything in between? We love to understand how that works at GitHub.

Arfon: Yeah. So we have, a fairly new structure actually. So, I am going to talk about what we have just reorged to and maybe about why we have made that change. So, we have a data platform team, which encompasses, sort of, a collection of data engineers; but coming the way, we think about that, they sort of run the common fabric of the data warehouse. So if you are a team that wants to provision storage or, you know, some kind of infrastructure for your service, then they sort of build that paved path for provisioning, compute, storage, that kind of thing, role-based access control, that kind of stuff. That team also owns this BI experience. So, the tools that you would use to interrogate the warehouse go around, and you are on self-service kind of questions as a member of staff. And then, we have a collection of what we call “verticals.” So, these are teams that are, sort of, close to full stack in terms of the skills as analysts, data scientists, data engineers, and their managers; and they are focused typically on product areas.  So, we have GitHub internally, the way that we build the product that sort of strategy around the service that we are building is broken into a few different verticals. And so we have, effectively, data teams aligned with those. And then, there are other data teams, that are not within this core unit, they are sort of a centralized data team for the company. There are other verticals that are outside. So, they are, sort of, more like satellite data teams, maybe in revenue, sales, that kind of thing. So, they are specifically serving their customers, who have sort of different types of questions. So, this is sort of, I think, an idea that has been pretty hot recently, as a sort of data mesh idea. So, building products for a particular set of customers, really owning that relationship and having sort of long-term relationship with those teams that you support. So, we have been doing this without the reorg for about a year now; and it is working pretty well. And so, we sort of fully embraced this model, actually, just this late last week, we have sort of finally kind of crossed the eyes.

Boaz: I mean, you did a lot of changes recently moving to Azure, reorging the data teams. I wonder what is next in terms of change.

Arfon: Wow. It is stability, I hope. So, you know, I think  part of what we have been doing as kind of growing up, like reorging the structure to be ready for sort of next level of growth of the business. I think so there's a lot, you know, the demand for data is very high and I think we have sort of found it a little bit hard to support some areas of the business traditionally. And so, you know, especially with the centralized data team, it could be hard to prioritize across the whole business. So now, we are sort of saying, this is a team that is focused on this area of the business or the product; and if it needs more capacity, then it should go and get a headcount and fund it that way, so it sort of gives us as a data team a little bit more of a logical sort of scaling unit.

Boaz: From a headcount perspective, all the teams we now covered, all the people involved with data, how many people are we talking about?

Arfon: So, in terms of sort of data engineering, data science, I think, we are probably at about 35 or 40, not very big. I think compared to other companies that exclude all of the people who store git on file servers and that is all completely separate. So of course, there is a whole sort of data infrastructure team, a set of teams around, running large-scale distributed git, all the application servers, all of that stuff. And in fact, actually, system observability is also separate from that. So. If you think about that without those headcounts, where they are sort of, I think, it's about 35 maybe, maybe 40 now.

Boaz: So let’s get back to the two of you. Now, that you have, actually, mapped out the structure of all the people involved with data, back to you guys. How do you fit in with that? You know, both of you have different sort of roles; Kim, you are an engineer, and Arfon you bring sort of a product hat. So, if you could elaborate on what you do within that.

Kim: Yeah, so for me, I have been on the data platform team from the beginning so this reorg doesn't actually affect me.

Boaz: Got it. Okay. Can you share one of, sort of the recent projects that were, sort of, more memorable or exciting for you? Beyond transitioning Airflow to Azure.

Kim: It is more of transitioning Airflow from one version to another version. So, Airflow is an open-source product.  It is continually being developed on, and there are new features right now, and we do want to keep on top of all the bug fixes. So, the most recent project for me was just the mixture that we have, AirFlow 2.0 running.

Boaz: Within, you know, the data platform team, what is the distinction between what a software engineer would do and what a data engineer would do?

Kim: I think all of my team members are software engineers and we consider ourselves as data engineers. We solve any problem that our user have or come to us with from operating the platform for them, helping them integrate from one technology to another technology, or even build new features too. For example, Airflow, I have been building the backfill tools that open source do not have, to solve the needs that we do have.

Boaz: And Arfon.  Let's get back to your responsibilities.

Arfon: I have sort of, a bit of, a weird product manager role, I would say. So, we have lots of customers for data internally.

Boaz: Arfon! you're not weird. You're super cool. Come on!

Arfon: You're very kind.  It is a bit of a mixed bag, honestly. We do have some products that we serve customers with.  Some product teams we have this service product called Product 360, which is a thing we built with software engineers and data engineers, and that is sort of a data product that teams use internally to understand common engagement metrics, acquisition, retention, and churn. That kind of thing around how people are using different parts of GitHub, the product? So, sometimes this is a quite traditional sort of product work where it is kind of figuring out what we should do, managing that backlog, understanding what that product should do? what the customer is? But then, there is also a much more sort of general kind of workaround with all these potential customers we could serve internally, where is the lowest hanging fruit? where is the highest return on investment? So, we have quite a lot of autonomy about how we spend our resources. SWe have got 15 data scientists. What should they work on? You know, I think that is a really good question. And so sometimes I spend my time developing ideas with other teams, sort of getting work to the point where it's ready for a data scientist to go invest some time, where you are sort of doing speculative work.  We think there is some potential here. We should spend some time doing some R & D to mature an idea to then go to a product team and say, “Hey, look, we think we could, you know, use data to like, you know, supercharge our product in this way or something.” And so, it is a mixed bag. It is pretty varied and I enjoy that.

Boaz: It is very interesting. How many pure product manager or product people are there that have a data-specific road?

Arfon: It is just me. So, we are hiring.  We just hired a second person.  I am delighted to say because there is a lot. One PM to 40 engineers is way off a good ratio from my perspective. So just add two.

Eldad: So now you have high availability at least.

Arfon: Yes! Yes!

Eldad: What about metadata, by the way? Like, what is the most amazing metadata you have been playing with? Like I am thinking GitHub. I am thinking it is best semi-structured. How does that turn into a Looker dashboard? Or if you can share on the metadata side.

Arfon: Yeah. We have a sort of collection of data streams that we put in the warehouse. We have a thing we called hydro, which is, I think just Kafka, you know, that they streamed a collection of services, write events there.  This is the sort of paved path if you are a product team.  For this particular part of the product, you want to instrument, measure some behavior in the product, then, you write an event in the product. There is a sort of pretty good abstraction as a developer, in Ruby for writing there or in Go, you know, there are mature client libraries that developers use. And, anything that is in hydro in this Kafka stream ends up in the warehouse automatically, and then, we have got, you know, query tools on top of that, Presto or something else. And then, we have a lot of bags that run and turn those into things that are more consumable. So, yeah, I think probably a good example of where we have to invest a lot. If you think about something like security scanning as a product, I have a vulnerability in my project or dependable, that kind of thing, where there is some dependency that I have in my project and I want to be alerted if, you know, there is some vulnerabilities. So, we care a lot about how successful that product is and whether people are responding to those alerts and whether they saw them and whether they did anything, whether they resolve the alert by pushing new code. And so actually, building a set of dashboards that the product team could use to make data-driven insights about how well the product is performing.  There was a lot of data engineering work to design those tables, to collect the right data, and build a performance data model that they could query easily. Because there is a lot of dependable alerts and there is a lot of vulnerabilities that get exposed and people get alerted to, so there is actually a lot of work to make a single part of the product understandable for a product team. So, yeah, it probably will be a good example of how uniquely you get a problem. Yeah.

Boaz: What are sort of the key architectural lessons learned? I mean, you know, when you did the shift to Azure, probably huge architecture redo in a lot of places.  Can you talk about Hadoop and how you change your approach there? What are the kinds of things that you do not mess with the old architecture and you are smarter about now?

Kim: One of the bad things that I do not mess is, operating ECG instances. I do not like having to check to see if they are dead or alive and they need to restart or how healthy they are. When we moved to Azure, we also adopted Kubernetes and Terraform. So, we use Terraform to  deploy our infrastructure, we use Kubernetes to run our containers. So, it is a lot easier to bring up a new container or deployment. It is less time for a user. So yeah, those are the two big things that I am really happy that we are laying on.

Boaz: Which data volumes were we talking about at GitHub?

Kim: About 40 storage account and they range from a couple of megabytes to a couple of petabytes, and they are different data sets. So, it really depends on the topic or the kind of data that is in the storage that determines how big they are; but yeah, we have a wide range of the data sectors.

Arfon: One good example is the hydro topics. I think they are the most voluminous. It is about 4 billion events per day that get written to the warehouse.  So, just to give you an idea.  I think that is our biggest the hydro topic, pretty sure.

Boaz: What are some of the use cases that are more real time in nature that are running today?

Arfon: This is a satellite data team. So in the sense like this isn't us, but platform health would be a good example. So spam protection. Those teams use these hydro events, you know, so, very low latency sort of system events, and they are running, their own ML algorithms on the fly to detect spammers and quickly remove them. So, I think one of the things that I kind of think is pretty awesome about the GitHub experience is, it is really rare to see spam, and so that's amazing, so that team is incredible and they are just laser-focused on keeping spammers off the platform. So that is a real-time thing. I think also all the stuff that, sort of, other platforms teams do around availability and GitHub receives a lot of denial of service attacks still and stuff like that, but that is not our world, but I think those are some real-time stuff that are pretty important.

Boaz: It does sound like, you know, at GitHub, there is a culture around being data-driven. I mean, there are so many initiatives going around data, data teams are embedded within different services. Is there any standardization of how to approach becoming data-driven within, sort of, new services being launched or a new initiative or is it all very different across different teams with their own analysts and data people?

Arfon: I think this is part of the challenge of the work that we do.  I think I am correct in saying that product operations as a concept is quite new at GitHub. This idea of what should you do when you launch a new product, right? Like there is a playbook, like sort of due diligence. I mean, I think we are good at making sure the docs are in a good state, but what sort of instrumentation should you do in the product to measure success? Do you have success metrics? All that kind of stuff I think actually, GitHub is getting much better at that, but that is not a historical strength, I think. So, I think, there are playbooks now. We built services; it is Product 360 service that we built. Really all you have to do as a product team is change a configuration file and then we effectively ingest that as part of the dag, when it runs at Airflow.  We ask product teams to define engagement with their product, so if a customer is active in these areas of the product, then that counts as engagement. This counts as a contribution. This counts as acquisition, that kind of thing. And then if they do that, they have got all these sorts of free dashboards that are built every night. And so I think our general tactic is to try and encourage people to do the right thing by building them good tools that if they just follow the playbook, then they are going to get, you know, they get carrots. But more and more, I think the expectation is that a product manager can talk about their product, can talk about how many users they have, can make intelligent comments about, you know, users that are being particularly successful and those less so. And so I think, sort of the culture of being able to reason with data is much, much stronger at GitHub today than it was, let's say, five years ago. So I think you are being very nice when you are saying it sounds like we have a very strong data culture. I think we are growing a stronger data culture. I think traditionally observability has been amazing at GitHub, like the whole kind of ChatOps stuff and the ability to really manage a service from a chat channel was, you know, foundational in GitHubs’ engineering culture. And that's a decade old. GitHub has been doing that forever.

Boaz: Let us do spend a couple of minutes on the observability. Even though it's a decade old.  Tell us a little bit more about that.

Arfon: I don’t know. I can tell you some. We don't own that, but I know we use services like Centrify, Splunk, that kind of thing. Those are services I know about. I know we are moving towards Open Telemetry for logs and metrics. That kind of thing.

Boaz: I actually, yeah, I caught that on your engineering blog.

Arfon: Yeah. There was a nice post about that recently.  I think one thing I would say though, is that we have a divide in the way that we think about that data, which is not good, which is observability, is like an engineering problem and like, you know, metrics and instrumentation is a product thing and in some sense, those data, it doesn't make sense to think of those as separate data streams, you know, product managers can build a dashboard from logs if they wanted to, but they traditionally haven't.

Eldad: This is a great point actually.  We see that a lot, like, engineering needs data to figure out how to build something and product needs data to figure out what to build; and when they collide, it usually sparks, you know, in there because each side just looks on its own need and it is really hard to understand the other side. So engineering will say, we cannot wait for a full-fledged data warehouse and we cannot wait for this data pipeline and we need it now, now, now.  And the product will say, Well! We have figured out most of the stuff on how to do it. Let us figure out what to do. I think we are entering this kind of era where, as you said, those two parts will eventually converge into one big stream where everyone can take their part. But it is interesting to see that. We see it all the time. Even internally at Firebolt, like, there is a race to space around engineering versus product who can build a better system.

Boaz: We managed a big list of things that usually go through the most horrible grunt work. So what is the worst grunt work in your day-to-day job that you wish could magically disappear?

Kim: That would be Python dependency again, not waking up in the morning and figuring out something broke because of a transient, transient, transient dependency that we did not pin devotion on.

Eldad: Say no to Python, just kidding.

Kim: I have been using Python for a while and it has been the single thing that I have been complaining about Python over and over again.

Boaz: It is a love-hate relationship, I guess.

Kim: Yeah.

Boaz: Tell us about, you know, a great, win from a sort of the last year or something that you are proud of or had a good experience with.

Kim: I would say the backfill tools for Airflow.  Airflow is very great with managing a workflow, but they don't backfill at an operation that people do day-to-day, and at GitHub, we have so many people writing deck, writing workflow, shipping day data and the data can change like schema change or the metric needs to be updated, the calculation needs to be updated. They are more factor that they learn about and once more accurate model, things like that, that we try backfill where you are running the whole workflow over a long period of time, could be years of data in. Therefore, does not have a good story around that. So we have to build in our own tooling to let user to use backfill or to run backfill on Airflow easily. Reaching out to the open-source projects and hopefully that in the new versions of Airflow, I would engage more in that and help share our learning and share our use-case and hopefully share the new backfill tools in Airflow.

Boaz: Awesome, super interesting, but stop bragging. Now tell us about a failure. What didn't work so well? what can you warn others about or sort of a lesson learned?

Arfon: I was going to say one thing. I will give you success and then I will give you a quick failure because I do want to tell you about one thing. We did miss 4 billion events per day stream that we have.  One thing I think we have done that is really good is we have managed to use this common data source. So that 4 billion per day event stream is actually a log of every request to the Rails application that GitHub like monolith, so it is effectively the Rails log, so every request, it was about 4 and a bit billion per day, and we have managed to sort of figure out how to drive the majority of our company core metrics. So things like your monthly engaged users, your top-level business things that attract by the leadership team and Microsoft and using this event stream and the majority of all the individual product areas. So they use a sort of common data source, which felt really, really good to have this common way of approaching metric definitions. That felt like it was very fragmented beforehand. And so, that felt particularly good.  It was very satisfying to sort of be able to, sort of, see them rolling up to some sort of total numbers so that I would say that was a big success in terms of trying to take a sort of rational approach to thinking about metrics across the business.

Boaz: Any special tips, I mean, as the experts in everything Git or is there any unique way in which you approach all the data projects from a Git’s perspective, something all the regular Git users can learn from?

Arfon: I have one suggestion. It is not really about Git.  It is more about the process, which is I think, I appreciate the ADR format. I do not know if you know, that is an Architectural Decision Record. Is that what it stands for?

Kim: Right.

Arfon: Yeah. So one of the things I do enjoy is that GitHub uses the product extensively still. So the idea is you can just go browse any team, what they are working on. But this folder with all the ADRs and now, which is really lovely, so if you are trying to understand, like, why is the service built this way? Like what decision-making process did we go through? There is usually an ADR. If it was a big decision for a team, there is a thing you can go and read and its reasons and what else did we consider? I like that a lot.

Boaz: Who owns the ADR? Like who in engineering?

Arfon: Usually they are sort of principal engineers responsible for building, you know, the decision-maker, but the fact it is in the repository, means it has been through an extensive review. You can go and look at the pull request associated with that review, you know!

Boaz: Awesome!.

Eldad: Is that somehow related to the change in how kind of, you know, how we are working together in a distributed way where time zones are different, people are moving across places. So getting to those basic disciplines suddenly becomes necessary because you just can't hop into a room and just wrap it up quickly. I think, you know, it's huge and small companies, if you want to work with a distributed workforce with engineers, especially, those advisors are gold.

Boaz: I think we should definitely look into it ourselves. I mean, we've been scaling and starting with remote culture and just, you know, collaborative decision-making in a remote way, it is not as easy as it sounds, so super! Yeah! Thanks. Okay! Now. It's, the advice corner, anything sort of inspirational or who to follow, what can you sort of give back to the community in terms of what to look out for, who to follow things like that? And any less famous words from the data people at GitHub.

Arfon: I guess a company that is well in set technologies that really inspires me is the Jupyter project. So this is Jupyter notebooks, you know, one of the things that they do. I think that this is a project that came out of the IPython originally, this was scientific Python ecosystem. I just think they have built a really compelling set of technologies, cross-platform, cross-language, sort of the de facto way to sort of work as a data scientist today, and the way that the project runs is really incredible, as well as they've got really, really sort of enlightened approach to governance and participation and stuff. So yeah, I think Jupyter is, you know, sort of an exemplar for how I think really good open-source projects can be run. And of course, you know, that sort of toolchain is adopted by all the cloud vendors, you know, SageMaker, Azure notebooks, there's Google collaboratory that are all using the same thing, right? That just seems amazing to me. And I think, yeah, that's a great project. I think they're doing great work.

Boaz: Okay. Now for a quick blitz round:

Eldad: Open source or commercial?

Kim: Open source or commercial? Depends. You have me, I'm a little bit skeptical on either one in terms of whether or not I have used it yet. So I want to get my hand on it before I recommend.

Boaz: AWS or Azure.

Kim: Azure.

Arfon: Azure.

Boaz: Work from home or from the office?

Kim: Work from home.

Arfon: A bit of both.

Boaz: Yeah. Okay. That has been awesome.  Arfon and Kim, thank you so much for joining us. It has been super interesting and we will see you around.

Eldad: Thank You.

Arfon: That is cool.  Thanks for the time.

Kim: Thank you.

Boaz: Our pleasure.


Read all the posts

Intrigued? Want to read some more?