How Amplitude Engineers Process 5 Trillion Real-time Events
January 5, 2023
January 5, 2023

How Amplitude Engineers Process 5 Trillion Real-time Events

Listen to this article

Powered by NotebookLM
Listen to this article

Weichen Wang, Senior Engineering Manager at Amplitude, came to meet the bros to talk about Amplitude's cutting-edge data stack and how it processes 5 Trillion real-time events while dealing with mutable data and massive scale.

Listen on Spotify or Apple Podcasts

Benjamin: Welcome back everyone! Today, both Eldad and I are in Tel Aviv. We have the pleasure of having, Weichen Wang, here, who's a Senior Engineering Manager at Amplitude.

A few words about Amplitude and Weichen. Amplitude is actually at this point a public company. Before that they raised more than 300 million in private rounds and Weichen is based in Vancouver, British Columbia. He leads the data connections team there and built on scale, their Vancouver site. So, we're super happy to have you. Do you want to give us a high-level view of what Amplitude does and what service you guys deliver to your customers?

Weichen: Sure. First of all, I just want to thank you for having me today, super thrilled! My name is Weichen. Thanks for the introduction!

Maybe I can talk about Amplitude, the team that I'm working with.

For Amplitude, we are a digital analytical platform. We started serving the customers for product mix. We're giving this self-service visibility into the entire customer journey. We have all those tools for funnel analysis, behavior graphs, event segmentation, and all of that. The key takeaway here is we try to help the entire spectrum of data collection, instrumentation, governance, and all to get insight and sharing out. So, it's kind of the one-in-all solution, if you may.

For the team, I'm working on - We are the data connection team and as the name suggests, our mission is to send the data in and out to the right place, at the right time. If you think about the entire space out there, all different sorts of services, including data warehouse, ads, marketing, CRM, and CDP. Sometimes, the data has to be sent in real-time, when time is of the essence. Either time, the data will be sent in batches of different natures. So, that's where our team is building this pipeline and also the tools, and the user interface for our customers to handle those data.

Benjamin: All right, sounds awesome. We'll have a lot to talk about. How big is your team and data connectors, can you give some examples of...?

Eldad: What this means?

Benjamin: Exactly. Like, where's the data coming from? Where's it going?

Weichen: If you talk about the Amplitude as a whole, we're about 700 people altogether. And for engineering, I would say it's less than 200 at the moment. This is still a fair small team at the moment. Data collection is roughly less than 30 people, I say altogether. So again, it's a fairly small team. You want an example of the connections. I already named just a few of the categories. In the different places people want to bring data in and these years, the data warehouse has been a really prevalent choice where they served as the first step in all the data, in relation to the governance and everything, and this server, the source of choice. Where in other cases, some clients might prefer trying to get data indirectly from instrumentation, using our SD case and sending directly from the devices or websites. So, whether data hit to us directly in the streaming fashion. So, that's already a different type of ingestion.

When it comes to exports, they are different, again, for example, sometimes people use us at a CDP capacity where they do real-time event propagation. As soon as we receive the data, we probably send them to their data warehouse or anything else. And in the other case maybe they generate insights, for example. They might do queries on user segmentations with audiences and this is a new group of users. They want to start a new campaign, ads. Then they actually want to sync that audience to Google Ads, for example. That's another problem. So, we're trying to support as much as we can at the moment, about a hundred sourcing destinations.

Benjamin: Okay. Awesome. Can you give us a feeling of the data sizes or the data volume you guys are dealing with?

Weichen: Yeah. We measure the data that is ingested in a number of events. Mostly, we process somewhere around 5 trillion events altogether and most of them at moment is real-time streaming. But we also are ingesting lots of batch data as well.

Eldad: 5 trillion is nice. Very nice.

Benjamin: Yeah, definitely.

Eldad: Is it run on a daily basis or a monthly one?

Weichen: No, that's a monthly basis.

Benjamin: Awesome. You manage to connect your team, and this is where you have the most experience, but do you want to give us a high-level overview of the internal data stack at Amplitude and where the data goes after it made it through your connector?

Weichen: Sure. So, I can give you a high-level architectural overview, although I'm trying to not disclose too much information, that we have been talking about.

Benjamin: Of course.

Weichen: The most important part was when Amplitude started, there were a few principles that when came, and again, this came from one of the early engineers and the founders and I joined at a much later time. So, one of the key points here is we want to make the data query super fast.

Even with the real-time data coming in, how can you update existing queries in a real-time with as lower latency as possible? So, basically, they chose a column, the storage architecture. Cause you think about other queries you could not make, especially for product mix. So, making like say DAUs or other things, users who did this action after that action usually with just one or two properties that you are interested in among all the event property or user properties for the data. So when you index things in the column, things get you professed. We actually have it in-house column storage system. It's proprietary and at the end of the day, it just stored the raw events and then we can index them in a way. Then, we have a query system we call Nova, that actually builds on top of it. So, it's actually working in a kind of a divide and conquer fashion where it tries to utilize a cluster of computing resources based on those indexes so that you can get a query out super fast. So that's kind of the essence of the system.

For the streaming, of course, Kafka and various stuff. We also use Postgres, lots of things. We started with MySQL, but we're trying to move away from them. And then, everything was on AWS. So, we have to use an S3 for storing those raw indexes and events. So that's probably the most important thing on a high level. But of course, the bunch of things down the roof, like we were Redis with a bunch of other stuff.

Benjamin: Got you. Okay.

Eldad: Who should take credit for inventing columnar storage and databases?

Benjamin: Boom. Okay. Asking you database research questions to me.

Eldad: Is it the Stanford guys or is it the CWI?

Benjamin: That's a good question.

Eldad: Is it Store or is it MonetDB?

Benjamin: So as someone from Central Europe, let's give it to Amsterdam.

Eldad: Boom! Amsterdam wins.

Benjamin: Amsterdam wins, easy, but I mean, that's super cool. So, basically, at least at the core of your stack, you then have this in-house query engine in a sense. Like it's not talking SQL, but this domain-specific language that actually powers all of the stacks. That's actually super interesting. Cool.

Now, what's the hardest thing about your internal data stack? Is it lower latency? Is it high concurrency? What are the biggest data challenges in a sense you guys are dealing with and trying to give your users a great experience?

Weichen: If I put it in words, it's just a scale. Everything was easy when the scale is small. But as a lot of demand comes in, even the original architecture would not scale, even the actual feasibility when the query span you cannot scale vertically, just a number of open clusters and all that. And now the other part is cost.

For real-time, it all sounds really good, but of course every time there is trade-off. It's an intentional design that we have, but then that also actually has a recurring burden on us to make everything super fast. Sometimes, there are use cases. I don't know if you actually order food, like with DoorDash. So, maybe you put something in your shopping bucket, but you didn't check it out and you move away. Maybe the DoorDash, hey, you want to send the coupon or push right way in 30 seconds before you gave up and try to finish that order that is super time sensitive. So that's why we need that to be really fast responding. In other cases, that itself, maybe in the monthly data of your inventory and stuff, there's no sense of time, urgency, and that cost is kind of actual.

The other example is data mutability. If you think about the event ingested from an instrumentation SD case, most of you think, oh, the data is mutable and which is the assumption that the system was built before. But there are lots of cases where that assumption is not true. The biggest example is just GDPR. The data deletion request where they actually remove every trace of certain users and that is a big pain point where we had. We actually have a dedicated team to handle that.

Eldad: Thank you for that.

Weichen: Yeah. Then we are of course rethinking our architecture.

Eldad: Also European.

Benjamin: Exactly. Like a European, you take credit for the second time.

Eldad: Sure, the delete feature in the database and data warehouse.

Weichen: If you ever doubt what was that, it's, a definite challenge if it was not designed, to begin with.

Eldad: I remember that period when delete was dead and everyone was happy with the pend only, immutable. And there were these two years where everyone was happy, and life was presumably becoming simple. And then, GDPR came and everyone woke up.

Tell me, ecosystem-wise, input, output, you said Kafka and Postgres. We get that a lot. What do you see? Do you see more Kafka or more Postgres, one overtaking the other? We see a lot of comeback of Postgres. What's your take on that?

Weichen: Kafka is mostly for the streaming process. Postgres, which we use primarily for storing metadata. Because we have already our separate column indexing system, so Postgres is not used for that purpose. I actually was listening to some of your shows, and lots of people are coming from data engineering backgrounds that they're familiar with just the SQL or just relational database kind of thing. For us, we started with the streaming perspective where the data is coming in and we see every single stream, whereas the batch process or is actually coming in a different stage where we have a separate architectural fit where the system was not designed, to begin with to handle that. But, nevertheless, I think there are benefits that we're designing at a later stage. That means we learn the use cases. So that we know how to optimize that. So back to your question, and Postgres is a popular database. Yes! and we use it all the time. There's a reason that we move it away from MySQL, I suppose. At the same time, they are just serving a different purpose.

Eldad: What about output? How do you export your data? Do you see a lot of customers using your platform as a way to cleanse and prepare and then share data with others? How does it play with the existing data? Is it the Snowflakes of the world?

Weichen: That's an interesting one. There are a few parts to it. The first is just data governance. There are again, multiple, parts of it. But, you know, I think one phrase I really like is "garbage in garbage out." You have to really control your data ingestion. If you think about all the instrumentation or tools out there, there are certain solutions that actually offer basically with no instrumentation, that's basically auto-tracking and that tracks every button click, every event you get from the app. Of course, that's easy to set up, but then there's a time we had to dig really deep. It's like, imagine you're in this pile of garbage and trying to figure out the gold in it. So it's time-consuming. It may work.

Where's the other part, which is basically what Amplitude, at least at the current stance is we try to do tracking plans and then we try to do versioning of that. So, there are lots of tools we build so that you have a really protected data schema that actually has infrastructure shared across the entire organization. All the teams will actually benefit from it and though it might sound a lot of overhead to start with, and we'll try to make it easier for the small but low-maturity customers. But if you are actually the bigger customers out there, it's actually beneficial to start with, especially to basically make peace of mind for the data leaders. That's one part.

The other part, how we enhance the existing raw events with all the user properties. And then there's the part, we actually try to send it out, in order to touch a little bit, there's a real-time propagation we call stream. So whereas the other, batch export and other time, we send out insights you generated.

So how does that reconcile to the source of truth? I think this is a great question. I guess nowadays, because, Snowflake or the other data route is becoming prevalent, our mission is so that people don't think about that source of choice problem.

Eldad: People should always think, I'm telling you, I've learned that. You have been trying to build tools, so people don't need to think and those projects always ended up badly. But, first, your passion is amazing about your product.

Weichen: Thank you.

Eldad: We love that and listening to you over a few minutes, you first learn how complex it gets when you move from theory to real life, and try to practice data to drive your business, and this is why all of these little things are so, so important.

Tell me, how do you deal with data politics internally? Your team, would that be considered an engineering, a data engineering, a data team? How do they work with the rest of the engineering organizations? What can you tell us about remote and the evolution over the last few years? And then some lessons learned, maybe.

Benjamin: Tell us everything.

Weichen: You've asked multiple questions, I'm hearing too. Wonders just the internal data policy, it's one. The two is going to be working remotely.

I will start with the first one. Amplitude is coming with a unique edge because we are a data analytics company. Drinking our own champagne is actually a motto for us. We hold ourselves accountable whenever we release something, even if a new feature comes to Amplitude, we actually be the first ones trying it out. If you are shipping a custom-facing feature, you better be using, AB testing experimentation. You better use feature flags and all the other instrumentation in place so people know from, let's say you launch a new, export, the destination of data. So, how many actually click on it? What's the funnel from the user clicking and actually finishing setting up the configuration with all credentials and actually getting data through? So, we have all this tracking and we held ourselves accountable. It's never we shipping features out there. So, in the sense, everyone is a data engineer or a gross engineer so to speak.

I think that definitely provides us with a unique edge, that's one.

Two, just on a remote. I am remote in a sense because Amplitude headquarter is in San Fransico. It's a blessing to me because I was really looking for it. I didn't even know what Amplitude is before someone reached out to me and they opened their office here. So I was hired as one of the earliest managers here without an engineer. And the lot of work I was doing last year is just building the team here.

Eldad: Drinking champagne and drinking a lot of ...

Weichen: Hopefully, yes.

Benjamin: Much, nicer than dogfooding drinking. I would much rather drink...

Eldad: He talks and drinks water here. But, some companies are more involved.

Weichen: It's just a different way of putting it, but you got it what it is. So working remotely, I guess for us, it's lots of internal debate as well. We at Amplitude want to position ourselves because when Covid started, the situation was staying there. So there is a new reality and a new paradigm that would never go back. Certain, people would just prefer going remote 100%, and they are not even considered working, being the hybrid. We actually had an internal vote and we had an internal discussion.

At this moment, Amplitude is operating in the hybrid remote where we require engineers to be in the office two days a week. We'll still cherish that in-person time, especially with collaboration that's a lot of quality time when brainstorming or face-to-face, which cannot be replaced in my opinion. But yes, there's the time when people can definitely get focused on what they don't want to get disrupted with meetings and stuff.

So, that's where we are at right now. Of course, there's also another debate about whether, for example, the teams should be clustered, geographically. For example, maybe all the teams in Vancouver focus on data connections, and, maybe, teams in London work on something else. So, I don't think that the debate is decisive.

My personal opinion is I do not think geographic location should be a parameter in that equation at all. But you know those people who think otherwise because I do believe that having that flexibility to tap into the talent pool globally definitely provides us a unique edge and the challenge coming with managing a team, it's something that can be managed in my opinion.

Eldad: You see Benjamin, it can be done.

Benjamin: It can be done. In terms of your team, where are people actually located?

Weichen: At the moment, maybe I have to also dive into the kind of a bit of structure, so we run the EPD trail. So, Engineering, Product, and Design. We have pillars and parts. So in data connection, we have three parts and I run two of them. So the streaming and the integrations. For integrations, the majority of the engineers is in Vancouver. For streaming, is probably half-half.

Eldad: Tell us about the most embarrassing failure you've had and something that was a big lesson for you. And answering, I'm perfect, I've never made mistake in my life is also valuable.

Benjamin: I told you that in interviews, Eldad.

Eldad: That's what Benjamin answered in his interview, but we proved him wrong many times. That's why we went on this adventure.

But data teams, we talked to many teams, engineering teams, and when it comes to data, people make a lot of mistakes. So I was just wondering any project that went the wrong direction, any technology that had huge promise and turned out to be a huge flop, something smart for our listeners to learn from in terms of past experiences, the good or the bad. Usually, we love the bad ones.

Weichen: Okay. However, I like to share lots of things, I think there is an example I think I can share. If you want to have a third-party authentication system, if you work within a SaaS company and is big enough, they would have authentication done without actually the right keys, so, you authorize on their behalf.

Before going there, Amplitude is in a really interesting position where lots of our customers are mutual that we are their customers and that they are our customers. Also, we share common customers that we have other in a hundred customers, which are both customers of us. In the early days, when Amplitude started, we were still a startup back then, there was a lot of thinking, when we set up those data per account, we just have one and there were also legal requirements where you can only have one account per entity. But, later on, when the system expands, we're building more and more integrations, the problem occurs where the same account was used for both purposes, both for internal teams managing their own data and needs and/or for managing the OR SAP, which is used for our common customers. And this becomes a burden. Sometimes, it is a really heavy burden for us. Then, any migration would come at a risk and also both teams are interfering with the internal teams and internal facing teams are interfering with each other where migration might provide a risk as it might break the data connections for hundreds of customers. It's the burden that's have shown the repeated pattern, and I wish, I know that years ago, so we have, all separation responsibilities for those, internal, and external apps and yes, this actually came with some of the cost. That's kind of the lesson I have. So, my advice for you, try to separate your internal and external systems as much as you can.

Eldad: And a cold shower.

Benjamin: And a cold shower. So Eldad, you are a bit of a bad cop, good cop. You ask about kind of a failure. I'll do the kind of opposite thing. 2022 is coming to an end. Tell us about something you're super proud of, that your team accomplished this year that really is huge, and that you thought before it was, I don't know, impossible.

Weichen: Interesting! Two things in mind. First, we launched a developer portal in early the year. So basically, it's an effort to scale the number of connections we support by supporting third-party developers to build integrations towards us. Because whenever a large client comes to us, Hey, can you have these 20 integrations? And there's only so much we can do, that never scaled. So, we have to rely on our partners to build that bridge and then we provide this generic framework to allow them to submit integrations just in a configuration-based manner. We have a system that parses those configurations and basically, sets it up. So, that was quite successful this time. We probably have almost 50 partners submitting their integrations this year. So, I think that's a huge success for us. That's one thing.

The other for streaming. In the streaming, actually a new product we launched in Q3. Real-time data propagation is increasing demand because at the beginning we think about that we're analytical tools. We won't get data in, but we know we don't want that get data out. But that's apparently not true, especially for the real-time cases. It's basically the path through. They're using our governance features exactly back to that the early point, but, it's again a different system, comes with its own challenges, in relation to reliability, the latency cost and the broadcast overall. So, I think there is a huge shout-out for all the engineers if you're listening on this. So thank you for this great achievement.

Benjamin: Wow!

Eldad: Wow!

Benjamin: Awesome!

Eldad: First example by the way, was amazing. As a startup, we meet a lot of partners and platforms and we divide them into two those that have built connectivity. So third party can build their own stuff into the system and those that are not, they're black boxes and a long tail of startups cannot connect to those platforms. So it's really important and you never know how and when someone is using your system and you learn a lot by just opening it up. So congrats on that, especially to your engineers. It's an amazing effort.

Weichen: At Amplitude, one of the benefits as a public company, because it actually earns the trust of lots of the partners out there. So, they were linked to us, but of course there are times, those bigger players which never think about us. So we have to still in doing our job, but it definitely still makes our life much, much easier.

Benjamin: For us as a database vendor, this is in terms of the dialect and wanting to be compliant with Postgres, to make it easy for the ecosystem to exactly adopt. It's interesting that you guys face similar challenges or similar problems in that regard.

In terms of the second thing, you guys provide these very specific data experiences, to analyze certain user interactions and those types of things, how does this work in terms of like schema management? Do you guys basically say, okay, data has to come to our connectors in a certain schema and then we can give these use cases or like these types of analytics products? Or do you actually kind of manage the schema in your connectors in your system and can accept data in whatever form or shape?

Weichen: If you see the pattern for all my answers is like "it really depends." Because as Amplitude evolves, we get more and more customers from different spectrums were in those, the really mature customers where they have their own stuff or in the low maturity customers, we don't have anything which actually easier or somewhere in the middle where they have something. So, we try to be flexible. If they want to use our governing feature, they can, and if we can block all the events. Again, we think things are events. If the event doesn't fall into the tracking plan, we just block them, that's the easier way. And we can use transformations, and filters, certain data only goes to certain destinations and with different solutions, all of that. But, again, we will try to accommodate as much as we can. So that's why it involved a lot over time with all the different use cases coming in. Hopefully, that answered your question.

Eldad: It's all about surface area.

Benjamin: Yeah.

Eldad: Perfect. Okay.

Benjamin: Awesome.

Eldad: I think we've nailed it.

Benjamin: Definitely.

Weichen: All right.

Benjamin: Alll right Weichen. Thank you so much for your time. We super appreciate it. This was super interesting, learning more about what you and your team are doing, at Amplitude.

Eldad: We are going to check out hopefully next year, the connector framework.

Benjamin: Definitely.

Weichen: All right.

Eldad: Come work for us.

Weichen: Thank you for having me.

Eldad: Thank you for coming.

Benjamin: Awesome. Take care and have a great day!

Weichen: Take care.

Benjamin: Bye.

Eldad: Bye.

Read all the posts

Intrigued? Want to read some more?