How Statsig Engineers Do Observability Right
November 29, 2022
November 29, 2022

Making Observability a Key Business Driver

No items found.

Listen to this article

Powered by NotebookLM
Listen to this article

80% of the code that you write doesn’t work on the first try. And that’s fine. But knowing which 80% is not working and which 20% is working is the actual challenge. After 10 years at Facebook, managing and scaling the Seattle site to over 6000 engineers(!) Vijaye Raji founded Statsig to make observability automated and real-time. How is the semantic layer managed? How was the Statsig team able to build an observability product that handles real-time ever-changing metadata? What are Vijaye’s main takeaways from engineering at Facebook? Tune in.

Listen on Spotify or Apple Podcasts

Eldad: Just before we start, I don't know if you've noticed, I've adopted a new brother, his name is Benjamin, as you can see here. I tried to get the hair close to the family's DNA, but it didn't work very well. 

I'm joining Benjamin. Benjamin will run Season 2. He is also our query processing tech lead based in Munich, building and scaling the office there. I've met Benjamin 2-3 now, how long ago Benji?

Benjamin: Probably two.

Eldad: Two years ago, we got in touch. He was an intern, thinking about his PhD in Munich University and this one thing led to another. And here we are today with you, talking about data. So I'm really looking forward and with that, Benji, it's all yours.

Benjamin: Awesome! So, you're introducing me and I'll just go on, and introduce our guest, Vijaye.

Thank you so much for joining us today. Vijaye is the CEO and founder of Statsig, and he founded Statsig in February 2021, if I'm not mistaken. They're based in Bellevue, Washington, and their original office was in Kirkland, where Firebolt also has an engineering office and Statsig is building a product observability platform. And, we'll talk more about it later. Basically, it makes it super easy to manage feature flags and experimentation and yeah, we're excited to have you. So, they raised a series A in August 2021, and now in April of this year, 42 million Series B, led by Sequoia and Madrona. So, that's awesome.

Vijay didn't start out, as a startup founder, he started out in Big Tech actually. So he spent his first 10 years, at Microsoft doing a variety of different things, and then joined Facebook in 2011 where he stayed for another decade and there he led Facebook video and gaming, and was also the site lead for Facebook, Seattle. So, he scaled the site there from, just a handful to more than 5,000 engineers, which is pretty impressive.

Eldad: So, one out of those 5,000 used to work for you. He works for Firebolt now, and we got great, great feedback. So, we are happy and excited.

Vijaye: I love it. I love it.

Benjamin: Awesome. Perfect! So Vijaye, do you want to give a two minute high-level overview about yourself, about what Statsig is doing, before we dive into some of the nitty-gritty details.

Vijay: Absolutely! Hey guys! Thanks for having me. I'm honoured to be here on the first episode of Season 2. Excited to be here and chatting with you both. So, Benjamin, you covered the history a little bit. I started Statsig about 20 months ago, and this is after spending 10 years at Facebook. So, I joined Facebook in 2011 when Facebook was still considered a small startup. And then, obviously the company grew, both globally as well as in Seattle. So I spent, my time also growing the Seattle offices, from you said, a handful of folks to I think when I left about 6,500 people.

One of the things that struck me as I moved from Microsoft to Facebook, I was always an engineer, so I grew up and learned programming and wanted to be an engineer and I spent 10 years at Microsoft as an engineer. And, during these 10 years, it's formative when you see the software processes evolve.

And it went from the shrink box software that sits on Best Buy shelves to shipping every single day and then eventually continuously shipping code. Those are the things that I learned first hand at Facebook. And I was fascinated by that whole concept of how you decouple feature launches from code launches, and the tools that were associated with it, the tools that kept this chaos from getting out of control.

And also, I started understanding how you measure the impact of every single feature you ship. And then, throw away code that doesn't work because just because you thought it would work, it doesn't mean it would and then measuring every single change actually gives you a lot of insights, a lot of humility, I think, and then also a lot of cultural changes. And so, those definitely resonated very strongly to the point where I said, Okay, this is important, that we should go build a company around these sets of tools. And that's why in February 2021, I left Facebook and started Statsig. I brought with me a few folks who joined and built the tools and then now the company is going strong. We have 52 people as of yesterday. So, pretty exciting growth.

Benjamin: Awesome! Yeah, that's super cool! So, I mean...

Eldad: How was it switching from this huge, huge organizations, engineering organization to a startup that just now got to 52 people. How does it feel?

Vijay: It's crazy. Going into a startup, I thought I knew some things, but then over time I realized I had absolutely no idea what I was doing. It's really jumping off a cliff without a parachute, but just instructions to assemble a parachute and then you have to figure it all out before you land. And that's kind of like how it feels.

Managing an org of thousands of folks is a very different skill than building a startup from scratch, hiring the first person, the first engineer, the first product manager, first designer and then getting a product out. It's kind of like when you're building in the early days, you're building it in a vacuum. You have this vision, you believe in this thing, and then you go build and then you put it out. You're really vulnerable, right? So you build out your product and you put it out for somebody to come and take a look at it and you hope somebody would like it, and then, will use it, give some feedback, and then for six months you sit there and nobody is touching your product and you're wondering, Oh, what have I done, I've made the worst decision of my life. But then slowly, the traction happens, slowly people start to notice and then they are like, Oh, this is actually a pretty useful product. And, then you go from this one stage to another stage where people are now even trying out your product for free. And then eventually they want to actually use it and then eventually they want to actually pay for the software, and then eventually they love the product and then they actually want to evangelize the product. So you go through these stages, and each of these stages you're constantly questioning yourself and questioning your decisions. It is a fascinating journey that all startup founders go through. And, this is my, like you said, I spend all my life in large companies where you don't have to worry about this kind of stuff, and then when you throw it into a startup, every little thing matters.

It's been a fun journey, very much a learning journey, very much humbling, just to look around and see how many people that have gone through the same journey as me before and been successful.

Eldad: Nice! And you just started, there's a long journey ahead. Yeah, it is.

Vijaye: I know, definitely.  And it's only been 20 months.

Eldad: You picked great timing as well, because I think of raising round be that perfect. Perfect, perfect, perfect! Those are crazy times for building startups. It is always crazy building startups but those times are especially crazy.

Benjamin is young, so everything is new, but we've been there through previous recessions and it's always surprising how different each time it is.

Vijaye: Yeah.

Eldad: But I wouldn't replace it for anything. And, really, we look forward to hear more about what you're building.

Vijaye: Yeah. Thanks.

Benjamin: Awesome. So let's dive right into that. So maybe for our listeners who weren't exposed to, I don't know, A/B testing, experimentation, all those things yet, do you want to give the two minute pitch of why this matters and what it actually is?

Vijaye: Yeah. One of the most important things is when you're building a product, you have a hypothesis and you think that's something that you're building is going to be beneficial for your users or your customers or for your business. And that hypothesis is baked into every single feature that you're building. The idea that we build those things out and then ship it to people without really  instrumenting or measuring whether there is that particular belief is actually true or not, is no longer valid. It's no longer okay.

And, at Facebook, I remember this, where 80% of the code that you write or the products or the features that you build, don't work on first drive, which means that you have to go back to the drawing board, iterate on the product until it works or sometimes throw away your idea and it's okay because you can't be right all the time.

But knowing which 80% is not working, which 20% is working is the hard part. So, just understanding and measuring the impact of the changes that you're making to your product, to your users, to your customers is extremely important.

The stage number one is always about measuring, measuring things, measuring things that you build and ship to people.

Stage number two is taking the measurement back and then actually addressing some of these causal inferences. Why did the metrics do what they're doing? Can we attribute it to the changes that we made? And then make the decision, Okay, well what do we need to do in order to... whether double down on it because it's doing really well or wind it down because it's not doing well.

Those are the kinds of tools that people need access to in order to be able to make the right decisions on the product on a daily basis. The state of the art in the industry currently today about experimentation is still A/B testing. And A/B testing is the gold standard for causal inference. However, it's very, very time consuming and manually there's a lot of overheads because you have to first come up with a hypothesis, you'd have to build a variance and then ship out the variance and then you have to allocate the samples, isolate the experiment, run it, for however long it takes for statistical significance, maybe two weeks, maybe three weeks. And then you have your data science team go back and look at and analyze all of these results.

This whole process takes a long time. Most people don't run that many experiments. They pick some of these features that are really, really important, and then they run it through the A/B testing, experimentation pipeline, rest of the features just go out untested.

So, this is where I think companies Facebook, Uber, Airbnb, and even LinkedIn, some of these companies have  sophisticated tools that basically take every single code change you're making, feature change you're making, and then have tools that'll automatically run these A/B testing. And then will give you back the results of the impact of those changes. Thereby you don't have to do the previously mentioned long process of A/B testing. You get these results on a daily basis and then you make decisions, which will be much better decisions than before. And that's what we are trying to build with Statsig, it's where we're going with this product observability.

Benjamin: Gotcha. I mean, you have a background in gaming at Facebook, for example. Do you have  one or two grippy examples of things where this mattered or where these types of experiments led to interesting insights?

Vijaye: A lot of them. I say, one of the key metrics that I share is  80% of the code or 80% of the features that we think are going to be beneficial are not. And so just going back to the drawing board is a regular thing and it's okay. And, that way people don't get attached to the code they write.

So, I'll give you one anecdote. In video specifically, we have run plenty of experiments, and the general belief is that the higher the quality of video that you're able to provide, the more consumption that'll happen because obviously people like higher quality videos and so, at Facebook, we built, this automatic bitrate detection code, which basically constantly analyzes your network stability, last mile, bandwidth and all of that stuff to be able to, Okay, well we're going to give you the best possible bitrate for the connection that you have and so far, that's been experimented and proven and we continue to increase and increase and increase, and coding have gone up from a 480p to 720p to 1080p and then eventually, 4k, and the belief is, yes, that is always good.

And then, in 2020, the pandemic hit and when most people stayed home, the media consumption grew pretty heavily, almost over a course of a month, it doubled and tripled. And some of the backbones, the carriers and the ISPs, they couldn't basically handle the load, and there some of these folks reached out to Facebook and, Hey, could you help? Because you know, a lot of the traffic that is going out is going to Facebook and a lot of it is for video, could you throttle these for us? And we were, Oh my goodness. If we start throttling, that's actually going to reduce the consumption. That's actually going to leave people not feeling great about the quality and the experience, but we also understand that this is not normal terms. It's  crazy, crazy times that we're dealing with. So, we went ahead and started tweaking our ABR algorithm to actually give a couple of clicks lower bit rate than what was actually possible to give. What we found...

Eldad: I always knew this 4K wasn't real 4K when they saw it. Something felt fishy there.

Benjamin: Yeah.

Vijaye: No, we actually showed that we are dropping and then if people wanted to pick 4k, they could go ahead and pick the higher bandwidth. It's just the automatic default that we pick.

When we change the default, normally, obviously, we experiment with everything, this is one of those things when we changed the defaults, we were actually looking at the metrics that were coming in. We were shocked because the usage actually went up. People were using video a lot more and consuming video a lot more with the lower bitrate. Obviously, this was surprising, this was new. It's , okay, well we would've only known because we ran these experiments for every single change we make. Then, we went back and we looked and we did some case studies and we talked to folks, and then later on, we understood that there's a segment of population that is very bandwidth sensitive, and whether they're throttled or whether they're limited bandwidth and so on, generally, they get constrained by how much per day that they can use.

Now all of a sudden, a whole batch of people were now able to use or consume more video than they did before. Then, there's a whole segment of the population that were on the phone that were old outdated phones where 4K or 1080p doesn't actually make a difference. And so, trying to push more and more bitrate was actually not necessary. So there's a set of learning that we had that was very interesting. We wouldn't have had those learnings if not for this experimentation, constant experimentation. Anyways, it is pretty eye opening for me as I was running the video org.

Benjamin: That's pretty interesting, because when I looked at this first and I was, oh, maybe this is about, I don't know, the color of you are pay now button on the shopping cart, but these are really deep algorithmic, core parts of your infrastructure to which you then apply those experiments. So that's cool to see.

Vijaye: Yeah, that's absolutely right. I mean, a lot of people, when they think about A/B testing, you're naturally gravitating towards, Oh, well let's change the text of the button or the color of the button, or change the layout of where things go in order to get more conversion. I think those are also valid A/B tests, but they're more marketing centric and more shallow because your metrics that you're trying to move are directly correlated to the product changes. Whereas when you touch, Okay, well I'm going to change the parsing library, that's in the backend, and then, I want to see if that increases any latency on the large segment population, especially at scale.

Those are the kinds of things that you actually care about when you're modifying a product and you want to verify that it's actually doing what you're expected to do. And then secondarily, you want to verify that there are no extreme side effects. Sometimes you don't know what's going to happen when you actually make these changes. So monitoring those things are important.

Benjamin: Right. So let's talk a bit about what powers actually infrastructure this? So the actual experimentation infrastructure.

The first question I would have is, say I want to run this experiment now, right? I'm an IC at a company, I have this type of experimentation software, and now I'm saying, okay, my video and coding algorithm, I want to change that and see if XYZ increases. Do I have to be the one formulating the hypothesis? Do I have to be the one saying, I changed things and I think XYZ might increase? Or is it  there's some core metrics, someone else from the product team defined, and then each experiment is actually tested against these same core metrics? How does it work?

Vijaye: Generally, what you want to be keeping track of is there are some business critical metrics. You don't want to drop those business critical metrics. You want to make sure that they're healthy across all of the product changes that you are making. And then there are all these metrics that are the primary metrics that you expect to change, expect to move. Okay, well, if I'm changing my video and coding algorithm, I expect more video watches to happen or more time spent on videos to happen, and then perhaps even lesser lag or lesser interruptions in the video because of bandwidth issues. So those are the kinds of things that you would normally be watching as a direct result of the change you're making.

I think those are two separate sets of metrics that you should be watching. And one of them is probably picked by your company level product or growth leader. You're kind of like, okay, well I don't want to ever drop my DAU or engagement or retention, or obviously, revenue and things like that.

And then the other one is generally determined by the product team that is making product changes very close to those metrics. I think both are extremely important and then generally, more sophisticated tools Statsig. So you can basically specify if this particular metric, say, my DAU metric drops by below 3%, then I want an alert to fire and people to be notified. And so at any time anybody is changing any product you, if that drops the metric by 3%, you want to be aware of that and you want to be able to get right to the problem of what's happening?

Obviously you can also have these trade off conversations whether it's worth it or not. But that's up to you.

Benjamin: What actually powers these things in the background. There must be some pretty sophisticated statistics or something going on and one thing I just thought about, what about correlation and experiments, for example. I, on my team, am changing the video encoding, another team is changing how much video gets buffered and in the end, we are dropping this into production at the same time, somehow you have to figure out which change actually changes the core metrics you're interested in.

Vijaye: Yeah, this is a very good question. So obviously there's a lot of teams that are changing a lot of things. One of the things obviously you would not launch every product that you're making 100% to the user or splitting the audience into smaller portions and rolling it out.

Typically, we use this exponential rollout model where you take a new feature, you roll it out to 1% of people and then see how it affects the metrics, and then 2%, 5%, 10%, and then 20%, 50% and 100%. So that's  the sequence that you normally follow through and making sure that at every stage, you're not dropping any critical metric more than necessary.

When multiple people are doing those, salt or the randomization for each of these rollouts is different. And so different sets of population will get these different sets of features. There will be some intersection and the rest of them are getting their unique experiences and that's how our system determines how to attribute metric change back to the actual feature change. And that's why we also have error bars in our confidence intervals. And then you'll be able to once the error bars are within statistical significance, and then we turn them into either green or red, depending on if it's positive or negative. And so that's how our stats engine determines the causal inference back to the features.

Now, obviously, there are lots of things that you could, also get, employ the tools to validate some of these. So, if you have multiple features that are highly interactive in nature, so, okay, I'm changing something in the product that are very closely tied to each other, then what you do is you have isolated experiments. So you create a layer, which is called the universe in other words. And then you actually say, I'm going to allocate 25% of this universe or layer to this particular experiment and another 25% to this experiment. And so you go like that and that way no single user will be getting two different experiences. They'll actually be isolated and getting one particular experiment that they work on. That way you can also make the confidence interval better, just decreasing the confidence interval and at the same time, also attribute it really clearly, Okay, well this experiment is doing these things to my metric. Those are the kinds of tools that are available.

Now, there's also something that you mentioned, which is, okay, what if I launch these two things at the same time? What happens? The interaction affects the cumulative impact of multiple sets of features. So, we have something called holdouts. So long term holdouts are able to actually measure the cumulative impact of multiple sets of features that you're launching over a course of time. We use this extensively. So basically what that means is the beginning of a quarter or the beginning of a half, you specify 1% or 2% of the population that are in the holdout group, and then you launch a whole set of features and then you go back and analyze, compare all of the metrics to these 1% or 2% of the people that have been held out from these features. That gives you the cumulative impact.

So you have all of these tools at your disposal to get all of that information.

Benjamin: All right, super interesting! Sorry, Go ahead, Eldad.

Eldad: How do you manage this semantic layer of all of that? Because it gets so complicated. I wouldn't consider myself an engineer anymore, unfortunately, but observability to me always seemed that it's easy to collect the events. But it's impossible to connect those events to a semantical model where you can actually drive the business based on that and having so many cloud native companies driving their business through their data, through their product, it seems those platforms are really not just about observability anymore, right? It's really about driving your business. Is there a standardized way to do it? In the industry, is there a shift towards a semantic, like a DBT for observability or something that will make an engineer's life easier to communicate on the inside?

Vijaye: Yeah, that's basically what we're trying to build, which is, you know, the sophistication is actually in the stat's engine and that should not make your product building process any more complicated than it already is. And so the idea behind everything that we're building is engineers should still be building features and when you decorate a feature with a feature flag, we then take care of attribution, the analysis and then also try to do it as real time as possible because you want to get to diagnostics as quickly as possible. You do not want to put out something that is broken experience for your users for longer than is necessary, and then, catch that as quickly as possible and then fix those things.

While it seems like, okay, there's lots of stuff that is happening, all of that is encapsulated by the complexity of the stats engine. But there is also a layer of like, okay, well if you make it complicated for people to understand what's really happening, then you've lost. You have not actually achieved anything.

So how do you take all of that data and simplify it? Some of the visualization innovation that we're doing is to like - How do you make engineers, product managers, even designers, be able to understand the insights easily. However, when a data scientist wants to come in and dig into the code, you want to have this progressive disclosure of complexity, want to actually have the ability to dive into the data if somebody is inclined to do so. So that's the challenge, right? The tooling should make it extremely easy for everyone, but at the same time not block you from getting into the details.  That's precisely what we're building.

Product observability is like, if you think about it as an extension of, I always say, tools in the data observability space have become so sophisticated, and then if you look in the ops observability, you kind of like Data Dog or so, in real time, you'll know when one server is misbehaving among a forest of thousand servers or 10,000 servers, in real time, and that's pretty amazing. So systems have gotten so much more sophisticated, than what I remember, 10-20 years ago.

And then when you talk to product folks, you still are in the, Oh, yeah, we're going to  launch V2, three weeks from now, and then when V2 launches we're going to wait for three weeks for our product adoption to happen and then we're going to have analytics or data scientists go and look at the analytics. And then have some way of correlating, well, are the metrics going up after the feature launch or the product launch or they are going down? Why are they doing those things? You see how slow and...

Eldad: And it's done manually outside of the system, so it takes your alpha feature eight months to go into beta while trying to make observability work.

Vijaye: Yeah.

Eldad: I think it's amazing. I think the next step is really getting observability to be impactful.

Vijaye: Yeah.

Eldad: It's always been a black box for engineers to do their stuff and seeing observability moving away from just helping engineers do their day job to really driving your business. I think it's a huge step forward, and it's also if you think of engineering, the more data they deal with, the more data that's being impactful, using those kinds of tools becomes not just optional anymore. It's really changing how you build products, and we've seen that firsthand at Firebolt, by the way, how it affects. This is not a hacking, it starts at, okay, let's put a feature flag, but it quickly turns into kind of a way to drive your engineering culture. As you've mentioned at the beginning. Amazing stuff!

Now tell us how complicated these are... like building observability is the hardest part, right? I mean, it's data driven, it's real time, metadata is actually changing unlike your product, which makes it even more challenging. Can you share a bit with us, how does it work? How do you do it?

Vijaye: Obviously, the credit goes to the data science and the infrastructure engineers that we have at Statsig. These guys are pretty amazing.  I'll give you the overview of how we ingest the data and the infrastructure that we rely on and then the details of that is actually beyond me.

So, let's talk about… We have a set of SDKs that help you get started relatively quickly. Now, obviously building these SDKs is also a pretty interesting problem because as an engineer you're excited, okay, well there's all these technologies. It's like iOS and Android and React, and React Native. And so you got to build SDKs in every single technology because people have all of those out there. And this matrix continues to grow, especially if you have a new feature, you have to test it across. We have 25 SDKs, I think. So, 25 SDKs...

Eldad: The ecosystem team is going crazy.

Vijaye: Oh, Yeah! We have this on this whiteboard, I'm not kidding. There's this huge matrix where people draw, Okay, well have I implemented this feature in this SDK? And then, the matrix continues to grow. So, that's fascinating. That is step number one.

And then once the SDKs are integrated, they're obviously  collecting event data and then sending it back to our servers. On the server, we have a queuing system called Event Hub. So basically the Event Hub is the first stage that actually acquires all of these events and cues them, so we don't lose a sequence.  And then we actually have a redundant system as well. So once the Event Hub is receiving all of these things, they actually throw it into a file system, a big large file system so that if anything happens to the Event Hub, we can go back to the data that's sitting on the file system and replay it when necessary.

Beyond that it goes all of this...

Benjamin: Vijaye, if I can interrupt, can we take a step back maybe? What is an event actually in this world?

Vijaye: Yeah. No, absolutely. So, an event could be as granular as, somebody purchased a product in an e-commerce setup or someone, added to cart again, in an e-commerce setup, in a gaming setup, somebody reached a new level or achieved an accomplishment or achievement, and then in a FinTech scenario, someone used a credit card, and so those are all events and those are the ones that we be able to gather and turn them into metrics, really. Imagine if you have a purchase event and that purchase event happens to have a value, which is actually the dollar amount, then you can aggregate a sum of all that stuff, and then you can see revenue as a metric. And now you can verify, okay, if I'm making all of these product changes, is it changing my revenue up or down? And that's how we correlate it.

Benjamin: These events are then basically annotated with the specific feature flag configuration of that customer or user at that time?

Vijaye: No. The events only have user ID information and then on the other side where when we actually deliver the features you ask, for this user id, which features should we show?

Benjamin: Ah, got you.

Vijaye: So that's the correlation. The join is actually...

Benjamin: Join then in the background. Okay.

Vijaye: Yeah. That's why...

Benjamin: I love joins as someone building database systems.

Vijaye: Oh, I love it too because that actually..

Eldad:  There's always a join Benjamin.

Benjamin: Always a join. Wow!

Eldad: Always.

Vijaye: Everywhere. Join is actually powerful because it simplifies the two responsibilities. So if I'm a feature builder, if I'm an engineer, I only have to care about, okay, which feature flag am I using? I don't have to worry about all the metrics and annotating the metrics and we do that for you. And so it's part of actually simplifying how we use the system.

So, once the events are in our backend, then we throw it into a data lake. And so the data lake is where we do deduping because sometimes when users are, especially in React setups, constant rendering, you have multiple events generated at the same time, and so you want to de-dupe, do some sanitation of the data and stuff that and so we do all of that on the data lake side of things.

Then what happens every hour, we have this gigantic data bricks job. This data bricks job is where all of the statistical correlation, causation, all of the analysis is all encoded, in Python. Jobs pick back up and then they basically spin. Can you guys hear me?

Benjamin: All good.

Vijaye: Yeah, they basically spin off thousands of servers to go orchestrate all the analysis and then outcome is basically a set of analytics. Those analytics go back to our data warehouse and then, where we actually break it down into lots and lots of pre-computed metrics, so that when you actually go to our console, you'll be able to get these things in real time. So you don't have to recompute anything. The UI is extremely responsive. You can click around and actually have things pop up very fast because that experience is important. Otherwise, you're sitting for even one minute for every one of these queries to run, then you lose the whole fluidity.

We take a lot of care about making sure that those things are working as quickly as possible.

Eldad: Nice.

Benjamin: Nice. It's not about querying anymore. It's just a feature. The fact that there is a query running in the background, users don't care.

Benjamin: All good. Cool. You have these hourly batch processes running, which means, say, I launched something, this experiment starts. After an hour or two hours, I'll start seeing some data on my experiment. What if something goes terribly wrong? Did the worst thing ever and it just breaks everything? Is the safety then that it's only with 1% of my customers and so these couple of hours don't hurt a lot, whereas to a fail safe or crap door if things go terribly,

Vijaye: Yeah.

Eldad: Benjamin, you want them to fix your bugs as well? I mean, come on.

Benjamin: I'm sorry.

Eldad: Don't stretch it. Don't stretch it too far. Sorry, go ahead.

Benjamin: No. I think that's important. I think you, you're hit in the nail because, it's important to roll out slowly in stages. So typically we see most customers when they create a feature flag, it's open to only employees, and so there's usually this dog footing process. If something is terribly wrong, you catch it with just employees before it even hits your customers or users. And then what you do is you roll it out to sometimes some customers have these early adopters that are a little bit more okay to get some beta experiences, and they're okay to tolerate some of these things that are broken. And so you turn it on to the next ring and which will be the early adopters. And then obviously you are monitoring all these metrics. Sometimes you get a little bit of scale, which is good. So it's stress testing your product and so if something is broken, you pause the rollout and you go back and fix it.

Then the next stage is, okay, well now I'm going to  open it up to 1% of the general public or you can pick a country. Okay, I'm going to open it up only to the United States. Sometimes people do that because they don't want to invest in localization, and they want to test it in English-speaking countries first before and if everything works, then you invest in localization and so you roll it up to 1% and see how things are operating.

The good thing about this is if something is broken, you catch it within an hour. So, which is as real time as we can get.  And so then you are going back and fixing things versus waiting and waiting for your customers to tell you or some support channel file a ticket, things like that. So, those are, I think, why it's extremely important to catch these early and fix them.

One interesting thing is because I'll add this here too. A lot of people think that, oh, you need lots of samples in order to even catch these things. But if something is broken, you don't need thousands of samples. You need 10 samples to tell you, Oh my gosh, this particular metric is dropping, so let's go figure out what's going on.

Benjamin: What volumes of data are we actually talking about here? How much data are you processing every day?

Vijaye: Oh my gosh, terabytes. We have about, 15 to 20 billion events every single day, give or take, based on how one of the large customers ran a sale for three days. And then what happened was their volume tripled for a week and our system is able to scale and take care of that. But give or take 15 to 20 billion and it's been growing. Just eight months ago, we had a million or so events a day, and now we're talking 20 billion events. Those things are raw events, right? And so the job that we need to do is to quickly dedupe and reduce that as much as possible before we toss it over to  the data bricks jobs.

Eldad: Benjamin, this is a huge insert into the group by, by the way.

Benjamin: I see.

Eldad: Translating to relational algebra, insert into group.

Benjamin: Ah, now I understand...

Eldad: Billions, billions of unique events.

Benjamin: I always need those relational algebra trees to understand what's going on.

Eldad: That's why I'm here. You're welcome.

Vijaye: And then what happens after that is actually store all the raw events for when our customers need and then they want to be able to go download this thing and when you download, you get this a terabyte file, which is pretty crazy. It takes days to even download.

Then, obviously, we offered several ways to slice and dice, filter, so you can get your data as quickly as possible. But we store all of that stuff and so we're not quite at the petabyte yet, but we're quickly approaching that.

Benjamin: Awesome! So what are actually the scalability limits on those pipelines? What's harder basically growing to 10X more customers or a single customer of 10X the size. Is it similar? Is it very different? How is that looking?

Vijaye: Yeah, it is very different. I think 10X customers is actually easier than one customer growing 10X especially if that customer is extremely large. Because what happens is the joins and the analysis that we do are within the same customer's data. We actually have separation of data. These things take much longer and require a lot more orchestration. It's interesting, some of these things that I never thought we would run into, we are running into. For example, one of the things, where in the data bricks job runs, it spins out thousands or thousands or spot instances, spot VMs, then actually go and do the actual heavy lifting. And then to communicate with each of these spot instances you need local IP addresses and you run out of IP address allocation. Really! I never thought that you would actually run out of those kinds of things and apparently those are resources and then you're, Okay, well how are we going to  solve this problem? And then you have to  start thinking about, okay, well..

Benjamin: IPv6.

Vijaye: Pretty much, right. It is fascinating, when you're running some of these jobs at scale with so much data. Even the choice of how we store the data, what kind of databases we use, and how we run these orchestrations. It's been affected and it's actually  evolving too. Obviously, we didn't foresee all of this stuff, so we built something that worked six months ago, and then you throw it all away and then you rebuild based on, well, we know now, and I am pretty sure six months from now, we'll probably throw all of what we have right now and then rebuild something for whatever scale that we will be dealing with. Because those are the times we're actually going to identify some new problem that we haven't even thought about.

Benjamin: Right. So, what's, for example, now on your current stack, an experience where saying, Wow, we would really like to provide this, but it's kind of something we just can't do on the architecture we have?

Vijaye: Well, I mean, wouldn't it be awesome if we're able to give you more real time than one hour? So imagine if we're able to give you in minutes, 5 minutes, 10 minutes, that would be amazing. Currently, our systems or architecture, our expense, the cost of infrastructure is prohibitive enough that we cannot provide that. So we have to work for an hour, as real time as we can get at this point, but I would love to keep pushing that limit as much as we can because I think we've gone from many, many weeks to an hour, which is a pretty huge step forward. But we're not going to  stop there.

Benjamin: Awesome!

Eldad: I smell subscriptions coming along soon.

Vijaye: Yeah. I think especially as we go into this Web 3 world, I'm assuming that that's going to become a real time, an important issue. So we will have to keep investing.

Benjamin: Awesome!

To wrap up the conversation about your internal data stack, I assume you guys are dogfooding basically, so that you're also using Statsig internally to try out the things you are launching. Do you want to quickly talk about that maybe?

Vijaye: Oh yeah. Everything we do is behind the feature flag. So we don't ever throw out any code that is even a little bit sensitive or a little bit, we want to validate it.  And so we throw everything behind a feature flag.

Obviously that gives us the separation of, when we can roll out a feature. Sometimes we do an announcement to everyone and then when we do the announcement is when we open up the feature flag. Obviously, we are also monitoring how people use these new features and when we build a new feature, is there adoption to this feature? And then we also constantly consider, okay, should we invest more in this feature or not. Those are product decisions that we make on a daily basis based on the data that we know how people are using our own product.

Our marketing side is full, feature flags. If you go to our marketing site, the button text then gets a demo, that text is actually different for every person, and that is driven by a multi-arm banded experiment, which is automatically deciding which text actually yields the most number of conversions. And then we'll start to show more and more of the winning variants. So we have 8 different variants that we throw into that system. So, we use stat's pretty extensively.

Benjamin: Awesome! Cool!

Maybe to close up today's podcast episode, we'll shift gears a bit to more on the advice side for our listeners. On my end, Okay, Eldad said it earlier I have just more academic background, so I'm relatively new to industry and you've seen all of it, right? Over three decades or well starting your third decade at Big Tech now at startups, do you have any advice for aspiring engineers basically?

Vijaye: Yeah, I think, as an early engineer, I always chase technology. So in the beginning I was, Okay, I want to be, for me, the compiler engineers are like Gods. And I was like, Oh, I want to  be  a compiler engineer. So, I went in... I was in Microsoft, I worked on compiler technologies. I worked on language services, incremental compilation and stuff like that.

Eldad: Benjamin's upcoming paper is on the query compilation, I'm sorry.

Vijaye:  Oh, really.

Benjamin: I haven't told anyone yet.

Eldad: We cut it out, but we don't cut anything out. We'll list it after we release the paper. Sorry, go ahead.

Benjamin: Three months.

Vijaye: Yeah, and I went in, worked on compilers for a while and then over time I started realizing technical problems actually ladder up to user problems. Okay. What user problems are you solving? And so you have a view on, okay, this is what the problems are, and those translate into technical problems. And so you just start chasing specific user problems that you're interested in solving. So developer problems and so I worked on lots of developer frameworks in Windows and such.

And then over time, you start to realize a lot of the user problems after you peel the first couple layers are about the same, the technologies that you deal with, the problems that you solve are repeated over and over and over. You start to pattern match and stuff.

Then, it starts to, okay, there are people that I want to follow, go from technology to use problems to people, you start following people because you pattern match, like this person has done things that I want to do. I want to aspire to be these people. And so, pick those folks. And then, because you don't have to chart your own path, some of these things have already been done. So, we follow them. Pick good mentors. Pick good managers. And learn from them because there's so much to learn from people that came before you.

As you grow, then there's an element of turn around and pay back, because just as you grew on the coattails of other great folks, now there are a set of folks that are joining and learning the trade, turn around and mentor them and coach them, creating that followership. I think that's the journey of an engineer as you grow in your scope, as you grow in your craft.

Eldad: Boom! I love it! It doesn't matter where you work. If you are an engineer and if you love being an engineer and if you actually think about the problems that you are solving as an engineer, there's no limit and Benjamin, I hope I've been mentoring you enough, and if not, I will try harder because, yeah, it's future engineering is all about.

Benjamin: What am I supposed to say now, Eldad? Awesome!

Well, I think those were great closing words. Thank you so much Vijaye, for joining us. This was awesome. I think we've all learned a lot of new stuff. Cool!

Vijaye: Well, thanks guys.

Eldad: Thank you Vijaye.

Vijaye: Thanks, Eldad. Thanks Benjamin. Thanks for having me on your show. I'm looking forward to watching this.

Benjamin: Awesome!

Eldad: Absolutely!

Vijaye: All right, guys. Take care.

Benjamin: Take care.

Eldad: Bye-bye. Take care.

Read all the posts

Intrigued? Want to read some more?