Substack Data Platform Deep Dive

Listen to this article

Substack is an amazing — if not the most amazing — content publishing platform out there. Essentially, it allows anyone to become a journalist or to start their own newsletters and charge subscriptions for them. So how did they build a data stack that can support all of their 500K paying subscribers?

‍

Listen on Apple Podcasts or Spotify

‍

Boaz: We are very proud and lucky to have Mike Cohen with us today.Mike is a super talented data engineer at Substack. Now, for those of you whodon't know Substack, you should! Substack is an amazing — if not the mostamazing — content publishing platform out there. Essentially, it allows peoplelike you and me to become journalists or to start our own newsletters andcharge subscriptions for them. And Substack has been growing. In February, theyreported that they have 500,000 paying subscribers.

Eldad: That's old news.

Boaz: Yeah, that's probably old news by now. Every month that numbergrows like crazy. They've been getting a lot of media attention. I think thatwe should consider stopping this podcast and moving to Substack or something.That's the place to be right now. So, Mike has been working in the data spacefor quite some time prior to Substack, which we'll hear all about. He also spenttime at companies like Capax, Venmo, and a lot of other exciting places.

Boaz: At Substack, how much data, in terms of data volumes, do you guysdeal with?

Mike: I don't know what big data means when people say “big,” but Ithink we're in the small to medium data camp still. We're in the tens of terabytes,not in the petabytes or anything like that quite yet. But the data volumes aregrowing quickly as more and more people come on. We have a lot of event datathat we're logging and that's where we're at today.

Boaz: I think the rule of thumb to “big data” is that everybody startswith apologizing that they have data, but it could be bigger. Yeah, we onlyhave hundreds of terabytes. We're not petabytes scale yet.

Mike: Yeah.

Boaz: I think that is considered big data. And what's the headcount atSubstack these days?

Mike: We just surpassed 40. In the last couple of weeks, we brokethrough the 40 mark. We're on a big hiring spree. Come check out our jobs page!

We're trying to scale up the team. Fingers crossed, we'dlove to be somewhere in the seventies by the end of the year.

Boaz: And how many people deal with data?

Mike: We are a very small team in an already small company. We're a two-personteam at the moment. So, it was just me for the first 15 months or so and then Irecently brought on someone so we have two of us now since March.

Boaz: So, you’ve grown really fast. In terms of subscribers andvisitors, you've grown dramatically also in a little bit over a year. How doesthat look like from the retention perspective?

Mike: It's been exhilarating. It's been super fun to watch. And theproblems to tackle have grown with the growth of the company in general. Exhilaratingis the best way to describe it. Constantly thinking about “this thing that wewere doing a week ago — now how do we do it at a much bigger rate and fasterpace? How do we design our systems for a couple of weeks and months from now?”Stuff like that. So, it's been super fun.

Boaz: So, let's talk about your data world. Tell us what it looks likeand what kinds of things do you do with it.

Mike: We have a Postgres production database and what we call ourevents pipeline, which is effectively a Kinesis stream of data that getsprocessed in parts into S3 and subsequently then dumped into Snowflake.Separately, we also have a process that will mirror our data from productioninto the data warehouse. Other data sources are getting piped in there too. Andso, everything ultimately lands there. Then we have our BI tooling set up ontop of that. From there we do transformations internal to Snowflake, and thenwe pipe that back out to places. I think the phrase I've seen a lot of thesedays is reverse ETL. But we send our derived or transformed data back out to aseparate Postgres database such that the data can be accessed in the productwith indexes and be super-fast. So that's our high-level structure today.

Eldad: What about BI? Which BI tools are you using there?

Mike: We use Periscope Data acquired by a company called Sisense.

Boaz: So how much of the data stack at Substack do you consider legacyversus modern? How much has it changed since you joined or is it somethingthat's already built to scale into the future?

Mike: That's a good question. A lot has changed since when I joined.When I first joined, we had very little in terms of BI tooling or any data warehousing.So, all of that was in the last 12 to 15 months and that's gotten us to wherewe are today. As a company, today we are thinking about how we start to reallyput the data to work now that it's much easier to work with and accessible, andwe have the systems in place to put it back into the product or to do BI andanalytics. And then I think after that, there'll be the next chapter of asking“What do we do next? How do we build towards more real-time? How do we buildtowards faster insights?” And unfortunately, as a small team, we have to takethings in little chapters. That's how I kind of think of it. So, I think we'reon chapter two now and then chapter three will be about figuring out how toramp this up even faster and do even more.

Boaz: Who is driving the requirements for BI? With 40 people, is itspecific departments or cross-departmental?

Mike: It's a mixture of data exploration work, which can be driven by eitherthe product or data team. We also have other internal teams, whether it's oursupport team or what we call our partnerships team, and they have dataquestions. As a data team, we will help them answer those questions by givingthem reports that they can monitor in Periscope.

Boaz: So how much of a bottleneck do you end up with? It sounds likeyou have a lot of supporting to do.

Mike: Yeah. That's one of the reasons for doubling and why we'reseeking to double again, hopefully this year. I'd love to end the year with aroundfour people on the team. And so it's definitely a factor. And I would say itwas only somewhat recently we found our team chemistry... we used to be moretechnical users than non-technical users. In other words, more SQL users than non-SQLusers. And only very recently has our growth shifted that equation to where nowwe have more non-SQL users. And so that bottleneck has started to become moreapparent than it once was.

Boaz: What does your morning routine look like? Which tool do you openthe most to check in on things every day?

Mike: That's a good question. I check Slack to make sure there'snothing in the data channel that someone has reported or asked about. Then Ihave about six Periscope dashboards that I look at in order every single daythat are basically pinned in one of my Chrome browsers. They're high-levelcompany stuff. And then I have a spam publication detection dashboard where Ilook for any bad actors and try to handle that, too.

Boaz: It’s an interesting use case. How do you find the right actors inthe spam publications?

Mike: I can’t answer that. That'll give away my fraud rules and everyonewill know how to beat them.

Boaz: Good point good point. I was just testing you, but okay. Let's dowhat we call a lightning round...

Mike: Okay.

Boaz: So don't overthink. Shoot straight. Let's see what you come upwith. Are you ready?

Mike: Yes.

Boaz: Commercial or open source?

Mike: Commercial.

Boaz: Batch or streaming?

Mike: Streaming.

Boaz: Write your own SQL or use a drag-and-drop vis tool?

Mike: Write my own.

Boaz: Work from home or from the office?

Mike: That one is hard.

Eldad: Both. You can have both.

Mike: Yeah, I think three days at home, two days in the office.

Eldad: Yeah, exactly.

Boaz: There are Hybrid modules now so it's legit to say both. AWS, GCPor Azure?

Mike: AWS.

Eldad: So, you can pick one. To DBT or not to DBT?

Mike: Controversial, not to DBT.

Boaz: To delt delete or not to delt delete.

Mike: Not to delt delete.

Boaz: Not to DBT, I think is the first time we had somebody said, no.

Eldad: This is the first time.

Mike: I know.

Boaz: Let's talk about that.

Eldad: You're probably mistaken. You probably got that answer wrong.

Mike: All your listeners just turn this episode off.

Eldad: We put it in the trailer.

Boaz: So, elaborate on that a little bit. So, what's your take on DBTand why not?

Mike: No.

Eldad: Big no. It was a big no, that's why.

Mike: It's not, it's not. I don't have that strong of an opinion. I'veused it, but I have another system that I kind of put together that doessimilar behavior and I think allows a little bit more control. Ultimately, withDBT, my understanding is you still need to have something that schedules andorchestrates the jobs and so I’ve just kind of created a system that does a lotof that. I don't want to say it's anywhere near as comprehensive as DB, butdoes a lot of that. And it's all just based in Airflow, so it’s Python-based.

Boaz: You said you guys are on Snowflake. How much processing? Did youguys do ELT exclusively in Snowflake? Do you do a lot of processing alsooutside of Snowflake?

Mike: No. A hundred percent of the processing is happening inSnowflake.

Eldad: Have you ever considered using spark to do that? Or just starteda clean sheet with Snowflake? No need to migrate anything.

Mike: That was my thought. Yeah, it was clean sheet. Start from scratchand we'll see when we need to go bigger than snowflake “can handle.” I'm surethey don't want to hear that, but I'm sure there's a point where usingsomething like Spark in a more distributed fashion — where you can have a lotmore control — might make a lot of sense. But we're not there yet at least.

Boaz: Looking at your pie chart of time spent on which activities, howmuch time do you spend supporting the BI users and the BI tools versus supportingthe warehouse or supporting the pipeline, and so forth?

Mike: Yeah, not to cop out on the question, but I'm pretty evenly splitat the moment. And there's like another administrative chunk, which is hiringand building the team out so that I can be more places all the time.

Eldad: So basically, like most high-growth startups, 70% of your timegoes to hiring and then the 30% that's left goes on real stuff, which is great.

Mike: Yeah, I would say I'm at 35% hiring, 35% support of differentpeople or functions and meetings, and then the remaining 30% is split evenlybetween either data engineering work or just my own data analytics work.

Boaz: Always hiring is also always a good excuse because if you tellpeople, “I'm sorry, it will be fixed the moment we hire another person, so I'mactively hiring.” It's not your fault essentially.

Eldad: Boaz loves hiring. He discovered hiring a few months ago.

Boaz: Eldad is always complaining, “Why didn’t you do this? Why didn’tyou do that?” And I tell him I'm hiring for it.

Okay. So, tell us about an awesome win at Substack.

Mike: I mentioned before we want to get to a place where there’s morereal-time analytics and more real-time insights in the product. But right now,it's we're living in a batch world where our definition of "real-time” isreally every 20 minutes. We're kind of updating data in place. But I'm prettyproud of the system that we have. We're running a bunch of interesting, complexqueries that create meaningful tables that are de-normalized and great foranalytics, but also great for serving up things in the product. And then theway we're piping that back to Postgres with indexes in a way that is efficientand scalable is pretty neat. So I'm happy with that system that we have inplace.

Eldad: Connecting data back to the product and feeding the productexperience with data is huge. And you're right, it is super satisfying to getthere.

Mike: Yeah. And when you send a newsletter a lot, some people want tojust refresh, refresh, refresh, refresh, and watch the numbers tick. We're notthere yet and I want to get there. There'll be other exciting things betweennow and then, but that will be a really exciting day for me.

Boaz: Now enough with this self-gratification, and then the winstories, tell us about an epic failure.

Mike: Okay. That's a bigger list.

Eldad: Everything that happened before we managed to get data back tothe product.

Mike: I guess I should say, thankfully, there haven't been catastrophicerrors that we can attribute to the data team, but there have been things we'vedone poorly. For example, we were writing data too aggressively to thisPostgres thing I keep talking about and we ended up filling up the write aheadlog and knocking over the database. All queries started to time out and thenthe site went down. So, we've had a number of learning experiences about how todo things that keep the site running and how to test things a little bit better.We also use some of our metrics and monitoring tools like Honeycomb to have asense of when things might be going wrong and then we try to prevent that fromhappening in the first place. So, there's been a lot of small disasters,nothing too catastrophic yet. The keyword is “yet” because I'm sure that it'scoming.

Boaz: What's the top challenge for data engineers or the data team ingeneral at Substack?

Mike: I prefer to keep the surface areas small. So, in manyorganizations, there might be a data warehouse or something like that and thendata is sent back out to a lot of different other services. And I am trying tonot send it out to too many places because my fear is that ends up leading to asituation where you have one person looking at something in Google sheets, or Excel,or Air Table and saying, “Oh, I see this number here, but over in the BI tool,the number is different.” There's a lot of ways to try to control for that, butone way I think to control for that is to try to centralize and keep things inone location.

And so, the thing I'm thinking about a lot recently is howto make our data and our analytics more self-service? And more self-service fornot necessarily technical users. Whether it's building canonical data sets thatare easy to query and we give everyone a little SQL lesson, or we get some sortof tooling that doesn't require SQL knowledge at all. How do we democratize thedata access a bit more? So, I think about that a lot, and that's maybe more onthe analytics side than on the engineering side, but I think that they're twosides of the same coin, really.

Boaz: How much of your responsibilities are on the analytics side aswell?

Mike: All of it.

Boaz: Building out company dashboards and stuff like that. All of it?Wow. What gets on your nerves the most in your daily work with data?

Mike: Reconciling data from different data sources. For example,yesterday, I spent a long time trying to reconcile data that we're seeing froma test that we're running in Optimizely with our own data events logging andit's hard. It's an example of what I was talking about just a moment ago whereOptimizely is a bit of a black box. You want to be able to put trust in the toolbut sometimes you verify it yourself and in verifying it yourself, you end upin a rabbit hole. And so that can be kind of frustrating but it's important.

Boaz: Yeah. I feel you on that one. It's important, it's frustrating.

Mike: Yeah.

Boaz: For sure. Okay, so we're close to reaching the end. We want toget your advice on which companies, leaders or people to follow that inspireyou or that you find interesting online.

Mike: Data Engineering Weekly is a cool Substack newsletter. TristanHandy has a great newsletter, which is not Substack based but is still a bignewsletter.

I'm also interested in what you guys are doing at Firebolt.I love Snowflake, but one of the reasons we're sending data back out toPostgres is because you lose the ability to index or to have functionalaggregation that are very snappy. And so, I think getting the data warehousingor the OLAP databases to look more like databases where the index is going tobe very, very interesting and compelling in the near future. So that's very interestingto me.

Boaz: Thank you so much. I think this is it. Eldad, what do you think?

Eldad: I think his interests are really spot on.

Boaz: Imagine Mike is not here with us. What would you tell me abouthim behind his back?

Eldad: So, I think the stack is perfect and frustration is hard, butit's a daily frustration we all deal with. And really, I think I wish you allthe best. I wish you can scale fast. I wish you can hire a team that you loveand can work with and accomplish meaningful things together. Always great tosee you.

Boaz: And congrats for being a part of the success of Substack. So,keep in touch and thank you everybody for hopping on this episode of the DataEngineering Show. See you soon.

Eldad: See you soon. Bye bye.

Mike: Thank you.

How Substack's Data Platform Supports 500K Paying Subscribers

Listen to this article

Technical Deep Dive: Automated Column Statistics

Hans-Peter Lehmann

Why 99% of Data Teams Give Up on Real-Time And How Artie Changes That

Firebolt Team

The $100M Problem: How Lyft's Data Platform Prevents ML Failures with Ritesh Varyani at Lyft

Firebolt Team

Intrigued? Want to read some more?

How Substack's Data Platform Supports 500K Paying Subscribers

Listen to this article

Technical Deep Dive: Automated Column Statistics

Hans-Peter Lehmann

Why 99% of Data Teams Give Up on Real-Time And How Artie Changes That

Firebolt Team

The $100M Problem: How Lyft's Data Platform Prevents ML Failures with Ritesh Varyani at Lyft

Firebolt Team

Intrigued? Want to read some more?

Don't miss a post, subscribe to the Fireblog