February 28, 2025

How Similarweb Serves 100s of TBs to Worldwide Users in Milliseconds

No items found.

Listen to this article

Powered by NotebookLM
Listen to this article

At SimilarWeb, we analyze internet traffic on a global scale, empowering clients with actionable insights for digital market research, campaign tracking, and market share analysis.

What makes our work unique is the scale and complexity of our data. Unlike classic data engineering, we collect data from the entire internet - hundreds of terabytes daily - and make it instantly accessible to our users. Clients can ask questions like what is the size of a specific market, which search terms drive the most traffic, how are campaigns performing against competitors, and who captures the most market share. We also provide a Sales Intelligence solution, similar to ZoomInfo, but tailored for advertising companies looking for leads.

Similarweb’s Data Pipeline: Serving Layer

This is a very typical schema in my group for making data accessible from our data lake to production. 

Our back-end developers handle backend work for both the UI and API, but also manage lots of pipelines that aim to take raw data, transform it, and load it to our production databases for end user consumption. 

Key-Value Systems vs SQL Databases: When to Use Each

How do key-value systems compare to SQL databases or execution engines, and which do we use when? I’m going to focus on DynamoDB and Firebolt as specific examples. 

In terms of performance, there's no real competition between key-value and Query Engines, for the simple reason that in a key-value store, we directly access the relevant key. Think of it as a very simple Hash Map - there aren't really any challenges. A query engine, on the other hand, runs more complex queries, requiring more CPU and RAM. As the data grows, memory must hold larger volumes, resulting in longer processing times.

In key-value, there's no issue accessing the key, deleting, or updating. Very dangerous for anyone who's kept terabytes or more - certainly in relational databases and especially in any big database solution. They always say large data volumes are supported, but I wouldn't recommend it. 

When we encounter complex queries, this is where SQL is significantly favored. Anyone who's experimented with the more advanced MongoDB solutions and the like, and tried to run really complex queries, especially when it comes to JOINs or trying to create all sorts of views - things become very complicated. This is where query engines enter the picture, which allow us to get much more complex and interesting insights from the data. 

Simple Analytics in Key-Value

As an example of key-value, let’s look at DynamoDB, a fully-managed AWS product. There’s no need to deal with optimizing the database or the DevOps behind it. It’s truly serverless - you don’t even need to configure anything related to the engine. From my experience, it’s also one of the most reliable services available.

A classic example of when we'll use key-value is time-series. This is a typical graph Similarweb uses to show time series on lots of metrics: 

Openai.com Dec 22’ - Jun 23’ Worldwide Daily Granularity - Number of user sessions a domain had in a certain time period

In this case, the domain and date serve as the key, and the number of user sessions is the value. This structure makes it easy to segment data over time.

Serving time-series data

Of course, we analyze longer time frames and compare competitors. We retrieve data in batches, another feature supported by DynamoDB. But here’s the catch: all manipulation—filtering, aggregation, and even the simplest operations—happens in the code. We don’t run scans in DynamoDB because it’s expensive and inefficient.

At these volumes, it works very fast. DynamoDB handles it without issue, so we always choose this option for simple datasets where users need raw data as it exists in the database.

DynamoDB Pricing - not for the faint-hearted

There are two payment options in DynamoDB:

  • On-Demand: $1.25 per million writes and $0.25 per million reads. This works well for handling peaks but is expensive at scale.
  • Provisioned: Lower cost per operation, but with write limits.

At Similarweb, we write in batches—daily or monthly—so we increase the write limit before writing and reduce it afterward. However, there’s a daily limit on changes, and switching between modes can get messy when data issues arise.

However you look at it, writing seven million records a day comes out super expensive. And we really suffered with DynamoDB tables that kept growing. We asked Amazon for at least two years to give us the option to do a batch write, or to upload a file to DynamoDB. What we would do is run a Spark job, writing one by one. It might have taken a reasonable amount of time, but it would cost a ton. 

Optimizing costs with S3 import

To address this, we switched to S3 imports. Instead of expensive per-record writes, we now load bulk data as JSON or CSV files at $0.15/GB. The result is significant cost reduction without maintaining Spark clusters or dealing with on-demand pricing issues.

The catch? Bulk S3 imports are only supported for new tables. We’ve adapted by creating a new table for each month’s data and managing API calls per table. While the infrastructure works, this approach has its frustrations.

From Simple Queries to Complex Analytics

When queries became more complex—like analyzing search terms for retail giants like Amazon and Walmart—we needed a solution that could handle large-scale analytics efficiently. Here’s an example of a table containing ten million search terms for a specific month in the U.S. 

For this use case, we needed to aggregate statistics, apply filters, and run multiple manipulations—operations that quickly become infeasible to handle in-memory. The data simply wouldn’t move at the scale we required. After extensive testing and POCs, we adopted Firebolt about four years ago to overcome these challenges. 

Breaking Through Limitations of Speed and Scale with Firebolt

Firebolt is a highly optimized columnar database powered by an efficient query engine. It scans terabytes of data and returns results in milliseconds. While DynamoDB excels at small-scale operations, Firebolt offers the performance and flexibility needed for large-scale analytics. Here’s how Firebolt enables us to operate effectively at this scale:

Compute-storage separation

With Firebolt, we write data once and access it through multiple clusters for different needs—reading, writing, or querying, with total separation between them. One database could fail while another remains unaffected.

Built-in warm-up mechanism

Querying S3 directly is slow due to download times. Firebolt automatically preloads frequently accessed data into SSDs during the first query. Subsequent queries operate at millisecond speeds - the latency is practically non-existent. What Firebolt gives us is out of the box, instead of creating all sorts of caching layers.

Sparse indexing

An important concept for analytics, Firebolt’s sparse indexing scans only relevant portions of data, enabling millisecond query execution. Of course, it’s also a managed service. We choose the cluster size, but don't deal with any maintenance beyond that. 

We typically load ORC or Parquet files from S3 to Firebolt, but there are many other supported formats. 

Firebolt engines

The idea behind Firebolt Engines is that as soon as I ingest data into Firebolt's S3, I can spin up engines for whatever purpose I want. For example, in our case one engine costs $5/hour, and is active 24/7 to serve queries to the UI. A second engine, more powerful, serves queries to the REST API, because here I need slightly faster performance and higher concurrency. And once a day, I launch a huge engine that writes for an hour. Everything runs simultaneously without bothering each other.

Each Firebolt engine (cluster) is separate in terms of resources and network

In some cases, you’ll prefer to describe your primary index as a combination of multiple columns. This is highly valuable when you’re constantly querying your table with those combined filters.

Real-World Example: Querying 300 TBs in Milliseconds

In this example, we query the keyword ‘iPhone’ across 300 TB of data. Thanks to sparse indexing, Firebolt scans only 80 MB of relevant data and returns results in 70 milliseconds. By preloading the necessary data into SSDs, Firebolt eliminates download delays and avoids reliance on external caching layers.

Firebolt queries 300 TBs in 0.07 seconds.

Advanced Firebolt Features

Data Ranges: Optimizing Query Readiness

Data Ranges is a useful concept when you don’t want to expose all your data upfront to customers. Instead, you allow for a slightly longer query time for the first user, after which the data is cached and ready for true sub-second querying.

Let’s say we have datasets comprising many domains, each containing numerous URLs for companies like Amazon and Walmart. I want to find all the pages containing the term 'shoes' within all of Amazon URLs. This is how the index is built: 

Firebolt indexes the Amazon data, preloading it into the engine’s SSD. Walmart data remains in S3, ready for access if queried. Now, if I switch my query to search for ‘laptops’ within Amazon data, it’s already in the engine’s SSD—no download time required. All I experience is the net query time, which is extremely fast. Other caching solutions, such as Presto, require additional layers for similar functionality, but with Firebolt, this optimization works out-of-the-box.

Indexing: Optimizing for Large Datasets

As a Firebolt design partner, we’ve seen multiple iterations of the sparse indexing mechanism—an essential feature for massive datasets. For smaller tables (less than a terabyte), performance isn’t usually a concern. However, when dealing with tables in the 200-300 TB range, query design becomes critical.

Here’s a simple example: If I have a table with product data and need to analyze by brand name, I use the brand as the primary index. This dramatically improves query performance for operations filtered by brand.

Using the Primary Index in Firebolt for performance enhancement

Aggregating Indexes: Reducing ETL and Boosting Query Speed

Firebolt also supports aggregating indexes, which precompute and store query results. This is particularly useful when querying by categories or filters where performance is critical. For example, if I need to analyze products grouped by category, I can create an aggregating index on the category column. Each time new data is ingested, Firebolt updates the index in the background. While this consumes some additional space, it eliminates the need for ETL processes and delivers maximum query performance.

The Aggregating Index can be created to define multiple indexes

Where Tools and Strategy Converge

Scaling systems to handle the complexity of internet-wide datasets isn’t just about tools—it’s about how those tools work together. By combining DynamoDB’s simplicity for small-scale operations with Firebolt’s speed for terabyte-scale analytics, we’ve built an infrastructure that balances cost, speed, and scalability—delivering millisecond performance to users worldwide.

Read all the posts

Intrigued? Want to read some more?