Abstract
Similarweb, hosted on AWS, provides detailed analytics on how end customer audiences interact with websites. This requires ingestion and processing of large volumes of clickstream data in an AWS data lake on a daily basis. The challenges of performing segmentation analysis on big data combined with the need for sub-second end user response times for customer dashboards led Similarweb to evaluate analytics platforms. In this session, Similarweb and Firebolt, an AWS Technology Partner, will share how they delivered sub-second, high concurrency analytics cost effectively.
About Similarweb
Imagine you could analyze how the Internet is used by all of us. Sort of a website analytics for the entire internet. This is Similarweb.
Similarweb is a big data powerhouse, collecting enormous amounts of web-related data to help marketers, brands, salespeople, and many others to analyze how their audiences interact with websites. For example, you can easily track which keyword searches drive traffic to your website, which sub-pages they land on, how often they actually land on a competitor's website, where they’re from, which mobile OS they’re using, whether they clicked on organic or paid links, and much more.
If you go to their website, the first thing you see is a search box allowing you to directly research any website:
For Similarweb, being able to deliver a great user experience for end-users, where they can analyze, slice & dice, and find insights in data through a broad spectrum of analyses, is what it’s all about.
For example, here you can see an overview of the behavior of godaddy.com:
You can go deeper into the analysis with more views:
Something very cool is the ability to compare multiple sites head to head, in this case, godaddy.com vs wix.com:
Delivering these capabilities is not easy and requires a purpose-built data stack that consistently ingests, stores, processes and analyzes large amounts of data.
Similarweb data stack
Similarweb runs its analytics on an AWS based data lake. This data lake unifies raw data from a variety of sources that includes public, partner, direct and others. Centralizing this data into a data lake provides Similarweb the ability to process, clean and remove private information efficiently in preparation for additional analytics. Spark and Airflow form the core of this data pipeline, ingesting 5TB of data per day. Machine Learning is then used to create predictive models from the cleansed data, helping draw conclusions on internet-wide behavior from partial data points. With this foundational architecture, Similarweb addresses various use cases for their customers.
The challenge of the ‘Segment Analysis’ use case
Imagine that you’re a marketer at FootLocker, and you want to understand how FootLocker.com performs compared to Amazon.com. Now obviously it's not an apples-to-apples comparison because shoes are just one of many things that Amazon sells. Similarweb wanted to allow its users to analyze ‘segments’ in bigger websites so that you could compare FootLocker.com traffic only with searches for shoes on Amazon.com. Similarweb understood that this feature is one of the most complex they’ve ever tackled from an analytics perspective.
Some of the challenges Similarweb ran into when trying to implement this use case are not hard to guess: Data volumes are huge, which makes everything tougher, and ETLs are very costly and take a lot of time to develop and maintain.
But an even bigger challenge was the fact that there’s dynamic input from users that the analysis needs to take into account. In order to pre-process every combination users might want to compare, an exponential amount of combinations would need to be tackled, which is unfeasible.
Each day Amazon generates 150 GB of data in Similarweb. Users want to analyze up to two years worth of data. If you want to look up URLs that match dynamic patterns, you need to scan a lot of data which becomes painfully slow. Multiple URLs are grouped into one array for each session of a single entry, making it even tougher.
Solutions Considered
Similarweb considered Presto since they already used it for internal analytics, but they quickly ruled it out because it could not deliver sub-second latency for a fast end-user experience. Similarweb also ruled out a key-value NoSQL database because even though it is a fast document store, it doesn't work well with SQL and it does not support dynamic grouping.
Another approach Similarweb tested was auto-scaled serverless-compute. The idea was to trigger one serverless compute function per every single day of the range the query requires. That requires data to be stored in JSON, so an ETL process was implemented to convert the existing ORC format into JSON. This essentially means that a full copy of the full 1 PB of data is required, to support all the possible segments users might request. Second, the performance wasn’t great. Even though serverless compute is parallelized, typically a few functions run slower, making the overall wait time too long. And lastly, it doesn't support SQL for further grouping and aggregations as needed.
Selection of Firebolt
Firebolt is a modern data warehouse that combines the cost and efficiency benefits of a cloud-native architecture with sub-second performance at terabyte scale to help data engineers and dev teams deliver data applications that end-users love.
Built on AWS, Firebolt is designed with the cloud-native principles of decoupled storage and compute, eliminating traditional challenges associated with provisioning, scale, and resource utilization.
Similarweb narrowed down the competition to BigQuery and Firebolt and conducted a benchmark. These were the results:
Firebolt showed the best performance, and also didn’t require any additional pre-processing. Raw data was loaded and was ready for dynamic querying at sub-second performance.
Additionally, Similarweb saw value in Firebolt’s approach to decoupling of storage and compute, which allows for easy workload isolation. This meant that they could isolate the workloads for the new feature set, and deliver queries that are consistently fast and predictable. It also allowed for easy continued development on different compute clusters (called “engines” in Firebolt) over the same data, without affecting the production experience.
The last consideration was cost, and Similarweb found Firebolt to have the best price-performance and lower TCO compared to the alternative.
Therefore Firebolt was selected, and within a few weeks was fully programmatically orchestrated in production with Airflow and Firebolt’s REST APIs. Here’s an example of the Similarweb app showing traffic and engagement over time for Playstation 5 (“PS5”) over amazon.com using Firebolt as the backend. Behind the scenes multiple TBs of data were scanned and dynamically filtered for “PS5” while delivering a ~1sec load time experience in the UI.