September 11, 2024
September 24, 2024

High Volume Ingestion: Scalable and Cost-Effective Data Loading

No items found.

Listen to this article

Powered by NotebookLM
Listen to this article

Introduction

"Bulk" or "batch" ingestion is a common challenge in data warehouses. With increasing volume and velocity of data, the need for scalable batch ingestion is crucial to dealing with shrinking batch processing windows. As customers bring on new data sources, the need to backfill large volumes of data quickly is a challenge as well. Through these benchmarks we look at the ability of Firebolt's elastic infrastructure to scale ingestion.

The benchmarking data and queries in this blog are available to readers on GitHub to further explore or to reproduce the benchmark numbers locally. This blog is part of a series that looks at the end-to-end query capabilities of Firebolt. For a deep dive on the efficiency and concurrency performance of our query engine, you can read this blog. We also provide benchmark analysis for incremental ingest scenarios, sometimes referred to as “trickle ingestion,” that looks at inserting, updating, or deleting data in existing tables. You can read more on incremental ingestion and data manipulation benchmarking here.

Summary of results

Parquet (Snappy compressed)  emerged as the top performer in our benchmarks, offering the fastest ingestion times, the best price-to-performance ratio ($2.46/TB), and the greatest throughput (15.3 TB/hour), all with no tuning and by simply adjusting the number of nodes in our engine configuration (scaling out).

We also benchmarked CSV (gzip) and CSV (uncompressed) formatted files, both of which showed near-linear scaling. They averaged $4.54/TB across all configurations (a difference of only $0.56 between min and max). The following charts summarize the data from our benchmarks:

Graphic 1 - Ingestion Throughput
Graphic 2 - Ingestion Time
Graphic. 3 - Ingestion price performace

Benchmark setup

Data

Our data is based on AMPLab, a benchmark established at the University of California Berkeley. AMPLab was originally designed for evaluating large-scale data processing frameworks, and we have extended the benchmark to better capture real-world workloads we see on Firebolt. For the bulk ingestion benchmark exercise, we adapted the UserVisits table from one of our datasets by slightly modifying it to achieve a round 1.0000 TB size to simplify the ingestion analysis. The table was created with the following data definition language (DDL):

CREATE FACT TABLE uservisits (
  sourceip TEXT NOT NULL,
  destinationurl TEXT NOT NULL,
  visitdate DATE NOT NULL,
  adrevenue REAL NOT NULL,
  useragent TEXT NOT NULL,
  countrycode TEXT NOT NULL,
  languagecode TEXT NOT NULL,
  searchword TEXT NOT NULL,
  duration INTEGER NOT NULL
) PRIMARY INDEX visitdate, destinationurl, sourceip;
Framework Table # of Columns # of Rows Size (uncompressed)
AMPLab UserVisits 9 6,616,741,117 1TB

Hardware configuration

Firebolt engines can scale up, by increasing node size, or scale out, by adding nodes. For data ingestion scenarios we recommend scaling out, as this takes advantage of the parallelism offered by distributing input files across nodes for processing. Firebolt engines scale out linearly, maintaining a consistent cost-to-performance ratio while providing flexibility to optimize workloads. Our bulk ingestion benchmark uses the following three engine configurations:

  1. Engine with 4 small nodes (4S)
  2. Engine with 8 small nodes (8S)
  3. Engine with 16 small nodes (16S)

We use various engine configurations with small node sizes to keep the comparison across engine configurations simple, showcase the cost effectiveness of bulk data ingestion, and demonstrate linear scaling. You can learn more about Firebolt engines, and other aspects of our architecture, in our recently published whitepaper

Dataset Generation

All three formats were created using the same source table, as described above (our AMPLab UserVists table, trimmed to 1.0000 TB). We used COPY TO to export the dataset with different formats and compression, and then our benchmark measures COPY FROM queries that ingest files into an empty table with an identical schema. The exact queries we used in this process can be found in our GitHub repository.

Benchmark scenarios

We explored three different scenarios for our bulk ingestion benchmarks. In all cases, the base table is identical, enabling easier comparison across file formats and compression algorithms. The following combinations of format and compression keep the benchmark simple, but include a variety to cover common scenarios internal to Firebolt, as well as common public baselines scenarios.

Scenario 1: Parquet with Snappy compression (Parquet-Snappy)

Parquet is the most popular format among Firebolt customers, and for Parquet's internal column-level compression, Snappy is the most popular option. Parquet provides a well-defined schema, columnar storage (suited to analytical workloads), rich metadata, and other qualities that make it a good fit for analytical workloads that we commonly see on Firebolt. Snappy compression provides a balance between compression speed and efficiency, which is particularly useful for scenarios with high volumes of data.

Scenario 2: CSV with gzip compression (CSV-GZIP)

CSV and gzip are popular combinations of file format and compression both within Firebolt and more broadly. Efficiency and processing overhead aren’t as good as Parquet with Snappy compression, but CSV remains a straightforward and simple format for shipping data. The CSV format is semi-structured, lacking a defined schema, metadata, and other conveniences of structured formats. However, it’s human-readable, lightweight, and widely compatible, making it a popular choice.

Scenario 3: CSV, uncompressed  (CSV)

Uncompressed CSV is not common, but it’s an interesting test case with no optimizations on storage size or data layout. This scenario helps us baseline differences and advantages of resource utilization and performance when comparing to other combinations of format and compression. In other words, we can consider this a “worst case” scenario.

Full results and additional insights

Our basic benchmark metric is the time elapsed to ingest the files, measured in seconds. We use this metric to calculate throughput and cost for each configuration. Throughput measurements are normalized to terabytes per hour, which shows how much data you can expect Firebolt to ingest per hour at different configurations. The cost metric is normalized to dollars per terabyte, showing how much it costs per terabyte to ingest data with different engine configurations. For ease of comparison, all values are calculated using the size of the uncompressed, deserialized table data, which is 1.0000 TB. The engine configurations, described in detail above, are summarized as follows: an engine with four small nodes is “4S,” eight small nodes is “8S,” and 16 small nodes is “16S.”

  Source files Load Time (seconds) Throughput (TB/(hr) Price-Performance ($/TB)¹
Scenario # of Files² Total Size, Compressed (TB) 4S 8S 16S 4S 8S 16S 4S 8S 16S
Parquet-Snappy 478 0.43 802 397 235 4.5 9.1 15.3 $2.49 $2.46 $2.93
CSV-GZIP 475 0.42 1448 762 392 2.5 4.7 9.2 $4.48 $4.77 $4.87
CSV- uncompressed 479 1.1³ 1359 696 363 2.6 5.2 9.9 $4.31 $4.31 $4.53

Parquet performs the best across our key metrics: ingest took the least amount of time and offered the best price-to-performance ratio, averaging $2.63/TB across all three engine configurations. The eight node configuration for Parquet was most efficient, taking only 6 minutes and 37 seconds to ingest the files, which comes to $2.46 per terabyte at 9.1 TB/hr. Parquet, as a columnar format, integrates well with our query execution pipeline, which is optimized for vectorized processing. This format allows us to efficiently read and write column chunks, making the process more efficient than parsing CSV file format.

Our benchmarks demonstrate near-linear scaling across most configurations. Parquet’s ingestion throughput more than doubles going from a four node to eight node configuration, and increases by a factor of ⅔ when doubling again to 16 nodes. Our other file types also performed well, with CSV-gzip scaling near linearly across configurations: 4S to 8S at a factor of 0.88 and 8S to 16S at a factor of 0.96. For the uncompressed CSV, from four to eight nodes, and eight to 16 nodes, we see ingestion throughput scale up by a factor of 1 and 0.9, respectively. 

Note: scaling performance is directly linked to the engine's ability to distribute input files across nodes for processing. Having many files of similar size to spread across nodes for processing improves scalability, leading to near-linear performance increases for engines with more nodes. Conversely, datasets with small numbers of files of dissimilar sizes may not distribute evenly across nodes, hindering workload balancing and scale-out performance. Our files were produced with COPY TO which is implemented to produce files with similar sizes, which users can manually adjust, if desired.

Conclusion

The benchmarks in this blog demonstrate that high volumes of data load quickly and cheaply into Firebolt. Near-linear scaling enables users to dial in their ingestion workloads based on price-to-performance goals, without worrying about increasing costs. Bulk ingestion is not only efficient, but straightforward - the benchmark demonstrates ingestion speeds above 15 TB/hr by simply adding more nodes.

As mentioned at the outset, this blog is part of a series. To read more about how Firebolt achieves sub-second analytics over terabytes of data, see our blog, Unleashing Efficiency: Benchmarking Data Applications for Low Latency and High Concurrency. Firebolt also excels at incremental data ingestion, typically defined as singleton or small-batch INSERT statements. You can read about our benchmarks for incremental ingest scenarios, including an analysis of UPDATE and DELETE performance, on our blog,  Low Latency Incremental Ingestion: Benchmarking Fast and Efficient DML Operations.

DIY Benchmark

These benchmarks are reproducible. All of the files, scripts and documentation necessary to recreate this benchmark are located in this GitHub repository. Whether to discuss these benchmarks, learn more about them, or share your experiences experimenting with them, we look forward to hearing from you.

1 Cost is normalized to the uncompressed size of the table used in all of the scenarios: 1.0000 TB.
2 We used COPY TO, Firebolt’s export function, to create different compressed versions of these datasets. COPY TO processes rows in non-deterministic order, which can result in small variations in the number of files produced.
3 Serializing the raw table into CSV format produces files that are slightly larger than 1.0000 TB due to factors such as quoting fields, delimiters, and the extra bytes needed to encode numbers into strings.
Read all the posts

Intrigued? Want to read some more?