Simplifying Data Ingestion
The Challenges of Data Ingestion
As data volumes and sources grow, so do the challenges of keeping data ingestion simple and performant. Managing table schemas, maintaining performance-optimizing structures like indexes, and scaling infrastructure to handle fluctuating workloads are all common hurdles that data engineers face. Firebolt provides a comprehensive solution to these challenges, offering advanced features such as automatic schema discovery, automated maintenance of indexes, and scalable infrastructure. With Firebolt, you can write simple SQL to point at your S3 bucket and start data ingestion. Firebolt engines have auto start/stop, saving cost, and dialing in performance is as simple as updating your engine size - no complicated options to dial in. This blog explores a few of the ways that Firebolt simplifies data ingestion, making it more efficient and scalable. For most of the following examples, we’ll use Firebolt’s Ultra Fast Gaming dataset.
Automatic Schema Discovery: Streamlining Data Onboarding
One particularly time consuming task of data ingestion is defining the table schema for your target table. Schema definition is frequently time consuming, involving line-by-line definition of column names and data types. Firebolt addresses this common problem with automatic schema discovery. When enabled, the AUTO_CREATE option detects the schema from the source data file and automatically creates the corresponding target table if it does not exist. For source files that aren’t metadata-rich (Parquet is an example of metadata-rich, while CSV is not), Firebolt will look at the data and attempt to infer the data type by scanning a sample of rows in the source file. This feature simplifies the onboarding of new datasets, ideal for data discovery and rapid prototyping, by reducing the need for manual schema definition.
In this example, we use COPY FROM to ingest a small csv file from our Ultra Fast Gaming dataset into a table called ‘levels’. Data ingestion workloads require a running engine. For more information on creating an engine reference the documentation.
COPY levels
FROM 's3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/levels.csv'
WITH TYPE = CSV HEADER = TRUE AUTO_CREATE = TRUE;
In this example, the `AUTO_CREATE = TRUE` option tells Firebolt to automatically detect the source file’s table schema and insert the data into a table called “levels.” For our CSV, we know the first row is our header and contains the column names. AUTO_CREATE is set to TRUE by default, so if left off the above query will still work, and if set to TRUE when a target table already exists, the option will be ignored.
After ingestion is complete, query the table’s information schema to show some useful information about the table, such as the data definition language (DDL) used to create the table, the number of rows in the table, size, and others:
SELECT DDL, number_of_rows, compressed_bytes, uncompressed_bytes
FROM information_schema.tables
WHERE table_name = 'levels'
The above query outputs the following data, which can be a helpful check to understand how the schema discovery feature created and populated the “levels” table:
Automatic Maintenance for Newly Ingested Data: Aggregating Indexes
Aggregating indexes (AI) in Firebolt are a powerful feature designed to optimize query performance. AIs are fully synchronous: as data is inserted into tables, aggregating indexes update automatically and transactionally, so that all queries are correct and fast. Additionally, AIs aren’t affected by concurrent DELETE or UPDATE operations, and don’t require full recalculation, providing predictable, consistent performance enhancement. In short, aggregating indexes deliver fresh data fast, without blocking other operations or getting out of sync.
In the following query, we’ll ingest a different set of data from the Ultra Fast Gaming dataset, taking advantage of COPY FROM’s PATTERN option to grab all the Parquet files present at that Amazon S3 bucket address (N.B.: The playstats files are publicly accessible and you can run this query, but it scans about 10 GBs of data so it will take a few minutes).
--Import playstats table
COPY playstats
FROM 's3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/playstats/'
WITH PATTERN = "*.parquet" TYPE = PARQUET AUTO_CREATE = TRUE;
We know that a frequent query for this data looks at the total play time and accumulated score by player, so we’ll define an aggregating index that automatically computes this data for faster queries. While we recommend creating an aggregating index before ingesting data, it’s not strictly necessary:
--create aggregating index
CREATE AGGREGATING INDEX aggidx_total_playtime_score_per_player
ON playstats
(
"PlayerID",
SUM("CurrentPlayTime"),
SUM("CurrentScore")
);
As we insert new rows in our playstats table, the aggregating index will update transactionally, requiring no intervention or management. This not only removes management overhead, but ensures predictable query performance after updates. If we run a query using these aggregations and then take a look at the query plan, we’ll see our results and verify that Firebolt referenced our new index:
SELECT "PlayerID",
SUM("CurrentPlayTime") as playtime_sum,
SUM("CurrentScore") as currentscore_sum FROM Playstats WHERE "PlayerID" = 1
GROUP BY "PlayerID
This query’s output:
Next we’ll run an EXPLAIN query to see the query plan:
EXPLAIN SELECT "PlayerID",
SUM("CurrentPlayTime") as PlayTime_Sum,
SUM("CurrentScore") as CurrentScore_SUM FROM Playstats WHERE "PlayerID" = 1
GROUP BY "PlayerID"
The query plan (abbreviated to save space)
[0] [Projection] 1, sum2merge(aggidx_player(CurrentPlayTime))...
The following query adds 10 rows that specifically increment the CurrentPlayTime and CurrentScore columns (which are cumulative in nature).
INSERT INTO playstats ("GameID", "PlayerID", "Timestamp", "SelectedCar", "CurrentLevel", "CurrentSpeed", "CurrentPlayTime", "CurrentScore", "Event", "ErrorCode")
VALUES
(1, 1, '2022-11-02 03:50:15', 'Mustang', 11, 285, 14575.504467, 763456, 'Move', 'NoError'),
(1, 1, '2022-11-02 03:51:15', 'Mustang', 12, 286, 14585.504467, 763556, 'Move', 'NoError'),
(1, 1, '2022-11-02 03:52:15', 'Mustang', 13, 287, 14595.504467, 763656, 'Move', 'NoError'),
(1, 1, '2022-11-02 03:53:15', 'Mustang', 14, 288, 14605.504467, 763756, 'Move', 'NoError'),
(1, 1, '2022-11-02 03:54:15', 'Mustang', 15, 289, 14615.504467, 763856, 'Move', 'NoError'),
(1, 1, '2022-11-02 03:55:15', 'Mustang', 16, 290, 14625.504467, 763956, 'Move', 'NoError'),
(1, 1, '2022-11-02 03:56:15', 'Mustang', 17, 291, 14635.504467, 764056, 'Move', 'NoError'),
(1, 1, '2022-11-02 03:57:15', 'Mustang', 18, 292, 14645.504467, 764156, 'Move', 'NoError'),
(1, 1, '2022-11-02 03:58:15', 'Mustang', 19, 293, 14655.504467, 764256, 'Move', 'NoError'),
(1, 1, '2022-11-02 03:59:15', 'Mustang', 20, 294, 14665.504467, 764356, 'Move', 'NoError');
When we rerun the same SELECT query, we can see that the values have incremented:
And when we take a look at the plan, we see it invokes the same aggregating index we initially defined:
[0] [Projection] 1, sum2merge(aggidx_playerstats_sum_score_byplayer(CurrentPlayTime))...
With no maintenance or optimization required, Firebolt updates the aggregating indexes as you ingest data, and automatically uses the index to optimize queries.
Scaling Infrastructure to Match Ingestion Workload
Firebolt's architecture decouples compute, storage, and metadata, allowing independent scaling of key architectural components (dig into the details of Firebolt’s architecture in our white paper). This decoupled design makes it simple to scale to any specific workload - ingest, ELT, INSERT, SELECT, and others - and achieve the desired price to performance ratio. With respect to data ingestion, we recommend that you scale out your engine (add nodes) if you want to improve ingestion time, as doing so takes advantage of Firebolt’s parallelism particularly when ingesting many files. You can learn more about engines here.
The following example syntax demonstrates how to scale out Firebolt engines to improve ingest performance on a small set of data. Identical to the previous example, the first query uses the smallest engine configuration possible: small engine with one node. To run this syntax, replace “my_engine” with the name of your engine.
--Changes engine size to type Small with one node
ALTER ENGINE my_engine SET TYPE = S NODES = 1;
-- Drop the playstats table if it exists. To reimport data, we'll need to drop the aggregating index as well, which is handled by the CASCADE parameter.
DROP TABLE IF EXISTS playstats CASCADE;
-- Ingest data
COPY playstats
FROM 's3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/playstats/'
WITH PATTERN = "*.parquet" TYPE = PARQUET AUTO_CREATE = TRUE;
The ingestion took about four minutes to succeed (the same query may perform slightly differently based on network conditions and Amazon S3 bucket bandwidth limits):
The information schema for Firebolt engines provides a variety of helpful data. Engine metrics history shows that “cpu_used,” a measure of utilization, hits “1” frequently, i.e. 100% utilized.
SELECT * FROM Information_Schema.engine_metrics_history
This means the engine’s CPU was fully utilized for most of the query and a bottleneck for the ingestion job. This is beneficial from a cost management perspective, as engine resources are fully utilized. However, if the goal is ingestion speed rather than cost management, high utilization suggests an opportunity to increase resources to improve performance. The next query scales out the processing power of our engine by adding nodes using simple SQL. Our playstats dataset comprises many files of roughly equal size, making it well suited for distributing the ingestion workload across multiple nodes. Aft er scaling out the engine to four nodes, this example then drops the table and re-ingests the data to see the difference:
--adds nodes to our engine
ALTER ENGINE my_engine SET TYPE = S NODES = 4;
--drops table
DROP TABLE playstats
-- Ingest data
COPY playstats
FROM 's3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/playstats/'
WITH PATTERN = "*.parquet" TYPE = PARQUET AUTO_CREATE = TRUE;
After adding nodes, there is approximately a 4x improvement (again, exact results will vary slightly on each run):
Examine the information schema to check on resource utilization:
SELECT * FROM information_schema.engine_metrics_history
The engine CPU was only briefly fully utilized. Dialing in the price to performance ratio is possible by writing simple SQL statements and looking into the information schema for the resources used.
In the above example we saw near-linear scaling: nearly a 4x improvement with 4x the infrastructure (thus maintaining a linear cost curve as well). Adding nodes to an engine is a simple way to speed up data ingestion, and depending on price-to-performance goals, engines can be further tweaked with a line of SQL. Learn more about sizing engines for all different types of workloads in our documentation.
Monitoring and Optimizing Performance
An essential aspect of managing data ingestion workflows is monitoring performance to identify and resolve bottlenecks. Firebolt provides a host of engine observability metrics that provide visibility into engine utilization. Use the Information_Schema.engine_metrics_history
view to understand how much CPU, RAM and disk are utilized by your ingestion workloads, or query Information_Schema.engine_query_history
to take a closer look at specific queries, how long they ran, how much data they read, and other helpful metrics.
For long-running queries, you can query Information_Schema.engine_running_queries
to get more detail about the amount of data these queries are reading, the syntax itself, and other information. These queries provide different views into your ingestion queries to help identify bottlenecks or help dial in the amount of resources to spend given your specific ingestion performance targets. Learn more about information schema views in our documentation.
Simple & Fast
Firebolt's advanced features, such as automatic schema inference, synchronous aggregating indexes, and SQL-based operations, provide a powerful solution for modern data ingestion challenges. These capabilities not only streamline the ingestion process but also enhance data quality and performance, enabling businesses to derive insights quickly and efficiently.
For more detailed insights and to explore how Firebolt can transform your data ingestion processes, visit our website and access our comprehensive resources and documentation.