On-demand
Hands-on
Workshop

Start Firebolt’s instructor-led, hands-on workshop where you’ll learn how to deliver sub-second analytics over TB-scale datasets.

Requirements:

  • A Firebolt account. If you still don’t have one, request it here and get $200 credits.

  • Basic database and SQL knowledge

  • Browser access to the web from your laptop

  • Knowledge of data warehousing and cloud concepts is a plus but not a requirement.

Sign up for Firebolt

In order to start the workshop, sign up for Firebolt and get $200 credits.

  • Register an organization on go.firebolt.io
  • Verify that you can login
  • Configure billing by connecting your organization to AWS Marketplace subscription

Click the “Develop” icon on the left side of the screen ( >_ ) to open the development workspace for access to the SQL editor and to get started.

Screenshot 2023-10-19 at 4.09.53 PM

Intro to Firebolt

What is Firebolt and why should you care?

Lab introduction 

Fast Gaming Inc. is a fictional gaming company. They’ve recently released their first game and are collecting statistics on game plays but were running into performance problems on the data platform they chose to prototype. This database is collecting player statistics across various games and tournament participation. Players, Playstats, Tournaments, Games, Levels and Rankings are the primary data structures used to gather granular data. Playstats functions as the primary fact table that is growing rapidly and currently has over 1 billion records. At any given time, Fast Gaming needs to present fast access to in-game and BI analytics.

To help Fast Gaming achieve their objectives, we will be completing the following four labs :

  • Organize data warehousing resources
  • Data Modeling and Rapid Ingestion
  • Query your data… fast!
  • Working with semi-structured data

Lab 1+2: Organize resources and ingest data

Lab 1: Organize data warehousing resources

Fast Gaming’s development team needs to organize its resources so that they have the option of managing Dev, QA and Production independently with isolation. Each of these environments will have a copy of the database used for development, QA and Production access.

In the Firebolt world the “Fast Gaming” organization can be set up with multiple Accounts - Dev, QA and Production. Accounts are regional objects that organize and manage secure access to databases and compute resources within each environment. In the example below, we will create an account “Workshop” on “us-east-1” to house the gaming database. Within the “Workshop” account you will create a database “workshop_dev” to host the data tables mentioned above.

You can create multiple accounts, with each account housing de-coupled compute and storage resources in the form of “Engines” and “Databases” respectively. Compute resources on Firebolt are organized as “Engines”. Each engine can scale-out from 1 to 128 compute nodes. Firebolt’s 2D scaling provides the ability to have multiple engines, each with its own specific compute type and number of nodes, accessing the “workshop_dev” database. In this case, we will create an Engine and associate it to the “workshop_dev” database using the “ATTACH” command.

Please note: The ability to create databases, engines, attaching databases to engines can all be performed using the Firebolt System Engine. The System Engine is a built-in Firebolt service that allows the data warehouse architect to perform metadata operations (Data Definition Language operations such as CREATE DATABASE, CREATE USER, CREATE ROLE among others) without incurring any costs. You will incur costs for compute only when you have actively running engines. The ability to start engines is configured by linking your Firebolt account to your AWS account in the “Configure” workspace under the “Billing” option. This is a prerequisite for running through ingestion and queries in this lab.

1.1 Create a Database

On your default account, we can create a database to house the data. This can be performed in the SQL Editor using SQL as shown below.

Or in the WebUI as shown below, by clicking the new database button.

newdb

1.2 Create an Engine

Firebolt Engines represent the compute resources in the de-coupled compute and storage architecture. You can have multiple engines connected to the same “workshop_dev” database. Each of the engines can be tuned to specific workloads and configured with a different engine spec and engine scale (number of nodes). This 2D scaling provides flexible options to scale with the ability to control costs easily. As workload needs increase beyond available capacity, Firebolt’s granular 2D scaling allows you to fine-tune your cluster size and avoid steep cost increases, unlike other cloud data warehouse offerings. Additionally, each of the engines provide workload isolation, so that your customer facing analytics application is not impacted negatively by internal BI access.

To create your compute resources for the “workshop_dev” database, the SQL is executed in two steps in the development workspace. The first step is to create the engine, “my_workshop_engine_general_purpose” below and the second step is to attach this engine to “workshop_dev”. If you have multiple engines, you will be able to attach them to “workshop_dev”. Finally, you can start the engine.

Please note: Creating and attaching an engine to a database are metadata operations and no compute resources are provisioned with the completion of these steps. Compute resources are provisioned when the engine is started with the “START ENGINE” command or in the WebUI.

In the above engine specification, we have created an engine with two Firebolt ‘M4’ nodes; this engine is set up to stop automatically after 120 minutes in case there is no activity. Optionally, the above engine creation can be performed in the WebUI as shown below when the “New database” option is selected in the “Databases” screen.

create_db_gui

When configuring the engine, you can select a time to automatically shut down on inactivity. We’ve set this to 120 minutes, so that the engine will not be accidentally shut down during the training.

Once the engine has been created and started, select the workshop_dev database in the drop down in the upper left corner of the “Develop” workspace.

select_db

Then select your new general purpose engine in the engine dropdown:

engine selection

For additional information on how engines and databases work together, please visit here

1.3 Examine Firebolt information schema

The Firebolt information schema shows an overview of the metadata information. The Firebolt information schema consists of a set of system views that contain metadata information on the objects defined in the current database. In addition, it contains a set of database usage views such as the query history and running queries. The Firebolt information schema relies on the standard SQL information schema with some additional information and views that are specific to Firebolt. This includes information on Accounts, Applicable roles, Columns, Databases, Engines, Indexes, Logins, Network policies. Object privileges, Query history, Running queries, Service accounts, Tables, Users and Views. First look at the existing databases and engines in your account.

Lab 2: Data modeling and rapid ingestion

In this exercise, we’ll implement Fast Gaming’s game data solution on Firebolt. As you go through the next set of labs, pay close attention to how data is ingested and how primary indexes are defined and used. We will ingest our sample data from Amazon S3, so we can move forward with the next chapters. As data is ingested, it will be indexed on the fly. Remaining labs will leverage this data for performance optimization.

ext_inte

Firebolt has two main classes of table objects. Internal tables and external tables.

External tables are just metadata pointers to files stored in S3, these can be in many formats including parquet, csv, json etc. Once created, they function a lot like any other table. You can select from them, join them to other tables, use functions against them etc. They’re unmanaged, unoptimized and slow. To track data in external tables, Firebolt exposes the file name and file timestamp through metadata fields called source_file_name and source_file_timestamp. When querying external tables, the location of each row of data in the form of external filename and its associated timestamp are made available. This can be used for incremental ingestion.

Internal tables are highly optimized, efficient and managed. Firebolt’s managed data simplifies data management lifecycle and eliminates day-to-day undifferentiated heavy lift associated with data lakes. Internal tables come in two flavors, fact tables and dimension tables. This doesn’t limit you to dimensional modeling, but the different types of tables are treated differently. Dimension tables are effectively mirrored to all nodes in an engine, this allows for really fast joins. Fact tables are automatically sharded across the nodes of an engine, this allows for efficient parallel aggregation etc.

As data is ingested from external tables to internal tables, it’s transformed to Firebolt’s columnar format called F3 (Triple F). Data is sorted, indexed and compressed based on the primary index defined, which functions a lot more efficiently than any partitioning method. Columnar compression along with range level access using Primary Indexes enables fast, granular and efficient access.

2.1 Create external tables

First, we need to create a connection to data stored in S3. This is done through creating external tables. Further information on external tables can be found in the documentation here.

Expandable Details Section
Solution ▿
DROP TABLE IF EXISTS ex_games;


CREATE EXTERNAL TABLE IF NOT EXISTS ex_games 
(src TEXT) 
URL = 's3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/' 
OBJECT_PATTERN = 'help_center_assets/firebolt_sample_dataset/games.json'
TYPE = (JSON PARSE_AS_TEXT = TRUE);


DROP TABLE IF EXISTS ex_levels;


CREATE EXTERNAL TABLE IF NOT EXISTS ex_levels (
  LevelID INTEGER,
  GameID INTEGER,
  Level INTEGER,
  Name TEXT,
  LevelType TEXT,
  NextLevel INTEGER NULL,
  MinPointsToPass INTEGER,
  MaxPoints INTEGER,
  NumberOfLaps INTEGER,
  MaxPlayers INTEGER,
  MinPlayers INTEGER,
  PointsPerLap REAL,
  MusicTrack TEXT,
  SceneDetails TEXT,
  MaxPlayTimeSeconds INTEGER,
  LevelIcon TEXT
) 
URL = 's3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/' 
OBJECT_PATTERN = 'help_center_assets/firebolt_sample_dataset/levels.csv'
TYPE = (CSV SKIP_HEADER_ROWS = 1);


DROP TABLE IF EXISTS ex_players;


CREATE EXTERNAL TABLE IF NOT EXISTS ex_players 
(src TEXT) 
URL = 's3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/' 
OBJECT_PATTERN = 'help_center_assets/firebolt_sample_dataset/players.json'
TYPE = (JSON PARSE_AS_TEXT = TRUE);


DROP TABLE IF EXISTS ex_tournaments;


CREATE EXTERNAL TABLE IF NOT EXISTS ex_tournaments (
  TournamentID INTEGER,
  Name TEXT,
  GameID INTEGER,
  TotalPrizeDollars INTEGER,
  StartDateTime TIMESTAMP,
  EndDateTime TIMESTAMP,
  RulesDefinition TEXT
) 
URL = 's3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/' 
OBJECT_PATTERN = 'help_center_assets/firebolt_sample_dataset/tournaments.csv'
TYPE = (CSV SKIP_HEADER_ROWS = 1);


DROP TABLE IF EXISTS ex_playstats;


CREATE EXTERNAL TABLE IF NOT EXISTS ex_playstats (
  "GameID" INTEGER,
  "PlayerID" INTEGER,
  "Timestamp" TIMESTAMP,
  "SelectedCar" TEXT,
  "CurrentLevel" INTEGER,
  "CurrentSpeed" REAL,
  "CurrentPlayTime" BIGINT,
  "CurrentScore" BIGINT,
  "Event" TEXT,
  "ErrorCode" TEXT
) 
URL = 's3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/' 
OBJECT_PATTERN = 'help_center_assets/firebolt_sample_dataset/playstats/*.parquet'
TYPE = (PARQUET);


DROP TABLE IF EXISTS ex_rankings;


CREATE EXTERNAL TABLE IF NOT EXISTS ex_rankings (
  "GameID" INTEGER,
  "PlayerID" INTEGER,
  "MaxLevel" INTEGER,
  "TotalScore" BIGINT,
  "PlaceWon" INTEGER
) 
URL = 's3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/' 
OBJECT_PATTERN = 'help_center_assets/firebolt_sample_dataset/rankings/*.parquet'
TYPE = (PARQUET);
	

Now that the external tables are setup, let’s create the managed internal tables as shown below. Please note the primary indexes that are defined for each table.

2.2 Create internal tables

Expandable Details Section
Solution ▿
DROP TABLE IF EXISTS games CASCADE;


CREATE DIMENSION TABLE IF NOT EXISTS games (
  GameID INTEGER,
  Title TEXT,
  Abbreviation TEXT,
  Series TEXT,
  Version NUMERIC(10, 2),
  GameDescription TEXT,
  Category TEXT,
  LaunchDate DATE,
  Author TEXT,
  SupportedPlatforms ARRAY (TEXT),
  GameConfiguration TEXT,
  SOURCE_FILE_NAME TEXT,
  SOURCE_FILE_TIMESTAMP TIMESTAMPNTZ
) PRIMARY INDEX GameID,
Title;


DROP TABLE IF EXISTS levels CASCADE;


CREATE DIMENSION TABLE IF NOT EXISTS levels (
  LevelID INTEGER,
  GameID INTEGER,
  Level INTEGER,
  Name TEXT,
  LevelType TEXT,
  NextLevel INTEGER NULL,
  MinPointsToPass INTEGER,
  MaxPoints INTEGER,
  NumberOfLaps INTEGER,
  MaxPlayers INTEGER,
  MinPlayers INTEGER,
  PointsPerLap REAL,
  MusicTrack TEXT,
  SceneDetails TEXT,
  MaxPlayTimeSeconds INTEGER,
  LevelIcon BYTEA,
  SOURCE_FILE_NAME TEXT,
  SOURCE_FILE_TIMESTAMP TIMESTAMPNTZ
) PRIMARY INDEX LevelID;


DROP TABLE IF EXISTS players CASCADE;


CREATE DIMENSION TABLE IF NOT EXISTS players (
  PlayerID INTEGER,
  Nickname TEXT,
  Email TEXT,
  AgeCategory TEXT,
  Platforms ARRAY (TEXT),
  RegisteredOn PGDATE,
  IsSubscribedToNewsletter BOOLEAN,
  InternalProbabilityToWin DOUBLE PRECISION,
  SOURCE_FILE_NAME TEXT,
  SOURCE_FILE_TIMESTAMP TIMESTAMPNTZ
) PRIMARY INDEX PlayerID,
Nickname,
AgeCategory,
RegisteredOn;


DROP TABLE IF EXISTS tournaments CASCADE;


CREATE DIMENSION TABLE IF NOT EXISTS tournaments (
  TournamentID INTEGER,
  Name TEXT,
  GameID INTEGER,
  TotalPrizeDollars INTEGER,
  StartDateTime TIMESTAMPNTZ,
  EndDateTime TIMESTAMPNTZ,
  RulesDefinition TEXT,
  SOURCE_FILE_NAME TEXT,
  SOURCE_FILE_TIMESTAMP TIMESTAMPNTZ
) PRIMARY INDEX TournamentID;


DROP TABLE IF EXISTS rankings CASCADE;


CREATE FACT TABLE IF NOT EXISTS rankings (
  GameID INTEGER,
  PlayerID INTEGER,
  MaxLevel INTEGER,
  TotalScore BIGINT,
  PlaceWon INTEGER,
  TournamentID INTEGER,
  SOURCE_FILE_NAME TEXT,
  SOURCE_FILE_TIMESTAMP TIMESTAMPNTZ
) PRIMARY INDEX GameID,
TournamentID,
PlayerID;


DROP TABLE IF EXISTS playstats CASCADE;


CREATE FACT TABLE IF NOT EXISTS playstats (
  GameID INTEGER,
  PlayerID INTEGER,
  StatTime TIMESTAMPNTZ,
  SelectedCar TEXT,
  CurrentLevel INTEGER,
  CurrentSpeed REAL,
  CurrentPlayTime BIGINT,
  CurrentScore BIGINT,
  Event TEXT,
  ErrorCode TEXT,
  TournamentID INTEGER,
  SOURCE_FILE_NAME TEXT,
  SOURCE_FILE_TIMESTAMP TIMESTAMPNTZ
) PRIMARY INDEX GameID,
TournamentID,
PlayerID,
StatTime;


DROP AGGREGATING INDEX IF EXISTS playstats_agg_idx;


CREATE AGGREGATING INDEX playstats_agg_idx ON playstats (
  TournamentID,
  SelectedCar,
  COUNT(DISTINCT PlayerID)
);


DROP AGGREGATING INDEX IF EXISTS playstats_agg_idx_errors;


CREATE AGGREGATING INDEX playstats_agg_idx_errors ON playstats (
TournamentID, 
ErrorCode, 
COUNT(ErrorCode)
);

DROP TABLE IF EXISTS players_200million;

CREATE DIMENSION TABLE IF NOT EXISTS players_200million (
  playerid integer,
  nickname text,
  email text,
  agecategory text,
  platforms array (text),
  registeredon pgdate,
  issubscribedtonewsletter boolean,
  internalprobabilitytowin double precision,
  source_file_name text,
  source_file_timestamp timestampntz
) PRIMARY INDEX PlayerID, Nickname, AgeCategory, RegisteredOn;

	

2.3 Data ingestion from data lake

The final step (that will take approximately 10 minutes) is to ingest from the external tables into the managed internal tables. Please note how source_file_timestamp is used to perform incremental ingestion into Firebolt internal tables.

Expandable Details Section
Solution ▿
INSERT INTO games
SELECT JSON_EXTRACT(src, '/GameID', 'INT') AS GameID
	,JSON_EXTRACT(src, '/Title', 'TEXT') AS Title
	,JSON_EXTRACT(src, '/Abbreviation', 'TEXT') AS Abbreviation
	,JSON_EXTRACT(src, '/Series', 'TEXT') AS Series
	,JSON_EXTRACT(src, '/Version', 'FLOAT')::DECIMAL(10, 2) AS Version
	,JSON_EXTRACT(src, '/Description', 'TEXT') AS "Description"
	,JSON_EXTRACT(src, '/Category', 'TEXT') AS Category
	,JSON_EXTRACT(src, '/LaunchDate', 'TEXT')::DATE AS LaunchDate
	,JSON_EXTRACT(src, '/Author', 'TEXT') AS Author
	,NEST(JSON_EXTRACT(platforms_unnest, 'Name', 'TEXT')) AS SupportedPlatforms
	,JSON_EXTRACT_RAW(src, '/GameConfiguration') AS GameConfiguration
	,SOURCE_FILE_NAME
	,SOURCE_FILE_TIMESTAMP
FROM ex_games UNNEST(JSON_EXTRACT_ARRAY_RAW(src, '/SupportedPlatforms') AS platforms_unnest)
WHERE SOURCE_FILE_TIMESTAMP > (
		SELECT COALESCE(MAX(SOURCE_FILE_TIMESTAMP), '1980-01-01'::TIMESTAMP)
		FROM games
		)
GROUP BY src
	,SOURCE_FILE_NAME
	,SOURCE_FILE_TIMESTAMP;

SELECT COUNT(*)
FROM games;--1

INSERT INTO levels
SELECT LevelID
	,GameID
	,LEVEL
	,Name
	,LevelType
	,NextLevel
	,MinPointsToPass
	,MaxPoints
	,NumberOfLaps
	,MaxPlayers
	,MinPlayers
	,PointsPerLap
	,MusicTrack
	,replace(SceneDetails, '''', '"')
	,MaxPlayTimeSeconds
	,DECODE(REPLACE(LevelIcon, '"', ''), 'BASE64')
	,SOURCE_FILE_NAME
	,SOURCE_FILE_TIMESTAMP
FROM ex_levels
WHERE SOURCE_FILE_TIMESTAMP > (
		SELECT COALESCE(MAX(SOURCE_FILE_TIMESTAMP), '1980-01-01'::TIMESTAMP)
		FROM levels
		);

SELECT COUNT(*)
FROM levels;--10

INSERT INTO players
SELECT JSON_EXTRACT(src, '/playerID', 'INT') AS PlayerID
	,JSON_EXTRACT(src, '/details/nickname', 'TEXT') AS Nickname
	,JSON_EXTRACT(src, '/details/email', 'TEXT') AS Email
	,JSON_EXTRACT(src, '/details/ageCategory', 'TEXT') AS AgeCategory
	,JSON_EXTRACT(src, '/details/platforms', 'ARRAY(TEXT)') AS Platforms
	,JSON_EXTRACT(src, '/details/registeredOn', 'TEXT')::PGDATE AS RegisteredOn
	,
	--temporary until we have TEXT to BOOL casting
	CASE 
		WHEN LOWER(JSON_EXTRACT_RAW(src, '/details/isSubscribedToNewsletter')::TEXT) = 'true'
			THEN 1::BOOLEAN
		ELSE 0::BOOLEAN
		END AS IsSubscribedToNewsletter
	,JSON_EXTRACT_RAW(src, '/details/internalProbabilityToWin')::DOUBLE AS InternalProbabilityToWin
	,SOURCE_FILE_NAME
	,SOURCE_FILE_TIMESTAMP
FROM ex_players
WHERE SOURCE_FILE_TIMESTAMP > (
		SELECT COALESCE(MAX(SOURCE_FILE_TIMESTAMP), '1980-01-01'::TIMESTAMP)
		FROM players
		);

SELECT COUNT(*)
FROM players;--5,420

INSERT INTO tournaments
SELECT *
	,SOURCE_FILE_NAME
	,SOURCE_FILE_TIMESTAMP
FROM ex_tournaments
WHERE SOURCE_FILE_TIMESTAMP > (
		SELECT COALESCE(MAX(SOURCE_FILE_TIMESTAMP), '1980-01-01'::TIMESTAMP)
		FROM tournaments
		);

SELECT COUNT(*)
FROM tournaments;--157

INSERT INTO rankings
SELECT *
	,REPLACE(SPLIT('/', source_file_name) [4], 'TournamentID=', '')
	,SOURCE_FILE_NAME
	,SOURCE_FILE_TIMESTAMP
FROM ex_rankings
WHERE SOURCE_FILE_TIMESTAMP > (
		SELECT COALESCE(MAX(SOURCE_FILE_TIMESTAMP), '1980-01-01'::TIMESTAMP)
		FROM rankings
		);

SELECT COUNT(*)
FROM rankings;--49,298

INSERT INTO playstats
SELECT *
	,REPLACE(SPLIT('/', source_file_name) [4], 'TournamentID=', '')
	,SOURCE_FILE_NAME
	,SOURCE_FILE_TIMESTAMP
FROM ex_playstats
WHERE SOURCE_FILE_TIMESTAMP > (
		SELECT COALESCE(MAX(SOURCE_FILE_TIMESTAMP), '1980-01-01'::TIMESTAMP)
		FROM playstats
		);

SELECT COUNT(*)
FROM playstats;--1,036,270,144

INSERT INTO players_200million (
	playerid
	,nickname
	,email
	,agecategory
	,platforms
	,registeredon
	,issubscribedtonewsletter
	,internalprobabilitytowin
	,SOURCE_FILE_NAME
	,SOURCE_FILE_TIMESTAMP
	)
SELECT playerid + (n - 1) * 5420
	,nickname
	,email
	,agecategory
	,platforms
	,registeredon
	,issubscribedtonewsletter
	,internalprobabilitytowin
	,SOURCE_FILE_NAME
	,SOURCE_FILE_TIMESTAMP
FROM players
CROSS JOIN (
	SELECT n
	FROM GENERATE_SERIES(1, 35000, 1) s(n)
	);
	

The statements above include a select statement already, so that you can see if all the data has been indexed properly.

Take a look at the docs about loading data incrementally, as the above example is doing this already, even though the load above will trigger a full load.

Lab 3: Query your data… fast!

Firebolt provides multiple options to boost query performance and utilize infrastructure efficiently. Three aspects important to achieve sub-second latency and high concurrency are:

  1. Range level, highly granular data pruning through Primary Indexes
  2. Pre-emptive acceleration of join operations through in-memory join accelerators
  3. Auto maintained, efficient pre-aggregations with query redirect
indexes

Primary indexes

Primary indexes (PIs) should be applied to every internal table on those fields commonly used in where statements and joins. Primary indexes enable data sorting for a table, resulting in improved data compression, efficient data scans and better performance. This can often give workloads on Firebolt as much as a 10x boost in efficiency.

Some considerations for primary indexes:

  • Defined at table creation time
  • One Primary Index per table
  • Include columns used in WHERE clauses
  • Consider including columns used in GROUP BY clauses
  • Order columns in the index definition by cardinality
  • Include as many columns as you need
  • Consider how you alter values in WHERE clauses
  • Consider including often used foreign key columns in the index

Join accelerators

Firebolt provides built-in functionality to automatically accelerate joins between different tables. Join operations are often the most resource-intensive aspect of queries, slowing down queries and consuming engine resources. Firebolt reads the join cached in engine RAM rather than creating the join at query runtime. Join accelerators are created & held in RAM automatically for eligible queries, reducing disk I/O and compute resources at query runtime. Queries run faster and you can use engine resources more efficiently to reduce cost.

Aggregating indexes

Aggregating indexes are a unique feature of Firebolt. They often negate the need for costly secondary pipelines that can introduce data quality issues and rob teams of staff hours while dramatically accelerating queries. Aggregating indexes can be added at any time and every table can have multiple aggregating indexes. Complex functions can be a part of the aggregating index definition also.

3.1 Sample queries

Firebolt supports standard SQL syntax in the form of sub-queries and common table expressions. These are useful in tuning performance by decomposing a larger query into segments that are readable and can be tuned individually. Feel free to explore the following queries in the SQL Editor.

3.2 Accelerate performance and efficiency with PIs

First, we want to compare the performance of queries running against the data lake versus those running against Firebolt’s internal/managed data. External data in our data lake isn’t indexed, while our internal table has a highly effective Primary/sparse index on it. If we run the following query, we’ll notice it’ll take quite a bit of resources, and run for about two and a half minutes:

But we’ve created and loaded an indexed internal table in Firebolt. Most of the queries we expect have similar predicates, so the primary index has been created to support them. If you run the following, you’ll notice it completes in under half a second. Primary indexing and compression are a game changer, a 300x improvement.

3.3 Fast joins through accelerators

Join accelerators in Firebolt are implemented as a part of the query optimizer and require no configuration. In this lab, we will demonstrate the impact of join accelerators with the following query that shows total scores by age category between specified dates.

Run this query and note down the execution time.

The first time this query is run, the query optimizer builds an in-memory data structure of the players table.

Re-run the query and you will notice that the execution time is lower than the first run. This performance boost is a direct result of join accelerators.

3.4 Boost aggregation performance

To look at game performance, we’re curious why people are crashing in this racing game. Run the following query that bucketizes ‘crash’ events by speed. If you look closely at the playstats table, you’ll notice that it uses a primary index that this query will not benefit from.

The execution time and the amount of data scanned are shown in the “Results” section of the “Develop” workspace. The primary index we use to support most of our queries helps, but it’s not amazing. Run time should be about 16 seconds. Still a lot better than querying the external data.

Screenshot 2023-10-24 at 5.36.22 PM

To understand how this query was executed, you can also run the query prefixed with ‘Explain’ as shown below.

As seen in the explain plan output the query runs against the ‘playstats’ without any indexing optimizations.

Screenshot 2023-10-24 at 5.37.37 PM

Now create an aggregating index that improves the above query and takes the errorcode, event, truncated speed and count() function into account. See Using aggregating indexes in the documentation.

Now, let’s create an aggregating index to tackle the query above.

Expandable Details Section
Solution ▿
DROP AGGREGATING INDEX IF EXISTS playstats_by_bucket;
	
CREATE aggregating INDEX playstats_by_bucket ON playstats (
	errorcode
	,event
	,CAST(currentspeed / 10 AS INTEGER)
	,COUNT(*)
	);
	

Use EXPLAIN to see that the aggregating index is used for queries!

Expandable Details Section
Solution ▿
EXPLAIN SELECT COUNT(*) AS cnt
	,CAST(currentspeed / 10 AS INTEGER) AS bucket
FROM playstats
WHERE errorcode = 'NoError'
	AND event = 'Crash'
GROUP BY bucket
ORDER BY bucket ASC;
	

Run the query without the Explain to see the impact of adding the aggregating index on execution time and total data scanned.

Screenshot 2023-10-25 at 8.36.38 AM

You’ll notice, this query isn’t scanning data from the base table anymore. Instead, the query optimizer is aware that there’s an aggregating index available to answer the query with much higher granularity, saving both time and costs.

If you run the query with the aggregating index, you should notice it’s over 1000x more efficient.

Lab 4: Working with semi-structured data 

4.1 Arrays

Arrays are powerful data structures that are designed for efficient data access. For additional information please access array function documentation. Now, we want to do some user segmentation discovery on our game users. Normally, user segmentation queries are extremely costly with numerous self joins, lots of CTEs and long where statements.

In this case, I want to know the count of users by age group who’ve played a game on Nintendo or Xbox but never a PC. Using simple array functions, this becomes really easy and efficient. If the base query doesn’t perform efficiently enough, it is an aggregate, so an aggregating index could be used to support it.

Expandable Details Section
Solution ▿
SELECT agecategory
	,COUNT(*)
FROM players
WHERE NOT CONTAINS (
		platforms
		,'PC'
		)
	AND CONTAINS (
		platforms
		,'Xbox'
		)
	AND CONTAINS (
		platforms
		,'Nintendo'
		)
GROUP BY ALL;
	

4.2 Lambda expressions

Lambda functions are similar to array functions, though they allow for more variable manipulation and iterations. In this case, we’re just looking for the count by age category where any platform in the array is PC or Xbox.

Expandable Details Section
Solution ▿
SELECT agecategory
	,COUNT(*)
FROM players
WHERE ANY_MATCH(x -> x = 'PC' OR x = 'Xbox', platforms)
GROUP BY ALL;
	

4.3 JSON filtering and transformation

After discussion with the game developers, its discovered that the attributes available for a given level’s scene details may change at any time. This normally would create a challenge of schema evolution. As a new attribute is added to a game, we’d normally have to add a new column to the table. Instead, Fast Gaming has agreed to record scene details as a JSON document, allowing a mix of schema on read and schema on write. You can manipulate JSON data using JSON functions built-in to Firebolt.

An example would be the following:

We want to know the average speed in game play for levels with dry scene details.

Hint: A combination of JSON extraction, and array functions should get you there (not necessarily in that order).

Expandable Details Section
Solution ▿

SELECT playstats.selectedcar
	,AVG(playstats.currentspeed) AS averagespeed
FROM playstats
INNER JOIN levels ON levels.gameid = playstats.gameid
	AND levels.LEVEL = playstats.currentlevel
WHERE CONTAINS (
		json_extract(scenedetails, '/details', 'ARRAY(STRING)')
		,'dry'
		)
GROUP BY ALL
ORDER BY 2 DESC;
	

Core take aways

Contact Firebolt

Thanks for doing Firebolt’s Hands-on Workshop! Have any questions? Reach out below.

Contact us
Send me the guide