UltraFast Gaming Inc. is a gaming company that publishes online games across a variety of platforms, including PlayStation, Xbox, PC, iOS, and Nintendo. They collect a wide range of information about these games, including information about the different games, levels, players, and tournaments, and statistics from player sessions and rankings in tournaments. UltraFast has a company mission to use data-driven decision-making, so they need to analyze and understand this data in order to make decisions about future development, tournaments, community initiatives, and building interactive leaderboards that players can consume. This primer introduces the data engineer with some sample use cases to this goal.
To achieve this, we will be using the Firebolt Cloud Data Warehouse to analyze all of the collected data in the UltraFast Gaming dataset. We’ll be starting from scratch, so this is a great way to get started with using Firebolt.
In order to run queries in Firebolt, you need to create an engine. This can be done with the Firebolt UI or with SQL, whichever you prefer. With the UI, it’s a couple of clicks away:
Fill in all the fields as you see below - make sure to click on “advanced settings” to open the dropdown. Once it’s filled in, you can click “Create new engine.”
Or, if you want to use SQL, you can run the following command to create an engine:
CREATE ENGINE "ultra_fast_engine" WITH
TYPE = "S"
NODES = 1
AUTO_STOP = 10
INITIALLY_STOPPED = false
AUTO_START = true
CLUSTERS = 1;
Either way you did it, you’ll then need to select the engine with the engine dropdown:
On freshly-created Firebolt accounts, the UltraFast Gaming dataset should be automatically loaded for you in the ultra_fast database. You can select it with the database dropdown menu in your query editor or by running this query in your workspace:
USE ultra_fast;
If you have an old Firebolt account without the dataset or have removed it, you can create a new database named “ultra_fast” with SQL or by using the UI:
With SQL, you would simply run:
CREATE DATABASE ultra_fast;
After creating the database, make sure you’ve selected it as described earlier. Then you can ingest this dataset by running the following SQL script in the query editor:
COPY INTO games FROM 's3://firebolt-sample-datasets-public-us-east-1/gaming/parquet/games/'
WITH PATTERN='*.snappy.parquet' TYPE = PARQUET;
COPY INTO levels FROM 's3://firebolt-sample-datasets-public-us-east-1/gaming/parquet/levels/'
WITH PATTERN='*.snappy.parquet' TYPE = PARQUET;
COPY INTO players FROM 's3://firebolt-sample-datasets-public-us-east-1/gaming/parquet/players/'
WITH PATTERN='*.snappy.parquet' TYPE = PARQUET;
COPY INTO playstats FROM 's3://firebolt-sample-datasets-public-us-east-1/gaming/parquet/playstats/'
WITH PATTERN='*.snappy.parquet' TYPE = PARQUET;
COPY INTO rankings FROM 's3://firebolt-sample-datasets-public-us-east-1/gaming/parquet/rankings/'
WITH PATTERN='*.snappy.parquet' TYPE = PARQUET;
COPY INTO tournaments FROM 's3://firebolt-sample-datasets-public-us-east-1/gaming/parquet/tournaments/'
WITH PATTERN='*.snappy.parquet' TYPE = PARQUET;
SHOW TABLES;
Now we can get querying!
It’s a simple question, and you can get a simple answer:
SELECT
Title AS GameTitle,
ARRAY_CONTAINS(SupportedPlatforms, 'PlayStation') AS SupportsPlayStation,
FROM
Games;
Our results should show “Johnny B. Quick” supports it, along with results of rows for games that haven’t been filled in yet, and thus don’t support it.
What if you want to figure out how many total platforms “Johnny B. Quick” supports?
SELECT
Title AS GameTitle,
ARRAY_LENGTH(SupportedPlatforms) AS NumberOfPlatforms
FROM
Games
WHERE
Title = 'Johnny B. Quick';
This tells us that Johnny B. Quick supports 5 total platforms.
Functions used:
Let’s say you want to look at all of our games, what month they launched, how many days it’s been since then, and also do a quick check to see if that’s within the last year. You can do all of this!
SELECT
Title AS GameTitle,
DATE_DIFF('day', LaunchDate, CURRENT_DATE) AS DaysSinceLaunch,
DATE_TRUNC('month', LaunchDate) AS LaunchMonth,
CASE
WHEN LaunchDate >= CURRENT_DATE - INTERVAL '1 year' THEN 'Yes'
ELSE 'No'
END AS LaunchedWithinLastYear
FROM
Games;
You only have the one game with valid data, and you can see it launched in November 2020, which is not within the last year.
Functions used:
SQL is a powerful tool: you can ask a lot of questions, and with Firebolt, you’ll get those results back very quickly. So if you want to ask, say, what the highest average score per player per game is, filtered by specific tournaments, and also see which players consistently achieve top scores. You can use window functions to make this happen:
WITH PlayerScores AS (
SELECT
ps.GameID,
ps.PlayerID,
AVG(ps.CurrentScore) AS AvgScore
FROM
PlayStats ps
WHERE
ps.TournamentID IN (56, 16, 98) -- Replace with your specific TournamentIDs
GROUP BY ALL
),
RankedScores AS (
SELECT
ps.GameID,
ps.PlayerID,
ps.AvgScore,
RANK() OVER (PARTITION BY ps.GameID ORDER BY ps.AvgScore DESC) AS ScoreRank
FROM
PlayerScores ps
)
SELECT
g.Title AS GameTitle,
p.Nickname AS PlayerNickname,
rs.AvgScore,
rs.ScoreRank
FROM
RankedScores rs
JOIN
Games g ON rs.GameID = g.GameID
JOIN
Players p ON rs.PlayerID = p.PlayerID
WHERE
rs.ScoreRank <= 50;
This query should return us the top 50 players for the tournament IDs you looked at: 56, 16, and 98. You can edit this query to look at different tournament IDs, or change the last line to look at different slices of the top of the leaderboard.
You have a lot of players, some of whom pay for a subscription, some of whom don’t. You also support five platforms, and maybe players on different platforms are playing the game differently. How does the subscription status and platform type influence how much they play the game? You can ask that question, and to avoid analyzing all of the data when a subset should tell you what you need to know, let’s filter by play sessions between specific dates and for specific tournament IDs:
WITH PlayerPlayTime AS (
SELECT
ps.PlayerID,
ps.GameID,
p.IsSubscribedToNewsletter,
UNNEST(p.Platforms) AS Platform,
AVG(ps.CurrentPlayTime) AS AvgPlayTime,
RANK() OVER (PARTITION BY ps.GameID ORDER BY AVG(ps.CurrentPlayTime) DESC) AS PlayTimeRank
FROM
PlayStats ps
JOIN
Players p ON ps.PlayerID = p.PlayerID
WHERE
ps.StatTime BETWEEN '2020-12-01' AND '2021-02-01'
AND ps.TournamentID > 100
GROUP BY ALL
),
FilteredPlayers AS (
SELECT
ppt.PlayerID,
ppt.GameID,
ppt.IsSubscribedToNewsletter,
ppt.Platform,
ppt.AvgPlayTime,
ppt.PlayTimeRank,
CASE
WHEN ppt.IsSubscribedToNewsletter = TRUE THEN 'Subscribed'
ELSE 'Not Subscribed'
END AS SubscriptionStatus
FROM
PlayerPlayTime ppt
WHERE
ppt.AvgPlayTime > 0
)
SELECT
fp.GameID,
fp.Platform,
fp.SubscriptionStatus,
AVG(fp.AvgPlayTime) AS AvgPlayTime,
AVG(fp.PlayTimeRank) AS AvgPlayTimeRank
FROM
FilteredPlayers fp
GROUP BY ALL
HAVING
AVG(fp.AvgPlayTime) > 0;
It looks like players on PC are playing the most, regardless of subscription status.
But it’s a little unclear whether subscription status is making an impact. On three out of five platforms, subscribed players are playing more, but on Nintendo and iOS, unsubscribed players play for longer. If you remove fp.Platform from the SELECT statement on line 37, the query will no longer select for or group by platform, and when you run that modified query, you can see that on the whole, subscribed players do have more playtime.
SQL features used:
With this guide, you should hopefully be comfortable with engines, databases, ingesting databases, querying, and accessing data from multiple tables. You can analyze all the tables in the dataset with a couple queries:
SELECT * FROM information_schema.tables;
SHOW COLUMNS IN <table>; -- replace <table> with each table
From there, you can write your own queries and do some more analysis. You can try slightly modifying various queries to see how quickly they speed up, even when different, thanks to subresult reuse.
Or, if you have your own data, you can ingest that from S3 and get rolling.
Tune in to the 'Data Engineering Show' to see how the fastest growing tech companies handle their data challenges
Real talk, no fluff.
Real talk, no fluff.