Support for semi-structured data enables cloud data warehouses to address data that
does not fit the normal tabular model. Addressing machine generated data with evolving, nested data structures is simplified through this capability. In this section, we will take a look at JSON formatted data. Following are common ways to deal with JSON data.
Each approach has its advantages. Storing JSON structure as raw TEXT and processing provides flexibility for evolving data. Flattening JSON data provides the ability to address specific performance needs. This tutorial provides a walk-through on how to leverage Firebolt’s JSON functions.
Firebolt provides various functions to help with processing semi-structured data in the form of JSON. These are shown below.
This JSON based dataset covers 175 years of NYC Philharmonic performances, covering seasons, concerts, composers, soloists and more. This data structure is an ideal example for JSON processing as it has nested structures. Sample JSON structure is shown below.
1{
2 "programs":[
3 {
4 "season":"1842-43",
5 "orchestra":"New York Philharmonic",
6 "concerts":[
7 {
8 "Date":"1843-02-18T05:00:00Z",
9 "eventType":"Subscription Season",
10 "Venue":"Apollo Rooms",
11 "Location":"Manhattan, NY",
12 "Time":"8:00PM"
13 }
14 ],
15 "programID":"5178",
16 "works":[
17 {
18 "workTitle":"I PURITANI",
19 "composerName":"Bellini, Vincenzo",
20 "conductorName":"Hill, Ureli Corelli",
21 "ID":"8838*2",
22 "soloists":[
23 {
24 "soloistName":"Otto, Antoinette",
25 "soloistRoles":"S",
26 "soloistInstrument":"Soprano"
27 }
28 ],
29 "movement":"Elvira (aria): \"Qui la voce...Vien, diletto\""
30 }
31 ]
32 }
33 ]
34 }
35
For your convenience this JSON data is already available at the following S3 location:
s3://firebolt-sample-datasets-public-us-east-1/nyc_sample_datasets/nycphilharmonic
We will explore this data directly from S3 using the external table definition, to familiarize us with JSON related functions. We will create an external table definition that will treat the JSON data as raw text as shown below.
CREATE EXTERNAL TABLE ex_nyc_phil (
raw_data TEXT
)
URL ='s3://firebolt-sample-datasets-public-us-east-1/nyc_sample_datasets/nycphilharmonic'
PATTERN = '*.json'
TYPE = (JSON PARSE_AS_TEXT = TRUE);
Once the external table definition is created, now we can leverage the following examples to answer popular questions that might arise with this data.
SELECT LENGTH(JSON_POINTER_EXTRACT_ARRAY(raw_data,'/programs')) AS programs_arrays FROM ex_nyc_phil;
This example shows the extraction of raw arrays to calculate the number of programs using JSON_POINTER_EXTRACT_ARRAY.
WITH programs AS (
SELECT JSON_POINTER_EXTRACT_ARRAY(raw_data, '/programs') AS programs_arrays
FROM ex_nyc_phil),
works AS (
SELECT JSON_POINTER_EXTRACT_ARRAY(program, '/works') as works_array
FROM programs, UNNEST(programs_arrays) AS r(program))
SELECT
JSON_VALUE( JSON_POINTER_EXTRACT(work, '/composerName'))
as composer_name, count(*)
FROM works, UNNEST(works_array) AS f(work)
WHERE JSON_VALUE(JSON_POINTER_EXTRACT(work, '/composerName')) IS NOT NULL
GROUP BY ALL
ORDER BY count(*) DESC;
Here we extract the “programs” array first, followed by the extraction of the “works” array using the “UNNEST” command. Composer names are stored as a part of the “works” arrays and can be retrieved using the “JSON_POINTER_EXTRACT” command as shown.
Concerts are stored as arrays within programs. Leveraging the same approach above, extracting “Time” from the “concert” array, followed by a count(*) and group by provides the most popular concert start times
WITH programs AS (
SELECT JSON_POINTER_EXTRACT_ARRAY(raw_data, '/programs') AS programs_arrays
FROM ex_nyc_phil
),
concerts AS (
SELECT JSON_POINTER_EXTRACT_ARRAY(program, '/concerts') as concerts_arrays
FROM programs, UNNEST(programs_arrays) AS r(program)
)
SELECT JSON_POINTER_EXTRACT(concert, '/Time') as concert_time, count(*)
FROM concerts ,UNNEST(concerts_arrays) AS f(concert)
GROUP BY ALL
ORDER BY count(*) DESC;
While these examples provide direct access to semi-structured data in S3, there are reasons for flattening and ingesting this data in a structured format. For this example, we will use a “Create Table as Select” (CTAS) statement to flatten the data into a Firebolt table.
CREATE TABLE nyc_phil AS
WITH programs AS (
SELECT JSON_POINTER_EXTRACT_ARRAY(raw_data, '/programs') AS programs_arrays
FROM ex_nyc_phil
), concerts_works AS (
SELECT
JSON_POINTER_EXTRACT(program, '/season') AS season,
JSON_POINTER_EXTRACT(program, '/orchestra') AS orchestra,
JSON_POINTER_EXTRACT_ARRAY(program, '/concerts') as concerts_array,
JSON_POINTER_EXTRACT(program, '/programID') as program_id,
JSON_POINTER_EXTRACT_ARRAY(program, '/works') as works_array
FROM programs,
UNNEST(programs_arrays) AS r(program)
), concerts_works_soloists AS (
SELECT
season,
orchestra,
JSON_VALUE(JSON_POINTER_EXTRACT(concert, '/Date'))::timestamptz as concert_date,
JSON_POINTER_EXTRACT(concert, '/eventType') as concert_event_type,
JSON_POINTER_EXTRACT(concert, '/Venue') as concert_venue,
JSON_POINTER_EXTRACT(concert, '/Location') as concert_location,
JSON_POINTER_EXTRACT(concert, '/Time') as concert_time,
program_id,
JSON_POINTER_EXTRACT(work, '/workTitle') as work_title,
JSON_POINTER_EXTRACT(work, '/ID') as work_id,
JSON_POINTER_EXTRACT(work, '/conductorName') as conduct_name,
JSON_POINTER_EXTRACT(work, '/composerName') as composer_name,
CASE WHEN JSON_POINTER_EXTRACT_ARRAY(work, '/soloists') = [] THEN ['No soloists'] ELSE JSON_POINTER_EXTRACT_ARRAY(work, '/soloists') END as soloists_array -- replacing empty entries where there are no soloists
FROM concerts_works,
UNNEST (concerts_array) as f(concert),
UNNEST (works_array) as p(work)
)
SELECT
JSON_VALUE(season) as season,
JSON_VALUE(orchestra)as orchestra,
concert_date,
JSON_VALUE(concert_event_type) as concert_event_type,
JSON_VALUE(concert_venue) as concert_venue,
JSON_VALUE(concert_location)as concert_location,
JSON_VALUE(concert_time) as concert_time,
JSON_VALUE(program_id) as program_id,
JSON_VALUE(work_title) as work_title,
JSON_VALUE(work_id) as work_id,
JSON_VALUE( conduct_name) as conduct_name,
JSON_VALUE(composer_name) as composer_name,
JSON_POINTER_EXTRACT(soloist, '/soloistName') as soloist_name,
JSON_POINTER_EXTRACT(soloist, '/soloistRoles') as soloist_roles,
JSON_POINTER_EXTRACT(soloist, '/soloistInstrument') as soloist_instrument
FROM concerts_works_soloists, UNNEST (soloists_array) AS t(soloist);
Once this data is ingested, it can be directly queried referencing the columns shown.