Processing semi-structured data with NYC Philharmonic Data

Support for semi-structured data enables cloud data warehouses to address data that
does not fit the normal tabular model. Addressing machine generated data with evolving, nested data structures is simplified through this capability. In this section, we will take a look at JSON formatted data. Following are common ways to deal with JSON data.

  • Store entire JSON structure as raw TEXT
  • Partial Flattening of JSON data into columns
  • Extract and flatten all JSON data into columns

Each approach has its advantages. Storing JSON structure as raw TEXT and processing provides flexibility for evolving data. Flattening JSON data provides the ability to address specific performance needs. This tutorial provides a walk-through on how to leverage Firebolt’s JSON functions.

Key functions for working with JSON

Firebolt provides various functions to help with processing semi-structured data in the form of JSON. These are shown below.

  • JSON_POINTER_EXTRACT(<json>, <json_path_expression)
    • Extracts value for key with specific JSON pointer expression and expected data type
  • JSON_POINTER_EXTRACT_ARRAY(<json>, <json_path_expression>)
    • Extracts array of strings from  at the path provided
  • JSON_VALUE(<json>)
    • Converts to a scalar value
  • UNNEST
    • Converts array object to set of rows
  • Additionally, you can process arrays using Lambda functions

NYC Philharmonic Dataset

This JSON based dataset covers 175 years of NYC Philharmonic performances, covering seasons, concerts, composers, soloists and more. This data structure is an ideal example for JSON processing as it has nested structures. Sample JSON structure is shown below.

1{
2        "programs":[
3          {
4              "season":"1842-43",
5              "orchestra":"New York Philharmonic",
6              "concerts":[
7                {
8                    "Date":"1843-02-18T05:00:00Z",
9                    "eventType":"Subscription Season",
10                    "Venue":"Apollo Rooms",
11                    "Location":"Manhattan, NY",
12                    "Time":"8:00PM"
13                }
14              ],
15              "programID":"5178",
16              "works":[
17                {
18                    "workTitle":"I PURITANI",
19                    "composerName":"Bellini, Vincenzo",
20                    "conductorName":"Hill, Ureli Corelli",
21                    "ID":"8838*2",
22                    "soloists":[
23                      {
24                          "soloistName":"Otto, Antoinette",
25                          "soloistRoles":"S",
26                          "soloistInstrument":"Soprano"
27                      }
28                    ],
29                    "movement":"Elvira (aria): \"Qui la voce...Vien, diletto\""
30                }
31              ]
32          }
33        ]
34      }
35

Exploring NYC Philharmonic data in Amazon S3

For your convenience this JSON data is already available at the following S3 location:

s3://firebolt-sample-datasets-public-us-east-1/nyc_sample_datasets/nycphilharmonic

We will explore this data directly from S3 using the external table definition, to familiarize us with JSON related functions. We will create an external table definition that will treat the JSON data as raw text as shown below.

CREATE EXTERNAL TABLE ex_nyc_phil (
	raw_data TEXT
)

URL ='s3://firebolt-sample-datasets-public-us-east-1/nyc_sample_datasets/nycphilharmonic'
PATTERN = '*.json'
TYPE = (JSON PARSE_AS_TEXT = TRUE);

Once the external table definition is created, now we can leverage the following examples to answer popular questions that might arise with this data.

Example 1: How many programs has the NYC Philharmonic played ?

SELECT LENGTH(JSON_POINTER_EXTRACT_ARRAY(raw_data,'/programs')) AS programs_arrays FROM ex_nyc_phil;

This example shows the extraction of raw arrays to calculate the number of programs using JSON_POINTER_EXTRACT_ARRAY.

Example 2: Who are the most popular composers?

WITH programs AS (
        SELECT JSON_POINTER_EXTRACT_ARRAY(raw_data, '/programs') AS programs_arrays 
        FROM ex_nyc_phil), 
      works AS (
        SELECT JSON_POINTER_EXTRACT_ARRAY(program, '/works') as works_array
        FROM programs, UNNEST(programs_arrays) AS r(program))
      SELECT 
        JSON_VALUE( JSON_POINTER_EXTRACT(work, '/composerName'))
 as composer_name, count(*)
      FROM works, UNNEST(works_array) AS f(work)
      WHERE JSON_VALUE(JSON_POINTER_EXTRACT(work, '/composerName')) IS NOT NULL
      GROUP BY ALL
      ORDER BY count(*) DESC;

Here we extract the “programs” array first, followed by the extraction of the “works” array using the “UNNEST” command. Composer names are stored as a part of the “works” arrays and can be retrieved using the “JSON_POINTER_EXTRACT” command as shown.

Example 3: What time do concerts usually start?

Concerts are stored as arrays within programs.  Leveraging the same approach above, extracting “Time”  from the “concert” array, followed by a count(*) and group by provides the most popular concert start times

WITH programs AS (
          SELECT JSON_POINTER_EXTRACT_ARRAY(raw_data, '/programs') AS programs_arrays 
          FROM ex_nyc_phil
      ), 
      concerts AS (
          SELECT JSON_POINTER_EXTRACT_ARRAY(program, '/concerts') as concerts_arrays
          FROM programs, UNNEST(programs_arrays) AS r(program)
      )
      SELECT JSON_POINTER_EXTRACT(concert, '/Time') as concert_time, count(*)
      FROM concerts ,UNNEST(concerts_arrays) AS f(concert)
      GROUP BY ALL
      ORDER BY count(*) DESC;

Ingestion of semi-structured data directly into Firebolt

While these examples provide direct access to semi-structured data in S3, there are reasons for flattening and ingesting this data in a structured format.  For this example, we will use a “Create Table as Select” (CTAS) statement to flatten the data into a Firebolt table.

CREATE TABLE nyc_phil AS 
WITH programs AS (
    SELECT JSON_POINTER_EXTRACT_ARRAY(raw_data, '/programs') AS programs_arrays 
    FROM ex_nyc_phil
), concerts_works AS (
    SELECT 
        JSON_POINTER_EXTRACT(program, '/season') AS season,
        JSON_POINTER_EXTRACT(program, '/orchestra') AS orchestra,
        JSON_POINTER_EXTRACT_ARRAY(program, '/concerts') as concerts_array,
        JSON_POINTER_EXTRACT(program, '/programID') as program_id,
        JSON_POINTER_EXTRACT_ARRAY(program, '/works') as works_array
    FROM programs, 
        UNNEST(programs_arrays) AS r(program)
), concerts_works_soloists AS (
    SELECT 
        season,
        orchestra,
        JSON_VALUE(JSON_POINTER_EXTRACT(concert, '/Date'))::timestamptz as concert_date,
        JSON_POINTER_EXTRACT(concert, '/eventType') as concert_event_type,
        JSON_POINTER_EXTRACT(concert, '/Venue') as concert_venue,
        JSON_POINTER_EXTRACT(concert, '/Location') as concert_location,
        JSON_POINTER_EXTRACT(concert, '/Time') as concert_time,
        program_id,
        JSON_POINTER_EXTRACT(work, '/workTitle') as work_title,
        JSON_POINTER_EXTRACT(work, '/ID') as work_id,
        JSON_POINTER_EXTRACT(work, '/conductorName') as conduct_name,
        JSON_POINTER_EXTRACT(work, '/composerName') as composer_name,
        CASE WHEN JSON_POINTER_EXTRACT_ARRAY(work, '/soloists') = [] THEN ['No soloists'] ELSE JSON_POINTER_EXTRACT_ARRAY(work, '/soloists') END as soloists_array -- replacing empty entries where there are no soloists
        
    FROM concerts_works,
        UNNEST (concerts_array) as f(concert),
        UNNEST (works_array) as p(work)
)
SELECT
    JSON_VALUE(season) as season,
    JSON_VALUE(orchestra)as orchestra,
    concert_date,
    JSON_VALUE(concert_event_type) as concert_event_type,
    JSON_VALUE(concert_venue) as concert_venue,
    JSON_VALUE(concert_location)as concert_location,
    JSON_VALUE(concert_time) as concert_time,
    JSON_VALUE(program_id) as program_id,
    JSON_VALUE(work_title) as work_title,
    JSON_VALUE(work_id) as work_id,
    JSON_VALUE( conduct_name) as conduct_name,
    JSON_VALUE(composer_name) as composer_name,
    JSON_POINTER_EXTRACT(soloist, '/soloistName') as soloist_name,
    JSON_POINTER_EXTRACT(soloist, '/soloistRoles') as soloist_roles,
    JSON_POINTER_EXTRACT(soloist, '/soloistInstrument') as soloist_instrument
FROM concerts_works_soloists,    UNNEST (soloists_array) AS t(soloist);

Once this data is ingested, it can be directly queried referencing the columns shown.

Send me as pdf