<- Back to all posts

semi structured data

May 15, 2023

September 13, 2024

A primer on analyzing semi-structured data

Multiple contributors

No items found.

Listen to this article

Powered by NotebookLM

Listen to this article

In today's data-driven world, organizations are constantly generating, analyzing, and utilizing massive amounts of data to fuel their businesses. As the data landscape evolves, businesses encounter a wide range of data types, from structured to unstructured and everything in between. One such data type is semi-structured data, which holds a unique position in the data ecosystem, blending the features of both structured and unstructured data. In this introductory guide, we will explore the realm of semi-structured data, focusing on its handling and management in the context of databases.

Specifically, we will examine the JSON format, a widely adopted standard for representing semi-structured data, and discuss how to harness its potential in building data applications.

‍

Overture

The dataset

In this blog post, we will explore some of the key JSON and array functions available in SQL, including Lambda functions, and how to use them to manipulate semi-structured data. For this, we will leverage the New York Philharmonic dataset containing all their concerts since the Philharmonic’s first concert at the Apollo Room conducted by Ureli Corelli Hil on December 7, 1842.

The dataset spans over 175 years of performance history, covering 11.000 unique works and 20.000 performances. The full dataset, when flattened, has over 800,000 rows.

Which composers are most popular? Have there been any trends in popularity by a conductor or by season? How many concerts were performed each year? We will answer all these questions and more using JSON and array functions.

Apollo Rooms venue on the left. The first page of the Philharmonic’s first concert is on the right.

Setup

The tech stack used in this primer is composed of the following:

Storage: S3 – Object storage on AWS
Database: Firebolt – Cloud Data Warehouse for data apps

The raw JSON file, “raw_nyc_phil.json”, is loaded into AWS S3. This file is accessed from the Firebolt data warehouse in the form of an external table, “ex_nyc_phil”, to enable direct querying. This data can be ingested “as-is” or flattened to the “nyc_phil” table in Firebolt to be analyzed and visualized.

Loading data into Firebolt

Step 1a: Upload to S3. Since there is only one input file and no preprocessing is needed, we will use the AWS management console to upload the JSON file to our bucket. The json data can be found in the file ”raw_nyc_phil.json”. This is the only file that is used as a part of this primer.

‍

Step 1b: Create a database and a Firebolt engine for this demo. Firebolt provides various types of granular engine sizes to address various workloads. A single node engine of type “S” is sufficient for this exercise.

CREATE DATABASE semistruct_tutorial;

CREATE ENGINE semistruct_engine 
WITH TYPE=S NODES=1;

USE DATABASE semistruct_tutorial;
USE ENGINE semistruct_engine;

Step 2: Create an external table pointing to the file we just uploaded to S3.

CREATE EXTERNAL TABLE ex_nyc_phil (
  raw_data TEXT
)
URL = 's3://firebolt-sample-datasets-public-us-east-1/nyc_sample_datasets/nycphilharmonic'
PATTERN = '*.json'
TYPE = (JSON PARSE_AS_TEXT = TRUE);

Here’s a breakdown of the syntax:

Define the table as external. External tables serve as connectors to your data sources (in this case, S3). External tables contain no data.
URL and PATTERN specify the S3 location and the file naming pattern. For your convenience this data is staged at the following S3 location: “s3://firebolt-sample-datasets-public-us-east-1/nyc_sample_datasets/nycphilharmonic”
TYPE specifies the file type Firebolt expects to ingest given the PATTERN.
PARSE_AS_TEXT: If True, Firebolt wraps the entire JSON object literal into a single column of TYPE TEXT. If False, we must map each key in a JSON object literal to a column in the table definition.

Step 3. We can either: a) ingest the data as is (one-single field containing the JSON object) in a Firebolt-managed table or b) transform the raw_data using array and JSON functions before inserting it into a Firebolt-managed table. We will go with option b).

‍

Sonata: Analyzing the raw data

First, let’s ensure we can read the input data via the external table we just defined:

SELECT raw_data FROM ex_nyc_phil;

‍

‍The schema of the JSON object looks like this:

"root":{
	"programs":[52 items
		0:{...}6 items
		1:{6 items
			"season": "1842-43"
            "orchestra":"New York Philharmonic"
            "concerts":[1 item
	            0:{5 items
                    "Date":"1843-02-18T05:00:00Z"
                    "eventType":"Subscription Season"
                    "Venue":"Apollo Rooms"
                    "Location":"Manhattan, NY"
                    "Time":"8:00 PM"
	            }
			]
            "programID":"5178"
            "works":[9 items
	            0:{...}5 items
	            1:{6 items
                    "workTitle":"I PURITANI"
                    "composerName":"Bellini, Vincenzo"
                    "conductorName":"Hill, Ureli Corelli"
                    "ID":"8838*2"
                    "soloists":[1 item
            			0:{3 items
                          "soloistName":"Otto, Antoinette"
                          "soloistRoles":"S"
                          "soloistInstrument":"Soprano"
            			}
            		]
            		"movement":"Elvira (aria): "Qui la voce...Vien, diletto""
				}
			]
		}
	]
}

At the top level, we have an array of programs. Each program has multiple attributes:

season ("1842-43")
orchestra ("New York Philharmonic")
concerts, ARRAY of:some text
- Date ("1843-02-18T05:00:00Z")
- eventType ("Subscription Season")
- Venue ("Apollo Rooms")
- Location ("Manhattan, NY")
- Time ("8:00PM")
programID ("5178")
works, ARRAY of:some text
- workTitle ("I PURITANI")
- composerName ("Bellini, Vincenzo")
- conductorName ("Hill, Ureli Corelli")
- ID ("8838*2")
- soloists, ARRAY of:some text
  - soloistName ("Otto, Antoinette")
  - soloistRoles ("S")
  - soloistInstrument ("Soprano")
- movement ("Elvira (aria): \"Qui la voce...Vien, diletto\"")

Minuet

Array manipulations in SQL, including Lambda functions

Moving on to the next part of our play, we want to extract the top element: programs. To accomplish this, we will leverage the JSON_POINTER_EXTRACT_ARRAY function:

JSON_POINTER_EXTRACT_ARRAY(<json>, '<json_pointer_expression>')

SELECT JSON_POINTER_EXTRACT_ARRAY(raw_data,'/programs') AS programs_arrays FROM ex_nyc_phil;

Next, we will answer some questions about the data using ARRAY functions

‍

How many programs has the Philharmonic played?

To find out, we can use the LENGTH function:

SELECT LENGTH(JSON_POINTER_EXTRACT_ARRAY(raw_data,'/programs')) AS programs_array_length FROM ex_nyc_phil;

Let’s now convert the above array to a standard tabular format, having each program on a separate line. To do this, we will need to use the UNNEST clause:

WITH programs AS (
    SELECT JSON_POINTER_EXTRACT_ARRAY(raw_data, '/programs') AS programs_arrays 
    FROM ex_nyc_phil
)
SELECT program
FROM programs, UNNEST(programs_arrays) AS r(program);

‍

If we perform a COUNT(*) in the outer SELECT, we expect to see the same number of rows (13954) matching the LENGTH() of the array from above.

What time do the concerts usually start?

The first step is to extract the “Time” attribute from the “concerts” array. We’ll begin by extracting the concerts array using the JSON_POINTER_EXTRACT_ARRAY function and UNNEST it - similar to the programs array above. We will extract the Date, eventType, Season, Venue, Location, and Time for each concert using the JSON_POINTER_EXTRACT function.

WITH programs AS (
    SELECT JSON_POINTER_EXTRACT_ARRAY(raw_data, '/programs') AS programs_arrays 
    FROM ex_nyc_phil
), concerts AS (
    SELECT JSON_POINTER_EXTRACT_ARRAY(program, '/concerts') as concerts_arrays
    FROM programs, 
        UNNEST(programs_arrays) AS r(program)
)
SELECT 
    JSON_VALUE(JSON_POINTER_EXTRACT(concert, '/Date'))::timestamptz as concert_date,
    JSON_POINTER_EXTRACT(concert, '/eventType') as concert_event_type,
    JSON_POINTER_EXTRACT(concert, '/Venue') as concert_venue,
    JSON_POINTER_EXTRACT(concert, '/Location') as concert_location,
    JSON_POINTER_EXTRACT(concert, '/Time') as concert_time
FROM concerts, 
    UNNEST(concerts_arrays) AS r(concert);

Back to our question (What time do concerts usually start):

WITH programs AS (
    SELECT JSON_POINTER_EXTRACT_ARRAY(raw_data, '/programs') AS programs_arrays 
    FROM ex_nyc_phil
), concerts AS (
    SELECT JSON_POINTER_EXTRACT_ARRAY(program, '/concerts') as concerts_arrays
    FROM programs, 
        UNNEST(programs_arrays) AS r(program)
)
SELECT 
    JSON_VALUE(JSON_POINTER_EXTRACT(concert, '/Time')) as concert_time,
    count(*) as count
FROM concerts, 
    UNNEST(concerts_arrays) AS r(concert)
GROUP BY ALL
ORDER BY count(*) DESC;

8:00 and 8:30 pm are the most frequent options. Oddly enough, there were four times when the concert time was between 2:00 AM and 3:00 AM

Most popular venue?

WITH programs AS (
    SELECT JSON_POINTER_EXTRACT_ARRAY(raw_data, '/programs') AS programs_arrays 
    FROM ex_nyc_phil
), concerts AS (
    SELECT JSON_POINTER_EXTRACT_ARRAY(program, '/concerts') as concerts_arrays
    FROM programs, 
        UNNEST(programs_arrays) AS r(program)
)
SELECT 
    JSON_VALUE(JSON_POINTER_EXTRACT(concert, '/Venue')) as concert_venue,
    count(*) as count
FROM concerts, 
    UNNEST(concerts_arrays) AS r(concert)
GROUP BY ALL
ORDER BY count(*) DESC;

‍

Which composers are most popular?

Similarly, we need to access the works array now instead of the concerts array.

WITH programs AS (
    SELECT JSON_POINTER_EXTRACT_ARRAY(raw_data, '/programs') AS programs_arrays 
    FROM ex_nyc_phil
), works AS (
    SELECT JSON_POINTER_EXTRACT_ARRAY(program, '/works') as works_array
    FROM programs, 
        UNNEST(programs_arrays) AS r(program)
)
SELECT 
    JSON_VALUE(JSON_POINTER_EXTRACT(work, '/composerName')) as composer_name,
    count(*) as count
FROM works, 
    UNNEST(works_array) AS r(work)
WHERE JSON_VALUE(JSON_POINTER_EXTRACT(work, '/composerName')) IS NOT NULL
GROUP BY ALL
ORDER BY count(*) DESC;

Extract all conductors' names per season into an array.

We can either:

UNNEST, extract the composer name, and ARRAY_AGG back
Use TRANSFORM Lambda Function.

Let’s explore both options.

Option A: UNNEST, extract the composer name, and ARRAY_AGG back.

WITH programs AS (
    SELECT JSON_POINTER_EXTRACT_ARRAY(raw_data, '/programs') AS programs_arrays 
    FROM ex_nyc_phil
), works AS (
    SELECT 
        JSON_VALUE(JSON_POINTER_EXTRACT(program, '/season')) AS season,
        JSON_POINTER_EXTRACT_ARRAY(program, '/works') as works_array
    FROM programs, 
        UNNEST(programs_arrays) AS r(program)
)
SELECT 
    season,
    JSON_VALUE(JSON_POINTER_EXTRACT(work, '/composerName')) as composer_name
FROM works, 
    UNNEST(works_array) AS r(work)

Continuing on, we will group by season, and nest (into an array) the composers. Lastly, we will remove any duplicate names using the ARRAY_DISTINCT function. We can optionally return the number of unique values in an array using the ARRAY_COUNT_DISTINCT function.

WITH programs AS (
    SELECT JSON_POINTER_EXTRACT_ARRAY(raw_data, '/programs') AS programs_arrays 
    FROM ex_nyc_phil
), works AS (
    SELECT 
        JSON_VALUE(JSON_POINTER_EXTRACT(program, '/season')) AS season,
        JSON_POINTER_EXTRACT_ARRAY(program, '/works') as works_array
    FROM programs, 
        UNNEST(programs_arrays) AS r(program)
)
SELECT 
    season,
    ARRAY_DISTINCT(ARRAY_AGG(JSON_VALUE(JSON_POINTER_EXTRACT(work, '/composerName')))) as composer_names,
    ARRAY_COUNT_DISTINCT(ARRAY_DISTINCT( ARRAY_AGG(JSON_VALUE(JSON_POINTER_EXTRACT(work, '/composerName'))))) as composer_number
FROM works, 
    UNNEST(works_array) AS r(work)
GROUP BY ALL
ORDER BY season;

‍

Option B: Use TRANSFORM Lambda Function

TRANSFORM applies a function to each element of an array. Therefore, we don’t need to UNNEST anymore in the FROM clause. We can directly work with the array and extract the composer name in the SELECT clause.

WITH programs AS (
    SELECT JSON_POINTER_EXTRACT_ARRAY(raw_data, '/programs') AS programs_arrays 
    FROM ex_nyc_phil
), works AS (
    SELECT 
        JSON_VALUE(JSON_POINTER_EXTRACT(program, '/season')) AS season,
        JSON_POINTER_EXTRACT_ARRAY(program, '/works') as works_array
    FROM programs, 
        UNNEST(programs_arrays) AS r(program)
)
SELECT 
    season,
    TRANSFORM(x -> JSON_VALUE(JSON_POINTER_EXTRACT(x, '/composerName')), works_array) as composer_name
FROM works ;

Similarly to Option A, we can apply ARRAY_DISTINCT, ARRAY_FLATTEN AND ARRAY_AGG functions. Please note the above query doesn’t contain the GROUP BY clause. If you need to obtain one row per season, you will need to GROUP BY season, and ARRAY_FLATTEN with ARRAY_AGG with the composer_name column before applying ARRAY_COUNT_DISTINCT.

WITH programs AS (
    SELECT JSON_POINTER_EXTRACT_ARRAY(raw_data, '/programs') AS programs_arrays 
    FROM ex_nyc_phil
), works AS (
    SELECT 
        JSON_VALUE(JSON_POINTER_EXTRACT(program, '/season')) AS season,
        JSON_POINTER_EXTRACT_ARRAY(program, '/works') as works_array
    FROM programs, 
        UNNEST(programs_arrays) AS r(program)
   
)
SELECT 
    season,
    ARRAY_DISTINCT(ARRAY_FLATTEN(ARRAY_AGG(TRANSFORM(x -> JSON_VALUE( JSON_POINTER_EXTRACT(x, '/composerName')), works_array)))) as composer_names,
    ARRAY_COUNT_DISTINCT(ARRAY_FLATTEN(ARRAY_AGG(TRANSFORM(x -> JSON_VALUE(JSON_POINTER_EXTRACT(x, '/composerName')), works_array)))) as composer_number
FROM works 
GROUP BY season
ORDER BY 1 ASC

‍

Lambda functions can be chained together and become extremely powerful when applying changes iteratively to each element of the array. In the above case, we can apply another transformation and replace all the letters from each composer's last name with asterisks except the first one.

WITH programs AS (
    SELECT JSON_POINTER_EXTRACT_ARRAY(raw_data, '/programs') AS programs_arrays 
    FROM ex_nyc_phil
), works AS (
    SELECT 
        JSON_POINTER_EXTRACT(program, '/season') AS season,
        JSON_POINTER_EXTRACT_ARRAY(program, '/works') as works_array
    FROM programs, 
        UNNEST(programs_arrays) AS r(program)
)
SELECT 
    season,
    TRANSFORM(x -> REGEXP_REPLACE(JSON_POINTER_EXTRACT(x, '/composerName'),'([A-Za-z])[a-z]+,\s(.+)', '\1***\,\2'), works_array) as composer_name
FROM works

‍

If you are still wrapping your head around how Lambda functions work or are applied, you can take a look at this blog post that covers Lambda functions in depth.

Identify the works which were played at piano

To find out which works were played at piano, we have to extract the instruments from the soloists' array. We will wrap the extraction of the soloists’ array into a view.

View definition:

CREATE VIEW soloists AS 
WITH programs AS (
    SELECT JSON_POINTER_EXTRACT_ARRAY(raw_data, '/programs') AS programs_arrays 
    FROM ex_nyc_phil
), works AS (
    SELECT 
        JSON_VALUE(JSON_POINTER_EXTRACT(program, '/season')) AS season,
        JSON_POINTER_EXTRACT_ARRAY(program, '/works') as works_array
    FROM programs, 
        UNNEST(programs_arrays) AS r(program)
)
SELECT 
    season,
    JSON_VALUE(JSON_POINTER_EXTRACT(work, '/workTitle')) AS work_title,
    JSON_POINTER_EXTRACT_ARRAY(work, '/soloists') AS soloists_array
FROM works, 
    UNNEST (works_array) AS r(work)

Query to extract the soloists' details:

SELECT 
    season,
    work_title,
    soloists_array,
    TRANSFORM(x -> JSON_VALUE(JSON_POINTER_EXTRACT(x, '/soloistName')), soloists_array) as soloist_names,
    TRANSFORM(x -> JSON_VALUE(JSON_POINTER_EXTRACT(x, '/soloistRoles')), soloists_array) as soloist_role,
    TRANSFORM(x -> JSON_VALUE(JSON_POINTER_EXTRACT(x, '/soloistInstrument')), soloists_array) as soloist_instruments
FROM soloists
WHERE LENGTH(soloists_array)>0

‍

To find out the works played at Piano, we can leverage the CONTAINS array function that returns a boolean if it’s a match or not.

WITH soloists_data AS (
SELECT 
    season,
    work_title,
    soloists_array,
    TRANSFORM(x -> JSON_VALUE(JSON_POINTER_EXTRACT(x, '/soloistName')), soloists_array) as soloist_names,
    TRANSFORM(x -> JSON_VALUE(JSON_POINTER_EXTRACT(x, '/soloistRole')), soloists_array) as soloist_role,
    TRANSFORM(x -> JSON_VALUE(JSON_POINTER_EXTRACT(x, '/soloistInstrument')), soloists_array) as soloist_instruments
FROM soloists
)
SELECT * 
FROM soloists_data
WHERE CONTAINS(soloist_instruments,'Piano')

‍

Sort the soloists' names for each work by their instruments

To answer this question, we can reuse the view and CTE from above. We will turn our attention to a different array function: ARRAY_SORT, which can take two arrays as input and sort one by the values in the other. Given the following entry:soloist_names: Otto, Boulard, Munson, Mayer soloist_instruments: Soprano, Alto, Tenor, Bass It will resort the soloist_names array based on soloist_instruments values, as such: Boulard, Mayer, Otto, Munson

WITH soloists_data AS (
SELECT 
    season,
    work_title,
    soloists_array,
    TRANSFORM(x -> JSON_VALUE(JSON_POINTER_EXTRACT(x, '/soloistName')), soloists_array) as soloist_names,
    TRANSFORM(x -> JSON_VALUE(JSON_POINTER_EXTRACT(x, '/soloistRole')), soloists_array) as soloist_role,
    TRANSFORM(x -> JSON_VALUE(JSON_POINTER_EXTRACT(x, '/soloistInstrument')), soloists_array) as soloist_instruments
FROM soloists
)
SELECT 
    soloist_names,
    soloist_instruments,
    ARRAY_SORT(x, y -> y, soloist_names, soloist_instruments) soloist_names_sorted_by_instruments
FROM soloists_data
WHERE LENGTH(soloist_names)>0

‍

Firebolt supports a wide range of array and lambda functions similar to the above ones, that you can check by following this link to our docs.

Flattening data

Next, we’ll tackle ingestion. We will apply the same concepts from above to UNNEST all the arrays, starting with programs, and continuing to concerts, and works, and finally, UNNEST the soloists' arrays within works.

Finally, we will wrap the SELECT query and create a CTAS (create table as select):

CREATE TABLE nyc_phil AS 
WITH programs AS (
    SELECT JSON_POINTER_EXTRACT_ARRAY(raw_data, '/programs') AS programs_arrays 
    FROM ex_nyc_phil
), concerts_works AS (
    SELECT 
        JSON_POINTER_EXTRACT(program, '/season') AS season,
        JSON_POINTER_EXTRACT(program, '/orchestra') AS orchestra,
        JSON_POINTER_EXTRACT_ARRAY(program, '/concerts') as concerts_array,
        JSON_POINTER_EXTRACT(program, '/programID') as program_id,
        JSON_POINTER_EXTRACT_ARRAY(program, '/works') as works_array
    FROM programs, 
        UNNEST(programs_arrays) AS r(program)
), concerts_works_soloists AS (
    SELECT 
        season,
        orchestra,
        JSON_VALUE(JSON_POINTER_EXTRACT(concert, '/Date'))::timestamptz as concert_date,
        JSON_POINTER_EXTRACT(concert, '/eventType') as concert_event_type,
        JSON_POINTER_EXTRACT(concert, '/Venue') as concert_venue,
        JSON_POINTER_EXTRACT(concert, '/Location') as concert_location,
        JSON_POINTER_EXTRACT(concert, '/Time') as concert_time,
        program_id,
        JSON_POINTER_EXTRACT(work, '/workTitle') as work_title,
        JSON_POINTER_EXTRACT(work, '/ID') as work_id,
        JSON_POINTER_EXTRACT(work, '/conductorName') as conduct_name,
        JSON_POINTER_EXTRACT(work, '/composerName') as composer_name,
        CASE WHEN JSON_POINTER_EXTRACT_ARRAY(work, '/soloists') = [] THEN ['No soloists'] ELSE JSON_POINTER_EXTRACT_ARRAY(work, '/soloists') END as soloists_array -- replacing empty entries where there are no soloists
        
    FROM concerts_works,
        UNNEST (concerts_array) as f(concert),
        UNNEST (works_array) as p(work)
)
SELECT
    JSON_VALUE(season) as season,
    JSON_VALUE(orchestra)as orchestra,
    concert_date,
    JSON_VALUE(concert_event_type) as concert_event_type,
    JSON_VALUE(concert_venue) as concert_venue,
    JSON_VALUE(concert_location)as concert_location,
    JSON_VALUE(concert_time) as concert_time,
    JSON_VALUE(program_id) as program_id,
    JSON_VALUE(work_title) as work_title,
    JSON_VALUE(work_id) as work_id,
    JSON_VALUE( conduct_name) as conduct_name,
    JSON_VALUE(composer_name) as composer_name,
    JSON_POINTER_EXTRACT(soloist, '/soloistName') as soloist_name,
    JSON_POINTER_EXTRACT(soloist, '/soloistRoles') as soloist_roles,
    JSON_POINTER_EXTRACT(soloist, '/soloistInstrument') as soloist_instrument
FROM concerts_works_soloists,    UNNEST (soloists_array) AS t(soloist);

SELECT * FROM nyc_phil;

So far, we have covered loading and analyzing semi-structured data with various JSON and Array functions. Now, all the flattened data can be directly accessed through a Firebolt table.

‍

Conclusion

In this article, we walked through leveraging JSON functions to parse and process data along with leveraging Lambda expressions to operate on array data type. In a follow-up article we will cover visualizing this data.

Table of Contents

This is some text inside of a div block.

This is some text inside of a div block.

The $100M Problem: How Lyft's Data Platform Prevents ML Failures with Ritesh Varyani at Lyft

Lyft's Ritesh Varyani details their polyglot data strategy unifying Spark, Trino, and ClickHouse with AI.

Firebolt Team

Unlocking Faster Iceberg Queries: The Writer Optimizations You're Missing

Your Apache Iceberg tables are slow because of how your data was written into them.

John Kennedy

"Where Do I Put My Logs?" A Conversation with TLDCRM's CEO on Solving the Impossible

A conversation with TLDCRM how they built a customer facing API observability solution at scale.

Sergio Ferragut

Intrigued? Want to read some more?