The advent of cheap, infinite cloud storage and massive data collection from the web and IoT has dramatically transformed the nature of raw data subject to analytical query workloads. With endless streams of JSON log files and other nested serialization formats like Avro and Parquet, the ability to handle semi-structured data is a must.
This paper reviews the challenges posed by semi-structured data and demonstrates different approaches, from NoSQL document stores and Snowflake’s VARIANT to Firebolt’s tabular schema with support for nested types. Based on concrete implementations, it will walk you through making semi-structured data analytics fast, real-time and cost-effective as an alternative to complex and expensive ETL-like transformations that flatten the data into columns, or to raw unstructured data, neither of which perform as well.
While NoSQL solutions like MongoDB are designed to support the varying nature of the schema, running queries usually requires full scans on the data. The use of complex aggregations and filters often result in unacceptable response times. A partial solution might be to add a search technology like Elasticsearch. This may improve the query response time for searches but not general-purpose analytics for two reasons:
Snowflake provides two options to deal with semi-structured data:
Firebolt allows SQL experts to combine native array manipulation functions with partial flattening for the best performance.
Array manipulation functions directly embed objects with their full structure in the database and transform them into Firebolt’s native nested types. Users no longer work with objects as a whole, they transform them into an efficient format which maintains their nested structures. This enables more compact queries that run in SQL. Since the arrays are stored in a columnar format, querying does not require loading the entire object into memory.
Firebolt enables flattening on the fly. Users have the flexibility to balance flattening and manipulation functions, and work on a table which has parts that are fully flattened, parts that are partially flattened and parts that are left as they are.
When partially flattened data is natively supported in your database, it also results in fewer ETL steps. JSON manipulation functions are used to seamlessly cleanse, fix and organize semi-structured data as it’s ingested through the Firebolt pipeline. Partial flattening can then be performed to transform the nested types to an ordinary table that can be further manipulated with familiar SQL techniques, and indexed on each column or join to improve performance. It also enables creating views without nested types on tables containing nested types, which is useful when working with BI tools that cannot access nested types directly. Partial flattening as opposed to full flattening provides the benefits mentioned above while avoiding huge denormalized tables.
Sub-second performance: Firebolt provides extreme performance across structured and semi-structured data, regardless of the amount of attributes and arrays it contains. For flattened data this is achieved with the various optimizations of a modern columnar database, including vectorized execution, aggressive compression and encoding techniques as well as indexing. For semi-structured data, Firebolt provides native Lambda expressions that can be invoked within SQL. These functions are optimized to traverse the nested structures without loading the data into RAM or performing full scans. The end result is sub-second performance at scale without requiring nearly as many resources and therefore costs.
Substantial cost reduction: Firebolt’s flattening is extremely cost-efficient compared to Snowflake’s, since there's no need to duplicate a column’s data for each nested value. Although this duplication can have less of an impact on storage with column-based compression, it may require additional GROUP BY or FILTER operations, which can have a huge impact on performance. With arrays and direct array manipulation functions, GROUP BY and FILTER operations can be used only during the final result set generation. In addition, since the arrays are already stored in an efficient, columnar and compressed format, the UNNESTING is just a matter of presentation, not a transformation of the underlying storage representation.
Consider data collected from an e-commerce website. Each visit (session) is a record. Each visit has an Id, Start Time, and Duration (regular scalar values), but also zero or more tags, various properties of the user agent and zero or more “events of interest” like registration, adding an item to cart and leaving a review. Such events, in addition to scalar values (Id, Type and Timestamp) may include zero or more additional properties which will vary according to the event type and may differ even between events of the same type.
Assuming that the raw data is stored in a JSON format, here are examples of two records (usually they will appear in the same file in a “JSON per line” format) :
In order to enable efficient analytical queries which will be used by BI tools (and BI users) who natively speak SQL, the nested structure of the raw data has to be at least partially transformed into a semi-structured, SQL-friendly schema.
Here’s an illustration of the first record in Firebolt’s semi-structured table (some data omitting for brevity):
In order to illustrate the querying of the data in both of Firebolt’s approaches (UNNEST construct and special array manipulation functions), let’s consider the following query written first in plain English: List all the IDs of the visits whose tags contained “sale” or “fashion” in descending order of duration.
UNNEST flattens the nested data and adds a column having one row for each element of the nested type.
SELECT id
FROM visits
UNNEST (tags)
WHERE tags = ‘sale’ OR tags = ‘fashion’
GROUP BY id
ORDER BY duration DESC
The GROUP BY clause here is needed because each row will appear as many times as the number of elements in its tags column
Here we use the array function “has” which returns “true” if the array supplied as the first argument contains an element equal to the second argument.
SELECT id
FROM visits
WHERE has(tags, ‘sale’) OR has (tags,’fashion’)
ORDER BY duration DESC
Note that we do not need the GROUP BY clause here because each row appears just once and we operate directly on the array without “exploding” it.
The following examples will use array functions and UNNEST interchangeably. While the same query can typically be written in either one of the approaches, there is usually a more suitable way to do it, and of course it’s a matter of personal preference.
Count all the visits whose user agent string contained “safari”, group by event type in descending order of number of visits.
Here we’ll use both UNNEST and an array function. For the grouping, UNNEST is a more suitable option, while for the filtering the Array function is more compact and readable.
SELECT COUNT(id) as number_of_visits
FROM visits
UNNEST (events_type)
Here we introduce the concept of Lambda expressions in array functions. The function any_match accepts a Lambda expression, and one or more arrays. It then runs over the elements of the arrays (which have to be of the same length), passes the current elements to the Lambda expression and if for any reason there’s at least one call to Lambda that evaluates to true - the whole function will evaluate to true, otherwise it will evaluate to false.
A simpler example would be to rewrite our first example where we used the disjunction of two “has” functions: “has(tags, ‘sale’) OR has(tags,’fashion’)”, with a single call to any_match, as follows:
In the inventory of items below, there are several sizes for each item, and prices in different currencies for each size. So, for a single item, its prices object will look like this:
In order to convert it into a fully flattened table like the one below, Snowflake has to run the “LATERAL FLATTEN” operator twice - first to get sizes, then to get prices, and finally JOIN three tables.
With Firebolt, using arrays, you simply run (the subquery is here just for readability). The query may look complex, but it entirely eliminates the need to materialize intermediate tables and join them.
If you want a fully flattened result, you can augment the query with a double UNNEST clause (left as an exercise to the reader) - once again, no intermediate materialization, no JOINs and a huge performance gain.
Firebolt allows SQL experts to combine native semi-structured storage and Lambda expressions with partial flattening to achieve a seamless analytics experience delivering sub-second performance. It provides:
Native Lambda expressions that can be involved directly in SQL
A flexible combination of nested storage and flattening that can be queried in SQL without the need to create huge, costly tables
Sub-second performance across structured and semi-structured data, regardless of the amount of attributes and arrays it contains
Substantial cost reduction by avoiding the need to load all JSON or other semi-structured data entirely in RAM and do full scans or complex nested queries.