Semi-structured data modeling

Listen to this article

Semi-structured data processing is a common data engineering topic. How to ingest, store and query JSON data, for example, is a consistent question on the minds of customers. The ability to process JSON data varies by data warehouse platforms. In the video below, we review design considerations when processing JSON data.

For starters, the use of JSON in data integration projects is common. Question is, what is the right approach to storing JSON ? From the Firebolt perspective, Firebolt Cloud Data Warehouse provides a spectrum of options that are available to the customer. Let's take a look at some sample JSON. The decision on how to store this JSON should be driven by how users plan to access and use the data. In other words, store it for flexibility or efficiency of access.

In the simplest form, a customer can choose to leave the entire JSON as a single text column. With this approach, there is no need to think about parsing the JSON data as it is loaded. It results in fast ingestion. Now, since the JSON is stored as raw text, this also gives the maximum flexibility. Think of clay that can be molded into any shape or format as needed. What is the downside ? There is no direct indexing that can be applied to various parts of the JSON data within the raw text. On the flexibility front, you can get any element of the JSON structure using any of the extensive set of processing functions that are built into Firebolt. For the ultimate flexibility, a database view can be created on top of the text using aforementioned functions.

The next option would be to break the JSON object into multiple columns while leaving the data in JSON format. Use of Lambda functions to place results in arrays is an option. While this will require less processing at query time, it can be cumbersome to get aggregates from arrays.

Take this another step further using JSON related functions to extract and store JSON in a format that does not require additional parsing at query time. From an ingestion standpoint, it will be worth investigating the speed of ingestion due to the additional processing required. But since the parsing is done ahead of time, this provides the foundation for fast performance during query time. With this foundation, the option to filter on the field, aggregate, join etc are all possible. Additionally, leveraging arrays and arrays of arrays can be used to persist data within the same table. With this nesting, there is storage optimization and the ability to unnest data or flatten it on the fly. For data purists, normalizing this data model into multiple tables is an option as well. While this is an approach that conventional cloud data warehouses avoid, on Firebolt it is possible to adopt and optimize this approach by leveraging JOIN indexes.

Now that the data has been persisted in some shape or form, it is worth looking into optimizing performance by leveraging specialized indexes that Firebolt provides. For example, leveraging aggregating indexes against specific fields will deliver much faster results than running against the entire table. Using views to query against a combination of JSON and flattened data in a table provides the flexibility to ride through underlying schema changes.

To sum it up, the spectrum of choices should be evaluated based on how the data will be accessed or used. For example, will there be a need to filter, sort, aggregate, join based on a specific data element within the JSON ? Will filtering on a specific JSON field prune the amount of data analyzed ? Extract it out of JSON and persist it into its own column. Is the JSON component a simple array ? If so, move to an array. Is the JSON component a key-value pair ? If so, move it to a pair of key-value arrays. Is the data less structured ? Then, keep it as raw JSON. With columnar compression and low cost of storage, maintaining the raw data does not cost an arm and a leg anymore. Persisting raw data as well as extracted fields will give ultimate flexibility and performance. Firebolt’s technical team is always available to provide you with the guidance.

‍

Semi-structured data modeling

Listen to this article

Technical Deep Dive: Automated Column Statistics

Hans-Peter Lehmann

Why 99% of Data Teams Give Up on Real-Time And How Artie Changes That

Firebolt Team

The $100M Problem: How Lyft's Data Platform Prevents ML Failures with Ritesh Varyani at Lyft

Firebolt Team

Intrigued? Want to read some more?

Semi-structured data modeling

Listen to this article

Technical Deep Dive: Automated Column Statistics

Hans-Peter Lehmann

Why 99% of Data Teams Give Up on Real-Time And How Artie Changes That

Firebolt Team

The $100M Problem: How Lyft's Data Platform Prevents ML Failures with Ritesh Varyani at Lyft

Firebolt Team

Intrigued? Want to read some more?

Don't miss a post, subscribe to the Fireblog