E-Commerce Analytics Primer

In the world of data engineering, e-commerce stands out as a dynamic and data-intensive domain. To thrive in this space, data engineers must understand the foundational principles of data management, analysis, and optimization. This primer introduces the data engineer to sample use cases for data-driven decision-making in E-Commerce. We will also walk through ingesting data and explore performance optimizations.

To help with this, we will leverage Firebolt Cloud Data Warehouse to analyze data collected from the Open Customer Data Platform (CDP) Project. Please note that this is not a detailed data modeling exercise that handles various aspects of developing a data warehouse for E-Commerce. This exercise aims to familiarize the reader with use cases and optimization concepts in a simplistic manner. 

Understanding the E-commerce Data Model

A solid data model serves as the foundation for structuring a data warehouse, and this is especially important in the world of e-commerce, where data is plentiful and diverse. E-commerce generates a lot of data, such as user interactions, product details, and transaction records. Having a well-thought-out schema is key to efficiently managing and retrieving this data, as it impacts the speed, efficiency, and cost of performing analytics.

Additionally, a well-structured schema supports scalability, allowing for the smooth addition of new data and features when needed. It also helps maintain data integrity by enforcing constraints and validation rules, ensuring that incorrect or incomplete data doesn't enter the database. This commitment to data quality is essential for building trust among users who rely on the data platform.

Shown below is a schema with a single table. Complementary datasets and attributes can be easily added to extend the functionality of this schema. 

Property Description
event_time Time in UTC
event_type Customer event ( view/cart/purchase)
product_id ID of a product
category_id Product's category ID
category_code Product's category code
brand Product Brand
price Price of a product
user_id Permanent user ID
user_session Temporary user session ID

This schema provides the basics needed for user identification, event tracking, product purchase trends, and user session details. The user_id field allows businesses to associate interactions with specific users, enabling personalization, customer segmentation, and targeted marketing efforts. E-commerce revolves around user interactions with products and services through events. The event_time and event_type fields capture critical information about each user interaction. The user_session field tracks user sessions, grouping interactions within a single session. User sessions are invaluable for analyzing user flow, cart abandonment, and session-based recommendations. Finally, the product_id, category_id, category_code, brand, and price fields provide essential product details. These attributes can be used for catalog management, inventory control, and pricing strategies.

Working with the E-Commerce Dataset

We have downloaded the sample E-commerce data from here to our Amazon S3 bucket.  This S3 bucket will serve as the ingestion point for batch or micro-batch-based ingest as required.  Running the ‘aws ‘s3 ls’ command on the bucket provides a list of files, as shown below.  We have seven months of E-commerce data (approximately 32GB of uncompressed capacity) in the bucket. The dates range from October 2019 to April 2020. This sample dataset provides us with enough data for experimentation.

12023-10-26 12:31:54 5668612855 2019-Oct.csv
22023-10-26 12:31:54 9006762395 2019-Nov.csv
32023-11-15 10:01:46 9357988804 2019-Dec 2.csv
42023-11-15 10:01:46 7777991113 2020-Jan.csv
52023-11-15 10:01:47 7675550422 2020-Feb.csv
62023-11-15 10:01:46 7823838608 2020-Mar.csv
72023-11-15 10:01:46 9267234550 2020-Apr.csv


For this exercise, we will log in to the Firebolt WebUI at go.firebolt.io and proceed to the SQL Editor ( by clicking on the ‘Develop’ Workspace on the Toolbar). Running the following SQL commands in the SQL Editor will create the database, create and attach the compute resource for the database as a single node (NODES=1) engine, and start the engine. Approximately 49GB of data will be ingested, you can experiment with node type and number of nodes to control the speed of ingestion.

1CREATE DATABASE ecommercedb WITH DESCRIPTION = ‘ECommerce Analytics Primer’;
2
3CREATE ENGINE ecommerceEngine;
4
5START ENGINE ecommerceEngine;
6
7USE ENGINE ecommerceEngine;
8
9USE ecommercedb;

With the database up and running, we use the SQL Editor to run the ‘CREATE TABLE’ command, as shown. Please note that the following command is executed without defining any indexes. We will recreate the table with a Primary Index and create additional indexes in the optimization section.

1CREATE FACT TABLE IF NOT EXISTS "ecommerce" (
2  "event_time" TIMESTAMPTZ NOT NULL,
3  "event_type" TEXT NOT NULL,
4  "product_id" BIGINT NOT NULL,
5  "category_id" TEXT NULL,
6  "category_code" TEXT NULL,
7  "brand" TEXT NULL,
8  "price" NUMERIC(38, 9)NULL,
9  "user_id" TEXT NULL,
10  "user_session" TEXT NULL
11);

We can use the ‘COPY’ command to ingest the data from ‘S3’ into Firebolt. The ‘COPY’ command provides schema inference capability and addresses data from various formats, including Parquet, CSV, AVRO, ORC, TSV, and JSON. Please remember to substitute your bucket instead of <my-bucket-name> and an ‘ecommerce' folder in your bucket to store the dataset.  This command copies data from CSV files in the S3 bucket into the ‘ecommerce’ table we created in the previous step. Please note that the Firebolt Engine needs access to your bucket for ingestion. Access permissions to the S3 bucket can be configured as shown here

1COPY INTO ecommerce FROM 's3://firebolt-sample-datasets-public-us-east-1/ecommerce/parquet/'  
2WITH PATTERN='*.parquet.gz' TYPE = PARQUET;
3
4SHOW TABLES;

Once the COPY FROM is complete, run the “SHOW TABLES” to see the number of records loaded.  As seen below, the ‘ecommerce’ table has approximately 412 Million records and is approximately 49 GB uncompressed / 21 GB  compressed size. There are no indexes defined. 

Sample Analytics on E-Commerce Dataset

You can also look at the compression ratio to review the savings from Columnar compression. At this point, we are ready to run our queries using the internal ecommerce table.

Sample Analytics on E-Commerce Dataset

Below are sample use cases and queries to support these cases. Even though these are simple queries, they can trigger large scans that impact performance and cost. To illustrate this, review the response time and total amount of data scanned as you run these queries.  We will capture these and show them in the optimization section. As mentioned above, these queries are accessing Firebolt’s columnar storage but without any indexes defined. We will define indexes later. 

Customer Lifetime Value (LTV)

Calculating customer lifetime value involves assessing the total revenue generated by customers over their engagement with your brand. It helps in understanding the long-term value of customers.

1-- Customer LTV
2SELECT
3    user_id,
4    SUM(price) AS total_revenue,
5    COUNT(DISTINCT user_session) AS total_sessions,
6    (SUM(price) / COUNT(DISTINCT user_session)) AS average_revenue_per_session
7FROM ecommerce
8WHERE event_type = 'purchase'
9AND user_session IS NOT NULL
10GROUP BY user_id
11ORDER BY total_revenue DESC;

Funnel Analysis for Conversion Optimization

Funnel analysis tracks the customer journey, from initial views to cart additions and purchases. It identifies drop-off points and helps optimize conversion rates.

1-- Funnel Analysis
2SELECT
3    event_type,
4    COUNT(DISTINCT user_id) AS unique_users
5FROM ecommerce
6WHERE event_type IN ('view', 'cart', 'purchase')
7GROUP BY event_type
8ORDER BY 2 DESC;

Product Recommendations and Cross-Selling

Analyzing co-purchase data can reveal products that are frequently bought together, enabling practical product recommendations and cross-selling opportunities.

1-- Product Recommendations
2WITH CoPurchaseCounts AS (
3    SELECT a.product_id AS product_A, b.product_id AS product_B, COUNT(*) AS purchase_count
4    FROM ecommerce a
5    JOIN ecommerce b ON a.user_session = b.user_session AND a.product_id < b.product_id
6    WHERE a.event_type = 'purchase' AND b.event_type = 'purchase'
7    GROUP BY a.product_id, b.product_id
8)
9SELECT
10    *
11FROM CoPurchaseCounts cp
12WHERE purchase_count > 2
13ORDER BY purchase_count DESC
14LIMIT 10;

Seasonal Sales Trends and Insights

Analyzing sales trends by month helps businesses understand seasonal variations in customer behavior and tailor their strategies accordingly.

1-- Seasonal Sales Trends
2SELECT
3    EXTRACT(MONTH FROM event_time) AS sales_month,
4    SUM(price) AS total_revenue
5FROM ecommerce
6WHERE event_type = 'purchase'
7GROUP BY sales_month
8ORDER BY sales_month;

Customer Segmentation for Targeted Marketing

Segmenting customers based on activity levels and average revenue allows for targeted marketing efforts tailored to different customer groups.

1-- Segmentation
2WITH CustomerSegments AS (
3    SELECT
4        user_id,
5        COUNT(DISTINCT user_session) AS session_count,
6        SUM(price) AS total_revenue
7    FROM ecommerce
8    WHERE event_type = 'purchase'
9    GROUP BY user_id
10)
11SELECT
12    CASE
13        WHEN session_count <= 5 THEN 'Low Activity'
14        WHEN session_count <= 20 THEN 'Medium Activity'
15        ELSE 'High Activity'
16    END AS segment,
17    COUNT(user_id) AS user_count,
18    AVG(total_revenue) AS avg_revenue
19FROM CustomerSegments
20GROUP BY segment;

Cart Abandonment Rate: A Key Metric

Tracking cart abandonment rates helps identify where potential customers drop off in the purchase process, allowing for targeted recovery efforts.

1-- Cart Abandonment Rate
2WITH CartAbandonment AS (
3    SELECT event_time::DATE as event_date,
4        user_id,
5        COUNT(CASE WHEN event_type = 'cart' THEN 1 ELSE NULL END) AS cart_count,
6        COUNT(CASE WHEN event_type = 'purchase' THEN 1 ELSE NULL END) AS purchase_count
7    FROM ecommerce
8    WHERE event_type IN ('cart', 'purchase')
9    GROUP BY event_date, user_id
10)
11SELECT
12    event_date, COUNT(*) AS total_users,
13    SUM(CASE WHEN cart_count > 0 AND purchase_count = 0 THEN 1 ELSE 0 END) AS abandoned_carts,
14    (SUM(CASE WHEN cart_count > 0 AND purchase_count = 0 THEN 1 ELSE 0 END) * 100.0 / COUNT(*)) AS abandonment_rate
15FROM CartAbandonment
16GROUP BY event_date
17ORDER BY event_date;

User Purchase Frequency Distribution

Analyzing the distribution of user purchase frequencies helps understand user behavior and plan marketing campaigns.

1-- User Purchase Frequency Distribution
2WITH UserPurchaseFrequency AS (
3    SELECT
4        user_id,
5        COUNT(DISTINCT event_time::DATE) AS purchase_frequency
6    FROM ecommerce
7    WHERE event_type = 'purchase'
8    GROUP BY user_id
9)
10SELECT
11    purchase_frequency,
12    COUNT(user_id) AS user_count
13FROM UserPurchaseFrequency
14GROUP BY purchase_frequency
15ORDER BY purchase_frequency;

Optimizing Performance 

To find the Primary Index that can benefit our workload of E-Commerce queries, we can use the recommend_ddl command. The recommend_ddl command takes two parameters - Table name and a list of queries that are of interest from an optimization standpoint, analyzes the queries, and recommends the optimal Primary Index for the queries provided. 

Once the analytical queries have been run, we can use the ‘recommend_ddl’ command in the SQL editor. These queries are selected from information_schema.engine_query_history and represent the workload of interest. The example below focuses on recent queries run in the last 30 minutes. Depending on workloads, this could vary, and you can concentrate the recommendations on specific queries important to your organization.

1CALL recommend_ddl (
2  ecommerce,
3  (
4    SELECT
5      query_text
6    FROM
7      information_schema.engine_query_history
8    where query_text ilike 'select%'
9    and end_time > NOW() - INTERVAL '30 minutes'
10  )
11);

The output of the recommend_ddl command shows the event_type and user_session columns to be a good choice for the Primary Index for this table, with an average data pruning of 94%.

The output of the recommend_ddl

To test this out, we recreate a copy of the ‘ecommerce’ table with a Primary Index of event_type using the ‘CREATE TABLE’ command.

1CREATE FACT TABLE IF NOT EXISTS "ecommerce_pi" (
2  "event_time" TIMESTAMPTZ NOT NULL,
3  "event_type" TEXT NOT NULL,
4  "product_id" BIGINT NOT NULL,
5  "category_id" TEXT NULL,
6  "category_code" TEXT NULL,
7  "brand" TEXT NULL,
8  "price" double precision NULL,
9  "user_id" TEXT NULL,
10  "user_session" TEXT NULL
11)PRIMARY INDEX event_type,user_session;
12
13INSERT INTO ecommerce_pi SELECT * FROM ecommerce;

We can now rerun the previous set of queries on the new table with the primary index configured. 

Conclusion

Data analytics is a critical part of E-Commerce operations today. In this example, we took a sample dataset from a Customer Data Platform (CDP) and ingested it rapidly into Firebolt. The role of performance and efficiency cannot be underestimated when it comes to Cloud based analytics. Inefficient data processing results in slower performance and cost overruns in the cloud.  We walked through potential ways to address performance challenges using the Firebolt Cloud Data Warehouse Platform. 

Appendix 

In the example above, we ingested data into Firebolt to run our queries. However, there may be cases where we want to explore data in the data lake without the need for fast performance. Firebolt provides the ability to explore the data that is stored on ‘S3’ directly. This federated data lake access is helpful for ad-hoc analysis, transformation, and validation before loading the data into the data warehouse. Firebolt supports external tables, which are pointers to the files in the data lake. The step below illustrates creating an external table called ‘ex_ecommerce’ pointing directly to the data stored on Amazon s3.

1CREATE EXTERNAL TABLE IF NOT EXISTS ex_ecommerce (
2  event_time TIMESTAMPTZ NOT NULL,
3  event_type TEXT NOT NULL,
4  product_id BIGINT NOT NULL,
5  category_id TEXT NULL,
6  category_code TEXT NULL,
7  brand TEXT NOT NULL,
8  price NUMERIC(38, 9) NULL,
9  user_id TEXT NULL,
10  user_session TEXT NULL
11) URL = 's3://firebolt-sample-datasets-public-us-east-1/ecommerce/parquet/'
12 OBJECT_PATTERN = '*.parquet.gz' TYPE= (PARQUET);

Now, the data can be directly explored on S3 using SQL. For example, you can query the count of event types directly against the CSV files in the data lake, as shown below.

1SELECT event_type, count(*) 
2FROM ex_ecommerce
3GROUP BY ALL;

Querying directly against the data lake does not leverage Firebolt's performance optimizations in the form of columnar storage and indexes. 

Send me as pdf