NYC Traffic Dataset Analysis: SQL Warm Up

New York City is known for its bustling streets and heavy traffic. To gain insights into traffic patterns and volumes, the NYC Department of Transportation maintains a comprehensive dataset of traffic records. In this tutorial, we'll walk you through the NYC traffic dataset's schema and provide sample SQL queries with explanations to help you get started. Hope this warms up your SQL skills.

Understanding the Dataset Schema

Before diving into the analysis, it's essential to understand the dataset's schema. Here's a tabular representation of the dataset schema:

Column Name	Data Type	Description
requestid	Bigint	Unique identifier for each record
boro	Text	Borough where the traffic data is from
yr	Int	Year of the data
month	Int	Month of the data
dd	Int	Day of the data
hh	Int	Hour of the data
mm	Int	Minute of the data
vol	Int	Traffic volume
segmentid	Bigint	Unique identifier for each segment
wktgeom	Text	Well-Known Text representation of geometry
street	Text	Street name
fromstreet	Text	Starting street
tostreet	Text	Ending street
direction	Text	Traffic direction

‍

Now that you're familiar with the dataset's schema, let's explore some sample SQL queries to extract useful information.

Loading NYC Traffic data into Firebolt

Loading data into Firebolt can be done one of two ways. You can use the ‘COPY FROM’ command to perform schema inference and load the data.

Option 1:

The following command copies data from the S3 bucket into an internal table called ‘nyc_traffic’. This command automatically inferences the schema by sampling the dataset and creates the appropriate table structure.

1COPY INTO nyc_traffic FROM 's3://firebolt-sample-datasets-public-us-east-1/nyc_sample_datasets/nyctraffic/parquet/'
2WITH PATTERN="*.parquet" AUTO_CREATE=TRUE TYPE=PARQUET;

‍

The ‘CREATE TABLE’ statement generated by the ‘COPY FROM’ command is shown below.

1CREATE TABLE “nyc_traffic” (“requestid” bigint NULL, “boro” text NULL, “yr” integer NULL, “month” integer NULL, “dd” integer NULL, “hh” integer NULL, “mm” integer NULL, “vol” integer NULL, “segmentid” bigint NULL, “wktgeom” text NULL, “street” text NULL, “fromstreet” text NULL, “tostreet” text NULL, “direction” text NULL)

Option 2:

The second way is to use the concept of an external table to move data into Firebolt. You can review this approach in the Appendix.

Sample SQL Queries

Query 1: Total Traffic Volume by Segment and Borough

1SELECT boro, street, segmentid, SUM(vol) AS TotalTrafficVolume
2FROM nyc_traffic
3GROUP BY boro, street, segmentid
4ORDER BY TotalTrafficVolume DESC
5LIMIT 10;

This query identifies the top 10 segments with the highest traffic volume by borough, helping to pinpoint areas with significant traffic congestion.

Query 2: Traffic Volume by Direction

1SELECT direction, SUM(vol) AS TotalTrafficVolume
2FROM nyc_traffic
3GROUP BY direction;

This query provides a breakdown of traffic volume by direction, helping to understand the flow of traffic.

Query 3: Count of Records by Month and Year

1SELECT yr, month, COUNT(*) AS RecordCount
2FROM nyc_traffic
3GROUP BY yr, month
4ORDER BY yr, month;

This query provides a count of records for each month and year combination, allowing you to identify patterns and trends over time.

Query 4: Busiest Streets by Total Traffic Volume

1SELECT street, SUM(vol) AS TotalTrafficVolume
2FROM nyc_traffic
3GROUP BY street
4ORDER BY TotalTrafficVolume DESC
5LIMIT 10;

This query identifies the top 10 busiest streets by total traffic volume, helping to prioritize infrastructure improvements.

Query 5: Identify the Hour of the Day with Peak Traffic Volume

1SELECT hh, SUM(vol) AS TotalTrafficVolume
2FROM nyc_traffic
3GROUP BY hh
4ORDER BY TotalTrafficVolume DESC
5LIMIT 1;

This query finds the hour of the day with the highest traffic volume, aiding in traffic management and scheduling.

Query 6: Total Traffic Volume by Year, Borough, Street, From Street, To Street, and Direction

1SELECT yr, boro, street, fromstreet, tostreet, direction, SUM(vol) AS TotalTrafficVolume
2FROM nyc_traffic
3GROUP BY ALL
4ORDER BY yr, TotalTrafficVolume DESC;

This query provides a comprehensive view of total traffic volume, including multiple attributes like year, borough, streets, and direction.

These sample SQL queries should help you get started with analyzing the NYC traffic dataset. Depending on your specific analysis goals, you can customize and expand these queries to gain deeper insights into traffic patterns and volumes. As a data engineer, your skills in data manipulation and SQL will play a crucial role in extracting valuable information from the dataset.

Appendix

Using External Table for Adhoc exploration and Ingestion

Create an external table as a connector from Firebolt to your external data source. In the table definition, you specify credentials that allow Firebolt to read from the data source. For more information, see CREATE EXTERNAL TABLE and Using AWS roles to access S3. Data that you ingest must be in an Amazon S3 bucket in the same AWS Region as the Firebolt database.

1CREATE EXTERNAL TABLE ex_nyc_traffic (
2  requestid bigint,
3  boro Text,
4  yr int,
5  month int,
6  dd int,
7  hh int,
8  mm int,
9  vol int,
10  segmentid bigint,
11  wktgeom Text,
12  street Text,
13  fromstreet Text,
14  tostreet Text,
15  direction Text
16) URL = 's3://firebolt-sample-datasets-public-us-east-1/nyc_sample_datasets/nyctraffic/parquet/' OBJECT_PATTERN = '*.parquet'
17TYPE = (PARQUET);

‍

Create a table to store the data in Firebolt to be queried.

1CREATE  TABLE nyc_traffic (
2  requestid bigint,
3  boro Text,
4  yr int,
5  month int,
6  dd int,
7  hh int,
8  mm int,
9  vol int,
10  segmentid bigint,
11  wktgeom Text,
12  street Text,
13  fromstreet Text,
14  tostreet Text,
15  direction Text) PRIMARY INDEX yr,month,dd, hh, mm, boro, street;

‍

Use the INSERT sql statement using a general purpose engine to load data from the external data source into the fact or dimension table.

1INSERT INTO nyc_traffic SELECT * FROM ex_nyc_traffic;

Send me as pdf

Where should we send the PDF to?