New York City is known for its bustling streets and heavy traffic. To gain insights into traffic patterns and volumes, the NYC Department of Transportation maintains a comprehensive dataset of traffic records. In this tutorial, we'll walk you through the NYC traffic dataset's schema and provide sample SQL queries with explanations to help you get started. Hope this warms up your SQL skills.
Before diving into the analysis, it's essential to understand the dataset's schema. Here's a tabular representation of the dataset schema:
Column Name | Data Type | Description |
---|---|---|
requestid | Bigint | Unique identifier for each record |
boro | Text | Borough where the traffic data is from |
yr | Int | Year of the data |
month | Int | Month of the data |
dd | Int | Day of the data |
hh | Int | Hour of the data |
mm | Int | Minute of the data |
vol | Int | Traffic volume |
segmentid | Bigint | Unique identifier for each segment |
wktgeom | Text | Well-Known Text representation of geometry |
street | Text | Street name |
fromstreet | Text | Starting street |
tostreet | Text | Ending street |
direction | Text | Traffic direction |
Now that you're familiar with the dataset's schema, let's explore some sample SQL queries to extract useful information.
Loading data into Firebolt can be done one of two ways. You can use the ‘COPY FROM’ command to perform schema inference and load the data.
Option 1:
The following command copies data from the S3 bucket into an internal table called ‘nyc_traffic’. This command automatically inferences the schema by sampling the dataset and creates the appropriate table structure.
1COPY INTO nyc_traffic FROM 's3://firebolt-sample-datasets-public-us-east-1/nyc_sample_datasets/nyctraffic/parquet/'
2WITH PATTERN="*.parquet" AUTO_CREATE=TRUE TYPE=PARQUET;
The ‘CREATE TABLE’ statement generated by the ‘COPY FROM’ command is shown below.
1CREATE TABLE “nyc_traffic” (“requestid” bigint NULL, “boro” text NULL, “yr” integer NULL, “month” integer NULL, “dd” integer NULL, “hh” integer NULL, “mm” integer NULL, “vol” integer NULL, “segmentid” bigint NULL, “wktgeom” text NULL, “street” text NULL, “fromstreet” text NULL, “tostreet” text NULL, “direction” text NULL)
Option 2:
The second way is to use the concept of an external table to move data into Firebolt. You can review this approach in the Appendix.
1SELECT boro, street, segmentid, SUM(vol) AS TotalTrafficVolume
2FROM nyc_traffic
3GROUP BY boro, street, segmentid
4ORDER BY TotalTrafficVolume DESC
5LIMIT 10;
This query identifies the top 10 segments with the highest traffic volume by borough, helping to pinpoint areas with significant traffic congestion.
1SELECT direction, SUM(vol) AS TotalTrafficVolume
2FROM nyc_traffic
3GROUP BY direction;
This query provides a breakdown of traffic volume by direction, helping to understand the flow of traffic.
1SELECT yr, month, COUNT(*) AS RecordCount
2FROM nyc_traffic
3GROUP BY yr, month
4ORDER BY yr, month;
This query provides a count of records for each month and year combination, allowing you to identify patterns and trends over time.
1SELECT street, SUM(vol) AS TotalTrafficVolume
2FROM nyc_traffic
3GROUP BY street
4ORDER BY TotalTrafficVolume DESC
5LIMIT 10;
This query identifies the top 10 busiest streets by total traffic volume, helping to prioritize infrastructure improvements.
1SELECT hh, SUM(vol) AS TotalTrafficVolume
2FROM nyc_traffic
3GROUP BY hh
4ORDER BY TotalTrafficVolume DESC
5LIMIT 1;
This query finds the hour of the day with the highest traffic volume, aiding in traffic management and scheduling.
1SELECT yr, boro, street, fromstreet, tostreet, direction, SUM(vol) AS TotalTrafficVolume
2FROM nyc_traffic
3GROUP BY ALL
4ORDER BY yr, TotalTrafficVolume DESC;
This query provides a comprehensive view of total traffic volume, including multiple attributes like year, borough, streets, and direction.
These sample SQL queries should help you get started with analyzing the NYC traffic dataset. Depending on your specific analysis goals, you can customize and expand these queries to gain deeper insights into traffic patterns and volumes. As a data engineer, your skills in data manipulation and SQL will play a crucial role in extracting valuable information from the dataset.
1CREATE EXTERNAL TABLE ex_nyc_traffic (
2 requestid bigint,
3 boro Text,
4 yr int,
5 month int,
6 dd int,
7 hh int,
8 mm int,
9 vol int,
10 segmentid bigint,
11 wktgeom Text,
12 street Text,
13 fromstreet Text,
14 tostreet Text,
15 direction Text
16) URL = 's3://firebolt-sample-datasets-public-us-east-1/nyc_sample_datasets/nyctraffic/parquet/' OBJECT_PATTERN = '*.parquet'
17TYPE = (PARQUET);
1CREATE TABLE nyc_traffic (
2 requestid bigint,
3 boro Text,
4 yr int,
5 month int,
6 dd int,
7 hh int,
8 mm int,
9 vol int,
10 segmentid bigint,
11 wktgeom Text,
12 street Text,
13 fromstreet Text,
14 tostreet Text,
15 direction Text) PRIMARY INDEX yr,month,dd, hh, mm, boro, street;
1INSERT INTO nyc_traffic SELECT * FROM ex_nyc_traffic;