New York City is known for its diverse culinary scene, with thousands of restaurants serving a wide array of cuisines. To ensure the safety and hygiene of these establishments, the NYC Department of Health conducts regular inspections and maintains a comprehensive dataset of inspection records. In this tutorial, we'll walk you through the NYC restaurant inspections dataset's schema and provide sample SQL queries with explanations to help you get started.
Before diving into the analysis, it's crucial to understand the dataset's schema. Here's a tabular representation of the dataset.
Column Name | Data type | Description |
---|---|---|
camis | Int | Unique identifier for each restaurant |
dba | Text | Doing Business As (DBA) name of the restaurant |
boro | Text | Borough where the restaurant is located |
building | Text | Building number of the restaurant |
street | Text | Street address of the restaurant |
zipcode | Text | ZIP code of the restaurant |
phone | Text | Phone number of the restaurant |
cuisine_description | Text | Description of the restaurant's cuisine |
inspection_date | Text | Date of the inspection |
action | Text | Action taken during the inspection |
violation_code | Text | Code indicating the type of violation |
violation_description | Text | Description of the violation |
critical_flag | Text | Flag indicating if the violation is critical |
score | Numeric | Inspection score |
grade | Text | Inspection grade (A,B,C,Z-Pending,N-Not yet graded, Z-Pending,P-Pending issued on reopening) |
grade_date | Date | Date of the inspection grade |
record_date | Date | Date when the record was created |
inspection_type | Text | Type of inspection |
latitude | Numeric | Latitude coordinates of the restaurant |
longitude | Numeric | Longitude coordinates of the restaurant |
community_board | Text | Community board where the restaurant is located |
council_district | Text | Council district where the restaurant is located |
census_tract | Text | Census tract where the restaurant is located |
bin | Text | Building Identification Number (BIN) |
bbl | Text | Borough-Block-Lot (BBL) |
nta | Text | Neighborhood Tabulation Area (NTA) |
location_point1 | Text | Location coordinates (point) in text format |
Now that you're familiar with the dataset's schema, let's explore how to load the data from an S3 bucket and run some sample SQL queries to extract useful information.
Loading data into Firebolt can be done one of two ways. You can use the “COPY FROM” command to perform schema inference and load the data.
Option 1: The first approach is to use the “COPY FROM” statement to copy data from the external bucket.
1COPY INTO nyc_restaurant_inspections FROM 's3://firebolt-sample-datasets-public-us-east-1/nyc_sample_datasets/nyc_restaurant_inspections/parquet/'
2WITH PATTERN="*.parquet" AUTO_CREATE=TRUE TYPE=PARQUET;
Option 2: The second way is to use the concept of an external table to move data into Firebolt. We will use the second approach here using the following three steps.
1CREATE EXTERNAL TABLE ex_nyc_restaurant_inspections (
2 camis INTEGER,
3 dba TEXT,
4 boro TEXT,
5 building TEXT,
6 street TEXT,
7 zipcode TEXT,
8 phone TEXT,
9 cuisine_description TEXT,
10 inspection_date TEXT,
11 action TEXT,
12 violation_code TEXT,
13 violation_description TEXT,
14 critical_flag TEXT,
15 score NUMERIC,
16 grade TEXT,
17 grade_date DATE,
18 record_date DATE,
19 inspection_type TEXT,
20 latitude NUMERIC,
21 longitude NUMERIC,
22 community_board TEXT,
23 council_district TEXT,
24 census_tract TEXT,
25 bin TEXT,
26 bbl TEXT,
27 nta TEXT,
28 location_point1 TEXT
29)URL = 's3://firebolt-sample-datasets-public-us-east-1/nyc_sample_datasets/nyc_restaurant_inspections/parquet/'
30OBJECT_PATTERN = '*.parquet'
31TYPE = (PARQUET);
1CREATE TABLE nyc_restaurant_inspections(
2 camis INTEGER,
3 dba TEXT,
4 boro TEXT,
5 building TEXT,
6 street TEXT,
7 zipcode TEXT,
8 phone TEXT,
9 cuisine_description TEXT,
10 inspection_date TEXT,
11 action TEXT,
12 violation_code TEXT,
13 violation_description TEXT,
14 critical_flag TEXT,
15 score NUMERIC,
16 grade TEXT,
17 grade_date DATE,
18 record_date DATE,
19 inspection_type TEXT,
20 latitude NUMERIC,
21 longitude NUMERIC,
22 community_board TEXT,
23 council_district TEXT,
24 census_tract TEXT,
25 bin TEXT,
26 bbl TEXT,
27 nta TEXT,
28 location_point1 TEXT
29);
1INSERT INTO
2 nyc_restaurant_inspections
3SELECT
4 camis,
5 dba,
6 boro,
7 building,
8 street,
9 zipcode,
10 phone,
11 cuisine_description,
12 inspection_date,
13 action,
14 violation_code,
15 violation_description,
16 critical_flag,
17 score,
18 grade,
19 grade_date,
20 record_date,
21 inspection_type,
22 latitude,
23 longitude,
24 community_board,
25 council_district,
26 census_tract,
27 bin,
28 bbl,
29 nta,
30 location_point1
31FROM
32 ex_nyc_restaurant_inspections;
Query 1: Find the Total Number of Inspections by Borough
1SELECT boro, COUNT(*) as total_inspections
2FROM nyc_restaurant_inspections
3GROUP BY boro
4ORDER BY total_inspections DESC;
This query calculates the total number of inspections conducted in each borough and presents the results in descending order of inspection counts.
Query 2: Identify the Most Common Cuisine Types
1SELECT cuisine_description, COUNT(*) as count
2FROM nyc_restaurant_inspections
3GROUP BY cuisine_description
4ORDER BY count DESC
5LIMIT 10;
This query identifies the top 10 most common cuisine types in NYC restaurants based on the dataset.
Query 3: Find the Restaurants with the Highest Scores
1SELECT dba, boro, max(score) as score
2FROM nyc_restaurant_inspections
3GROUP BY ALL
4ORDER BY score DESC
5LIMIT 10;
This query retrieves the top 10 restaurants with the highest inspection scores, along with their boroughs.
Query 4: Analyze Violations by Critical Flag
1SELECT critical_flag, COUNT(*) as count
2FROM nyc_restaurant_inspections
3GROUP BY critical_flag;
This query provides an overview of the number of inspections categorized by critical and non-critical violations.
Query 5: Calculate the Average Inspection Score by Cuisine Type
1SELECT cuisine_description, AVG(score) as avg_score
2FROM nyc_restaurant_inspections
3GROUP BY cuisine_description
4ORDER BY avg_score DESC
5LIMIT 10;
This query calculates the average inspection scores for different cuisine types and presents the top 10 with the highest average scores.
These sample SQL queries should help you get started with analyzing the NYC restaurant inspections dataset. Depending on your specific analysis goals, you can customize and expand these queries to gain deeper insights into the data. As a data engineer, your skills in data manipulation and SQL will play a crucial role in uncovering meaningful patterns and trends in the dataset.
Tune in to the 'Data Engineering Show' to see how the fastest growing tech companies handle their data challenges
Real talk, no fluff.
Real talk, no fluff.