Querying and Ingestion with Firebolt (Github dataset)
January 17, 2023
January 17, 2023

Analyzing the GitHub Events Dataset using Firebolt - Querying with Streamlit

No items found.
Listen to this article

TLDR; In this multi-series blog post we will take a look at the GithubArchive dataset consisting of public events happening on GitHub. We will cover discovering the data using Streamlit and Jupyter and the Firebolt Python SDK, writing a small data app using the Firebolt JDBC driver as well as leveraging Apache Airflow workflows for keeping our data up-to-date. So stay tuned for all the blog posts!

Introduction

Before talking about the nitty-gritty of Firebolt, let's share a few words about the GithubArchive dataset. The GithubArchive is an ever growing list of public events happening on GitHub. Whenever you watch a repository, comment on an issue, open a pull request or commit some code, such an event is generated. The GithubArchive streams those events and creates a new JSON file for each hour of events. Each of those files is gzipped and can be more than 100MB gzipped and more than 750MB unzipped, containing more than 200000 events.

The data since 2015 contains more than 5 billion events. Of course we want to take a look at all of this data and see if we can get any insights out of that!

If you want to follow along, take a look at the first part of the youtube video here or just keep reading!

In order to not get overwhelmed by the sheer amount of data, the first session gives an overview over the data. Showing 5 billion events, more than 267 million repositories and more than 51 million users. Activity has only risen over the years and judging by the event type roughly half of the actions are commits. This also means that the other half are less about the code, but about interactions - and this is what makes GitHub a platform.

Github archive data

Going into more details on a per-repository base and looking a top committers and commenters there is one outstanding trend: bots, and thus automation! Keep this in mind, as we will get to this later. The video shows a few more details about Elasticsearch, the Spring Framework, Spring Boot, Prometheus, Grafana as well as Hashicorp Vault.

Github archive data - repository info

Github archive data

Github archive data

Looking at the data from a single committer perspective and events over the years you can sometimes defer when people handed over projects, became parents, switched jobs or started working on internal code.

Github archive data - user info

Equipped with this overview, it's time to take a look at a few queries to retrieve the above results, as every graph or metric basically reflects a query.

Streamlit allows for really fast prototyping of data applications with very short and concise code for visualizations. In order to use those there must be a connection to a Firebolt instance.

Setting up the Firebolt connection can be reused across all the dashboards makes sense and looks like this.

Now we can reuse the above connection in every page of our Streamlit app, like in the Home.py main file

The above code is just a small snippet from the whole code, but it shows the usage of get_firebolt_connection and executing a SQL query to store the result in a variable that is used in a Streamlit visualization, once all the queries return.

It'd probably be a boring exercise to gloss over every SQL statement in the Streamlit app, so instead of doing that, let's do something fun and talk about performance! Why is querying 5 billion events and retrieving the number of distinct users, repositories or event types so fast? By taking a shortcut and querying less data!

All of the queries used above use some sort of aggregate, be it a `count(*)`, a `distinct(count())`, as well as a date truncation down to the day for the where-based filter.

The use of aggregating indices prevents a table scan. By accessing precalculated data structures defined in the aggregating index the amount of data  that requires to be read is significantly reduced and thus the queries become much faster!

Imagine a `count(*)` based on the event type truncated to the day. Instead of having millions of documents looking like this:

Github archive data - table 1

This can be aggregated to an index like

Github archive data - table 2

Using such a data structure requires event_types times days summations as the product and there is no need to look at single events while this is executing.

The third video of this series gives a first overview of required tables in Firebolt. In order to ingest data into Firebolt, a so-called external table referring to the S3 bucket is required to access the S3. The actual ingestion into the table holding the data is basically a SELECT statement against the external table, that extracts the JSON into separate fields for each row. In order to so Firebolt supports dedicated functions to extract data from JSON, a simple example looks like this:

JSON_EXTRACT(raw_json, '/id', 'TEXT')

Extracting JSON is more powerful than this, as you also need to extract data from arrays or split content of fields - all can be done by using functions during ingestion. While you could do this during querying, doing complex transformations during ingestion will ensure that your queries stay fast.

The three videos in this blog post showed the first steps with the GitHub dataset, but we're far from done, there are three more posts coming soon!

If you have any questions, head over to help.firebolt.io and ask!

Also take a look at the Streamlit app in the GitHub repository.

Read all the posts

Intrigued? Want to read some more?