Listen to this article
In part I of this series of blog posts, we talked about the differences between GEOMETRY and GEOGRAPHY, as well as the S2 library that powers Firebolt’s GEOGRAPHY data type. In this post we’ll take a deeper look into how Firebolt processes and stores geospatial data to enable fast and efficient query execution.
This post will cover:
1. How geospatial data is validated and normalized upon ingestion
2. How Firebolt structures and stores GEOGRAPHY objects using S2 cells and shape indexes
3. How Firebolt optimizes query performance with geospatial data pruning and partitioning
By the end, you'll have a comprehensive understanding of how Firebolt prepares and optimizes geospatial data behind the scenes. Let’s dive in.
Validating and normalizing inputs
Geospatial data is ingested into Firebolt through one of three standard formats: Well Known Text (WKT), Well Known Binary (WKB), or GeoJSON. These inputs are immediately converted into Firebolt’s native representation based on the S2 geometry library.
A GEOGRAPHY object in Firebolt can contain different types of geometric components, such as points, line strings, and polygons, and in the case of a GEOMETRYCOLLECTION, a combination of these. All components of a single GEOGRAPHY object are stored inside an S2ShapeIndex. A shape index is S2’s most generic abstraction. It can store any combination of points, line strings, or polygons. S2 also provides functions to compute relations between two shape indexes, such as containment, and intersection which we use to implement the functions exposed through Firebolt’s SQL interface.
However, many functions working on shape indexes require that within a shape index, no component intersects a polygon’s interior. Firebolt ensures this in a normalization step by merging intersecting polygons, removing any component that is completely contained in a polygon, and cutting line strings at polygon boundaries. This ensures that later operations can be as fast as possible because they don’t need to validate the input shapes. In some cases, like wrong polygon orientation, or self-intersecting polygons, Firebolt also fixes invalid inputs while converting them into the S2 format. Our documentation covers the cases where Firebolt fixes invalid inputs or applies normalization in more detail.
The example below shows an example of a GEOGRAPHY object before and after normalization. Both polygons are merged and the line string is cut into three pieces.
data:image/s3,"s3://crabby-images/e8527/e8527233bb271f9efc2f87bc87571473e9067caf" alt=""
Storing S2 cells to enable fast functions
Geospatial queries, especially those involving intersection, containment, and distance calculations, can be computationally expensive. A naïve approach would require scanning every geospatial object in a dataset to determine which ones match a given query—an operation that does not scale well as data volume grows. Efficient indexing and filtering mechanisms are essential to keep query execution fast and scalable.
Many of the performance optimizations Firebolt uses for geospatial data use S2 cells and coverings. S2 cells are a hierarchical partitioning of the Earth's surface, designed to represent spatial data at various levels of precision. The example below shows two of 6 cells of the highest level of the hierarchy. The right cell is further subdivided 4 times into lower level cells.
data:image/s3,"s3://crabby-images/10bee/10bee98ea740e75634e5605d415f1facadaf8cb7" alt=""
A covering, in this context, refers to a set of S2 cells that together encompass a given region. For example, in the image below, a polygon outlining Florida is approximated by the shown covering of 22 S2 cells at different levels of the hierarchy.
data:image/s3,"s3://crabby-images/ce3b8/ce3b8eed3a3c7fe4f18dc086439761b2fd623093" alt=""
One of the key advantages of using S2 cells is the efficiency of intersection and containment tests. Since each cell is represented by a single integer, determining whether one cell intersects with another can be done using simple integer comparisons, which are extremely fast.
After creating a valid shape index, we compute a single S2 cell that fully covers the entire shape index (footnote: Finding a single cell that fully covers a shape index is not always possible because the largest cells cover only 1/6th of earth. In these rare cases, Firebolt can also store multiple cells for a single shape index). This cell is persisted in our storage format and used to speed up function evaluation on GEOGRAPHY objects. You can learn more about these optimizations in part III of this blog post series.
Data pruning for geospatial predicates
Data pruning is one of the most important performance optimizations in a database. By using aggregated information about large parts of the data, Firebolt can often quickly deduce that this data does not need to be read at all. For types like integers, timestamps or even strings, we compute statistics such as the minimum and maximum value of a column within a tablet while writing to Firebolt storage. These statistics can then be used for pruning during query execution, i.e. skipping entire tablets if all the data inside is filtered out by the WHERE condition of a query. Because geospatial objects cannot be ordered, there are no minimums or maximums for GEOGRAPHY columns that could be used for pruning. Instead, we compute a covering of S2 cells that covers the entire column within the tablet.
The coverings stored for each tablet can be used for queries using geospatial functions as filters. Currently, these are ST_COVERS(A, B), ST_CONTAINS(A, B), and ST_INTERSECTS(A, B), where either A or B is a constant, and the pattern ST_DISTANCE(A, B) < C, where C is a constant number and either A or B is a constant GEOGRAPHY object. Firebolt’s query optimizer recognizes these patterns and adds an additional filter that skips a tablet if its covering does not intersect the covering of the constant input. For the ST_DISTANCE(A, B) < C pattern with constant A and C, the covering for A is constructed such that it contains everything within a distance C of A.
The image below illustrates two tablets that contain all of the displayed points. The purple and gray boxes represent the coverings stored for the tablets. When a query is executed to retrieve all points within the polygon shown, the red covering is computed. The purple covering corresponding to the first tablet overlaps the red covering, so the first tablet is scanned. The gray covering corresponding to the second tablet does not overlap the red covering, so the second tablet is completely skipped.
data:image/s3,"s3://crabby-images/43d22/43d22b8d3048e863589c4aca90c36203fffab267" alt=""
Improving pruning performance using partitions
Depending on how you ingest data, tablets might have objects scattered across the globe which can severely impact the effectiveness of pruning entire tablets. In order to improve this, an option is to partition the data by a spatial attribute using the PARTITION BY clause when creating a table. This could be a zip code (or a prefix of the zip code), or the ID of a point’s S2 cell at a given level using the ST_S2CellIdFromPoint function. By partitioning the data spatially, every tablet only contains objects from a single partition so the tablet pruning conditions can work at full effectiveness.
However, it's important to avoid creating too many partitions, as this can negatively impact performance. Ideally, the partitioning key should have a moderate number of values—around 100 partitions is a reasonable target. For zip codes, this could involve using the first two digits, while for S2 cell IDs, the appropriate cell level can be determined based on the average area of cells at that level. For example, when dealing with points distributed across the land mass of the Earth, a cell level of 3 would be suitable, whereas for points within the USA, a cell level of 5 would be more appropriate.
Outlook: GEOGRAPHY in the Primary Index
Currently, Firebolt does not support GEOGRAPHY columns in the primary index. In the future, when this feature is added, geography data will be sorted using a method called a space-filling curve. Space-filling curves work by assigning every point on Earth a single number in a way that keeps nearby points assigned to similar numbers. This organization helps group related data together, making it faster to perform operations like checking whether areas overlap or whether a point is within a specific region. The same space-filling curve is also what powers S2 cells and enables fast intersection and containment checks using simple integer comparisons.
data:image/s3,"s3://crabby-images/826d7/826d76e5353479ae91e82dfe0efdddf89dba1435" alt=""
By using this ordering within a tablet, each range contains only a small region. We can then store an S2 cell covering of all rows within the range. This covering is then used to efficiently prune ranges in the same way as we prune tablets. In the future, we will also use S2 cell coverings to support fast spatial joins where optimized data structures on top of S2 cell coverings can help to quickly find join partners.
Now that we have learned about the fundamentals and data format of Firebolt’s GEOGRAPHY data type, we will dive into the details of snap rounding and performance optimizations for spatial operations in part III. In the meantime you can try out Firebolt’s geospatial functionality yourself using our demo project.