Accelerating GIS Workflows with Geodesic

January 16, 2024

Link to Storymap with additional details

The Geodesic Platform

A very typical use of GIS software is management of land records such as tax parcels. By applying spatial analysis techniques, a GIS analyst can gain new insights about a set of geospatial features such as parcels, building footprints, municipal boundaries, and more.

For instance, say you wish to associate information from many disparate sources to these parcels. This could include elevation to use for a flood risk model, or perhaps pollution data such as methane or other greenhouse gasses to understand where major polluters are within an area. All of these datasets could come from disparate sources, formats, and APIs. This poses a major challenge for a GIS analyst or Data Scientist who would be much more effective if they could focus on answering their question rather than spend significant time gathering, reformatting, reprojecting, and transforming data.

Traditionally, this work is done by first obtaining these data in the form of files (e.g. Esri Shapefiles, GeoJSON, etc) and loading them into a desktop GIS environment or into a geospatially enabled database through some ETL (Extract, Transform, Load) process. The process of obtaining data can take a considerable amount of time and effort, and what’s worse, the analyst is stuck with limited compute resources on their desktop or database server to run what could be very expensive and complex analytics. To further compound these challenges, many modern datasets are ballooning in size, ranging from many gigabytes, terabytes, or even petabytes of data.

Geodesic is the world’s first and only Data Mesh optimized for spatiotemporal data

Geodesic is the world’s first and only Data Mesh optimized for spatiotemporal data. While this data strategy means different things to different people, there are a few aspects that are critical

  1. Decentralized Data Model that allows uniform data access. Rather than pool all data in a centralized repository, work with authoritative data at rest
  2. Semantic Layer to enable discovery. Just being able to list or search for Datasets is not enough – users need to be able to understand what data is useful for.
  3. Data as a Product too often results of analysis fall on the floor or sit stagnant on an analyst’s desktop. All data, include those created should be available the same as any other data source.

Geodesic is a cloud-native platform that implements a Data Mesh architecture through three main components: Boson, Entanglement, and Tesseract.

  1. Boson acts as an access layer for virtually any data source, particularly optimized for GIS data.
  2. Entanglement helps us organize Data Products in a knowledge graph so that data is discoverable and put into an appropriate use-case context.
  3. Tesseract lets us run massive scale analytics jobs. This includes anything from basic GIS workflows like zonal statistics and spatial joins, to advanced analytics such as machine learning.

In this article, we will specifically focus on use of Boson to access data and Tesseract to scale up a simple GIS workflow: zonal statistics on 1.4 million parcels with a digital elevation model (DEM). With Geodesic, we do things a bit differently.

Unless otherwise noted, all feature layers on the maps are served via the Boson Data Mesh and are direct results of analytics run in Tesseract.

Parcel Enrichment

One very common application of GIS is management of land records called parcels. Parcels give governments and other organizations a way to manage land by assigning a geographic boundary and a way to register them with other attributes for purposes of taxation, right-of-way management, and more.

Beyond traditional usage of parcels, it’s fairly common to join other datasets to parcel records in order to begin asking questions about the parcels for a region. Which parcels are at risk of flooding? Where is there high pollution and which parcels may be affected? Here we used Geodesic to begin to answer these questions for Cook County, Illinois.

The process of mapping statistics from some geospatial dataset, such as a digital elevation model to a polygon feature such as a parcel is called zonal statistics. For instance, if we wanted to know the average elevation of the ground inside some land parcel, we could use zonal statistics to estimate this.

Parcels for Cook County Illinois overlayed on a Hillshade Terrain Layer. The goal here with zonal statistics is to calculate statistics on the digital terrain model on each parcel.  Source

Cook County, Illinois has approximately 1.4 million parcels. In GeoJSON Format, this equates to approximately 3.5 gigabytes of data. This isn’t a particularly large dataset, but serves as a non-trivial challenge for quickly computing zonal statistics for Parcel Enrichment.

Data Access

One challenge nearly all organizations have is data fragmentation. Some data is stored within enterprise databases, some is stored locally on Desktops, some is stored in file shares, cloud warehouses, etc. Often times large organizations don’t have a coherent data strategy that spans across divisions within the company. The GIS and the Data Science, Business Intelligence, and other teams do not talk to each other even though much of that data could be shared.

One of the goals of a Data Mesh architecture is to break down silos within and across organizations and unify the way that data are managed. We built Geodesic because this problem is especially difficult when working with Geospatial Data. The GIS industry has slowly began to migrate from file-based or point geodatabase solutions to a more web service and cloud deployment centric way of doing business. Unfortunately for many organizations, the migration path isn’t always easy, and due to a need to support legacy technology the migration path is neither straight forward nor scalable. Geodesic dramatically simplifies the process of combining data from many disparate sources into a unified system that does not require users to dramatically alter their current workflows.

As mentioned above, our parcel dataset for Cook County consists of a 3.5 GB GeoJSON file. This is hardly an ideal format to use for analytics or to share on a web map. Fortunately, Boson makes using any geospatial data source easy and scalable. Boson acts as a interoperability layer for geospatial data – both into and out of. In many ways, it can be thought of as a  CDN for geospatial . For all examples shown here, Boson is emulating the GeoServices REST API (ArcGIS REST Service compatible), but none of the data it is serving is in ArcGIS Services. Of course, Boson is perfectly capable of connecting to data stored in ArcGIS Services as well. This level of flexibility means that Boson acts as a glue binding all geospatial data sources together without knowing or caring where the source data is located. This gives organizations the flexibility to both store data in separate systems as appropriate for the source data but also to break down internal data silos.

Boson: The unification of geospatial data access and usage with a massively scalable architecture to back it.

All we need to do is point Boson at whichever data sources we wish to use. No database or ETL required! We can now access these three datasets in Geodesic:

  • Cook County Parcels (GeoJSON) – 1.4 million polygon features
  • World Terrain (ArcGIS Living Atlas)
  • Boson can use data from many different sources including various APIs and databases. This includes data in STAC APIs/Catalogs, Google Earth Engine, any ArcGIS Source, the Microsoft Planetary Computer, Snowflake, Elasticsearch, and more.If Boson does not currently support your data source or you are using a particularly esoteric data source, Boson can be extended with a Python SDK.

The above code is all that it takes to add a dataset to Boson, now each of these datasets can be used as if they are all coming from the same source – without doing any ETL! In addition, all of these datasets can be mixed together using Views, Unions, and Joins as well as transforms on the pixels or features without requiring any change to the underlying data. This means that the data authority does not need to make changes for individual users, but rather that the users can request data how they need it.

For this article, we’ll focus on enriching parcels with a digital elevation model, but in principle, any dataset we can connect to with Boson will work with no material changes to the process.

The above graphic shows the workflow we will be demonstrating. Boson is used to access both data sources and Tesseract uses Boson to generate a fused layer of elevation and parcel geometries.

Analysis

Now that we have access to the data we need, we can run our analysis. For this we use a tool in Geodesic call Tesseract.

Tesseract allows the user to do arbitrary spatiotemporal (time and space) processing on as many input datasets as the user likes. This can be anything from geoprocessing, such as Zonal Statistics, Spatial Joins, or similar to Advanced Analytics such as Deep Learning.

Tesseract is a massive-scale Data Fusion tool that can run arbitrary processing on spatiotemporal tile workflows.

Tesseract works by dividing a region and time range into small pieces of optionally overlapping spatiotemporal extents. Each of these tiles can be processed in parallel by hundreds or workers.

For basic tasks the user just needs to specify some basic information describing the area the job should run, data inputs, and properties such as spatial references and pixel sizes. For more advanced workflows, the user may specify a model container that executes arbitrary Python code or choose a preexisting model to run, such as Zonal Statistics.

The Results

As mentioned above, Boson is capable of emulating a number of different Geospatial APIs to use for output, including ArcGIS compatible REST services. This means that even though we ran a processing job on potentially hundreds of machines in the cloud, the results can be used in ArcGIS Pro, Online, and Enterprise (or anything else that can consume ArcGIS services) without further copying or moving of the data.

Individual parcels colored by the minimum elevation of each parcel.

Summary

In this article, we demonstrated how to perform a simple zonal statistics workflow on 1.4 million parcels using Geodesic.

  • At no point did we need to write data to a database
  • Results are available to web maps through OGC, ArcGIS, and other APIs
  • We were able to leverage the power of as many computers as needed without managing our own complex cloud processing architecture
  • We could use inputs from any source without ETL