Tom's Blog

My Real-World Match / Case

Ned Batchelder recently shared Real-world match/case, showing a real example of Python’s Structural Pattern Matching. These real-world examples are a great complement to the tutorial, so I’ll share mine. While working on some STAC + Kerchunk stuff, in this pull request I used the match statement to parse some nested objects: for k, v in refs.items(): match k.split("/"): case [".zgroup"]: # k = ".zgroup" item.properties["kerchunk:zgroup"] = json.loads(v) case [".zattrs"]: # k = "....

STAC Updates I'm Excited About

I wanted to share an update on a couple of developments in the STAC ecosystem that I’m excited about. It’s a great sign that even after 2 years after its initial release, the STAC ecosystem is still growing and improving how we can catalog, serve, and access geospatial data. STAC and Geoparquet A STAC API is a great way to query for data. But, like any API serving JSON, its throughput is limited....

Gone Rafting

Last week, I was fortunate to attend Dave Beazley’s Rafting Trip course. The pretext of the course is to implement the Raft Consensus Algorithm. I’ll post more about Raft, and the journey of implementing, it later. But in brief, Raft is an algorithm that lets a cluster of machines work together to reliably do something. If you had a service that needed to stay up (and stay consistent), even if some of the machines in the cluster went down, then you might want to use Raft....

National Water Model on Azure

A few colleagues and I recently presented at the CIROH Training and Developers Conference. In preparation for that I created a Jupyter Book. You can view it at https://tomaugspurger.net/noaa-nwm/intro.html I created a few cloud-optimized versions for subsets of the data, but those will be going away since we don’t have operational pipelines to keep them up to date. But hopefully the static notebooks are still helpful. Lessons learned Aside from running out of time (I always prepare too much material for the amount of time), I think things went well....

Jupyter, STAC, and Tool Building

Over in Planetary Computer land, we’re working on bringing Sentinel-5P into our STAC catalog. STAC items require a geometry property, a GeoJSON object that describes the footprint of the assets. Thanks to the satellites’ orbit and the (spatial) size of the assets, we started with some…interesting… footprints: That initial footprint, shown in orange, would render the STAC collection essentially useless for spatial searches. The assets don’t actually cover (most of) the southern hemisphere....

py-spy in Azure Batch

Today, I was debugging a hanging task in Azure Batch. This short post records how I used py-spy to investigate the problem. Background Azure Batch is a compute service that we use to run container workloads. In this case, we start up a container that processes a bunch of GOES-GLM data to create STAC items for the Planetary Computer . The workflow is essentially a big for url in urls: local_file = download_url(url) stac....

Dask-GeoPandas Spatial Partitioning Performance

A college reached out yesterday about a performance issue they were hitting when working with the Microsoft Building Footprints dataset we host on the Planetary Computer. They wanted to get the building footprints for a small section of Turkey, but noticed that the performance was relatively slow and it seemed like a lot of data was being read. This post details how we debugged what was going on, and the steps we took to fix it....

Planetary Computer Release: January 2023

The Planetary Computer made its January 2023 release a couple weeks back. The flagship new feature is a really cool new ability to visualize the Microsoft AI-detected Buildings Footprints dataset. Here’s a little demo made by my teammate, Rob: Your browser doesn't support HTML video. Here is a link to the video instead. Currently, enabling this feature required converting the data from its native geoparquet to a lot of protobuf files with Tippecanoe....

Cloud Optimized Vibes

Over on the Planetary Computer team, we get to have a lot of fun discussions about doing geospatial data analysis on the cloud. This post summarizes some work we did, and the (I think) interesting conversations that came out of it. Background: GOES-GLM The instigator in this case was onboarding a new dataset to the Planetary Computer, GOES-GLM. GOES is a set of geostationary weather satellites operated by NOAA, and GLM is the Geostationary Lightning Mapper, an instrument on the satellites that’s used to monitor lightning....

Queues in the News

I came across a couple of new (to me) uses of queues recently. When I came up with the title to this article I knew I had to write them up together. Queues in Dask Over at the Coiled Blog, Gabe Joseph has a nice post summarizing a huge amount of effort addressing a problem that’s been vexing demanding Dask users for years. The main symptom of the problem was unexpectedly high memory usage on workers, leading to crashing workers (which in turn caused even more network communication, and so more memory usage, and more crashing workers)....