We had the pleasure of interviewing Aimee Barciauskas, a Data Engineer at Development Seed who is passionate about leveraging technology to make a positive social impact. Aimee specializes in developing cloud-native solutions for analyzing and processing geospatial data. She has contributed to NASA’s Multi-Mission Algorithm and Analysis Platform, The Visualization, Exploration, and Data Analysis (VEDA) Project, and is part of the Pangeo project, and the Earth Systems Information Partners (ESIP) Cloud Computing Cluster. We spoke with Aimee about her expertise in cloud-native geospatial solutions and her thoughts on the future of managing and processing geospatial data.
What inspired you to pursue a career in data engineering, and how did you become interested in working with geospatial data in the cloud?
I think my non-linear path demonstrates how following your values and personal interests can lead to fulfillment, even when the destination was not predicted. I started with a degree in economics, with a development focus, and never considered software development as a career. But I was excited about using data-backed empirical evidence to drive policymaking.
After graduating, I took a job as a project assistant at a nonprofit consulting firm, but it turned out to be an administrative role. While the consultants were smart and thoughtful, I found the work boring. It mostly involved making cold calls and arranging travel. I knew I wanted to do something more creative, so I switched to web design and front-end development. The first time I touched HTML and saw a webpage appear, it changed my life forever.
I returned to my love of statistics by earning a master’s in data science from the Barcelona Graduate School of Economics. Through this intensive program, I realized I wanted to be a developer who worked with scientists rather than a data scientist. The challenge of solving technical problems is something I enjoy.
I think I am no different from many in understanding that climate change is humanity’s biggest challenge. Discovering Development Seed and how it is using open-source technologies to tackle Earth’s biggest challenge felt like a perfect fit. I get to work with some of the smartest and most thoughtful people and the largest datasets in Earth science by working with NASA IMPACT.
Can you give us an overview of what cloud-native means and how it differs from traditional approaches to data?
Archival formats such as NetCDF, HDF5, and GeoTIFF require users to download entire files before being able to do any analysis. These files may be rich in metadata, but users are limited to their personal machines’ network and storage capacities.
Only so much information about the planet can be stored and analyzed by one machine. And data is growing in a variety of ways across the Earth observation sector. There has been a rise in private satellite companies alongside the coming launch of space agency missions like NASA and ISRO’s NISAR. The NISAR mission itself is estimated to generate 85 TB of data a day and is one of a few missions that motivated NASA’s move to use the cloud for cloud storage.
But cloud-native means more than just storing data on the cloud: Cloud-Native means access to data without dependence on your local machine’s storage so anyone with a network connection can run the same analysis.
There are two core implementations of cloud-native: data stored with co-located compute and data stored in cloud-optimized formats. Both help minimize the amount of data that needs to be transferred across networks.
Could you elaborate on the difference between data stored with co-located compute and data stored in cloud-optimized formats?
Data stored with a colocated compute server means storage and compute servers are located in close physical proximity – often in the same data center or server. This reduces latency to access and process the data because there is less physical distance for data to travel.
We did some cloud vs. on-premise data access performance testing for NASA’s Multi-Mission Algorithm and Analysis Platform (MAAP). MAAP provides access to data from the AWS cloud and data stored on-premise at NASA’s data centers via its JupyterHub development environment. We found the time to download a file from a NASA data center could take up to 13 times longer than the same file when transferred from the same data center as the server the notebook hub was running (using AWS us-west-2 region).
In addition to where the data is stored, we need to address how the data is stored. Cloud-optimized data formats have metadata identifying data chunks based on various parameters (typically spatial extent, temporal extent, and data variable). Chunk structure defined by metadata enables client libraries to do something called “lazily loading” data. Lazily loading means first reading the metadata and only fetching a subset of the raw data via HTTP range requests. These new formats are great because they enable parallel access, and with the advent of cloud services, these formats enable anyone to do the computing previously only available via supercomputers.
Looking ahead, what do you see as the biggest challenges and opportunities for innovation in the cloud-native geospatial?
Just putting the data in the cloud is insufficient. NASA is moving its Earthdata archives to AWS Simple Storage Service (S3). While moving data to the cloud makes it accessible to anyone with a network connection, the major challenges of big data and data discovery remain significant obstacles for cloud-native geospatial applications.
I think the biggest opportunity to solve this challenge is convergence on STAC. I’m excited STAC is seeing wide adoption! STAC is the foundation by which many varied and rich applications are being built.
But there is still much work to do. First, it is still hard for people to find data that is relevant to them. We are close to a world where someone can ask questions like “How much hotter is it today than the average for the past 100 years?” But we won’t get to answers without a shared approach towards curating and accessing the datasets to best answer that question. The metadata I have seen in STAC catalogs can be inconsistent or insufficient, making cross-catalog search challenging. Further, STAC collection search is yet to be implemented. I would love to see STAC collection search and more consistency in STAC catalogs, perhaps through wider adoption of STAC validation tools like stac_pydantic.
And once you have found the data, what about access?
Users today must register with multiple data providers (NASA, NOAA, USGS, academic institutions) archiving data on varied data centers (on-premise data centers, AWS, Microsoft, Google). We need to find a solution for federated access to data and compute. I think it’s ok for users to start their exploration on different platforms, but they should then be able to execute offline processing on the system closest to the data. Systems should have similar APIs for processing data. The OGC Best Practice for Earth Observation Application Package references STAC in its implementation, so perhaps that specification for running application packages is a place to start.
Moving data to the cloud solves the scalability and high availability challenges, but users will still have to download whole files if the data and corresponding metadata and services are not cloud-optimized. There is consolidation around COGs and Zarr for subsetted access to gridded/raster data, however a lot of data is still stored in archival formats. Kerchunk provides an opportunity to provide subsetted access to these archival formats, but access to chunks will be constrained by the chunk structure of the original files, and often archival formats are written without an optimal (or any) chunking. So we need to get in on the “ground floor” by having the data producers create cloud-optimized data stores.
There is still a lot to be done to consolidate on cloud-optimized data access approaches. We see pockets of success, such as the large-scale development of Zarr archives by the pangeo-forge project and heavy use of COGs across the industry. But ideally, we can help data producers to create cloud-optimized formats from the outset. This is to eliminate the need for additional cloud-optimization pipelines or heavy subsetting APIs. And it must be mentioned, we are still working on consensus on formats for point and vector data. The proliferation of different approaches means users are having to learn many different tools and formats, which is not sustainable or scalable.
Is normalizing Earth data possible?
Perhaps not; therefore, converging on data formats doesn’t make sense. Especially since use cases are varied, we should just converge on the cloud-native format paradigm (chunk-defining metadata). My colleague Alex Mandel and I put together a presentation about cloud-optimized formats, which details how point and vector formats are new and serve different purposes (for example, columnar may be good for large-scale analytics across files, whereas flatgeobuf provides a spatial index that makes it easy to stream a single file).
However, If input datasets are at least published in STAC, we can extend tooling to support reproducible and shareable science. I believe that a rich and common metadata standard can serve the interoperability of data across platforms for analysis and visualization, making reproducible science a reality.
Where can people learn more about your work on cloud-native geospatial technologies?
These projects are great places for people to get started and learn more.
- NASA’s The Visualization, Exploration, and Data Analysis (VEDA) Project - VEDA is championing the use of STAC as the configuration engine for its dashboard and APIs as well as the integrating its STAC catalog with other NASA projects, like NASA Worldview.
- GitHub - stac-utils/pgstac: Schema, functions, and a python library for storing and accessing STAC collections and items in PostgreSQL - My colleagues at Development Seed have developed a PostgreSQL backend for STAC, making it possible to store and query rich STAC metadata like any other database. We store some basic statistics in STAC, which means we often don’t need to access the files at all for many use cases.
- Zarr Enhancement Proposals (ZEPs) - These Zarr proposals get at some of the most interesting considerations when adopting cloud-native formats - such as how to store aggregations and performantly create and access many, many small chunks of data (see the sharding storage transformer spec).
- There’s also the GeoZarr spec which “aims to provide a geospatial extension to the Zarr specification.”
Thank you for sharing your insights on cloud-native geospatial solutions. Your answers have provided valuable information for our audience. Before we sign off, what communities do you recommend joining for people interested in learning more in this space?
A lot is evolving in this space, and if you are interested in learning more or how to solve a specific problem, I encourage readers to join the following communities:
- The ESIP Cloud Computing Cluster meets monthly to discuss innovations in cloud computing and cloud-native geospatial. Browse the ESIP Community Calendar and join the mailing list.
- The Pangeo Community meets weekly in the Pangeo showcase but also has other community meetings to join you can learn about them in Meeting Schedule and Notes.
- There are even communities for specific tools, like Zarr and Jupyter. I haven’t gotten to join those yet myself but I would encourage the reader to join if they are of interest.