The Naive Origins of the Cloud-optimized GeoTIFF
In our November post about democratization of Earth observation technologies, I promised to share some history of the Cloud-optimized GeoTIFF (COG) in a future post. This post fulfills that promise and includes some insights into the thinking behind the COG and how it emerged as a community standard. My hope is that sharing this history can help explain what it means to be “cloud-native” or “cloud-optimized” and the impact cloud computing has on how people interact with data.
As I said in November:
Back in 2014, thanks to a tip from Chris Holmes, I started looking into making Landsat data available on the AWS cloud. One of the first things I discovered was that people got Landsat data from USGS by downloading gigabyte-sized TAR files that contained 12 TIFF files, but most users only ever used about 3 of the TIFFs. It was immediately apparent that we could save people a ton of time by letting them access data on a per-TIFF basis from Amazon S3 instead of a TAR file. Together with many generous people (including Frank Warmerdam at Planet, Peter Becker at Esri, and Charlie Loyd and Chris Herwig at Mapbox), we took this idea even further and added internal tiling to the TIFF and launched what we called Landsat on AWS. This allowed people to not only access only the TIFFs they wanted, but allowed them to access specific tiles within the TIFFs they wanted. Not only did this save time, but it also let people build apps that could interact with a massive corpus of Landsat data in real time. Thanks to the work of many people in the geospatial data community, this approach has since evolved into a widely-used best practice called the Cloud-optimized GeoTIFF (COG).
The truth is that I knew very little about Earth observation in 2014. I did, however, know quite a bit about building Internet software, and I knew with every fiber of my being that you should avoid moving data over networks if you can avoid it. Being mindful of data transfer is a good thing for your budget and it’s a good thing for your users who benefit from snappy, low-latency interfaces. Fun fact: the “TAR” extension is derived from “tape archive”, which is a pretty big clue that this method of sharing data was rooted in a pre-WWW era. It was clear to me that we could use a more modern approach to sharing data that wouldn’t require users to download large TAR files.
My proposal to improve data access was to unpack the TAR files and make the GeoTIFFs available as individual files on S3. We could use Landsat’s naming conventions to allow people to use S3’s APIs as a read-only API. This would allow people to download only the files they needed. Peter Becker suggested that we go further and add internal tiling to the GeoTIFFs as well as provide overview files. Reader, if you don’t know what that means, don’t feel bad. I didn’t know either. Peter explained that internal tiling and overviews would allow users to quickly preview the imagery or request specific imagery within a GeoTIFF rather than downloading the whole thing. I did the math and found that doing this would slightly increase the overall volume of data we would host, but it would significantly reduce the volume of data users would have to transfer in order to work with the data. It seemed like a good idea for our customers, and on March 19th, 2015, we launched Landsat on AWS with 85,000 Landsat 8 scenes. We added new data as fast as the team at Planet (and by “team,” I mean Amit Kapadia) could get it.
As I was working on this, I had to deflect questions from well-intentioned colleagues who would ask me what all this cool satellite data would “look like” when we got it onto the cloud. The awkward truth is that the data looked like a bunch of files in an S3 bucket. Paul Ramsey accurately characterized the approach as “not user friendly at all,” but “simple” and “computer friendly” in an excellent presentation called Data Pipes and Relevance he gave a few months after we’d launched Landsat on AWS. We didn’t offer any tools to visualize the data, and we didn’t offer any APIs other than S3’s API. These decisions were deliberate. What we wanted to do was deliver data efficiently in the most flexible way possible and then get out of people’s way. Because we were making this data open to anyone, we had very little insight into what tools they would have access to, so we made no assumptions about what people should do. We simply adopted a commonly-used pattern for sharing GeoTIFFs with plain off-the-shelf HTTP APIs backed by a cloud object storage service.
This approach turned out to be useful. Many smart people throughout the geospatial community adopted it, documented why it was a good idea, made a website for it, and named it the Cloud-optimized GeoTIFF, or COG. The COG is now the standard way to get Landsat data from USGS. They even made a funny video describing why they adopted it (the animation at the top of this post is taken from this video).
I’m glad it worked out, but there was no guarantee that it would. When people told me that our approach would break existing workflows, I believed them. That was kind of the point. I knew our approach would be useful, but explaining it was sometimes embarrassing. As Jeff Bezos has said, “I believe you have to be willing to be misunderstood if you’re going to innovate.” I remember being in a meeting with some executives from NASA sometime later in 2015. I showed them what we’d done, and I remember someone saying something like, “This isn’t impressive.” I’m sure the comment was more polite than that, but the truth is that the innovation was hard to see. On the surface, it’s accurate to say that all we did was put a bunch of files into S3. But if you zoom in on that “bunch of files” and “S3,” there’s an enormous amount of innovation that we can learn from.
With years of hindsight, here are a few of the innovative technologies that have made the COG as useful as it is:
First and foremost, Landsat data is inherently extremely compelling and useful! None of this would have been interesting if Landsat wasn’t already recognized a gold standard for Earth observation data. It is far beyond the scope of this post to list the innovations that made Landsat possible.
The World Wide Web and the Hypertext Transfer Protocol (HTTP)
Here’s how the COG is described at cogeo.org:
A Cloud Optimized GeoTIFF (COG) is a regular GeoTIFF file, aimed at being hosted on a HTTP file server, with an internal organization that enables more efficient workflows on the cloud. It does this by leveraging the ability of clients issuing HTTP GET range requests to ask for just the parts of a file they need.
This leads to the question: if a COG isn’t hosted on an HTTP file server, is it really a COG?
Open data policy
We couldn’t have done this if USGS hadn’t adopted an open data policy for Landsat in 2008 that allowed anyone to copy and reproduce it. We owe immense gratitude to Barbara Ryan and all of her co-conspirators who made this happen, particularly because this innovative policy was subsequently adopted by the European Union’s Copernicus program.
The Geospatial Data Abstraction Library (GDAL) is the most important piece of software for the Earth observation community that no one ever talks about. It’s the open source library used to create COGs and read them efficiently. Just about every geospatial software application worth using relies on GDAL. This is also the point where I should reemphasize that GeoTIFFs with internal tiling and overviews was not a new idea in 2015! It was an established best practice that GDAL already supported.
Cloud-based object storage
Public cloud-based object storage is another incredibly consequential invention we take for granted. All major cloud providers maintain object storage services that are absolute marvels of technology that let people efficiently store and retrieve effectively any volume of data from anywhere in the world using HTTP. They can handle just about all the traffic you can throw at them, and they’re all close to scalable computing resources that can be used to analyze the data.
Public cloud business model
The technology behind object storage wouldn’t matter if AWS hadn’t built a business that had scaled to the point where they could give away petabytes of free storage. The scale of the public cloud business is so large that all major cloud providers are now dedicating multiple petabytes of object storage for Earth observation data. In case you don’t know what a petabyte is, consider this: it would take you 2.5 years to download a single petabyte of data at a 100 megabit per second rate. We’re lucky that competition in the cloud industry has enabled this kind of support for the scientific community.
The evolution of the COG and the subsequent development of the Spatiotemporal Asset Catalog (STAC) specification has all been possible because of the generosity of many members of the geospatial community. We did all of this in collaboration with individuals with a mind toward creating something that would be the most useful to the most people. When we launched Landsat on AWS, we didn’t set out to solve everyone’s problems or “own the geospatial market.” Instead, we set out to make it quite a bit easier for our customers to hack on their own problems and share their findings with their community. All of this collaboration was made possible by technologies as prosaic as telephones, email, mailing lists, air travel, and GitHub.
This is a decent (and incomplete) list of what made the COG possible. My point in listing all these things is not to specify who should get credit, but rather to point out that innovation is often emergent – it arises out of many interacting things. So while the COG is relevant today, we can be confident that something better (faster, easier to access, more flexible) will come along someday. For that to happen, we must create circumstances in which many things can interact.
The good news is that this is already happening. Several fantastic cloud-native initiatives are underway that make geospatial data easier to work with (see GeoZarr, Cloud-optimized Point Clouds, and GeoParquet). There are even Cloud-optimized TAR files! Competition among public cloud providers has reliably led to a reduction in object storage costs, continual improvement in performance, and a commodification of services that ensures that object storage services support generic RESTful / HTTP data transfer protocols. These trends combine to make cloud-native approaches to data a powerful way to expand access to planetary-scale volumes of data. As more data becomes available in these formats, we’re seeing increased diversity of tools and applications being developed.
From our perspective at Radiant, diversity of tools and applications isn’t merely a nice thing to have, but it’s core to our mission of increasing shared understanding of our world. We need greater diversity of tools because different organizations and individuals require them due to a number of factors such as budget, staff size, language requirements, accessibility requirements, security requirements, available equipment, and Internet access. The cloud-native approach enables the creation of new tools that are entirely browser-based, which can make data more accessible to people who may not have access to proprietary software licenses or significant compute resources.
In 2016, Paul Ramsey (yes, I’m a fan) gave a presentation titled The Undiscovered Country in which he predicted an “interesting” future would emerge as the result of interactions between utility computing (cloud), free software, cheap data, and increased availability of ML models. It’s fair to say we’re living in a pretty interesting future right now.
We’re currently working on some ideas to support the cloud-native data community. If you’re interested in getting involved, we’d love to hear from you: firstname.lastname@example.org.