We don’t talk about open data much at Radiant Earth. This might seem strange given how much work we do to expand access to data, but the term “open data” has become increasingly unhelpful over recent years.
The Open Knowledge Foundation has long proposed what they call the Open Definition, which states that open data is data that “can be freely used, modified, and shared by anyone for any purpose.” This is fine. Open data by this definition sounds like a nice thing and I agree that more data should be made available under the terms described by Open Definition. It’s hard to argue with “open,” which is part of the problem. “Open” increasingly appears in white papers, marketing copy, and press releases as a magic word that makes any project appear altruistic, collaborative, or democratic.
There are two reasons why this doesn’t work for us:
- “Open” is an imprecise term that doesn’t capture anything about data’s value, intended use, or usability. It’s possible to meet the simple version of the Open Definition without making data truly available or useful to people.
- Focusing exclusively on open data can lead organizations to discount or ignore data that should not be made available openly.
Let’s address these in order:
The imprecision of open
At Radiant Earth, rather than talking about open data, we talk about making more data more available to more people – let’s call it the “More Data framework.” This framework allows us to be pragmatic and work incrementally to achieve our mission of increasing shared understanding of our world. The framework explicitly states that there is always more to do.
Also, rather than talking about “data” in the abstract, we talk about “data products.” I have sat in many meetings in which people talk about data as if it were a magical substance that would solve everyone’s problems if only they had more of it. I have never been in a meeting in which someone said the same thing about software. When we talk about software, we talk about specific software products – and when we talk about products, we talk about what they do, who they’re built for, and how they work. The same framing is very useful for data.
By simply appending the word “product” to “data,” we’re forced to think more practially. When talking about a data product, we have to explain who we expect to use it, the value provided by it, and the cost to produce and maintain it. Within this context, we can also consider all of the attributes that might impact the “openness” of the data product. Is the data product available under some kind of open license? Is it available for download? Can people interact with it programmatically? Is it documented? Is it available in a commonly-used format? Is it hosted on infrastructure that is practically usable by many people?
Instead of talking about open data, we talk about openness as an attribute of a data product, and we expect quite a bit of variance in degrees of openness across different data products.1
The case against openness
Should we be working to expand access to data products that might help poachers hunt endangered species? What about data products that could be used by militaries to identify and attack migrant populations? These are easy questions to answer. The answer is… maybe!
Note that I didn’t say “working to open” – I said “working to expand access.” We absolutely should work on ways to make data products available to people who can put them to good use (rather than get into the weeds about what “good use” means, let’s assume I mean “make progress toward the Sustainable Development Goals”). The point is that there are many instances when it makes sense to restrict access to data. There are people who work in conservation and law enforcement who can use data to protect habitats of endangered species. There are humanitarian groups who can use data to better protect migrant populations. We want to make sure they have access to the data they need to do their jobs, while not accidentally making it available to bad actors.
This is another place where the More Data framework comes in handy. These are use cases where open data simply isn’t an option, but want to ensure that the right people can access data that will help them do their jobs. We should be on the lookout for problems that can be solved by data by specific actors. Once we identify the problem, then we can determine which data products can help solve the problem and who should have access to those products. In some cases, the data should be completely free and open to everyone. In other cases, the data might be so expensive to produce that it’s infeasible to give it away for free. And in others, it might be made available for free, but only to trusted stakeholders.
A challenge that we’re working on right now is how to accommodate these various use cases. In recent years, a facile approach taken by many governments has been to declare all data “open by default” without answers for how to ensure access to data that shouldn’t be opened. We’re exploring how data products can be made available in a way that accommodates a broad spectrum of openness. Forgive me, but we believe that supporting a wider variety of user needs is, ultimately, a more open approach.
This approach is similarly helpful when thinking about the FAIR data principles which call for data to be findable, accessible, interoperable, and re-usable. Ryan Abernathy gave a great presentation called Beyond Fair in October 2022 that explores some of the reasons why calls for FAIR data haven’t made a meaningful impact on scientific data infrastructures. ↩︎