Source Cooperative Update: 1PB and Growing
Since launching Source Cooperative in 2023, we’ve grown to host over a petabyte of data.
We are proud to have reached this milestone, but we’re still barely getting started. We’re now at a point where we have enough data and enough users to meaningfully guide future product development. We recently conducted some user research, and our users have made it clear that we need to balance the power of Source with greater accessibility.
Here’s where we stand and what we’re doing to make Source usable by a wider range of people.
Source hosts over one petabyte of data & transfers half a petabyte per month
Source now hosts over 1 petabyte of data – more than doubling the 450 terabytes we were hosting when we announced our support from Navigation Fund in October 2024.
We host over 300 data products. Some exciting additions to Source in recent months have been cloud-optimized global forecast data from dynamical.org, global satellite imagery embeddings from Earth Genome, the Ocean Carbon Dioxide Removal Atlas from [C]Worthy, and the WaterNet dataset from Bridges to Prosperity. About 10% of the data products listed on Source are over 1 terabyte in volume.
Data usage has increased alongside our growth. We now log an average of 126 million data requests and about half a petabyte of data transfer per month. A search for our data proxy endpoint on GitHub currently yields 464 results in code repositories, up from 186 in October of last year. This indicates that we’re not only hosting a lot of data, but that we’re hosting useful data.
The growth has been somewhat organic, driven primarily by word-of-mouth recommendations within our own circles. We’ve also been heavily involved in data rescue operations, helping preserve and make accessible datasets that might otherwise be lost due to recent funding cuts.
Our users like Source, but we need to make it easier to use
Between April and May of this year, we conducted surveys and interviews with 40 users and potential users to better understand needs and adoption barriers. These users face common problems: data scattered across systems, unclear documentation and metadata, complex authentication barriers, and poor data lineage information. They need access to large datasets for analysis and modeling but are often budget-sensitive and wary of commercial cloud costs and vendor lock-in.
Many existing solutions are either very limited in terms of data capacity, charge high fees for simplified (and rigid) solutions, or require complex technical expertise to set up and maintain. We want Source to occupy the middle ground – an accessible service that enables the publication of any kind of data with reduced complexity and at an appropriate price point.
Our research revealed three primary user types: data engineers, software developers, and researchers/scientists – particularly those working in nonprofit or government-funded settings with increasingly limited resources. We found that 64% of Source users report that it meets their needs, and 86% plan to continue using it – this tells us we’re onto something, but “64%” gives a lot of room for improvement!
We found clear areas to work on: 41% want easier data upload processes, 36% want better documentation, and 32% want a larger community of users. Non-users cited technical limitations (89%), time to learn new tools (67%), and security concerns (56%) as key barriers to adopting any data repository.
What we’re working on now
We’re working with Development Seed on the next generation of Source. Here’s what we’re focused on for the rest of the summer.
Building a better data proxy
Based on user feedback, we’re rethinking our approach to the Source Data Proxy.
The Source Data Proxy is essential for us to provide an interface that can support multiple object stores while allowing us to meter traffic, throttle access, and manage authentication. This allows us to create durable public endpoints, manage bandwidth costs, and manage access for sensitive data products. However, we’re committed to finding ways to ensure researchers have direct access to data whenever possible for large-scale analysis and application development.
We want to support users who only need to publish a few files as well as those who need to publish millions of objects. The flexibility and scalability of object storage benefits users at either end of the spectrum, but Source currently favors power users who are already familiar with the cloud. We can make basic interface changes to help new users.
Users consistently identified proxy documentation as a pain point during our interviews, and we need to make it clearer how users can access data directly from cloud storage whenever possible without using the proxy. We will be updating Source documentation in the coming months to remedy this.
Creating a more flexible frontend
We’re developing a new frontend for Source, which you can preview at s2.source.coop. The new interface will be more flexible, allowing us to iterate more quickly based on user feedback and to share much more information about data products and how they’re being used.
Goals we are working toward with the new frontend include:
- Making it easier for non-expert users to publish data
- Adding more metadata to our web pages which will improve search engine results and help guide AI models
- Sharing data usage metrics on data product detail pages
- Incorporating ORCID IDs in user profiles, ROR IDs in organizational profiles, and DOIs to data products detail pages
The new frontend will be released this summer.
Writing more case studies
The art of creating great data products is understudied, and we think Source provides a great lab to experiment. We’re lucky to work with people committed to open science who are remarkably good at this, but we need to be more deliberate about showcasing what makes a great data product. There are many things to consider when sharing data including metadata, documentation, thoughtful file structure, thoughtful choice of file formats, and community outreach.
We’re working with a few of our users to produce case studies throughout the rest of the year to showcase what good looks like.
Funding and Sustainability
As we’ve said before, we believe financial self-sustainability should be a core feature of Source. We plan to share more details about our revenue model before the end of the year.
Unfortunately, many of our users are increasingly concerned about the durability of the data archiving systems they’ve previously relied upon due to proposed cuts in research funding and university overhead. We aim to position Source as an affordable option for organizations that are no longer able to fund their own data management systems.
Our approach will likely include:
- Free tiers for smaller datasets and human-scale usage
- Predictable fees based on storage volume for data publishers
- Direct cloud access options for users needing high-bandwidth programmatic access
The goal is predictable, fair pricing that covers our costs while keeping the service accessible to the academic and nonprofit communities we primarily serve. In the interim, funding from Navigation Fund and in-kind support from AWS allows us to continue offering free hosting while we develop a service worth paying for.
Advisory Board
To help guide Source Cooperative’s development and ensure we’re building something that truly serves the research community, we’ve assembled an advisory board of experts from across the data science, policy, and research landscape. Our advisory board members include:
- Denice Ross – Senior Fellow, Federation of American Scientists & Former U.S. Chief Data Scientist
- Fernando Pérez – Faculty Director, Berkeley Institute for Data Science
- Josh Peek – Head of Data Science Mission Office, Space Telescope Science Institute
- Marshall Moutenot – CEO, Upstream Tech & Founder, dynamical.org
- Millie Chapman – Assistant Professor of Environmental Policy, ETH Zürich
- Drew Breunig – Technical Advisor at large
- Carl Boettiger – Associate Professor, University of California, Berkeley
- Anna Greenwood – Neuroscientist at large
- Eli Fenichel – Professor of Natural Resource Economics, Yale School of the Environment
- Mark Otterlee – Senior Director of Engineering, Allen Institute for Artificial Intelligence
We are honored to have the support of this remarkable group of people. Their special blend of scientific, policy, economic, business, and product design experience will help us realize the full potential of Source.
More frequent development updates will be published on the Source Cooperative Documentation site moving forward.