DCN Responds to the US Request for Public Comment on Desirable Characteristics of Data Repositories

We’d like to share the Data Curation Network (DCN) response to the White House Office of Science and Technology Policy request for public comment on Draft Desirable Characteristics of Repositories for Managing and Sharing Data Resulting From Federally Funded Research posted January 17, 2020 as document 85 FR 3085 in the Federal Register.

Our team, a collaboration of 10 academic and general data repositories that share data curator expertise to overcome common challenges, generally agrees with all of the desired characteristics. So rather than echo what SPARC and CORE have expertly framed in their response, we would like to drill down on two characteristics in greater detail:

Curation & Quality Assurance
Free & Easy to Access and Reuse

Curation & Quality Assurance: Provides, or has a mechanism for others to provide, expert curation and quality assurance to improve the accuracy and integrity of datasets and metadata.

“Curation” is a term that may have multiple meanings depending on context and perspective. We recommend that the OSTP adopt a clear definition of “curation” to aid both researchers and repositories in understanding and adhering to expectations in managing and sharing data. Our preferred definition of curation is the “activity of managing and promoting the use of data from their point of creation to ensure that they are fit for contemporary purpose and available for discovery and reuse” from the CoreTrustSeal Trustworthy Data Repositories Requirements: Glossary 2020–2022 (Version v02_00-2020-2022).

Our research has shown that researchers view themselves as playing a key role in providing curation and quality assurance for their data, often starting when the data are created. As such, data repository curators must bring in multiple perspectives, including the originating author, when providing additional curation and quality assurance services.

Curators employed by data repositories should be recognized as trained professionals who draw from an educational foundation in digital archives grounded in subject matter expertise. For example, our data curators in the DCN often have a PhD in a discipline combined with a terminal degree in library information science (e.g., MLS or MLIS) and supplement this with ongoing professional development in digital curation practice (e.g., SAA digital archives specialist certification).

Not every general or multi-disciplinary data repository can hire an expert for the wide variety of data types and discipline-specific data formats that we receive (such as spatial data, code, databases, chemical spectra, 3D images, and genomic sequencing data). Therefore, the Data Curation Network, in addition to establishing a shared staffing model among our partner repositories, also created a platform for others to share expertise through Data Curation Primers. These freely available tools are interactive, living documents that detail a specific subject, disciplinary area or curation task and that can be used as a reference to curate research data. Primers published by teams of experts include:

Acrobat PDF Primer
ATLAS.ti Primer
Confocal Microscopy Image Primer
Geodatabase Primer
GeoJSON Primer
Jupyter Notebooks Primer
Microsoft Access Primer
Microsoft Excel Primer
netCDF Primer and Tutorial using an NCAR dataset
SPSS Primer
STL Primer
R Primer
Tableau Primer
WordPress.com Primer

Curation should protect the chain of custody of a dataset and ensure authenticity of the data. Therefore, we recommend that data repositories strive for transparency in the curation actions taken both generally as well as the specific curation actions taken for an individual dataset. Such transparency would benefit data depositors when selecting a repository, as well as data consumers, when determining whether to use the data. For example, members of the data curation network take generalized actions for our data sets, called the Data Curation Network CURATED steps, as well as specific actions that are detailed in a curation log. The CURATED steps include (briefly):

Check – Create an inventory of the files and review received metadata
Understand – Run the data/code, read documentation, assess for QA/QC red flags
Request – Work with the author to address any missing information or changes needed
Augment – Enhance metadata for discoverability and contextualize data with appropriate linkages (e.g., PUID for paper or published code, etc.)
Transform – Convert files to non-proprietary formats, if appropriate
Evaluate – Review overall data package for FAIRness
Document – Record all curation activities in a log file.

Levels of curation vary from repository to repository (see Table 1). Based on an examination of the work we do in the Data Curation Network, we recommend that federally funded research be shared in data repositories that practice enhanced curation to ensure that data sets are complete and understandable to someone with similar qualifications and in formats that allow for long-term use.

Table 1: CoreTrustSeal Levels of curation mapped to descriptions provided by the Data Curation Network

Level of Curation	Description and Examples
A. Content distributed as deposited	Data sets are accepted into the repository with no curator intervention. e.g. FigShare, Zenodo, OpenICPSR, many institutional data repositories.
B. Basic curation – e.g., brief checking, addition of basic metadata or documentation	Basic curation is often applied at the metadata record level. Descriptive metadata, such as keywords using a controlled vocabulary, are reviewed, verified, and/or added to improve discoverability. e.g., Springer ($), Mendeley Data ($), some institutional data repositories.
C. Enhanced curation – e.g., conversion to new formats, enhancement of documentation	Enhanced curation is often applied at the file-level where data files are checked for completeness and documentation is reviewed and/or enhanced to be understandable by someone with similar qualifications as the data creator. e.g., Data Curation Network institutions. We follow the CURATE(D) steps in order to apply enhanced curation for a wide variety of data types.
D. Data-level curation – as in C above, but with additional editing of deposited data for accuracy	Data-level curation is often applied by a subject matter expert who reviews the contents of the files in a process analogous to peer-review. This deeper level of curation may involve quality control, harmonization to increase interoperability with other data sets, and domain-specific metadata augmentation. e.g., many domain repositories such as Protein Data Bank, ICPSR, DBGap, GenBank

Free & Easy to Access and Reuse: Makes datasets and their metadata accessible free of charge in a timely manner after submission and with broadest possible terms of reuse or documented as being in the public domain.

We agree that data should be available to the user without cost. However, there are significant costs attached to providing long-term discovery, access, curation, preservation, and stewardship for data. As data repository managers, we ask that OSTP address this criteria by recognizing how data repositories fund these services. For example:

In many cases, academic data repositories are supported through federal grants via indirect costs as well as through state and tuition funds for general operating costs.
If sufficient funding is not available from the federal agencies or publishers who mandate the deposit of data into repositories, the repository may need to charge reasonable, cost-recovery fees to researchers depositing their data to cover operating expenses. For example, the Dryad Digital Repository includes a $120 deposit fee for authors, which may be covered by a researcher’s institution via annual membership.

Sincerely, representatives of the members in the Data Curation Network

Lisa Johnston, University of Minnesota

Kathryn Wissel, New York University

Elizabeth Hull, Dryad

Mara Blake, Johns Hopkins University

Cynthia Hudson Vitale, Pennsylvania State University

Joel Herndon, Duke University

Hoa Luong, University of Illinois

Wendy Kozlowski, Cornell University Library

Jake Carlson, University of Michigan