How metadata and persistent identifiers help the research process
Summary of the basic metadata concepts, how persistent idenfifiers tie in, and how they can help
Just because you can’t fit an econometric model on metadata doesn’t mean it’s not important—instead, metadata are a key part of making a research project reproducible. Metadata is a big topic, and the How To Fair guide gives more detail. But the essential piece we’ll focus on here is persistent identifiers to unambiguously refer to the inputs and outputs of research (citation metadata). Aspects of this have already been covered in citing code, citing data, publishing research compendia, licensing, and citing digital objects, so this section is about how these topics relate to each other, as summarized in Figure 1.
Identifying the inputs
The first thing to identify are the input data used for a research project. This is probably the most difficult part of a research project to reference and, unfortunately, probably one of the most important. Having a clear understanding of the source of the input data for a project is key to being able to reproduce it. Unambiguously referring to input data is easy in cases where these data come from an open dataset in a research archive (like Zenodo), as the DOI gives a persistent identifier for the input data to a project. At the other extreme are proprietary data, and it is not always straightforward to identify these inputs.
The next thing to identify is the process used to conduct the research. This is fairly straightforward, as covered by publishing research compendia, and can be done by using a research archive that mints a DOI. In some cases a source-code repository like github also works, although it does not mint DOIs and, consequently, it is more difficult to refer to a github repo at a point in time.
An often neglected piece of metadata is the software and tools used as part of a research project. As with identifying input data, citing research software is only as easy as the authors make it, and in some cases tools like {grateful} can help automate this process. Licensing is an important piece of metadata here, as depending on software that is not freely available greatly impedes reproducibility.
Finally, references to previous work are a well-established piece of metadata for any research project. Most published work now has a DOI, but conferences papers, for example, often do not, making them more difficult to reference. An important part of identifying the authors of previous work is having a persistent identifier for you, the researcher. This is the role of an ORCID, which uniquely identifies a researcher and is connected to their research.
Good metadata isn’t just about attribution; it can also help to make the different parts of a research project more accessible and interoperable. For example, research archives on Zenodo can be downloaded directly for R based on the DOI.
Metadata for data are different
If a dataset is the output of a project, rather than an input to the project, then metadata become more complex—it is now longer just about proper attribution but rather effectively describing the dataset (structural metadata). Both SDMX and DDI are standards for documentating statistical data, although the details are beyond the scope of this guide. But these only matter for producing data and making it usable by others; consumers of data need only references those data (ideally with a DOI).