How to approach reproducibility

Published

2025-04-28

What is reproducibility?

For robust scientific progress, new methods or approaches should be confirmed independently before they are widely adopted. The goal of appropriately structuring and sharing research objects in a transparent fashion is to simplify this peer review process.1 The primary way this is achieved by creating and publicly publishing a research compendium along with the paper. A research compendium is a collection of digital parts of the research project that supports reuse, including data, code, protocols, documentation, metadata, etc.2

Why does it help?

The main principle of the research compendium is to provide all the information about the project publicly and structured in a clear and logical way to ensure that its use is straightforward. This helps researchers themselves as they can easily jump back to a previous project, simplify the task for reviewers or those who want to extend the research, as well as those simply looking to learn. If done properly, the research compendium will help:3

  • Improve the transparency, reliability and reproducibility of research.
  • Simplify peer review.
  • Facilitate data and code sharing.
  • Allow easy extension of the research.
  • Enable learners to understand the research.
  • Make it much easier to transition a new method to production.4
  • Increase research visibility and citations.

Key components in a research project

To make a research project reproducible, it is key to to observe that there is not one object that is produced (i.e. the paper), but 4. Figure 1 provides a visual that can help understand what each is intended to for and how to approach each for reproducibility.

Figure 1: Research process

(1) The data

It is key to differentiate input data and output data:

  • Input data is the data used to answer research questions. It may be open, closed (proprietary or sensitive), or synthetic. Each is approached separately. Some notable aspects:

    • Ideally open data is used as this ensures the research can be fully reproducible and the dataset used can act as a benchmark if other methods are tried on it. We list a few available open datasets in the discipline specific catalogue.

    • While input data is ingested and processed by the research, in itself it is ignored by the research compendium (such as by listing the data files in the .gitingore). Read more in the input data guide.

  • Ouput data is data created as part of the research process. Whether (and how) its shared depends on various circumstances and value to later research projects (to say act as an input dataset for other projects). Read about this process here.

(2) The research compendium

The research compendium is a structured object that contains all the aspects to recreate the output (i.e. results shared in the paper). Some key sub-components include:

  • Its logical structure ensures maturity and helps others (or you later) in trying to peer review, reproduce or extend the research. Read more about how and why to set up an appropriate structure here.

  • The code and workflows ingest and processes input data to create relevant outputs (output data, figures, etc) in an automated fashion (i.e. avoiding manual steps). As mentioned above, the input data itself however is not a part of the compendium and is not republished.

  • The source for the paper is included so that raw input to final paper can be reproduced in an automated fashion. This allows you to regenerate and change the paper as needed.

It is key to differentiate between active and archived versions of the compendium:

2A. Active research compendium

When a project is active, it is best to use a version control process (such as GitHub/GitLab) to store and version the whole object. This is a sensible default for research projects.

There is a risk however with GitHub/GitLab for long term archival as they are not permanent records - i.e. the GitHub repository can be deleted or changed and the URL could change.

2B. Archived research compendium

When a research project is complete, the compendium should thus be archived in a repository. The snapshot created at this time should align to the paper that was published. Repositories (such as Zenodo) mint DOIs (or Digitial Object Identifiers) that are unique and are immutable (i.e. don’t change). This ensures that you can find that same snapshot at any point in the future.

NoteAdoption depends on the organization

Use of repositories such as Zenodo depends on the researchers and the organization they are in. It is best to follow your organization’s rules for which repository it is best to publish from. The main idea is to publish to a repository that helps create a DOI which is (ideally) immutable.

(3) Key software

Not all code in the research compendium will be written by the researchers from scratch using vanilla code. It is likely to use packages to abstract away some or most of the complexity and automate the work. There are two sensible rules of thumb that can be applied:

  • If the code is pivotal to the methodology (for instance the package was used to calculate price indices), then it is best to acknowledge this in the paper (by citing it).

  • A snapshot of the computational environment should be taken and included in the research compendium. This ensures that the exact computation requirement can be replicated, but also to ensure that dependencies on specific versions of other packages is saved.

(4) The paper

Finally, the paper is generated as the output. As the research compendium should (ideally) have the source to generate the paper (such as by using Quarto to render a paper.qmd as a pdf), outputs or images in the paper can be traced back from the body of the image to the code that generated them.

Back to top

Footnotes

  1. A good overview of reproducible research is done by the Turing Way. The book also covers open research and many other topics.↩︎

  2. See Research Compendia in The Turing Way for more detail on this concept.↩︎

  3. See a more in depth overview of the benefits of reproducible research, as well as common barriers.↩︎

  4. Research compendia are conceptually quite similar to Reproducible Analytical Pipelines (RAPs), although the latter focuses more on production processes. Hence if the research is easy to reproduce by adopting a compendium structure and making everything easily reproducible, operationalizing of a new method could be dramatically simplified. For more on RAPs in price statistics, see a RAP class recently done by ESCAP, as well as a good paper by Price and Marques (2023) showcasing RAP for production processes.↩︎