How to approach reproducibility
What is reproducibility?
For robust scientific progress, new methods or approaches should be confirmed independently before they are widely adopted. The goal of appropriately structuring and sharing research objects in a transparent fashion is to simplify this peer review process.1 The primary way this is achieved by creating and publicly publishing a research compendium along with the paper. A research compendium is a collection of digital parts of the research project that supports reuse, including data, code, protocols, metadata, etc.2
Why does it help?
The main principle of the research compendium is to provide all the information about the project publicly and structured in a clear and logical way to ensure that its use is straightforward. This helps researchers themselves as they can easily jump back to a previous project, simplify the task for reviewers or those who want to extend the research, as well as those simply looking to learn. If done properly, the research compendium will help:3
- Improve the transparency, reliability and reproducibility of research.
- Simplify peer review.
- Facilitate data and code sharing.
- Allow easy extension of the research.
- Enable learners to understand the research.
- Make it much easier to transition a new method to production.4
- Increase research visibility and citations.
How to structure the research compendium
The guidance on research compendium is at an interim phase. The project team will flesh this out to provide more clarity aligned with the needs of the discipline and how researchers can ramp up (as it’s not an all or nothing task).
Overview of the structure
In a nutshell, a compendium includes all research objects necessary to reproduce the research. In a technical sense, these are often git repositories (in say GitHub) that include a structure similar to the one in Figure 1 below.
A little about each aspect
A data folder that outlines where to store the raw dataset used for the research project. Ideally the researcher uses an open dataset (which will make the whole process reproducible), but they may also use a proprietary dataset.
The dataset that acts as the main input dataset to the research should not be version controlled. The folder is created in order to ensure that when a local copy of the compendium is used by researchers, they know where to put the data to ensure that the code will replicate the results in full.
Technically, this means making sure that the .gitignore
skips this data file
A folder for functions (or other code) that helps process your data into the relevant outputs. This could include the code to clean and prepare the raw dataset for research purposes, the code to create the processing and research experiments, as well as notebooks where the data is explored and various aspects that feed the research paper are generated.
A folder for the output data. This data can be versioned (if it is not sensitive) with the repository and allows researchers to replicate the process. Note, if the output data can be used for research in its own right, it may be appropriate to register this dataset on a public repository (such as Zenodo) that mints a DOI.
A folder for documentation to that explain key aspects of the research. This folder stores project documentation or the project design, however code should also be documented well.
A license. This will tell users how they can use the code.
A .gitignore
file. There are some files and folders that should not be version controlled. Notable example is datasets
A file to recreate the environment on which the code will run identically. A shift in one package version to another may change the output, hence its key to track exactly how to replicate the same environment and get the same result.
Finally, a README to introduce the project. This will be the landing place when someone navigates to the repository, hence it should describe the project at a glance.
Notable example
To showcase an exemplar for price statistics, we created a mock-up price index pipeline that researchers can explore.
Footnotes
A good overview of reproducible research is done by the Turing Way. The book also covers open research and many other topics.↩︎
See Research Compendia in The Turing Way for more detail on this concept.↩︎
See a more in depth overview of the benefits of reproducible research, as well as common barriers.↩︎
Research compendia are conceptually quite similar to Reproducible Analytical Pipelines (RAPs), although the latter focuses more on production processes. Hence if the research is easy to reproduce by adopting a compendium structure and making everything easily reproducible, operationalizing of a new method could be dramatically simplified. For more on RAPs in price statistics, see a RAP class recently done by ESCAP, as well as a good paper by Price and Marques (2023) showcasing RAP for production processes.↩︎
See overview and explanations of version control in The Turing Way for more info.↩︎