How can you contribute?

Published

2025-04-08

Modified

2026-01-12

If you have a dataset that you would like to register on the catalogue, the following process outlines how to do this. Figure 1 outlines this in high level, with details on each step below.

Figure 1: High level process flow to register a dataset

Before you start

Requirements to contribute to the data catalogue

In order to contribute to the catalogue, the following criteria must be met:

Where to host the dataset itself

As the price statistics data catalogue describes datasets already hosted elsewhere (in other words it is not a data repository), the first step is to host the dataset somewhere. This is fundamentally up to the researcher and the institution they work at. Ideally a data repository is used that mints a Digitial Object Identifier (DOI) so that the dataset can be easily cited and found. Data repositories often allow a researcher to host private datasets and lock down access but still create a DOI. Regardless of where, the DOI should be ready when the dataset is registered in the catalogue so that researchers can find the dataset itself and know how to cite the dataset.

Read more about data repositories in the Turing Way.

How to register a dataset on the catalogue?

Once the above aspects are considered, a dataset can be registered to the catalogue.

How is the catalogue rendered?

The catalogue uses the open-source python package datacontract-cli to render the dataset record as a static html site from a structured yaml file that stores the relevant information. Basically if there is a dataset.yaml in the /datasets/ folder of the data catalogue repository, the GitHub workflow renders it a an html alongside other datasets.

Read more about the configuration of the yaml file and the open standard it is based on here.

There are two ways to request that a dataset be registered to the catalogue

Option 1: Document the dataset yourself

The best way is to directly propose a dataset and start the process of registering it (such as by fully describing the dataset):

Fork the catalogue repo and mock up a new dataset in the /datasets/ directory of your fork.
Submit a PR to request from your fork to the main catalogue repository. Please tag the PR to the dataset label.
We will review your request and coordinate with you as appropriate (such as if we need more info) or help further flush out the metadata so that the dataset is well defined before it is published. We may work with you to propose more detailed documentation based on the maturity level, such as (see below).
- For instance if you see value in writing a data exploration blog to describe the dataset, we can work with you to do so and make it part of the reproducibility project site.

When we are ready, we will either merge your PR and thus register your dataset or reject the dataset if it is not appropriate.

Option 2: Request that we consider a dataset

You could also just give us a ping to let us know that a good dataset exists and we can review and register it when we get a chance.

Create a new issue and describe the dataset you wish that we register. Include the relevant details that we can use to find out more about the dataset. Please also tag the issue with the dataset label.

Maturity model of registered datasets

As not all datasets can be documented to the same level, we’ve broken up the structure into a maturity model to aspire to for each dataset record, starting from the benchmark to the top ‘gold standard’ level.

Level 1

This level sets a bare minimum a dataset needs to have to be registered to the catalogue

The dataset has a basic description to introduce it to any users in the discipline
The data model (structure of the data and each variable) is documented
The dataset is available openly and the data file format is anything that is machine readable (for instance a proprietary format like .xlsx or language specific data formats like .Rdata are fine, however a pdf is not) is referenced.
The license (or at minimum terms of the use of the dataset) is listed so that it is clear how the dataset can be used and how it cannot.
Information on how to cite the dataset is available.

Level 2

Level 2 implies a higher level of maturity to simplify the process for data

The dataset is stored in an open file format
Dataset quality considerations and detailed nuances of the data are discussed. This is best done in specific quality section of each variable or for the table/dataset as a whole.

Level 3

This implies a ‘gold standard’ for a dataset

The dataset is made available in a data repository (such as Zenodo) that mints a DOI. This DOI is listed as part of the dataset.
A data paper detailing the dataset is available and linked. Alternatively, a blog can also be written on the project site to introduce the dataset with the dataset owner.