How can you contribute?
If you have a dataset that you would like to register on the catalogue, the following process outlines how to do this. Figure 1 outlines this in high level, with details on each step below.
Before you start
Requirements to contribute to the data catalogue
In order to contribute to the catalogue, the following criteria must be met:
- The dataset should be publicly available for researchers. There are proprietary datasets that could in theory also be listed, however until the price statistics reproducibility project figures out the process for this, we request that only fully open datasets are registered. We encourage requests on valuable proprietary datasets, however these will not be catalogued until the process is flushed out.
- The dataset must be related to the price statistics discipline. Price statisticians most typically track change in prices, such as through price index methods – thus the dataset should support this use case. Other use cases, such as for machine learning applications when it comes to classification, can also be submitted, but should be as close to the needs of the discipline as possible.
- Be of value to the discipline. Many data catalogues that are too lax with the cataloguing process become filled with many datasets of incremental value. As a result, users struggle to find highly valuable datasets, which eventually causes a dropoff in use of the catalogue. To avoid this, the value of the dataset to researchers in the discipline should be clear. The reproducibility team will review and approve each proposed submission during each meeting.
- The contributor must document the dataset in full when the dataset is to be registered. Having partially documented datasets on the catalogue will take away from user experience and will thus takeaway from the push to be open. A Maturity model of registered datasets is provided below to showcase how to document a dataset.
- The dataset should be real, although artificial and modified datasets are accepted if they are of value to reproducibility. Specifically, synthetic datasets generated as part of a research process may not need to be registered if they can be reproduced through code published with the research, in which case we recommend that the code that generated it be made publicly available as part of that research projects’ research compendium.
- Some synthetic/artificial datasets may be proposed, such as if the artificial dataset has become of high value and is used everywhere as if its a real dataset (such as the Turvey dataset).
- The dataset is already published somewhere easy to download and in a machine readable format. Make sure that users can download and use the dataset easily. Several options for hosting a dataset are possible (see next section on where to host datasets).
Where to host the dataset itself
As the price statistics data catalogue describes datasets already hosted elsewhere (in other words it is not a data repository), the first step is to host the dataset somewhere. This is fundamentally up to the researcher and the institution they work at. Ideally a data repository is used that mints a Digitial Object Identifier (DOI) so that the dataset can be easily cited and found. Data repositories often allow a researcher to host private datasets and lock down access but still create a DOI. Regardless of where, the DOI should be ready when the dataset is registered in the catalogue so that researchers can find the dataset itself and know how to cite the dataset.
Read more about data repositories in the Turing Way.
How to register a dataset on the catalogue?
Once the above aspects are considered, a dataset can be registered to the catalogue.
The catalogue uses the open-source python package datacontract-cli to render the dataset record as a static html site from a structured yaml file that stores the relevant information. Basically if there is a dataset.yaml in the /datasets/ folder of the data catalogue repository, the GitHub workflow renders it a an html alongside other datasets.
Read more about the configuration of the yaml file and the open standard it is based on here.
There are two ways to request that a dataset be registered to the catalogue
Option 1: Document the dataset yourself
The best way is to directly propose a dataset and start the process of registering it (such as by fully describing the dataset):
- Fork the catalogue repo and mock up a new dataset in the
/datasets/directory of your fork. - Submit a PR to request from your fork to the main catalogue repository. Please tag the PR to the dataset label.
- We will review your request and coordinate with you as appropriate (such as if we need more info) or help further flush out the metadata so that the dataset is well defined before it is published. We may work with you to propose more detailed documentation based on the maturity level, such as (see below).
- For instance if you see value in writing a data exploration blog to describe the dataset, we can work with you to do so and make it part of the reproducibility project site.
When we are ready, we will either merge your PR and thus register your dataset or reject the dataset if it is not appropriate.
Option 2: Request that we consider a dataset
You could also just give us a ping to let us know that a good dataset exists and we can review and register it when we get a chance.
- Create a new issue and describe the dataset you wish that we register. Include the relevant details that we can use to find out more about the dataset. Please also tag the issue with the dataset label.
Maturity model of registered datasets
As not all datasets can be documented to the same level, we’ve broken up the structure into a maturity model to aspire to for each dataset record, starting from the benchmark to the top ‘gold standard’ level.
Level 1
This level sets a bare minimum a dataset needs to have to be registered to the catalogue
Level 2
Level 2 implies a higher level of maturity to simplify the process for data
Level 3
This implies a ‘gold standard’ for a dataset