Draft

What to do with input data to your project

Common best practices, whether data is sensitive or not

Published

November 25, 2025

CautionWIP

This page is still in the works. Overview of possible topics to summarize:

  • Summarize the main types of input data:

    • Open data - this is data that is openly available and can be downloaded and used for analysis. We aim to document the most valuable open data on the open data catalogue. Open data makes reproducibility easy as these datasets can become benchmarks for empirical studies.

    • Proprietary data - this data is in theory available, however it may need to be purchased. If the organization the researcher is in has access to the data, this dataset can be used for research. As most researchers do not have access to this data and the structure and exact details (i.e. metadata) of the dataset is often not summarized.

    • Sensitive data - most data at NSOs is data provided to the organization by data owners in that country, which is likely never going to be available to researchers outside the NSO that owns the data.

      • NOTE: This is the most common case. In this case, creating a synthetic dataset to simulate the sensitive one could help make the project at least replicable.
  • Either download the raw data directly or save downloaded input data into /data/ folder and use .gitignore to ensure that the raw data is not saved - as per the deeper dive on structure:

    • If data and analysis is simple, the analysis scripts in /src/analysis/ will take the data and generate relevant outputs (data and visuals) in /output/

    • If data cleaning is more complex, you can create a /data/raw, a /data/clean/, and a /src/data_cleaning.py that converts from raw to clean (before analysis). This way anyone can reproduce this process and modify the analysis with new data as they can understand exactly how to preprocess the data before analysis.

  • Use precommit hooks to ensure that analysis notebooks don’t render the output that may be sensitive. precommit hooks to ensure privacy

  • Make sure you don’t commit other sensitive information with the code and writeup - like access tokens or secrets. There are ways to set this up in a way that others can repeat that doesn’t commit it in git.

  • Off course avoid mentioning sensitive things in the prose (say the documentation)

NOTE: Use of these best practices is key you use sensitive or confidential data. For public data, .gitignore is still a good practice so that you don’t repost the raw data. Should also touch upon how researchers should approach propietary datasets (#46)

Back to top