How to cite open data

Published

2025-04-28

Citing an open dataset for a research project is straightforward if it is archived in an open research repository like Zenodo. In this case it will have the relevant metadata, along with a DOI, to make it easily citeable, even if access to these data is restricted. Unfortunately open datasets do not always reside in a repository like Zenodo. A common place for open datasets in statistics, and in particular price statistics, is as part of an R package on CRAN. Citing these datasets is not difficult—all CRAN packages have a DOI and are easily cited—but, as part of a package, data are more difficult to access.

Example: {PriceIndices}

The {PriceIndices} R package comes with several datasets that can be useful for comparing different index-number methods. Citing these data is easy because R packages are highly citeable.

@Manual{,
  title = {PriceIndices: Calculating Bilateral and Multilateral Price Indexes},
  author = {Jacek Białek},
  year = {2025},
  doi = {10.32614/CRAN.package.PriceIndices},
  url = {https://cran.r-project.org/package=PriceIndices},
  note = {R package version 0.2.3}
}

Using these data is easy enough with R. Simply install the package and the datasets become available from within R.

# install.packages("PriceIndices")
head(PriceIndices::coffee)

Using these datasets is less convenient with, say, Python. One option is to simply export them in an interoperable format like Apache Parquet from R.

arrow::write_parquet(PriceIndices::coffee, "coffee.parquet")

Now it is simple to use these data with Python.

import pandas as pd

pd.read_parquet("coffee.parquet").head()

Another approach, and one that doesn’t use R, is to simply download the {PriceIndices} package from CRAN and use the {rdata} Python package to read the datasets in Python.

curl -s https://cran.r-project.org/src/contrib/PriceIndices_0.2.3.tar.gz -o PriceIndices_0.2.3.tar.gz
tar -vxzf PriceIndices_0.2.3.tar.gz PriceIndices/data

import rdata
import pandas as pd


def date_constructor(obj, attr):
    return pd.to_datetime(obj, unit="D")

constructor_dict = {**rdata.conversion.DEFAULT_CLASS_MAP,
                    "Date": date_constructor}

coffee = rdata.read_rda("PriceIndices/data/coffee.rda",
                        constructor_dict=constructor_dict)

pd.DataFrame(coffee["coffee"]).head()