How to cite open data
Citing an open dataset for a research project is straightforward if it is archived in an open research repository like Zenodo. In this case it will have the relevant metadata, along with a DOI, to make it easily citeable, even if access to these data is restricted. Unfortunately open datasets do not always reside in a repository like Zenodo. A common place for open datasets in statistics, and in particular price statistics, is as part of an R package on CRAN. Citing these datasets is not difficult—all CRAN packages have a DOI and are easily cited—but, as part of a package, data are more difficult to access.
Example: {PriceIndices}
The {PriceIndices} R package comes with several datasets that can be useful for comparing different index-number methods. Citing these data is easy because R packages are highly citeable.
@Manual{,
title = {PriceIndices: Calculating Bilateral and Multilateral Price Indexes},
author = {Jacek Białek},
year = {2025},
doi = {10.32614/CRAN.package.PriceIndices},
url = {https://cran.r-project.org/package=PriceIndices},
note = {R package version 0.2.3}
}
Using these data is easy enough with R. Simply install the package and the datasets become available from within R.
# install.packages("PriceIndices")
head(PriceIndices::coffee)
Using these datasets is less convenient with, say, Python. One option is to simply export them in an interoperable format like Apache Parquet from R.
::write_parquet(PriceIndices::coffee, "coffee.parquet") arrow
Now it is simple to use these data with Python.
import pandas as pd
"coffee.parquet").head() pd.read_parquet(
Another approach, and one that doesn’t use R, is to simply download the {PriceIndices} package from CRAN and use the {rdata} Python package to read the datasets in Python.
curl -s https://cran.r-project.org/src/contrib/PriceIndices_0.2.3.tar.gz -o PriceIndices_0.2.3.tar.gz
tar -vxzf PriceIndices_0.2.3.tar.gz PriceIndices/data
import rdata
import pandas as pd
def date_constructor(obj, attr):
return pd.to_datetime(obj, unit="D")
= {**rdata.conversion.DEFAULT_CLASS_MAP,
constructor_dict "Date": date_constructor}
= rdata.read_rda("PriceIndices/data/coffee.rda",
coffee =constructor_dict)
constructor_dict
"coffee"]).head() pd.DataFrame(coffee[