CPI Production Systems

Interim Report with Survey Analysis

Collin Brown

Statistics Canada

Steve Martin

Statistics Canada

Upfront Admin

Work was done by Workstream 4 (CPI Systems and Architecture), which is part of the UN Task Team on Scanner Data under the UN Committee of Experts on Big Data and Data Science for Official Statistics.

We would like to acknowledge the team members of Workstream 4 who provided insights, helped draft survey questions, and supported this empirical assessment.

Note: Please consider reading the full interim report that accompanies this presentation.

CPI Production Systems Background

  • CPI Production Systems involve significant amounts of code, documentation, and other non-code artifacts (e.g., Excel Workbooks).

  • These systems carry out complex business logic in order to transform input data into output data.

  • These systems are often developed entirely or in large part by people with domain expertise but without training in software engineering.

Why Run This Survey?

  1. Many CPI Production Systems teams struggle with managing system complexity.

  2. State of CPI Production Systems around the world is unknown (e.g., how are systems organized, how often are systems updated).

  3. Provide practical advice based on the current state of systems.

Survey Concepts

  • Goal is to characterize system layout, team organization, tool use, and system performance metrics.

  • How to articulate important aspects of system architecture in a short survey?


Need to introduce a simple conceptual model to communicate key system ideas.

Survey Concepts - Systems

We define a system as any indivisible (atomic) software component that takes one or more data inputs and produces one or more data outputs.

Survey Concepts - Flow of Change

From left to right, we go from raw data to the production of the CPI.

Figure 1: Flow of Change

Flow of change based loosely on the General Statistical Business Process Model (GSBPM).

Survey Concepts - Monolithic vs. Modular Architectures

Figure 2: Perfect Monolith Example

Figure 3: Perfect Modular Example

Survey Concepts - Teams

A team is defined as a group of individuals who maintain one or more systems.

Team Type Description
Corporate IT IT professionals not in the price statistics team.
Domain-Embedded IT IT professionals in the price statistics team.
Domain-Embedded Analysts non-IT professionals in the price statistics team.
Non-Domain Analysts non-IT professionals not in the price statistics team.
External Consultants or Contractors Professionals outside of the organization to whom work is contracted.

Survey Concepts - 4 Team Types

We define 4 team types for the purposes of this survey.

Team Type Description
Stream Aligned Domain-Embedded IT and/or Domain-Embedded Analysts.
IT-Only Corporate IT and/or Domain-Embedded IT.
Analyst-Only Domain-Embedded Analysts and/or Non-Domain Analysts.
Other Mix Something other than the above (e.g., domain analysts and corporate IT)

Results - Which Steps are Systems Coupled Across?

  • Hypothesis: Monolithic architectures don’t scale with the complexity introduced by ADS.

  • ingestion-processing-elementary indexes might be associated with a piece of a monolithic system “breaking off”.

Data ingestion Data processing Elementary indexes Aggregation Finalization #
21
6
0
6
12
8
11
4
20

Results - Which Team Combinations are Common Within the Flow of Change?

Corporate IT Domain IT Domain Analysts Other Analysts Consultants #
120
25
33
20
11
45

Results - Version Control System (VCS) Use

Figure 4: Version Control Software Use
  • 40 respondents reported not using a VCS at all or using file-naming conventions only. E.g.,

    • analysis_v5_final_2025_03.py

    • analysis_v6_FINAL_2025_03_edits.py

  • 21 respondents reported using some combination of GitHub/GitLab, or just using Git locally.

  • Recommendation: Use Git to version code, documentation, and configuration (but not with Excel Workbooks)!

Results - Commercial Software Use

Figure 5: Commercial Software Use
  • Over 2/3 of the respondents listed that Microsoft Excel was used in their CPI Production Systems.

  • Spreadsheets are useful for tabular data analysis, however, they are not ideal tool for expressing complex business logic.

  • Excel Workbooks encourage a high degree of coupling between business logic and data.

Results - Project Management Software Use

  • 28 respondents report using a shared Excel workbook for project management.

  • 20 report using no software for project management.

  • 14 report using some other software for project management, such as Jira, GitLab, or GitHub.

Results - Programming Language Use

  • 29 respondents reported using SQL in their CPI Production Systems.

  • 31 reported using Python or R or SAS.

  • 14 reported using Python or R, but not SAS.

  • 8 reported using no programming language.

Results - Storage Use

Choice of Storage (Overall)
  • Only 3 respondents use analytics optimized file formats such as Apache Parquet.

  • Approximately equal split between use of filesystem and database management system (DBMS) for data storage.

  • Only 3 respondents use an object storage solution such as Azure Data Lake Storage or AWS S3.

Results - System Age

System Age Distribution (Entire Sample)
  • Monolithic architectures most likely to have systems aged Between 6-10 years or Between 11-20 years old.

  • IT-Only teams are less likely to have systems > 20 years old compared to Stream Aligned teams.

  • By far most common answer for Other Mix teams is system age Between 6-10 years old.

Results - Update Frequency

Update Frequency Distribution (Overall)
  • 10 respondents never update the majority of systems

  • Monolithic systems more likely to never be updated than modular systems.

  • IT-Only teams more likely to update systems frequently than other team types.

  • Other Mix teams most likely to never update or update very infrequently.

  • All team types reported multiple Never updated answers.

  • NSOs that use ADS have 7 respondents who update at least every six months and 16 respondents who update once per year or less.

Results - Number of Individuals

  • Large changes required more people.

Number of Individuals for Small Changes (Overall)

Number of Individuals for Large Changes (Overall)

Results - Lead Time

  • Lead time: the amount of time required to get an end-to-end change to a CPI Production System implemented.

  • 38 respondents reported lead times of Within 1 day or Within 1 week for small changes.

  • 11 reported lead times of Within 1 month for small changes.

  • Lead time responses for large changes range anywhere from Within 1 week to Can't be modified.

  • Takeaway: Modest changes with short lead times are less risky than substantial changes with long lead times.

Results - Alternative Data Source (ADS) Usage

  • Just under two thirds of respondents don’t use Alternative Data Sources (ADS) at all.

  • Of those respondents who do not use ADS, most of them would like Between 10% and 30% or Between 30% and 70% of their CPIs to be comprised of ADS by expenditure weight.

Results - System Development Challenges (In General)

Practical Suggestions

  • Think explicitly about system boundaries.

  • Think explicitly about data interchange between systems (e.g., Data Contracts).

  • Embed software engineering technical expertise within business domain teams.

  • Version control all the things!

  • Aim for high cohesion and low coupling in system development.

  • Use analytics-optimized file formats like Apache Parquet.

  • Organize systems around one-way data flows and idempotent operations.

  • Update systems frequently.

  • Have a separate development environment for rapid iteration.

Future Work

Three main follow-questions from our survey.

  1. Which system structures and team organizations enable the best business outcomes? We have some hypotheses based on prior knowledge and the data in this survey, but more research is needed.

  2. How to efficiently communicate software engineering knowledge to a business domain audience, and business domain knowledge to a software engineering audience? Effective system delivery requires imparting the relevant knowledge to both audiences.

  3. Do our conclusions about CPI production systems have external validity? We have some evidence that they do, but more research is needed.

Thanks for Listening!




Please consider reading the full interim report that accompanies this presentation.

References

Dehghani, Z. 2022. Data Mesh: Delivering Data-Driven Value at Scale. O’Reilly. https://books.google.ca/books?id=M5J5zgEACAAJ.
Forsgren, N., J. Humble, and G. Kim. 2018. Accelerate: The Science Behind DevOps : Building and Scaling High Performing Technology Organizations. G - Reference,information and Interdisciplinary Subjects Series. IT Revolution. https://books.google.ca/books?id=85XHAQAACAAJ.
NHS. 2017. “Reproducible Analytical Pipelines (RAP).” https://nhsdigital.github.io/rap-community-of-practice/.
Skelton, M., M. Pais, and R. Malan. 2019. Team Topologies: Organizing Business and Technology Teams for Fast Flow. G - Reference,information and Interdisciplinary Subjects Series. IT Revolution. https://books.google.ca/books?id=oFdRuAEACAAJ.

Appendix

Results - Which Steps are Systems Coupled Across?

Conceptual diagram of loose and high coupling. Source: https://en.wikipedia.org/wiki/Coupling_(computer_programming)

Question: To what extent do distinct software modules depend on each other?

Results - Programming Language Use (Monolith vs. Modular)

Results - System Age (Monolith vs. Modular)

System Age Distribution (Monolith)

System Age Distribution (Modular)

Results - Update Frequency (Monolith vs. Modular)

Update Frequency Distribution (Monolith)

Update Frequency Distribution (Modular)

Results - Update Frequency (4 Team Types)

Update Frequency Distribution (Stream-Aligned)

Update Frequency Distribution (IT-Only)

Update Frequency Distribution (Analyst-Only)

Update Frequency Distribution (Other Mix)

Results - Lead Time (Monolith vs. Modular)

  • Monolithic systems more likely to report Within 1 year or More than 1 year for large changes compared to modular systems.

  • Modular systems never reported Can't be modified or Too complex for the large changes lead time question, whereas Monolithic systems reported one of these values 4 times.

Results - Lead Time (4 Team Types)

Lead Time for Large Changes (Stream-Aligned)

Lead Time for Large Changes (IT-Only)

Lead Time for Large Changes (Analyst-Only)

Lead Time for Large Changes (Other Mix)

Results - Methods Applied to ADS

GEKS 6
Time Dummy Hedonic 4
Hedonic 6
Other Multilateral < 3
Dynamic Sample 5
Fixed Sample 16
Other 7
  • Fixed Sample is the most commonly used method.

  • GEKS and hedonic methods second most common.

Results - Challenges in Using ADS