Interim Report with Survey Analysis
Statistics Canada
Statistics Canada
Work was done by Workstream 4 (CPI Systems and Architecture), which is part of the UN Task Team on Scanner Data under the UN Committee of Experts on Big Data and Data Science for Official Statistics.
We would like to acknowledge the team members of Workstream 4 who provided insights, helped draft survey questions, and supported this empirical assessment.
Note: Please consider reading the full interim report that accompanies this presentation.
CPI Production Systems involve significant amounts of code, documentation, and other non-code artifacts (e.g., Excel Workbooks).
These systems carry out complex business logic in order to transform input data into output data.
These systems are often developed entirely or in large part by people with domain expertise but without training in software engineering.
Many CPI Production Systems teams struggle with managing system complexity.
State of CPI Production Systems around the world is unknown (e.g., how are systems organized, how often are systems updated).
Provide practical advice based on the current state of systems.
A lot of related work is from the world of software engineering. Notable Examples:
Team Topologies (Skelton, Pais, and Malan (2019)) looks at how to optimally organize teams.
Accelerate: The State of DevOps Report (Forsgren, Humble, and Kim (2018)) shares ideas around how to measure software delivery performance.
Data Mesh Architecture (Dehghani (2022)) introduces architecture concepts oriented around domain-aligned data product teams.
Reproducible Analytical Pipelines (RAP) Community of Practice (NHS (2017)) shares tools, principles, and techniques to create more robust analytical systems.
Goal is to characterize system layout, team organization, tool use, and system performance metrics.
How to articulate important aspects of system architecture in a short survey?
Need to introduce a simple conceptual model to communicate key system ideas.
We define a system as any indivisible (atomic) software component that takes one or more data inputs and produces one or more data outputs.
From left to right, we go from raw data to the production of the CPI.
Flow of change based loosely on the General Statistical Business Process Model (GSBPM).
A team is defined as a group of individuals who maintain one or more systems.
Team Type | Description |
---|---|
Corporate IT | IT professionals not in the price statistics team. |
Domain-Embedded IT | IT professionals in the price statistics team. |
Domain-Embedded Analysts | non-IT professionals in the price statistics team. |
Non-Domain Analysts | non-IT professionals not in the price statistics team. |
External Consultants or Contractors | Professionals outside of the organization to whom work is contracted. |
We define 4 team types for the purposes of this survey.
Team Type | Description |
---|---|
Stream Aligned | Domain-Embedded IT and/or Domain-Embedded Analysts. |
IT-Only | Corporate IT and/or Domain-Embedded IT. |
Analyst-Only | Domain-Embedded Analysts and/or Non-Domain Analysts. |
Other Mix | Something other than the above (e.g., domain analysts and corporate IT) |
Hypothesis: Monolithic architectures don’t scale with the complexity introduced by ADS.
ingestion-processing-elementary indexes might be associated with a piece of a monolithic system “breaking off”.
Data ingestion | Data processing | Elementary indexes | Aggregation | Finalization | # |
---|---|---|---|---|---|
✅ | ✅ | ❌ | ❌ | ❌ | 21 |
❌ | ✅ | ✅ | ❌ | ❌ | 6 |
❌ | ❌ | ✅ | ✅ | ❌ | 0 |
❌ | ❌ | ❌ | ✅ | ✅ | 6 |
✅ | ✅ | ✅ | ❌ | ❌ | 12 |
❌ | ✅ | ✅ | ✅ | ❌ | 8 |
❌ | ❌ | ✅ | ✅ | ✅ | 11 |
❌ | ✅ | ✅ | ✅ | ✅ | 4 |
✅ | ✅ | ✅ | ✅ | ✅ | 20 |
Corporate IT | Domain IT | Domain Analysts | Other Analysts | Consultants | # |
---|---|---|---|---|---|
❌ | ❌ | ✅ | ❌ | ❌ | 120 |
❌ | ✅ | ✅ | ❌ | ❌ | 25 |
✅ | ❌ | ❌ | ❌ | ❌ | 33 |
❌ | ✅ | ❌ | ❌ | ❌ | 20 |
✅ | ✅ | ❌ | ❌ | ❌ | 11 |
✅ | ❌ | ✅ | ❌ | ❌ | 45 |
40 respondents reported not using a VCS at all or using file-naming conventions only. E.g.,
analysis_v5_final_2025_03.py
analysis_v6_FINAL_2025_03_edits.py
21 respondents reported using some combination of GitHub/GitLab, or just using Git locally.
Recommendation: Use Git to version code, documentation, and configuration (but not with Excel Workbooks)!
Over 2/3 of the respondents listed that Microsoft Excel was used in their CPI Production Systems.
Spreadsheets are useful for tabular data analysis, however, they are not ideal tool for expressing complex business logic.
Excel Workbooks encourage a high degree of coupling between business logic and data.
28 respondents report using a shared Excel workbook for project management.
20 report using no software for project management.
14 report using some other software for project management, such as Jira, GitLab, or GitHub.
29 respondents reported using SQL in their CPI Production Systems.
31 reported using Python or R or SAS.
14 reported using Python or R, but not SAS.
8 reported using no programming language.
Only 3 respondents use analytics optimized file formats such as Apache Parquet.
Approximately equal split between use of filesystem and database management system (DBMS) for data storage.
Only 3 respondents use an object storage solution such as Azure Data Lake Storage or AWS S3.
Monolithic architectures most likely to have systems aged Between 6-10 years
or Between 11-20 years
old.
IT-Only teams are less likely to have systems > 20 years
old compared to Stream Aligned teams.
By far most common answer for Other Mix teams is system age Between 6-10 years
old.
10 respondents never update the majority of systems
Monolithic systems more likely to never be updated than modular systems.
IT-Only teams more likely to update systems frequently than other team types.
Other Mix teams most likely to never update or update very infrequently.
All team types reported multiple Never updated
answers.
NSOs that use ADS have 7 respondents who update at least every six months and 16 respondents who update once per year or less.
Lead time: the amount of time required to get an end-to-end change to a CPI Production System implemented.
38 respondents reported lead times of Within 1 day
or Within 1 week
for small changes.
11 reported lead times of Within 1 month
for small changes.
Lead time responses for large changes range anywhere from Within 1 week
to Can't be modified
.
Takeaway: Modest changes with short lead times are less risky than substantial changes with long lead times.
Just under two thirds of respondents don’t use Alternative Data Sources (ADS) at all.
Of those respondents who do not use ADS, most of them would like Between 10% and 30%
or Between 30% and 70%
of their CPIs to be comprised of ADS by expenditure weight.
Think explicitly about system boundaries.
Think explicitly about data interchange between systems (e.g., Data Contracts).
Embed software engineering technical expertise within business domain teams.
Version control all the things!
Aim for high cohesion and low coupling in system development.
Use analytics-optimized file formats like Apache Parquet.
Organize systems around one-way data flows and idempotent operations.
Update systems frequently.
Have a separate development environment for rapid iteration.
Three main follow-questions from our survey.
Which system structures and team organizations enable the best business outcomes? We have some hypotheses based on prior knowledge and the data in this survey, but more research is needed.
How to efficiently communicate software engineering knowledge to a business domain audience, and business domain knowledge to a software engineering audience? Effective system delivery requires imparting the relevant knowledge to both audiences.
Do our conclusions about CPI production systems have external validity? We have some evidence that they do, but more research is needed.
Please consider reading the full interim report that accompanies this presentation.
Question: To what extent do distinct software modules depend on each other?
Monolithic systems more likely to report Within 1 year
or More than 1 year
for large changes compared to modular systems.
Modular systems never reported Can't be modified
or Too complex
for the large changes lead time question, whereas Monolithic systems reported one of these values 4 times.
GEKS | 6 |
Time Dummy Hedonic | 4 |
Hedonic | 6 |
Other Multilateral | < 3 |
Dynamic Sample | 5 |
Fixed Sample | 16 |
Other | 7 |
Fixed Sample is the most commonly used method.
GEKS and hedonic methods second most common.