Building reproducible pipelines: the stack that doesn't break in six months
Most analysis code is write-once. Here's how to build workflows your team can maintain, audit, re-run six months later, and actually trust when a regulator asks for them.
Micah Thornton, MS — Thornton Statistical Consulting
The problem with most analysis code
The analysis for a Phase II trial ships. The CSR goes out, the data package is archived, and everyone moves on. Six months later, a reviewer asks a question. Or the FDA requests a re-run with one variable recoded. Or a new statistician joins and needs to understand what was done and why.
In most shops, the answer is a folder of scripts with names like final_v3_REALFINAL_use-this-one.R, dependencies that were never recorded, package versions nobody remembers, and data transformations scattered across files that reference each other in undocumented order.
The good news: the tooling has matured. A small number of well-chosen tools, configured correctly from day one, eliminate most reproducibility problems. This article covers the stack I use and why each piece earns its place.
What reproducibility actually requires
A pipeline is reproducible if a different person, on a different machine, at a different time, can run it and get the same results. That deceptively simple definition has four concrete requirements:
- 1.Fixed inputs. The exact data that entered the analysis — versioned, timestamped, and read-only.
- 2.Recorded environment. Every package and its version, the R or Python version, and ideally the OS.
- 3.Deterministic execution. Run the pipeline twice, get the same output. No uncontrolled randomness, no side effects that depend on execution order.
- 4.Documented intent. Someone reading the code six months later can understand what each step does and why — without asking the original author.
Most teams get the first one by accident (the data doesn't change after database lock) and fail at the other three. The stack below addresses each directly.
The stack
These are the tools I use on every engagement that involves more than a single analysis script. Each solves a specific failure mode.
| Tool | Role | Why it earns its place |
|---|---|---|
| git | Version control | Every change is tracked, attributed, and reversible. Branches let you explore without touching the main analysis. |
| renv / uv | Environment management | Locks package versions to the project, not the machine. A collaborator gets your exact dependency tree. |
| targets | Workflow orchestration (R) | Declares the dependency graph explicitly. Only re-runs what changed. Caches results automatically. |
| DVC | Data versioning | Tracks large data files outside git. Ties each analysis run to the exact data version that produced it. |
| Quarto | Literate reporting | Analysis and narrative in one file. Re-running the document re-runs the analysis. No copy-paste from script to Word. |
| Docker | Environment portability | When renv isn't enough — captures system libraries, OS version, and R/Python version in a container. |
You don't need all six on every project. The minimum viable stack is git + renv (or uv) + targets. Add the rest as complexity grows.
Git: the non-negotiable foundation
If your analysis isn't in version control, it isn't reproducible in any meaningful sense. Git is the foundation everything else builds on.
The key practice is committing at meaningful checkpoints, not continuously. Each commit message should explain why the change was made, not just what changed. "Recode age variable per SAP amendment 2" is useful. "Update script" is noise.
# A useful git log for a clinical analysis project git log --oneline a3f1c9d Final TLFs for CSR section 14.3 b82e441 Address FDA query Q7: add sensitivity excluding site 04 c991f70 Correct baseline covariate specification per SAP amendment 3 d3a20b5 Add imputation model for missing LOCF endpoints e7b1234 Locked analysis dataset v2.1 ingested f42dc91 Initial project scaffold
Branching strategy for analysis projects is simpler than software development. A main branch that always runs cleanly. A dev branch for work in progress. Feature branches for substantial changes (protocol amendments, new endpoints) that get reviewed before merging. That's enough.
One firm rule: never commit data to git. Raw datasets, even de-identified ones, belong in a data management system with access controls. Git tracks the code that transforms them, not the data itself.
renv: locking the R environment
Package updates break analyses silently. A function changes behavior in a new version. A dependency gets dropped. An argument is deprecated. None of this is visible until you re-run the analysis and something looks different.
renv solves this by creating a project-local package library and recording exact versions in a lockfile. When a collaborator clones the project and runs renv::restore(), they get your exact package tree — not whatever happens to be installed on their machine.
# Initialize renv in a new project renv::init() # After installing or updating packages, snapshot the state renv::snapshot() # On a new machine, restore the locked environment renv::restore() # Check for divergence between current state and lockfile renv::status()
The renv.lock file is committed to git. It records the package name, version, and source (CRAN, Bioconductor, GitHub) for every dependency. Anyone with git access can restore the environment in minutes.
targets: making the pipeline explicit
The hardest reproducibility problem isn't packages — it's execution order. In a script-based workflow, you have to remember which scripts to run, in what order, and which ones depend on which. Change a data cleaning step and you have to manually re-run everything downstream.
targets replaces that implicit knowledge with an explicit dependency graph. You define each step as a target with declared inputs and outputs. targets figures out the graph, runs steps in the right order, and skips anything that hasn't changed since the last run.
# _targets.R — the pipeline definition
library(targets)
list(
# Ingest locked dataset
tar_target(raw_data, read_sas("data/locked/adsl_v2.1.sas7bdat")),
# Apply SAP-specified exclusions
tar_target(analysis_pop, filter_itt(raw_data)),
# Primary endpoint model
tar_target(primary_model, fit_mmrm(analysis_pop, endpoint = "cfb_week12")),
# Generate Table 14.3.1
tar_target(table_14_3_1, make_tlf_primary(primary_model)),
# Render the CSR statistical sections
tar_target(csr_report, quarto_render("report/csr-stats.qmd",
execute_params = list(model = primary_model)),
format = "file")
)Now change the exclusion criteria in filter_itt() and run tar_make(). The pipeline re-runs everything from that point forward and skips what it doesn't need to recompute. The primary model, the table, and the report all update automatically.
You can visualize the current state of the pipeline with tar_visnetwork() — green nodes are up to date, orange are outdated, grey are never run. During active analysis work, that visualization is the first thing I look at in the morning.
Data versioning with DVC
git tracks code. DVC (Data Version Control) tracks data. The pattern is the same: each dataset gets a content hash, changes are tracked, and you can check out any historical version of the data alongside the code that processed it.
The immediate use case is database amendments. When the locked dataset is updated — a site audit, a query resolution, a data management correction — DVC captures the new version, timestamps it, and keeps the old one accessible. Every analysis run is tied to the exact data version that produced it.
# Track a locked SAS dataset dvc add data/locked/adsl_v2.1.sas7bdat # This creates adsl_v2.1.sas7bdat.dvc — commit that to git git add data/locked/adsl_v2.1.sas7bdat.dvc git commit -m "Lock analysis dataset v2.1" # When v2.2 arrives after site 04 audit dvc add data/locked/adsl_v2.2.sas7bdat git commit -m "Update to analysis dataset v2.2 post site-04 audit" # Reproduce results from v2.1 exactly git checkout <v2.1 commit> dvc checkout
For smaller projects or those without large binary datasets, DVC is optional — storing a SHA-256 hash of each input file in a plain text manifest file achieves much of the same traceability at zero infrastructure cost. But for anything with multi-gigabyte datasets or frequent data amendments, DVC earns its place.
Quarto: analysis and reporting in one place
Separating analysis from reporting creates a synchronization problem. You update a table in R, then paste the numbers into Word, then someone requests a change, and you're not sure if the Word document reflects the latest analysis or the version before last.
Quarto eliminates that problem by putting the analysis and the narrative in the same file. Code chunks run inline; their output appears in the rendered document. Re-render the document, get updated tables and figures automatically. The rendered output — PDF, HTML, Word — is always a direct function of the code.
---
title: "Primary Efficacy Analysis"
format: pdf
params:
dataset: "data/locked/adsl_v2.1.sas7bdat"
cutoff_date: "2026-03-15"
---
## Primary Endpoint
```{r}
#| label: primary-model
#| echo: false
model <- tar_read(primary_model)
tbl <- make_tlf_primary(model)
gt::gtsave(tbl, "output/table_14_3_1.rtf")
tbl
```
The primary analysis used a mixed model for repeated measures (MMRM)
with treatment, visit, treatment × visit interaction, baseline, and
site as covariates. The least-squares mean difference at Week 12 was
`r fmt_est(model)` (95% CI `r fmt_ci(model)`; p `r fmt_p(model)`).Project structure that scales
Good tooling in a chaotic folder structure still produces confusion. A consistent project layout means anyone joining the project can find what they need without asking.
study-xyz/
├── _targets.R # pipeline definition
├── renv.lock # locked R environment
├── .gitignore # excludes data/, output/raw/
│
├── data/
│ ├── locked/ # DVC-tracked, read-only after lock
│ │ └── adsl_v2.1.sas7bdat.dvc
│ └── derived/ # targets-managed intermediate datasets
│
├── R/
│ ├── data-prep.R # cleaning and derivations
│ ├── models.R # analysis functions
│ ├── tables.R # TLF generation functions
│ └── utils.R # shared helpers
│
├── report/
│ ├── csr-stats.qmd # main statistical report
│ └── appendices/ # supplementary analyses
│
├── output/
│ ├── tables/ # RTF/PDF TLFs
│ ├── figures/ # publication-quality plots
│ └── models/ # serialized model objects
│
└── docs/
├── SAP.pdf # statistical analysis plan
└── data-dictionary.xlsxA few principles embedded in this structure: data/locked/ is read-only after database lock — nothing in the pipeline writes to it. R/ contains only function definitions, never analysis scripts that run top-to-bottom. All execution flows through _targets.R.
The SAP lives in docs/ with the code. When a reviewer asks "where is this analysis specified?" the answer is one folder up from the script that implements it.
Handling randomness reproducibly
Any analysis that uses random number generation — bootstrap confidence intervals, MCMC, multiple imputation, permutation tests — will give different results each run unless the seed is fixed. This is one of the most common silent reproducibility failures.
# In R: set seed at the top of every function that uses randomness
fit_imputation_model <- function(data, m = 20, seed = 2024L) {
set.seed(seed)
mice::mice(data, m = m, method = "pmm", printFlag = FALSE)
}
# In targets: use tar_seed for reproducible per-target seeds
tar_target(
imputed_data,
fit_imputation_model(analysis_pop, m = 20),
seed = 42L # targets sets this seed before running the target
)One additional discipline: if the analysis uses parallelism (parallel bootstraps, parallel chains in MCMC), document the parallelism strategy and confirm that switching from serial to parallel execution doesn't change results. It shouldn't, but verify it.
Common failure modes
Failure 1: The analysis works on your machine.
"Works on my machine" means you haven't tested reproducibility — you've tested that you can run it today. The standard is: clone the repo on a clean machine with no pre-installed packages, run renv::restore() and tar_make(), and get the same output. If you haven't done that, you don't know.
Failure 2: The analysis file is the documentation.
Comments that describe what the code does are less useful than comments that describe why. "Exclude patients with TRTFL = 'N'" is obvious from the code. "Per SAP section 4.2: ITT excludes patients who never received study drug (TRTFL)" is the information someone needs six months later.
Failure 3: Hard-coded paths.
/Users/micah/Desktop/study-xyz/data/adsl.sas7bdat breaks the moment anyone else runs it. Use relative paths from the project root, managed by here::here() in R or pathlib.Path in Python. The project root is always deterministic; a home directory is never.
Failure 4: Interactive analysis that never got scripted.
You found the answer in the console and never wrote it down. The analysis exists only in your session history, which is gone when you close R. Any analysis step that matters must be in a file. Anything that's only in your console didn't happen.
Failure 5: One massive script.
A 2,000-line analysis script that runs top-to-bottom is fragile. Every run re-runs everything. It's impossible to test components in isolation. It's nearly impossible to review. Break it into functions, put functions in R/, and let targets call them.
When to add Docker
renv handles R package versions. It does not handle the R version itself, system libraries (like libgsl for certain Bayesian packages), or OS-level differences that affect numerical output.
For most clinical analysis projects, renv + git is sufficient. Docker becomes worth the overhead when: the analysis uses packages with non-trivial system dependencies, the analysis needs to run in a validated computing environment (GxP), you need to guarantee identical results across Windows/Mac/Linux, or you're building an automated pipeline that runs in CI/CD.
# Dockerfile for a clinical analysis environment
FROM rocker/r-ver:4.4.1
# Install system dependencies
RUN apt-get update && apt-get install -y \
libgsl-dev libssl-dev libcurl4-openssl-dev
# Copy and restore renv lockfile
COPY renv.lock renv.lock
RUN R -e "install.packages('renv'); renv::restore()"
# Set working directory
WORKDIR /analysis
COPY . .
# Entry point
CMD ["Rscript", "-e", "targets::tar_make()"]Practical checklist
Before handing off any analysis project, run through this:
- 1.Clone the repo on a clean machine and run it end-to-end. If it fails, fix it before delivery.
- 2.Every randomized step has a pre-specified, documented seed. Verify the output matches with that seed on two different machines.
- 3.No hard-coded paths. Search the codebase for home directory references and replace with here::here() or equivalent.
- 4.The SAP amendment version matches the code. If an amendment changed the analysis, the git log should show a commit for it.
- 5.renv.lock is committed and current. Run renv::status() — it should show no discrepancies.
- 6.All intermediate outputs are reproducible from targets. Delete the _targets/ cache and re-run. Same results.
- 7.Every number in the report was programmatically generated. No manually typed statistics in the Quarto document or Word output.
Bottom line
Reproducible pipelines aren't about perfectionism — they're about not having to rebuild your analysis from memory when a reviewer asks a question in month seven.
The minimum viable stack is git + renv + targets. Those three tools handle version control, environment management, and execution order. They cost maybe half a day to set up correctly on a new project and save days of reconstruction work later.
If your team is working in Python instead of R, the same principles apply: git + uv + snakemake (or Prefect, or DVC pipelines) gets you to the same place. The tools are different; the underlying requirements are identical.
Need help setting up a reproducible analysis environment?
I can help you set up infrastructure for clinical trials, observational studies, and regulatory submissions — from scratch or on an existing project.