SoftwareMay 2026 · 14 min read

Building reproducible pipelines: the stack that doesn't break in six months

Most analysis code is write-once. Here's how to build workflows your team can maintain, audit, re-run six months later, and actually trust when a regulator asks for them.

Micah Thornton, MS — Thornton Statistical Consulting


The problem with most analysis code

The analysis for a Phase II trial ships. The CSR goes out, the data package is archived, and everyone moves on. Six months later, a reviewer asks a question. Or the FDA requests a re-run with one variable recoded. Or a new statistician joins and needs to understand what was done and why.

In most shops, the answer is a folder of scripts with names like final_v3_REALFINAL_use-this-one.R, dependencies that were never recorded, package versions nobody remembers, and data transformations scattered across files that reference each other in undocumented order.

Reproducibility isn't just about academic replication. It's about being able to answer the next question without rebuilding the analysis from memory. In a regulatory context, it's about demonstrating that your results weren't accidental.

The good news: the tooling has matured. A small number of well-chosen tools, configured correctly from day one, eliminate most reproducibility problems. This article covers the stack I use and why each piece earns its place.

What reproducibility actually requires

A pipeline is reproducible if a different person, on a different machine, at a different time, can run it and get the same results. That deceptively simple definition has four concrete requirements:

  1. 1.Fixed inputs. The exact data that entered the analysis — versioned, timestamped, and read-only.
  2. 2.Recorded environment. Every package and its version, the R or Python version, and ideally the OS.
  3. 3.Deterministic execution. Run the pipeline twice, get the same output. No uncontrolled randomness, no side effects that depend on execution order.
  4. 4.Documented intent. Someone reading the code six months later can understand what each step does and why — without asking the original author.

Most teams get the first one by accident (the data doesn't change after database lock) and fail at the other three. The stack below addresses each directly.

The stack

These are the tools I use on every engagement that involves more than a single analysis script. Each solves a specific failure mode.

ToolRoleWhy it earns its place
gitVersion controlEvery change is tracked, attributed, and reversible. Branches let you explore without touching the main analysis.
renv / uvEnvironment managementLocks package versions to the project, not the machine. A collaborator gets your exact dependency tree.
targetsWorkflow orchestration (R)Declares the dependency graph explicitly. Only re-runs what changed. Caches results automatically.
DVCData versioningTracks large data files outside git. Ties each analysis run to the exact data version that produced it.
QuartoLiterate reportingAnalysis and narrative in one file. Re-running the document re-runs the analysis. No copy-paste from script to Word.
DockerEnvironment portabilityWhen renv isn't enough — captures system libraries, OS version, and R/Python version in a container.

You don't need all six on every project. The minimum viable stack is git + renv (or uv) + targets. Add the rest as complexity grows.

Git: the non-negotiable foundation

If your analysis isn't in version control, it isn't reproducible in any meaningful sense. Git is the foundation everything else builds on.

The key practice is committing at meaningful checkpoints, not continuously. Each commit message should explain why the change was made, not just what changed. "Recode age variable per SAP amendment 2" is useful. "Update script" is noise.

# A useful git log for a clinical analysis project
git log --oneline

a3f1c9d  Final TLFs for CSR section 14.3
b82e441  Address FDA query Q7: add sensitivity excluding site 04
c991f70  Correct baseline covariate specification per SAP amendment 3
d3a20b5  Add imputation model for missing LOCF endpoints
e7b1234  Locked analysis dataset v2.1 ingested
f42dc91  Initial project scaffold
That log is an audit trail. It answers "what changed after the amendment" without reading every line of code. It also means that if the FDA query introduces an error, you can revert to the pre-query state in one command.

Branching strategy for analysis projects is simpler than software development. A main branch that always runs cleanly. A dev branch for work in progress. Feature branches for substantial changes (protocol amendments, new endpoints) that get reviewed before merging. That's enough.

One firm rule: never commit data to git. Raw datasets, even de-identified ones, belong in a data management system with access controls. Git tracks the code that transforms them, not the data itself.

renv: locking the R environment

Package updates break analyses silently. A function changes behavior in a new version. A dependency gets dropped. An argument is deprecated. None of this is visible until you re-run the analysis and something looks different.

renv solves this by creating a project-local package library and recording exact versions in a lockfile. When a collaborator clones the project and runs renv::restore(), they get your exact package tree — not whatever happens to be installed on their machine.

# Initialize renv in a new project
renv::init()

# After installing or updating packages, snapshot the state
renv::snapshot()

# On a new machine, restore the locked environment
renv::restore()

# Check for divergence between current state and lockfile
renv::status()

The renv.lock file is committed to git. It records the package name, version, and source (CRAN, Bioconductor, GitHub) for every dependency. Anyone with git access can restore the environment in minutes.

For Python projects, uv does the same job with better performance than pip or conda. The uv.lock file is the equivalent artifact. Both approaches give you a fully pinned, reproducible environment without Docker overhead.

targets: making the pipeline explicit

The hardest reproducibility problem isn't packages — it's execution order. In a script-based workflow, you have to remember which scripts to run, in what order, and which ones depend on which. Change a data cleaning step and you have to manually re-run everything downstream.

targets replaces that implicit knowledge with an explicit dependency graph. You define each step as a target with declared inputs and outputs. targets figures out the graph, runs steps in the right order, and skips anything that hasn't changed since the last run.

# _targets.R — the pipeline definition
library(targets)

list(
  # Ingest locked dataset
  tar_target(raw_data, read_sas("data/locked/adsl_v2.1.sas7bdat")),

  # Apply SAP-specified exclusions
  tar_target(analysis_pop, filter_itt(raw_data)),

  # Primary endpoint model
  tar_target(primary_model, fit_mmrm(analysis_pop, endpoint = "cfb_week12")),

  # Generate Table 14.3.1
  tar_target(table_14_3_1, make_tlf_primary(primary_model)),

  # Render the CSR statistical sections
  tar_target(csr_report, quarto_render("report/csr-stats.qmd",
    execute_params = list(model = primary_model)),
    format = "file")
)

Now change the exclusion criteria in filter_itt() and run tar_make(). The pipeline re-runs everything from that point forward and skips what it doesn't need to recompute. The primary model, the table, and the report all update automatically.

This is the single biggest workflow improvement I've seen in practice. The discipline of declaring dependencies forces clarity about what each step does and what it depends on. It also makes the pipeline self-documenting — the graph is the specification.

You can visualize the current state of the pipeline with tar_visnetwork() — green nodes are up to date, orange are outdated, grey are never run. During active analysis work, that visualization is the first thing I look at in the morning.

Data versioning with DVC

git tracks code. DVC (Data Version Control) tracks data. The pattern is the same: each dataset gets a content hash, changes are tracked, and you can check out any historical version of the data alongside the code that processed it.

The immediate use case is database amendments. When the locked dataset is updated — a site audit, a query resolution, a data management correction — DVC captures the new version, timestamps it, and keeps the old one accessible. Every analysis run is tied to the exact data version that produced it.

# Track a locked SAS dataset
dvc add data/locked/adsl_v2.1.sas7bdat

# This creates adsl_v2.1.sas7bdat.dvc — commit that to git
git add data/locked/adsl_v2.1.sas7bdat.dvc
git commit -m "Lock analysis dataset v2.1"

# When v2.2 arrives after site 04 audit
dvc add data/locked/adsl_v2.2.sas7bdat
git commit -m "Update to analysis dataset v2.2 post site-04 audit"

# Reproduce results from v2.1 exactly
git checkout <v2.1 commit>
dvc checkout

For smaller projects or those without large binary datasets, DVC is optional — storing a SHA-256 hash of each input file in a plain text manifest file achieves much of the same traceability at zero infrastructure cost. But for anything with multi-gigabyte datasets or frequent data amendments, DVC earns its place.

Quarto: analysis and reporting in one place

Separating analysis from reporting creates a synchronization problem. You update a table in R, then paste the numbers into Word, then someone requests a change, and you're not sure if the Word document reflects the latest analysis or the version before last.

Quarto eliminates that problem by putting the analysis and the narrative in the same file. Code chunks run inline; their output appears in the rendered document. Re-render the document, get updated tables and figures automatically. The rendered output — PDF, HTML, Word — is always a direct function of the code.

---
title: "Primary Efficacy Analysis"
format: pdf
params:
  dataset: "data/locked/adsl_v2.1.sas7bdat"
  cutoff_date: "2026-03-15"
---

## Primary Endpoint

```{r}
#| label: primary-model
#| echo: false
model <- tar_read(primary_model)
tbl <- make_tlf_primary(model)
gt::gtsave(tbl, "output/table_14_3_1.rtf")
tbl
```

The primary analysis used a mixed model for repeated measures (MMRM)
with treatment, visit, treatment × visit interaction, baseline, and
site as covariates. The least-squares mean difference at Week 12 was
`r fmt_est(model)` (95% CI `r fmt_ci(model)`; p `r fmt_p(model)`).
The critical habit: every number in a clinical report that came from an analysis should be programmatically generated, not typed. If you find yourself typing a p-value into a Word document, you've introduced a transcription error risk that will eventually cause a problem.

Project structure that scales

Good tooling in a chaotic folder structure still produces confusion. A consistent project layout means anyone joining the project can find what they need without asking.

study-xyz/
├── _targets.R          # pipeline definition
├── renv.lock           # locked R environment
├── .gitignore          # excludes data/, output/raw/
│
├── data/
│   ├── locked/         # DVC-tracked, read-only after lock
│   │   └── adsl_v2.1.sas7bdat.dvc
│   └── derived/        # targets-managed intermediate datasets
│
├── R/
│   ├── data-prep.R     # cleaning and derivations
│   ├── models.R        # analysis functions
│   ├── tables.R        # TLF generation functions
│   └── utils.R         # shared helpers
│
├── report/
│   ├── csr-stats.qmd   # main statistical report
│   └── appendices/     # supplementary analyses
│
├── output/
│   ├── tables/         # RTF/PDF TLFs
│   ├── figures/        # publication-quality plots
│   └── models/         # serialized model objects
│
└── docs/
    ├── SAP.pdf         # statistical analysis plan
    └── data-dictionary.xlsx

A few principles embedded in this structure: data/locked/ is read-only after database lock — nothing in the pipeline writes to it. R/ contains only function definitions, never analysis scripts that run top-to-bottom. All execution flows through _targets.R.

The SAP lives in docs/ with the code. When a reviewer asks "where is this analysis specified?" the answer is one folder up from the script that implements it.

Handling randomness reproducibly

Any analysis that uses random number generation — bootstrap confidence intervals, MCMC, multiple imputation, permutation tests — will give different results each run unless the seed is fixed. This is one of the most common silent reproducibility failures.

# In R: set seed at the top of every function that uses randomness
fit_imputation_model <- function(data, m = 20, seed = 2024L) {
  set.seed(seed)
  mice::mice(data, m = m, method = "pmm", printFlag = FALSE)
}

# In targets: use tar_seed for reproducible per-target seeds
tar_target(
  imputed_data,
  fit_imputation_model(analysis_pop, m = 20),
  seed = 42L  # targets sets this seed before running the target
)
Pre-specify seeds in the SAP. "All randomized procedures were seeded with 20240101" is a defensible statement. It means the analysis can be independently replicated and any discrepancy is a bug to find, not randomness to accept.

One additional discipline: if the analysis uses parallelism (parallel bootstraps, parallel chains in MCMC), document the parallelism strategy and confirm that switching from serial to parallel execution doesn't change results. It shouldn't, but verify it.

Common failure modes

Failure 1: The analysis works on your machine.

"Works on my machine" means you haven't tested reproducibility — you've tested that you can run it today. The standard is: clone the repo on a clean machine with no pre-installed packages, run renv::restore() and tar_make(), and get the same output. If you haven't done that, you don't know.

Failure 2: The analysis file is the documentation.

Comments that describe what the code does are less useful than comments that describe why. "Exclude patients with TRTFL = 'N'" is obvious from the code. "Per SAP section 4.2: ITT excludes patients who never received study drug (TRTFL)" is the information someone needs six months later.

Failure 3: Hard-coded paths.

/Users/micah/Desktop/study-xyz/data/adsl.sas7bdat breaks the moment anyone else runs it. Use relative paths from the project root, managed by here::here() in R or pathlib.Path in Python. The project root is always deterministic; a home directory is never.

Failure 4: Interactive analysis that never got scripted.

You found the answer in the console and never wrote it down. The analysis exists only in your session history, which is gone when you close R. Any analysis step that matters must be in a file. Anything that's only in your console didn't happen.

Failure 5: One massive script.

A 2,000-line analysis script that runs top-to-bottom is fragile. Every run re-runs everything. It's impossible to test components in isolation. It's nearly impossible to review. Break it into functions, put functions in R/, and let targets call them.

When to add Docker

renv handles R package versions. It does not handle the R version itself, system libraries (like libgsl for certain Bayesian packages), or OS-level differences that affect numerical output.

For most clinical analysis projects, renv + git is sufficient. Docker becomes worth the overhead when: the analysis uses packages with non-trivial system dependencies, the analysis needs to run in a validated computing environment (GxP), you need to guarantee identical results across Windows/Mac/Linux, or you're building an automated pipeline that runs in CI/CD.

# Dockerfile for a clinical analysis environment
FROM rocker/r-ver:4.4.1

# Install system dependencies
RUN apt-get update && apt-get install -y \
    libgsl-dev libssl-dev libcurl4-openssl-dev

# Copy and restore renv lockfile
COPY renv.lock renv.lock
RUN R -e "install.packages('renv'); renv::restore()"

# Set working directory
WORKDIR /analysis
COPY . .

# Entry point
CMD ["Rscript", "-e", "targets::tar_make()"]
If you use Docker, pin the base image tag to a specific version — rocker/r-ver:4.4.1, not rocker/r-ver:latest. "Latest" will change. Your analysis shouldn't.

Practical checklist

Before handing off any analysis project, run through this:

  1. 1.Clone the repo on a clean machine and run it end-to-end. If it fails, fix it before delivery.
  2. 2.Every randomized step has a pre-specified, documented seed. Verify the output matches with that seed on two different machines.
  3. 3.No hard-coded paths. Search the codebase for home directory references and replace with here::here() or equivalent.
  4. 4.The SAP amendment version matches the code. If an amendment changed the analysis, the git log should show a commit for it.
  5. 5.renv.lock is committed and current. Run renv::status() — it should show no discrepancies.
  6. 6.All intermediate outputs are reproducible from targets. Delete the _targets/ cache and re-run. Same results.
  7. 7.Every number in the report was programmatically generated. No manually typed statistics in the Quarto document or Word output.

Bottom line

Reproducible pipelines aren't about perfectionism — they're about not having to rebuild your analysis from memory when a reviewer asks a question in month seven.

The minimum viable stack is git + renv + targets. Those three tools handle version control, environment management, and execution order. They cost maybe half a day to set up correctly on a new project and save days of reconstruction work later.

Every hour you spend on project structure at the beginning is worth four hours you don't spend answering questions about what you did and why. The infrastructure is not overhead — it's the work.

If your team is working in Python instead of R, the same principles apply: git + uv + snakemake (or Prefect, or DVC pipelines) gets you to the same place. The tools are different; the underlying requirements are identical.


Need help setting up a reproducible analysis environment?

I can help you set up infrastructure for clinical trials, observational studies, and regulatory submissions — from scratch or on an existing project.