Reading Time: 8 minutes

Key Takeaways

  • Containers solve one problem. They freeze the software environment, but they do not track what changed in your data or how your analysis evolved.
  • Data versioning adds a missing layer. Tools like DVC and DataLad bring Git-style version control to datasets, making it easier to reproduce results with exact inputs, parameter files, and selected outputs.
  • Provenance is the audit trail. It records every transformation that happened to your data, from raw files to final figures.
  • Containers, data versioning, and provenance tracking are complementary. Together, they cover code, environment, data, and lineage.

What To Know First

Containers like Docker and Singularity have become a near-mandatory part of reproducible research. They solve a real problem: if you send someone a Docker image, that person should be able to run the code and get the same results on another machine.

This is why journals and reviewers increasingly ask for container images alongside papers. But containers leave two critical gaps open.

First, containers do not track changes to your data. A dataset evolves over time as you clean, filter, and reprocess it. When you reproduce a result six months later, you need to know which version of the data produced that figure. Containers do not record that by default.

Second, containers do not capture lineage. If an analysis script is wrong, a parameter was accidentally modified, or a preprocessing step was applied inconsistently, there may be no automated record of what happened. You are left relying on memory, manual notes, or luck.

This article explains how modern scientific workflows address those gaps through data versioning and provenance tracking. Data versioning tracks dataset changes over time. Provenance tracking records every transformation applied to the data.

Containers Are Not Enough

Before discussing data versioning, it helps to understand what containers do and do not do.

A container captures:

  • The operating system, usually a Linux distribution.
  • Installed packages and their versions.
  • Your code and configuration files.
  • The command used to execute the analysis.

A container does not automatically capture:

  • The exact files used as inputs, unless they are baked into the image.
  • Runtime parameters, unless they are explicitly logged.
  • Intermediate outputs generated during a pipeline.
  • The order and logic of transformations applied to the data.

The practical consequence is simple. You can run a container on another machine and get the same output only if the inputs and runtime context are also controlled. If you want to reproduce a specific analysis from months ago, especially one with several steps and a changing dataset, you need more than a container.

Consider a common scenario. You train a phase-field simulation on experimental microstructure images. The images are cleaned during preprocessing, then split into training and validation sets. Six months later, someone asks you to reproduce the simulation.

You open the container, run the command, and get the wrong result. The reason may be that the dataset was reorganized, new samples were added, or the entry point used whatever files happened to be in a directory. Without versioned inputs and an audit trail, the container alone cannot prove what happened.

This is where data versioning and provenance tracking become necessary.

What Is Data Versioning?

Data versioning brings the same idea that makes Git useful for code into the world of datasets. Instead of tracking every large binary file directly, it creates lightweight snapshots or pointers to specific states of data at specific moments.

The core idea is simple:

  1. You make a change to the dataset, such as adding files, modifying files, or reorganizing folders.
  2. You commit a snapshot that records what changed and where the current data lives.
  3. Later, you can check out that snapshot to restore the exact data state that produced a specific result.

Two leading tools in the scientific Python ecosystem are DVC and DataLad.

DVC: Data Version Control

DVC is a widely adopted data versioning tool for Python-based workflows. It works by generating small metadata files that track data locations and checksums while the actual data sits in remote storage, such as cloud buckets, local disks, or shared networks.

Several DVC features matter for scientific workflows:

  • Pipeline definition. dvc.yaml files let you describe the analysis pipeline, including inputs, outputs, processing steps, and dependencies.
  • Experiment tracking. dvc exp run creates isolated experiment namespaces so you can test parameter configurations without cluttering Git history.
  • Remote storage. DVC can push data to configured remotes such as S3, Google Cloud Storage, Azure, SSH servers, or shared filesystems.
  • Time travel. You can check out a Git commit, run dvc pull and dvc checkout, and restore the data state from that point.

A typical scientific workflow can look like this:

# Initialize DVC in your project
dvc init

# Add your dataset
dvc add data/raw_microstructures/

# Commit only the lightweight metadata
git add data/raw_microstructures.dvc
git commit -m "Initial microstructure dataset"

# Define a processing pipeline
dvc run -d data/raw_microstructures.dvc -o data/cleaned/ \
    python src/preprocess.py

# Run experiments
dvc exp run --set-param preprocessing.threshold=0.5

DataLad

DataLad is designed for scientific datasets and uses Git-Annex under the hood. It is especially useful for large datasets that may span many files, such as simulations, experimental measurements, imaging data, or institutional research collections.

DataLad provides several scientific-specific advantages:

  • Dataset-centric design. DataLad treats every directory as a dataset with versioned history.
  • Data retrieval on demand. Instead of pulling everything, DataLad can fetch specific files or subdirectories when needed.
  • Built-in replication. DataLad can manage sibling repositories across institutions for redundancy and compliance.
  • HPC integration. Extensions can support batch scheduling systems such as Slurm and PBS.

A basic DataLad workflow looks like this:

# Initialize a dataset
datalad create -s my-dataset

# Add data using git-annex under the hood
datalad add data/raw_images/

# Record a provenance-rich commit
datalad save -m "Add raw imaging data, batch 2024-01"

# Clone to another machine and fetch data on demand
datalad clone my-dataset
datalad get data/processed_results/

When to Use Which

DVC is often better for teams working in Python data science environments that need pipeline tracking and experiment management. It integrates naturally with tools such as scikit-learn, PyTorch, and other Python workflows.

DataLad is often better for long-term data curation projects, especially when datasets are large, distributed across institutions, or expected to survive for many years.

They are not mutually exclusive. Both use Git as part of their version control backbone, and both can be combined with containerized execution environments.

What Is Provenance Tracking?

Provenance is the systematic record of where data came from and what happened to it. In scientific workflows, provenance answers practical questions:

  • Which version of the input data was used?
  • What parameters were applied during processing?
  • What intermediate files were generated?
  • What software versions were active when the output was produced?
  • Who ran the workflow, and when?

Two types of provenance matter in scientific workflows.

Prospective provenance describes what will happen. It is the planned workflow specification: the recipe for how data should move through analysis. Tools such as CWL and WDL use declarative formats to specify what should happen when a workflow runs.

Retrospective provenance describes what actually happened. It records execution logs, environment snapshots, data checksums, and exact runtime files. This is the audit trail that supports reproducibility.

RO-Crate: Packaging Research Artifacts with Provenance

RO-Crate is a standard for packaging research outputs, data files, workflow definitions, parameters, and logs into a machine-readable archive.

An RO-Crate archive is usually a directory or zip file that contains:

  • Data files and analysis outputs.
  • An ro-crate-metadata.json file with JSON-LD metadata.
  • Workflow definitions such as CWL, Nextflow, or Snakemake files.
  • Execution logs and environment captures.

RO-Crate is useful because it is lightweight and portable. You do not need a special database or service to inspect it. Anyone with the files can read the metadata and understand the provenance.

The format supports both prospective provenance, which describes what the workflow plans to do, and retrospective provenance, which records what actually ran.

RO-Crate is gaining traction in scientific computing. Platforms such as Galaxy, WorkflowHub, and Zenodo support RO-Crate exports, making it useful for publication-ready research artifacts.

Putting It All Together: A Practical Workflow Pattern

Containers, data versioning, and provenance tracking work best when used together. Each one covers a different reproducibility layer.

Step 1: Version Your Data with DVC or DataLad

Start by versioning raw inputs. Use dvc add or datalad add for every dataset that feeds into the analysis. Commit the DVC or Git-Annex metadata files to Git alongside the code.

Step 2: Define Your Pipeline

If you use DVC, write a dvc.yaml file that describes each processing step as a stage. If you use DataLad, use documented scripts, consistent parameter files, and sibling repositories where needed.

Step 3: Containerize the Execution

Wrap analysis scripts in a container that can run across machines. The container keeps Python packages, C++ libraries, and binaries consistent. It does not replace versioned inputs.

Step 4: Capture Provenance with RO-Crate or CWLProv

At the end of each workflow run, package artifacts into an RO-Crate archive. This gives you:

  • A single directory or zip file containing data, code, logs, and metadata.
  • Machine-readable provenance linked to datasets.
  • Persistent identifiers when deposited into repositories such as Zenodo or Figshare.

The Workflow in Practice

Your Project Directory
├── .git/                     # Code and metadata
├── .dvc/                     # DVC pipeline state
├── src/                      # Analysis scripts
├── data/                     # Versioned datasets
├── results/                  # Output data
├── dvc.yaml                  # Pipeline definition
├── params.yaml               # Parameter file
└── Dockerfile                # Container definition

Running the workflow can look like this:

# Restore exact data state
dvc checkout

# Run with pinned container
docker run -v $(pwd):/workspace my-analysis:1.2 python src/run.py

# Package results with provenance
ro-crate add data/results.csv params.yaml results/

Common Mistakes

These are common pitfalls when researchers adopt data versioning and provenance tracking.

1. Tracking Every Intermediate File

Versioning every intermediate file is usually unnecessary and can become harmful. It inflates storage and metadata without improving reproducibility.

Track only:

  • Raw inputs, which are the original acquired files.
  • Preprocessed data, which is the curated version actually used for analysis.
  • Final output files, such as figures, tables, and published simulation outputs.

Do not version every temporary CSV or scratch file unless it is necessary to reproduce the final result.

2. Forgetting to Version Parameter Files

If parameters are passed only as command-line arguments, they may not be recoverable from Git alone. Parameter files are just as important as data files.

Version them explicitly:

dvc add configs/params.yaml
git add configs/params.yaml.dvc

3. Assuming Containers Solve Everything

Containers are valuable, but they are only one layer. A container without versioned inputs and provenance is not enough to prove how a specific result was produced.

Think of containers as the environment layer. You still need data versioning for inputs and provenance tracking for the workflow history.

4. Ignoring Remote Storage Configuration

DVC and DataLad work best when remote storage is configured early. Without remotes, metadata files may point to local paths that disappear when you move to a cluster or cloud environment.

Configure remotes at the start of the project, not during a deadline crisis.

Choosing the Right Tool for Your Project

The tool landscape is diverse. Use this decision framework as a starting point.

Scenario Recommended Tool Why
Python machine learning pipeline with experiments DVC Native dvc exp run support for experiment tracking
Large scientific datasets across institutions DataLad On-demand retrieval and sibling replication
Publication-ready artifact packaging RO-Crate Lightweight, portable, and repository-friendly
Multi-step pipeline with caching DVC with dvc.yaml or Snakemake Automatic dependency resolution and rerun logic
Long-term archival for 10+ years DataLad plus Zenodo Git history plus persistent identifiers

You can also combine tools. For example, use DVC for data versioning, Docker for environment reproducibility, and RO-Crate to package the final workflow run for publication.

What To Do Next

If you are starting a new project, the path is straightforward:

  1. Initialize a Git repository for the code.
  2. Add DVC with dvc init and version raw datasets.
  3. Write analysis scripts as Python files or Bash scripts.
  4. Containerize the workflow with Docker or Singularity.
  5. Package results with RO-Crate at each major milestone.

If you are working on an existing project, start small:

  1. Add DVC to the repository and version key datasets.
  2. Write a dvc.yaml file that describes the pipeline.
  3. Containerize the main execution step.
  4. Create one RO-Crate archive for the most important result.

This incremental approach lets you add reproducibility layers without rewriting the entire project.

Related Guides

This article covers data versioning and provenance tracking as practical tools for reproducible scientific workflows. DVC, DataLad, and RO-Crate are actively maintained and widely used in the scientific Python ecosystem. For the latest setup details, always refer to the official documentation for each tool.