Reproducible Research Workflows: Docker and Conda for Simulation Projects

Reading Time: 9 minutes

Reproducible research workflows ensure that simulation results can be exactly recreated by others (or your future self) using the same data, code, and computational environment. Docker provides complete system-level containerization for maximum consistency across platforms, while Conda offers lightweight package and environment management ideal for Python-based scientific computing. For simulation projects, we recommend: (1) use Conda for daily development and dependency management, (2) build Docker images to “freeze” working environments for publication and collaboration, and (3) always pair with version control and thorough documentation. Avoid the common mistake of relying solely on one tool—combine both for robust reproducibility.

Introduction: The Reproducibility Gap in Simulation Research

Scientific simulation projects often suffer from a silent failure mode: the code worked yesterday, but today it produces different results. The underlying physics hasn’t changed—your computational environment has. Missing package versions, altered library dependencies, operating system updates, or even different Python interpreters can silently alter simulation outputs, sometimes in ways that are difficult to detect.

This is more than an inconvenience. Reproducibility is the cornerstone of cumulative science—the ability for other researchers to verify claims and build upon your work. When simulation results cannot be reliably recreated, trust erodes, papers are retracted, and valuable research time is wasted debugging environmental issues instead of advancing knowledge.

In this guide, we’ll examine how to implement reproducible research workflows for simulation projects using Docker and Conda—two complementary tools that, when used together, provide a robust solution for environment consistency, dependency management, and long-term preservation of computational methods.

What Is Reproducible Research? Definitions and Core Components

At its core, reproducible research means that an independent researcher can regenerate the published results (tables, figures, quantitative findings) using only the original data, code, and documentation. As defined by The Turing Way and widely adopted in computational science, this requires:

Data Availability: Raw input data is accessible (with appropriate privacy considerations)
Code Transparency: All analysis and simulation scripts are shared
Documentation: Complete records of software versions, parameters, and environment configurations
Compute Environment Control: The exact runtime environment is captured and reconstructable

Computational reproducibility differs from replicability (obtaining similar conclusions using new data or independent methods). Reproducibility is the minimum standard—it’s about getting the same numbers from the same inputs, not validating the underlying scientific claims.

Why it matters for simulation projects: PDE solvers, finite element methods, and other simulation frameworks often involve complex dependency chains. A minor version change in a numerical library can alter discretization behavior, convergence criteria, or rounding—producing measurably different results. Reproducible workflows eliminate this source of uncertainty.

Docker for Reproducible Scientific Simulations

What Docker Provides

Docker is a containerization platform that packages an application and its entire runtime environment—operating system, libraries, dependencies, configuration files—into a portable, immutable image. When you run a Docker container, you’re executing the exact same environment that was built and tested, regardless of the host system.

For simulation projects, Docker delivers:

Environmental Consistency: No more “it works on my machine.” The container includes specific versions of compilers, MPI implementations, Python interpreters, and numerical libraries.
Platform Portability: A Docker image built on a laptop can run on an HPC cluster, cloud instance, or colleague’s workstation without modification.
Isolation: Simulation dependencies don’t conflict with system libraries or other projects.
Versioned Snapshots: Each Docker image is immutable and can be tagged (e.g., my-sim:paper-v1) for exact future reproduction.

Docker Limitations to Consider

Despite its strengths, Docker has important constraints for scientific computing:

Not a silver bullet: As noted in “Docker Does Not Guarantee Reproducibility” (arXiv 2026), containers can still exhibit non-deterministic behavior if underlying hardware differs (CPU architecture, floating-point optimizations) or if external services (database, file system) vary.
Kernel Dependency: Docker containers share the host kernel. This means container behavior can still be affected by host kernel version and configuration.
Size Overhead: Full OS images can be large (hundreds of MB to GB), though slim images like alpine help.
HPC Restrictions: Many HPC centers don’t allow Docker directly due to security concerns; they use Singularity or Apptainer instead. However, you can build Singularity images FROM Docker images, making Docker a viable development tool.

Writing Dockerfiles for Simulation Projects

The Dockerfile defines how to build your container. Following “Ten Simple Rules for Writing Dockerfiles for Reproducible Research” (Nüst et al., 2020), key practices include:

# Start from a minimal, pinned base image
FROM ubuntu:22.04  # Pin exact version, not :latest

# Set environment variables for reproducibility
ENV LANG=C.UTF-8
ENV LC_ALL=C.UTF-8

# Install system dependencies in one layer to minimize cache issues
RUN apt-get update && apt-get install -y \
    python3 \
    python3-pip \
    libopenblas-dev \
    && rm -rf /var/lib/apt/lists/*

# Create and set working directory
WORKDIR /simulation

# Copy dependency specifications first (for better caching)
COPY requirements.txt environment.yml ./

# Install Python packages with pinned versions
RUN pip install --no-cache-dir -r requirements.txt

# Copy simulation code
COPY src/ ./src/
COPY scripts/ ./scripts/
COPY data/ ./data/

# Define entry point or command
ENTRYPOINT ["python3", "scripts/run_simulation.py"]

Key rule: Pin all versions explicitly—base image, OS packages, Python packages, and even pip version. Use requirements.txt or environment.yml with exact versions (package==1.2.3, not package>=1.0).

Conda for Environment Management in Research

What Conda Provides

Conda is a cross-platform package and environment manager that handles not just Python packages but also non-Python dependencies (C/C++ libraries, compilers, MPI). This makes it particularly well-suited for scientific simulation where you might need specific versions of OpenMPI, FFTW, or HDF5.

Conda delivers:

Language Agnostic: Install Python, R, C/C++ libraries, and system tools in one environment.
Binary Dependencies: Pre-compiled packages avoid compilation hell on different systems.
Isolated Environments: Each project gets its own environment with no cross-contamination.
Export/Import: conda env export > environment.yml captures the entire environment for exact reproduction.

Conda Best Practices

Based on MSI University of Minnesota and Anaconda guidance:

Never install into base: Create a new environment for each project:

conda create --name my-simulation python=3.11
conda activate my-simulation

Use community channels: Prefer conda-forge over defaults for more up-to-date scientific packages:
```
conda config --add channels conda-forge
conda config --set channel_priority strict
```
Install all packages at once: This avoids dependency conflicts:
```
conda install numpy scipy matplotlib fipy
```
Don’t modify existing environments: If you need new packages, either update the environment specification or create a fresh environment from the updated YAML file.
Export cleanly: When sharing, remove platform-specific packages and pip-installed items that aren’t essential:
```
conda env export --no-builds | grep -v "prefix:" > environment.yml
```

Docker vs Conda: When to Choose Which

The question isn’t “Docker or Conda?”—they solve different problems and are complementary.

Aspect	Docker	Conda
Scope	Entire OS + runtime	Package & environment manager
Isolation	System-level (kernel namespace)	User-space environment
Size	Large (100MB–1GB+)	Small (MBs)
Speed	Slower to build/transfer	Instant activation
HPC Support	Limited (Singularity works)	Excellent (native)
Use Case	Publishing, sharing, deployment	Daily development, exploration

Practical Recommendation

Use Conda for development: Quickly create isolated environments, test dependencies, iterate on code. It’s lightweight and fast.
Use Docker for preservation: Once your simulation works, build a Docker image to “freeze” the exact environment. Share this image with collaborators, attach it to publications, or use it for CI/CD.
Combine both: Develop in Conda, then create a Dockerfile that either:
- Copies the Conda environment into the image, or
- Recreates the environment using the exported environment.yml

This layered approach gives you development agility and publication robustness.

Integrating Docker and Conda in Scientific Workflows

Strategy 1: Conda Inside Docker

The most common integration is to install and use Conda within a Docker container. This gives you Conda’s fine-grained package management inside Docker’s system isolation.

FROM ubuntu:22.04

# Install Miniconda
RUN wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh \
    && bash Miniconda3-latest-Linux-x86_64.sh -b -p /opt/conda \
    && rm Miniconda3-latest-Linux-x86_64.sh
ENV PATH=/opt/conda/bin:$PATH

# Create and use a conda environment
COPY environment.yml .
RUN conda env create -f environment.yml
ENV PATH=/opt/conda/envs/my-sim/bin:$PATH

Pros: Leverages Conda’s extensive scientific package ecosystem; consistent with local development workflows.
Cons: Larger image size; Conda activation nuances in Docker.

Strategy 2: Docker for Environment Capture

Develop locally with Conda, then export the environment and bake it into a Docker image without running Conda at runtime:

FROM python:3.11-slim

# Copy pre-built packages or use pip from a frozen requirements.txt
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

This is simpler but loses Conda’s non-Python dependency handling.

Strategy 3: Multi-Stage Builds for HPC

For HPC clusters using Singularity, build the Docker image locally, then convert to Singularity:

docker build -t my-sim:latest .
singularity build my-sim.sif docker://my-sim:latest

This workflow lets you develop with Docker (easy testing) and deploy on HPC (Singularity).

Common Pitfalls and How to Avoid Them

Based on analysis of reproducibility challenges in simulation research, here are critical mistakes to avoid:

1. Missing or Incomplete Documentation

Problem: You have a working Docker image, but no record of how to use it, what inputs it expects, or how to interpret outputs.

Solution: Include a README.md in the container (or alongside the image) with:

How to run the simulation (command-line arguments)
Expected input file formats
Output file descriptions
Hardware requirements (CPU, memory, GPU)
Known limitations

2. Unpinned Dependencies

Problem: Using numpy>=1.0 or python=3.x allows automatic updates that can change behavior.

Solution: Pin exact versions in both environment.yml and requirements.txt:

dependencies:
  - python=3.11.8
  - numpy=1.26.4
  - scipy=1.11.4
  - pip:
    - my-package==0.3.2

3. Non-Deterministic Simulations

Problem: Even with identical environments, simulations produce slightly different results due to floating-point non-associativity, parallel race conditions, or uninitialized memory.

Solution:

Set deterministic flags where available (e.g., OpenMP OMP_NUM_THREADS=1, BLAS threading)
Use fixed random seeds and document them
Test reproducibility by running the container multiple times on the same host

4. Large Data Inside Containers

Problem: Baking large simulation datasets into Docker images bloats them and makes distribution slow.

Solution: Keep data external. Use Docker volumes or bind mounts to attach data at runtime:

docker run -v /path/to/data:/data my-sim:latest

Document data expectations clearly.

5. Ignoring HPC Constraints

Problem: Docker images that work on a laptop fail on an HPC cluster due to missing MPI implementations, incompatible drivers, or security restrictions.

Solution:

Test in a cluster-like environment early
Use Singularity compatibility when targeting HPC
Avoid Docker-in-Docker patterns; build on a base image that matches cluster OS (e.g., CentOS/Rocky if cluster uses those)

6. No Version Control for Dockerfiles and Environment Files

Problem: You have a working image but no history of changes to the Dockerfile or environment.yml.

Solution: Treat Dockerfile and environment specifications as code. Store them in Git alongside your simulation code. Tag releases (e.g., git tag -a v1.0 -m "Paper submission").

7. Overlooking External Services

Problem: Your simulation pulls data from a database or API that changes over time, breaking reproducibility.

Solution: Either:

Snapshot external data and include it in your repository or container, or
Use versioned API endpoints and document the exact version/date accessed

HPC and Cluster Considerations

High Performance Computing environments introduce additional reproducibility challenges:

Singularity/Apptainer Instead of Docker

Most HPC centers prohibit Docker for security reasons. Instead, they provide Singularity (or its fork Apptainer). Singularity containers are built from Docker images:

# On your local machine with Docker
docker pull ubuntu:22.04
docker tag ubuntu:22.04 my-sim:base

# Build your image as usual
docker build -t my-sim:latest .

# Transfer image to HPC and convert
singularity build my-sim.sif docker://my-sim:latest

Key difference: Singularity runs containers as the invoking user (no root), so package installation paths differ. Test your Dockerfile with Singularity to catch issues early.

Module Systems

Many HPC clusters use Environment Modules (Lmod) to manage software versions. You can either:

Load required modules before running your container (if Singularity can access them), or
Build your container on a base image that already includes needed libraries

Parallel I/O and MPI

If your simulation uses MPI (Message Passing Interface), ensure your container includes a compatible MPI implementation. For Singularity, you can bind-mount the host’s MPI libraries:

singularity run --nv -B /usr/lib/x86_64-linux-gnu/openmpi:/usr/lib/x86_64-linux-gnu/openmpi my-sim.sif

Alternatively, install MPICH or OpenMPI inside the container and ensure it’s configured to use the host’s network fabric (Infiniband, etc.).

GPU Support

For GPU-accelerated simulations, both Docker and Singularity require special flags:

Docker: --gpus all
Singularity: --nv

Test GPU functionality thoroughly in your container.

Step-by-Step Implementation Guide

Here’s a practical workflow for implementing reproducible research workflows in your simulation project:

Phase 1: Project Setup

Initialize version control (Git):

git init
git add .
git commit -m "Initial project structure"

Create Conda environment:

conda create --name my-sim python=3.11
conda activate my-sim

Install dependencies and record them:

conda install numpy scipy matplotlib fipy  # Example for PDE simulations
conda env export --no-builds | grep -v "prefix:" > environment.yml

Create project structure:

my-simulation/
├── src/              # Source code
├── scripts/          # Run scripts, entry points
├── data/             # Input data (git-ignored if large)
├── outputs/          # Generated results (git-ignored)
├── docs/             # Documentation
├── environment.yml   # Conda environment
├── requirements.txt  # Pip-only dependencies (if any)
├── Dockerfile        # Container definition
├── README.md         # Usage instructions
└── .gitignore        # Exclude outputs, large data

Phase 2: Development with Conda

Develop and test inside the Conda environment
Commit code changes frequently
Update environment.yml when adding/removing packages
Use .gitignore to exclude generated outputs and large data files

Phase 3: Building the Docker Image

Create a Dockerfile (see example above)
Build the image:
```
docker build -t my-sim:latest .
```

Test the container:

docker run -v $(pwd)/data:/data my-sim:latest python scripts/run_simulation.py --input /data/input.h5

Tag for publication:

docker tag my-sim:latest my-sim:paper-v1.0

Push to registry (optional, for sharing):

docker push my-registry.example.com/my-sim:paper-v1.0

Phase 4: Verification and Sharing

Test reproducibility: Have a colleague pull and run the image. They should get identical results (bit-for-bit identical if simulation is deterministic).
Document: Ensure README.md includes:
- How to obtain the image (Docker Hub, registry, or .sif file)
- How to run it (full command)
- Input file specifications
- Expected output files and their formats
- Citation information
Archive: Deposit the Docker image (or Singularity .sif) in a long-term archive like Zenodo or Figshare, and include the link in your paper’s methods or data availability statement.

Phase 5: Long-Term Maintenance

When making code changes, update the Docker image and tag with a new version (e.g., v1.1)
Keep old images/tags for as long as you need to reproduce old results
Use Git tags to correlate code commits with Docker image versions

Conclusion and Next Steps

Implementing reproducible research workflows isn’t a single tool decision—it’s a layered strategy:

Conda for lightweight, fast environment management during development
Docker for immutable, portable environment snapshots suitable for publication and collaboration
Git for version control of code, Dockerfiles, and environment specifications
Documentation to make the workflow understandable and usable by others

For simulation projects where correctness and verifiability are paramount, this combination provides a robust foundation. Start with Conda for your next project, and once the simulation is working, invest the time to create a Docker image. The upfront cost pays dividends when you (or others) need to rerun the simulation months or years later with confidence.

Next steps you can take today:

Audit your current simulation projects: Are environments documented? Are dependencies pinned?
Convert an existing project to use Conda environments with environment.yml
Build a Docker image for a working simulation and test it on a different machine
Explore your HPC center’s Singularity policies and convert a Docker image to Singularity format
Include environment specifications and container images in your next paper’s supplementary materials

Related Guides

Managing Research Software Through Tickets – Structure reproducibility work as tracked issues and enhancements
Managing Large-Scale PDE Problems: Strategies, Solvers, and HPC Case Studies – Scale your simulations while maintaining reproducibility
Tracking Long-Term Technical Debt in Research Software – Use CI/CD practices to maintain reproducibility over time
Collaboration Between Developers and Researchers – Ensure reproducibility across team boundaries

References and Further Reading

Boettiger, C. (2015). An introduction to Docker for reproducible research. ACM SIGOPS Operating Systems Review.
Nüst, D., et al. (2020). Ten simple rules for writing Dockerfiles for reproducible data science. PLOS Computational Biology.
The Turing Way. Definitions of reproducible research.
Fitzpatrick, B. G., et al. (2018). Issues in Reproducible Simulation Research. Frontiers in Computer Science.
Conda Documentation. Managing environments.