Unit Testing for Scientific Code: pytest Strategies for Research Projects

Reading Time: 10 minutes

Unit testing is non-negotiable for trustworthy scientific software. Unlike commercial applications, research code often lacks formal testing, leading to irreproducible results and wasted effort. This guide covers pytest strategies specifically for scientific Python projects: handling numerical precision with pytest.approx, isolating external dependencies with mocking, using fixtures and parametrization efficiently, and integrating tests into continuous integration pipelines. You’ll learn when to use black-box validation against published results and how to design tests that survive code evolution without becoming brittle.

Why Unit Testing Matters in Research Software

Scientific software exists in a challenging space. It must be flexible enough to explore new hypotheses, yet reliable enough that published results can be reproduced months or years later. Unlike commercial software with clear specifications, research code often evolves alongside experiments, with requirements changing as new discoveries emerge.

The consequences of inadequate testing in research are severe:

Irreproducible results: Different researchers get different outputs from the same code
Silent bugs: Numerical errors that seem small individually compound into significant inaccuracies
Knowledge loss: When original developers leave, tests serve as executable documentation
Wasted effort: Debugging becomes a detective hunt instead of a systematic process

Unit testing addresses these problems by validating small, isolated components of your code. Each test verifies that a specific function or class behaves as expected given defined inputs. When tests pass consistently across environments, you have evidence that your code produces reliable results.

But testing scientific code presents unique challenges that standard software testing approaches don’t fully address.

Unique Challenges of Testing Scientific Code

Scientific and numerical code differs from typical business applications in several ways that affect testing strategy.

Numerical Precision and Floating-Point Errors

Floating-point arithmetic is inherently imprecise. Due to how computers represent decimal numbers, 0.1 + 0.2 is not exactly 0.3 in binary floating-point representation. In scientific simulations involving thousands or millions of operations, these tiny errors accumulate.

A naive test that uses exact equality (==) will fail intermittently or on different hardware:

def test_numerical_computation():
    result = complex_simulation()  # returns 1.0000000000000002
    assert result == 1.0  # FAILS! Even though the difference is negligible

The solution is to use tolerance-based comparisons. Pytest provides pytest.approx() for this purpose:

def test_numerical_computation():
    result = complex_simulation()
    assert result == pytest.approx(1.0, rel=1e-9, abs=1e-12)

rel (relative tolerance) is useful for values of any magnitude; abs (absolute tolerance) handles cases where the expected value is near zero. The defaults are rel=1e-6 and abs=1e-12, but scientific code often requires tighter tolerances.

Unknown “Correct” Answers

In many research scenarios, you don’t have a known correct output. The simulation might be exploring uncharted territory. How do you test code when you don’t know what the answer should be?

Several strategies work:

Inverse operations: If your code computes B = f(A), also test that A ≈ f⁻¹(B).
Conservation laws: For physical simulations, verify that mass, energy, or momentum is conserved within tolerance.
Limiting cases: Test behavior in simplified limits where analytical solutions exist.
Regression testing: Store outputs from a trusted run and detect unexpected changes.

External Dependencies and Heavy Computation

Scientific code often depends on:

Large datasets (terabytes of simulation input)
External solvers or libraries (HPC packages, Fortran libraries)
File I/O with complex formats
Database connections or APIs

Running the full system in every unit test is impractical. You need isolation.

pytest Features That Solve Research Testing Problems

Pytest provides several features that are particularly valuable for scientific code.

Fixtures for Setup and Teardown

Fixtures encapsulate setup code that runs before tests. For scientific testing, fixtures can:

Create temporary test data or meshes
Initialize simulation objects with known parameters
Clean up temporary files after tests
Provide reusable test configurations

import pytest
import tempfile
import numpy as np

@pytest.fixture
def simple_mesh():
    """Create a small 1D mesh for testing."""
    from fipy import Grid1D
    return Grid1D(dx=0.1, nx=10)

@pytest.fixture
def diffusion_solver(simple_mesh):
    """Set up a diffusion solver on the test mesh."""
    from fipy import CellVariable, DiffusionTerm
    var = CellVariable(name="concentration", mesh=simple_mesh, value=1.0)
    eq = DiffusionTerm(coeff=1.0) == 0
    return var, eq

Parametrization for Multiple Scenarios

Instead of writing separate test functions for similar cases, use @pytest.mark.parametrize to run the same test logic with different inputs.

@pytest.mark.parametrize("dx,nx,expected_volume", [
    (0.1, 10, 1.0),
    (0.01, 100, 1.0),
    (0.001, 1000, 1.0),
])
def test_mesh_volume(dx, nx, expected_volume):
    """Test that mesh volume matches domain size."""
    mesh = Grid1D(dx=dx, nx=nx)
    assert mesh.cellVolumes.sum() == pytest.approx(expected_volume)

Parametrization is especially useful for:

Testing edge cases (zero values, very small/large numbers)
Verifying behavior across different mesh resolutions
Validating multiple boundary condition types
Checking various material property values

For research code, you can parametrize against known benchmark results from published papers.

Mocking External Dependencies

Mocking replaces real dependencies with controlled fakes. This isolates the unit under test and makes tests faster and more reliable.

When to mock in scientific code:

External data files: Replace large datasets with minimal synthetic data that exercises the same code paths
HPC solvers: Mock expensive Fortran libraries with pure Python implementations that return known results
Network APIs: Stub remote services that provide parameters or configuration
Random number generators: Seed them to produce deterministic sequences

from unittest.mock import patch, MagicMock

def test_simulation_with_external_data():
    # Mock the data loading function to return small, known data
    with patch('mycode.load_large_dataset') as mock_load:
        mock_load.return_value = np.array([1.0, 2.0, 3.0])
        result = run_simulation()
        assert result.converged

Important guideline: Mock your own code’s dependencies, not third-party libraries you don’t control. Follow the principle “don’t mock what you don’t own.”

Using `pytest.approx` for Numerical Comparisons

Floating-point comparisons must account for rounding errors. Pytest’s approx object handles this elegantly:

def test_diffusion_solution():
    """Test that diffusion reaches expected steady state."""
    concentration = solve_diffusion(time=100.0)
    expected = 0.5  # analytical steady state for this boundary condition
    assert concentration.mean() == pytest.approx(expected, rel=1e-6)

You can also use approx with arrays:

def test_array_computation():
    result = compute_field()
    expected = np.array([1.0, 2.0, 3.0])
    assert result == pytest.approx(expected)

For scientific code, choose tolerances based on:

The numerical accuracy of your methods (e.g., second-order finite difference has truncation error O(dx²))
The precision required by your application (engineering tolerance vs. exploratory research)

Test-Driven Development for Research Projects

Test-Driven Development (TDD) follows a simple cycle: write a failing test, then write minimal code to make it pass, then refactor. While TDD is well-established in commercial software, research projects often resist it due to perceived time constraints.

The reality: TDD saves time in research by catching bugs before they propagate through experiments. Writing tests first forces you to clarify the interface and expected behavior of each function before implementation.

TDD adapted for scientific exploration:

Start with a simple model or algorithm you understand analytically
Write tests that validate against known results (analytical solutions, limiting cases)
Implement the code to pass those tests
Extend the model incrementally, adding tests for each new capability
When you discover a bug, write a test that reproduces it first, then fix it

TDD works well for:

Utility functions (mesh generation, coordinate transformations)
Mathematical operations (matrix manipulations, special functions)
Data processing pipelines (parsing, filtering, normalization)
Configuration validation

TDD is less suitable for:

Highly exploratory code where the interface itself is uncertain
One-off scripts that won’t be reused
Code that depends on external resources not yet available

In practice, a hybrid approach works best: write tests for stable, foundational components; use lighter integration tests for experimental sections.

Organizing Tests for Scientific Projects

Where should test files live? Pytest offers flexibility:

my_research_project/
├── src/
│   └── mypackage/
│       ├── __init__.py
│       ├── solver.py
│       └── mesh.py
├── tests/
│   ├── __init__.py
│   ├── test_solver.py
│   ├── test_mesh.py
│   └── conftest.py  # shared fixtures
├── data/
│   └── reference_results/  # stored outputs for regression tests
├── .github/
│   └── workflows/
│       └── ci.yml  # GitHub Actions CI configuration
├── pyproject.toml
└── README.md

Key conventions:

Keep tests in a separate tests/ directory parallel to src/ (or lib/)
Name test files test_*.py or *_test.py
Name test functions test_*() to allow pytest auto-discovery
Use conftest.py for fixtures shared across multiple test files

For FiPy-based projects, structure tests to match the module hierarchy:

fipy_project/
├── fipy/
│   ├── meshes/
│   │   └── grid1d.py
│   └── terms/
│       └── diffusion.py
├── tests/
│   ├── meshes/
│   │   └── test_grid1d.py
│   └── terms/
│       └── test_diffusion.py

Integration with Continuous Integration

Unit tests only provide value if they run consistently. Continuous Integration (CI) automates test execution whenever code changes.

GitHub Actions provides a straightforward CI setup for Python projects:

# .github/workflows/ci.yml
name: CI

on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        python-version: ["3.9", "3.10", "3.11"]

    steps:
    - uses: actions/checkout@v3
    - name: Set up Python ${{ matrix.python-version }}
      uses: actions/setup-python@v4
      with:
        python-version: ${{ matrix.python-version }}
    - name: Install dependencies
      run: |
        python -m pip install --upgrade pip
        pip install -e .[test]
    - name: Run tests with pytest
      run: |
        pytest --cov=src --cov-report=xml --cov-report=html
    - name: Upload coverage
      uses: codecov/codecov-action@v3

CI ensures:

Tests pass on multiple platforms (Linux, macOS, Windows)
Tests pass on multiple Python versions
Code coverage is tracked over time
Pull requests are validated before merging

For research software, consider adding:

Tests that run with different versions of key dependencies (NumPy, SciPy, FiPy)
Performance regression checks (ensure algorithms don’t slow down)
Documentation builds to verify examples still work

Common Mistakes and How to Avoid Them

Based on research and industry best practices, here are the most common unit testing mistakes in scientific code:

1. Testing Implementation Details Instead of Behavior

Testing internal implementation makes tests brittle. When you refactor code, tests should still pass if the external behavior is correct.

# ❌ Bad: tests internal state
def test_algorithm_updates_counter():
    obj = MyAlgorithm()
    obj.step()
    assert obj.counter == 1  # fragile if counter implementation changes

# ✅ Better: tests observable outcome
def test_algorithm_produces_correct_result():
    obj = MyAlgorithm()
    result = obj.run()
    assert result == expected

2. Ignoring Numerical Tolerance

Exact comparisons on floating-point results cause flaky tests that fail randomly or on different hardware. Always use pytest.approx() or similar tolerance-based assertions for numerical outputs.

3. Writing Slow Tests

Unit tests should run quickly (milliseconds, not seconds). If a test is slow:

It won’t be run frequently enough
Developers will skip running the full test suite
CI becomes expensive and slow

Solutions:

Use small synthetic datasets instead of large real ones
Mock expensive external computations
Separate slow integration tests from fast unit tests
Use parametrization wisely: don’t run thousands of variations in every test pass

4. Not Isolating Tests

Tests should not depend on each other or on global state. Each test should:

Create its own test data (use fixtures)
Clean up after itself
Not rely on execution order

# ❌ Bad: shared mutable state
results = []

def test_first():
    results.append(1)

def test_second():
    assert results == [1]  # fails if tests run in wrong order

# ✅ Better: independent tests
def test_first():
    result = compute_something()
    assert result == 1

def test_second():
    result = compute_something_else()
    assert result == 2

5. Skipping Tests Without Good Reason

@pytest.mark.skip should be used sparingly. If a test is skipped because the environment lacks something, use pytest.importorskip() at module level or make the dependency optional in CI configuration.

6. Writing Vague Assertions

Tests should clearly express what is being verified and why.

# ❌ Unclear: what's being tested?
assert result != None

# ✅ Clear: specific expectation with context
assert result.converged is True, "Solver should converge for well-posed problem"

7. Hardcoding Paths and Environment Assumptions

Use temporary directories and fixtures rather than fixed paths. Pytest’s tmp_path fixture provides a fresh temporary directory for each test.

Practical Example: Testing a FiPy Diffusion Solver

Let’s put these strategies together with a concrete example relevant to MatForge’s audience.

# tests/test_diffusion.py
import pytest
import numpy as np
from fipy import Grid1D, CellVariable, DiffusionTerm

@pytest.fixture
def simple_1d_grid():
    """Create a uniform 1D grid for diffusion testing."""
    return Grid1D(dx=0.1, nx=50)

@pytest.fixture
def steady_state_diffusion(simple_1d_grid):
    """Set up diffusion with Dirichlet boundaries at both ends."""
    mesh = simple_1d_grid
    var = CellVariable(name="concentration", mesh=mesh, value=0.0)
    var.constrain(1.0, mesh.facesLeft)
    var.constrain(0.0, mesh.facesRight)
    eq = DiffusionTerm(coeff=1.0) == 0
    return var, eq

def test_mesh_volume(simple_1d_grid):
    """Total domain length should equal nx * dx."""
    expected_length = 50 * 0.1
    assert simple_1d_grid.cellVolumes.sum() == pytest.approx(expected_length)

def test_diffusion_conservation(steady_state_diffusion):
    """For steady diffusion with no sources, total mass should be conserved."""
    var, eq = steady_state_diffusion
    eq.solve(var, dt=1.0)
    # With fixed values at boundaries, mass enters from left and exits right
    # In steady state, flux in should equal flux out
    left_flux = var.faceValue[simple_1d_grid.facesLeft.value]
    right_flux = var.faceValue[simple_1d_grid.facesRight.value]
    # Flux direction: positive means flow to the right
    assert left_flux > 0  # mass enters from left
    assert right_flux < 0  # mass exits from right (negative direction)
    assert abs(left_flux + right_flux) == pytest.approx(0, abs=1e-10)

def test_diffusion_solution_shape(steady_state_diffusion):
    """Concentration should decrease monotonically from left to right."""
    var, eq = steady_state_diffusion
    eq.solve(var, dt=1.0)
    # Steady state should be linear for constant diffusivity
    x = simple_1d_grid.cellCenters[0]
    expected = 1.0 - x / (50 * 0.1)  # linear from 1 to 0
    assert var.value == pytest.approx(expected, rel=1e-5)

@pytest.mark.parametrize("dx,nx", [(0.1, 50), (0.05, 100), (0.02, 250)])
def test_mesh_independence(dx, nx):
    """Solution should converge as mesh refines."""
    mesh = Grid1D(dx=dx, nx=nx)
    var = CellVariable(name="c", mesh=mesh, value=0.0)
    var.constrain(1.0, mesh.facesLeft)
    var.constrain(0.0, mesh.facesRight)
    eq = DiffusionTerm(coeff=1.0) == 0
    eq.solve(var, dt=1.0)
    # Check at midpoint
    mid_idx = nx // 2
    assert var.value[mid_idx] == pytest.approx(0.5, rel=0.1)

This example demonstrates:

Fixtures for reusable test setup
Parametrization to test multiple resolutions
Tolerance-based numerical assertions
Testing physical principles (conservation, linearity)
Clear, descriptive test names and assertions

When Unit Testing Isn’t Enough

Unit tests validate individual components, but research software also needs:

Integration tests: Verify that multiple modules work together correctly
System tests: Run complete simulations end-to-end and compare to known outputs
Performance tests: Ensure algorithms meet computational complexity expectations
Visualization checks: Spot obvious rendering errors (automated image comparison where feasible)

A complete testing strategy for research projects includes multiple test levels, with unit tests forming the foundation.

What We Recommend: A Pragmatic Testing Strategy for Research Projects

Based on the evidence from scientific software best practices, here’s our recommended approach:

Start with Foundational Unit Tests

Begin by writing tests for:

Core mathematical functions (special functions, coordinate transformations)
Mesh generation and manipulation utilities
Boundary condition implementations
Data input/output routines (validation, formatting)

These components are stable, have clear expected behaviors, and are reused across many simulations.

Adopt pytest.approx as Standard

Never use == for floating-point results. Always use pytest.approx() with appropriate tolerances. Make this a team convention.

Use Fixtures Extensively

Fixtures reduce duplication and make tests more maintainable. Create fixtures for:

Common meshes (1D, 2D, 3D test grids)
Standard boundary condition setups
Known analytical solutions
Temporary file/directory management

Integrate CI Early

Set up GitHub Actions (or similar) before the project grows large. Automatically run tests on:

Every push
Every pull request
Scheduled nightly builds (to catch environmental drift)

Measure and Track Code Coverage

Use pytest-cov to measure which parts of your code are exercised by tests. Aim for at least 80% coverage on core modules, but don’t obsess over 100%—the goal is confidence, not a perfect score.

Write Tests When You Fix Bugs

Whenever a bug is reported, write a test that reproduces it before fixing. This guarantees the bug won’t reappear later.

Keep Tests Fast

If a test takes more than a few seconds, consider:

Using smaller test problems
Mocking expensive operations
Moving it to an integration test suite that runs less frequently

Related Guides

Tracking Long-Term Technical Debt in Research Software – Managing testing debt as projects evolve
Managing Research Software Through Tickets – Using issue tracking to coordinate testing efforts
Reproducibility and Its Role in Debugging – How tests enable systematic debugging
How to Write a Clear and Useful Bug Report – Providing the information needed to create regression tests
Feature Requests vs. Bug Reports: Knowing the Difference – Classifying issues that drive test development

Conclusion

Unit testing transforms research software from fragile scripts into reliable, reproducible instruments. While setting up comprehensive testing requires upfront investment, the payoff comes in reduced debugging time, increased confidence in results, and smoother collaboration.

The pytest framework provides powerful tools—fixtures, parametrization, mocking, and approx()—that directly address the challenges of scientific code: numerical precision, external dependencies, and unknown correct answers. Combined with continuous integration, these practices ensure that tests run consistently across environments.

Remember: testing is not about achieving perfection. It’s about building enough confidence in your code that you can trust its outputs when it matters most. Start with the core components, write tests that express clear expectations, and gradually expand coverage as the project grows.

Your future self—and anyone who inherits your code—will thank you.

Next Steps

Ready to add testing to your research project?

Install pytest: pip install pytest
Create a tests/ directory with a simple test file
Write a test for one core function using pytest.approx
Set up a GitHub Actions workflow to run tests automatically
Gradually expand coverage as you modify code

For personalized help implementing testing strategies in your specific research software, contact us for a consultation (visit our home page for more information).