Unit testing is non-negotiable for trustworthy scientific software. Unlike commercial applications, research code often lacks formal testing, leading to irreproducible results and wasted effort. This guide covers pytest strategies specifically for scientific Python projects: handling numerical precision with pytest.approx, isolating external dependencies with mocking, using fixtures and parametrization efficiently, and integrating tests into continuous integration pipelines. You’ll learn when to use black-box validation against published results and how to design tests that survive code evolution without becoming brittle.
Why Unit Testing Matters in Research Software
Scientific software exists in a challenging space. It must be flexible enough to explore new hypotheses, yet reliable enough that published results can be reproduced months or years later. Unlike commercial software with clear specifications, research code often evolves alongside experiments, with requirements changing as new discoveries emerge.
The consequences of inadequate testing in research are severe:
- Irreproducible results: Different researchers get different outputs from the same code
- Silent bugs: Numerical errors that seem small individually compound into significant inaccuracies
- Knowledge loss: When original developers leave, tests serve as executable documentation
- Wasted effort: Debugging becomes a detective hunt instead of a systematic process
Unit testing addresses these problems by validating small, isolated components of your code. Each test verifies that a specific function or class behaves as expected given defined inputs. When tests pass consistently across environments, you have evidence that your code produces reliable results.
But testing scientific code presents unique challenges that standard software testing approaches don’t fully address.
Unique Challenges of Testing Scientific Code
Scientific and numerical code differs from typical business applications in several ways that affect testing strategy.
Numerical Precision and Floating-Point Errors
Floating-point arithmetic is inherently imprecise. Due to how computers represent decimal numbers, 0.1 + 0.2 is not exactly 0.3 in binary floating-point representation. In scientific simulations involving thousands or millions of operations, these tiny errors accumulate.
A naive test that uses exact equality (==) will fail intermittently or on different hardware:
def test_numerical_computation():
result = complex_simulation() # returns 1.0000000000000002
assert result == 1.0 # FAILS! Even though the difference is negligible
The solution is to use tolerance-based comparisons. Pytest provides pytest.approx() for this purpose:
def test_numerical_computation():
result = complex_simulation()
assert result == pytest.approx(1.0, rel=1e-9, abs=1e-12)
rel (relative tolerance) is useful for values of any magnitude; abs (absolute tolerance) handles cases where the expected value is near zero. The defaults are rel=1e-6 and abs=1e-12, but scientific code often requires tighter tolerances.
Unknown “Correct” Answers
In many research scenarios, you don’t have a known correct output. The simulation might be exploring uncharted territory. How do you test code when you don’t know what the answer should be?
Several strategies work:
- Inverse operations: If your code computes
B = f(A), also test thatA ≈ f⁻¹(B). - Conservation laws: For physical simulations, verify that mass, energy, or momentum is conserved within tolerance.
- Limiting cases: Test behavior in simplified limits where analytical solutions exist.
- Regression testing: Store outputs from a trusted run and detect unexpected changes.
External Dependencies and Heavy Computation
Scientific code often depends on:
- Large datasets (terabytes of simulation input)
- External solvers or libraries (HPC packages, Fortran libraries)
- File I/O with complex formats
- Database connections or APIs
Running the full system in every unit test is impractical. You need isolation.
pytest Features That Solve Research Testing Problems
Pytest provides several features that are particularly valuable for scientific code.
Fixtures for Setup and Teardown
Fixtures encapsulate setup code that runs before tests. For scientific testing, fixtures can:
- Create temporary test data or meshes
- Initialize simulation objects with known parameters
- Clean up temporary files after tests
- Provide reusable test configurations
import pytest
import tempfile
import numpy as np
@pytest.fixture
def simple_mesh():
"""Create a small 1D mesh for testing."""
from fipy import Grid1D
return Grid1D(dx=0.1, nx=10)
@pytest.fixture
def diffusion_solver(simple_mesh):
"""Set up a diffusion solver on the test mesh."""
from fipy import CellVariable, DiffusionTerm
var = CellVariable(name="concentration", mesh=simple_mesh, value=1.0)
eq = DiffusionTerm(coeff=1.0) == 0
return var, eq
Parametrization for Multiple Scenarios
Instead of writing separate test functions for similar cases, use @pytest.mark.parametrize to run the same test logic with different inputs.
@pytest.mark.parametrize("dx,nx,expected_volume", [
(0.1, 10, 1.0),
(0.01, 100, 1.0),
(0.001, 1000, 1.0),
])
def test_mesh_volume(dx, nx, expected_volume):
"""Test that mesh volume matches domain size."""
mesh = Grid1D(dx=dx, nx=nx)
assert mesh.cellVolumes.sum() == pytest.approx(expected_volume)
Parametrization is especially useful for:
- Testing edge cases (zero values, very small/large numbers)
- Verifying behavior across different mesh resolutions
- Validating multiple boundary condition types
- Checking various material property values
For research code, you can parametrize against known benchmark results from published papers.
Mocking External Dependencies
Mocking replaces real dependencies with controlled fakes. This isolates the unit under test and makes tests faster and more reliable.
When to mock in scientific code:
- External data files: Replace large datasets with minimal synthetic data that exercises the same code paths
- HPC solvers: Mock expensive Fortran libraries with pure Python implementations that return known results
- Network APIs: Stub remote services that provide parameters or configuration
- Random number generators: Seed them to produce deterministic sequences
from unittest.mock import patch, MagicMock
def test_simulation_with_external_data():
# Mock the data loading function to return small, known data
with patch('mycode.load_large_dataset') as mock_load:
mock_load.return_value = np.array([1.0, 2.0, 3.0])
result = run_simulation()
assert result.converged
Important guideline: Mock your own code’s dependencies, not third-party libraries you don’t control. Follow the principle “don’t mock what you don’t own.”
Using pytest.approx for Numerical Comparisons
Floating-point comparisons must account for rounding errors. Pytest’s approx object handles this elegantly:
def test_diffusion_solution():
"""Test that diffusion reaches expected steady state."""
concentration = solve_diffusion(time=100.0)
expected = 0.5 # analytical steady state for this boundary condition
assert concentration.mean() == pytest.approx(expected, rel=1e-6)
You can also use approx with arrays:
def test_array_computation():
result = compute_field()
expected = np.array([1.0, 2.0, 3.0])
assert result == pytest.approx(expected)
For scientific code, choose tolerances based on:
- The numerical accuracy of your methods (e.g., second-order finite difference has truncation error O(dx²))
- The precision required by your application (engineering tolerance vs. exploratory research)
Test-Driven Development for Research Projects
Test-Driven Development (TDD) follows a simple cycle: write a failing test, then write minimal code to make it pass, then refactor. While TDD is well-established in commercial software, research projects often resist it due to perceived time constraints.
The reality: TDD saves time in research by catching bugs before they propagate through experiments. Writing tests first forces you to clarify the interface and expected behavior of each function before implementation.
TDD adapted for scientific exploration:
- Start with a simple model or algorithm you understand analytically
- Write tests that validate against known results (analytical solutions, limiting cases)
- Implement the code to pass those tests
- Extend the model incrementally, adding tests for each new capability
- When you discover a bug, write a test that reproduces it first, then fix it
TDD works well for:
- Utility functions (mesh generation, coordinate transformations)
- Mathematical operations (matrix manipulations, special functions)
- Data processing pipelines (parsing, filtering, normalization)
- Configuration validation
TDD is less suitable for:
- Highly exploratory code where the interface itself is uncertain
- One-off scripts that won’t be reused
- Code that depends on external resources not yet available
In practice, a hybrid approach works best: write tests for stable, foundational components; use lighter integration tests for experimental sections.
Organizing Tests for Scientific Projects
Where should test files live? Pytest offers flexibility:
my_research_project/
├── src/
│ └── mypackage/
│ ├── __init__.py
│ ├── solver.py
│ └── mesh.py
├── tests/
│ ├── __init__.py
│ ├── test_solver.py
│ ├── test_mesh.py
│ └── conftest.py # shared fixtures
├── data/
│ └── reference_results/ # stored outputs for regression tests
├── .github/
│ └── workflows/
│ └── ci.yml # GitHub Actions CI configuration
├── pyproject.toml
└── README.md
Key conventions:
- Keep tests in a separate
tests/directory parallel tosrc/(orlib/) - Name test files
test_*.pyor*_test.py - Name test functions
test_*()to allow pytest auto-discovery - Use
conftest.pyfor fixtures shared across multiple test files
For FiPy-based projects, structure tests to match the module hierarchy:
fipy_project/
├── fipy/
│ ├── meshes/
│ │ └── grid1d.py
│ └── terms/
│ └── diffusion.py
├── tests/
│ ├── meshes/
│ │ └── test_grid1d.py
│ └── terms/
│ └── test_diffusion.py
Integration with Continuous Integration
Unit tests only provide value if they run consistently. Continuous Integration (CI) automates test execution whenever code changes.
GitHub Actions provides a straightforward CI setup for Python projects:
# .github/workflows/ci.yml
name: CI
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.9", "3.10", "3.11"]
steps:
- uses: actions/checkout@v3
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -e .[test]
- name: Run tests with pytest
run: |
pytest --cov=src --cov-report=xml --cov-report=html
- name: Upload coverage
uses: codecov/codecov-action@v3
CI ensures:
- Tests pass on multiple platforms (Linux, macOS, Windows)
- Tests pass on multiple Python versions
- Code coverage is tracked over time
- Pull requests are validated before merging
For research software, consider adding:
- Tests that run with different versions of key dependencies (NumPy, SciPy, FiPy)
- Performance regression checks (ensure algorithms don’t slow down)
- Documentation builds to verify examples still work
Common Mistakes and How to Avoid Them
Based on research and industry best practices, here are the most common unit testing mistakes in scientific code:
1. Testing Implementation Details Instead of Behavior
Testing internal implementation makes tests brittle. When you refactor code, tests should still pass if the external behavior is correct.
# ❌ Bad: tests internal state
def test_algorithm_updates_counter():
obj = MyAlgorithm()
obj.step()
assert obj.counter == 1 # fragile if counter implementation changes
# ✅ Better: tests observable outcome
def test_algorithm_produces_correct_result():
obj = MyAlgorithm()
result = obj.run()
assert result == expected
2. Ignoring Numerical Tolerance
Exact comparisons on floating-point results cause flaky tests that fail randomly or on different hardware. Always use pytest.approx() or similar tolerance-based assertions for numerical outputs.
3. Writing Slow Tests
Unit tests should run quickly (milliseconds, not seconds). If a test is slow:
- It won’t be run frequently enough
- Developers will skip running the full test suite
- CI becomes expensive and slow
Solutions:
- Use small synthetic datasets instead of large real ones
- Mock expensive external computations
- Separate slow integration tests from fast unit tests
- Use parametrization wisely: don’t run thousands of variations in every test pass
4. Not Isolating Tests
Tests should not depend on each other or on global state. Each test should:
- Create its own test data (use fixtures)
- Clean up after itself
- Not rely on execution order
# ❌ Bad: shared mutable state
results = []
def test_first():
results.append(1)
def test_second():
assert results == [1] # fails if tests run in wrong order
# ✅ Better: independent tests
def test_first():
result = compute_something()
assert result == 1
def test_second():
result = compute_something_else()
assert result == 2
5. Skipping Tests Without Good Reason
@pytest.mark.skip should be used sparingly. If a test is skipped because the environment lacks something, use pytest.importorskip() at module level or make the dependency optional in CI configuration.
6. Writing Vague Assertions
Tests should clearly express what is being verified and why.
# ❌ Unclear: what's being tested?
assert result != None
# ✅ Clear: specific expectation with context
assert result.converged is True, "Solver should converge for well-posed problem"
7. Hardcoding Paths and Environment Assumptions
Use temporary directories and fixtures rather than fixed paths. Pytest’s tmp_path fixture provides a fresh temporary directory for each test.
Practical Example: Testing a FiPy Diffusion Solver
Let’s put these strategies together with a concrete example relevant to MatForge’s audience.
# tests/test_diffusion.py
import pytest
import numpy as np
from fipy import Grid1D, CellVariable, DiffusionTerm
@pytest.fixture
def simple_1d_grid():
"""Create a uniform 1D grid for diffusion testing."""
return Grid1D(dx=0.1, nx=50)
@pytest.fixture
def steady_state_diffusion(simple_1d_grid):
"""Set up diffusion with Dirichlet boundaries at both ends."""
mesh = simple_1d_grid
var = CellVariable(name="concentration", mesh=mesh, value=0.0)
var.constrain(1.0, mesh.facesLeft)
var.constrain(0.0, mesh.facesRight)
eq = DiffusionTerm(coeff=1.0) == 0
return var, eq
def test_mesh_volume(simple_1d_grid):
"""Total domain length should equal nx * dx."""
expected_length = 50 * 0.1
assert simple_1d_grid.cellVolumes.sum() == pytest.approx(expected_length)
def test_diffusion_conservation(steady_state_diffusion):
"""For steady diffusion with no sources, total mass should be conserved."""
var, eq = steady_state_diffusion
eq.solve(var, dt=1.0)
# With fixed values at boundaries, mass enters from left and exits right
# In steady state, flux in should equal flux out
left_flux = var.faceValue[simple_1d_grid.facesLeft.value]
right_flux = var.faceValue[simple_1d_grid.facesRight.value]
# Flux direction: positive means flow to the right
assert left_flux > 0 # mass enters from left
assert right_flux < 0 # mass exits from right (negative direction)
assert abs(left_flux + right_flux) == pytest.approx(0, abs=1e-10)
def test_diffusion_solution_shape(steady_state_diffusion):
"""Concentration should decrease monotonically from left to right."""
var, eq = steady_state_diffusion
eq.solve(var, dt=1.0)
# Steady state should be linear for constant diffusivity
x = simple_1d_grid.cellCenters[0]
expected = 1.0 - x / (50 * 0.1) # linear from 1 to 0
assert var.value == pytest.approx(expected, rel=1e-5)
@pytest.mark.parametrize("dx,nx", [(0.1, 50), (0.05, 100), (0.02, 250)])
def test_mesh_independence(dx, nx):
"""Solution should converge as mesh refines."""
mesh = Grid1D(dx=dx, nx=nx)
var = CellVariable(name="c", mesh=mesh, value=0.0)
var.constrain(1.0, mesh.facesLeft)
var.constrain(0.0, mesh.facesRight)
eq = DiffusionTerm(coeff=1.0) == 0
eq.solve(var, dt=1.0)
# Check at midpoint
mid_idx = nx // 2
assert var.value[mid_idx] == pytest.approx(0.5, rel=0.1)
This example demonstrates:
- Fixtures for reusable test setup
- Parametrization to test multiple resolutions
- Tolerance-based numerical assertions
- Testing physical principles (conservation, linearity)
- Clear, descriptive test names and assertions
When Unit Testing Isn’t Enough
Unit tests validate individual components, but research software also needs:
- Integration tests: Verify that multiple modules work together correctly
- System tests: Run complete simulations end-to-end and compare to known outputs
- Performance tests: Ensure algorithms meet computational complexity expectations
- Visualization checks: Spot obvious rendering errors (automated image comparison where feasible)
A complete testing strategy for research projects includes multiple test levels, with unit tests forming the foundation.
What We Recommend: A Pragmatic Testing Strategy for Research Projects
Based on the evidence from scientific software best practices, here’s our recommended approach:
Start with Foundational Unit Tests
Begin by writing tests for:
- Core mathematical functions (special functions, coordinate transformations)
- Mesh generation and manipulation utilities
- Boundary condition implementations
- Data input/output routines (validation, formatting)
These components are stable, have clear expected behaviors, and are reused across many simulations.
Adopt pytest.approx as Standard
Never use == for floating-point results. Always use pytest.approx() with appropriate tolerances. Make this a team convention.
Use Fixtures Extensively
Fixtures reduce duplication and make tests more maintainable. Create fixtures for:
- Common meshes (1D, 2D, 3D test grids)
- Standard boundary condition setups
- Known analytical solutions
- Temporary file/directory management
Integrate CI Early
Set up GitHub Actions (or similar) before the project grows large. Automatically run tests on:
- Every push
- Every pull request
- Scheduled nightly builds (to catch environmental drift)
Measure and Track Code Coverage
Use pytest-cov to measure which parts of your code are exercised by tests. Aim for at least 80% coverage on core modules, but don’t obsess over 100%—the goal is confidence, not a perfect score.
Write Tests When You Fix Bugs
Whenever a bug is reported, write a test that reproduces it before fixing. This guarantees the bug won’t reappear later.
Keep Tests Fast
If a test takes more than a few seconds, consider:
- Using smaller test problems
- Mocking expensive operations
- Moving it to an integration test suite that runs less frequently
Related Guides
- Tracking Long-Term Technical Debt in Research Software – Managing testing debt as projects evolve
- Managing Research Software Through Tickets – Using issue tracking to coordinate testing efforts
- Reproducibility and Its Role in Debugging – How tests enable systematic debugging
- How to Write a Clear and Useful Bug Report – Providing the information needed to create regression tests
- Feature Requests vs. Bug Reports: Knowing the Difference – Classifying issues that drive test development
Conclusion
Unit testing transforms research software from fragile scripts into reliable, reproducible instruments. While setting up comprehensive testing requires upfront investment, the payoff comes in reduced debugging time, increased confidence in results, and smoother collaboration.
The pytest framework provides powerful tools—fixtures, parametrization, mocking, and approx()—that directly address the challenges of scientific code: numerical precision, external dependencies, and unknown correct answers. Combined with continuous integration, these practices ensure that tests run consistently across environments.
Remember: testing is not about achieving perfection. It’s about building enough confidence in your code that you can trust its outputs when it matters most. Start with the core components, write tests that express clear expectations, and gradually expand coverage as the project grows.
Your future self—and anyone who inherits your code—will thank you.
Next Steps
Ready to add testing to your research project?
- Install pytest:
pip install pytest - Create a
tests/directory with a simple test file - Write a test for one core function using
pytest.approx - Set up a GitHub Actions workflow to run tests automatically
- Gradually expand coverage as you modify code
For personalized help implementing testing strategies in your specific research software, contact us for a consultation (visit our home page for more information).