Reading Time: 9 minutes

Your simulation outputs are not just files. They are research assets. When a multi-node run finishes and generates hundreds of gigabytes across thousands of time steps, you are not only producing data. You are creating scientific evidence that needs to survive hardware upgrades, personnel turnover, and institutional storage migrations.

This guide explains how to structure dataset management so your work remains accessible, reproducible, and meaningful years after the simulation campaign ends.

Key Takeaways

  • Tiered storage separates active computation data from archival storage, optimizing both performance and cost.
  • FAIR principles — Findable, Accessible, Interoperable, and Reusable — provide a practical framework for long-term scientific data stewardship.
  • Data Management Plans should be written before the first simulation runs, not after the archive is already full.
  • Metadata and provenance are not optional extras. They determine whether data is reusable or orphaned.

The Problem You Are Actually Facing

The main problem is not storage capacity. It is organizational collapse.

A typical computational materials science or fluid dynamics project generates data in several formats across several stages:

  • Raw simulation output, including checkpoint files, diagnostic dumps, and field variables.
  • Intermediate results, including post-processed quantities, derived values, and analysis-ready data.
  • Final published data, including figures, tables, and selected supplementary datasets.
  • Reproducible artifacts, including input files, scripts, environment definitions, and software versions.

What happens next determines whether the project becomes a durable reference point or disappears. Many research groups manage this poorly for several reasons:

  1. Data is stored on compute nodes. When a cluster is upgraded or decommissioned, data can disappear.
  2. Metadata is absent. Nobody knows what a dataset represents without asking the original researcher.
  3. There is no preservation plan. The assumption that the team will back it up later often fails.
  4. Formats become obsolete. Proprietary binary formats may depend on one software package or version.

This is the core challenge. A dataset management strategy must address scale, organization, accessibility, and longevity at the same time.

Tiered Storage Architecture: The Foundation

Tiered storage is the backbone of scientific data management. Instead of storing everything on the same medium, you separate data across tiers optimized for different access patterns and cost profiles.

Tier 0: Active Computation Storage

Medium: NVMe SSDs and parallel file systems such as Lustre, IBM Spectrum Scale, or BeeGFS.

Purpose: This is where data lives during active computation. It requires:

  • High bandwidth for distributed I/O.
  • Low latency for checkpoint restart and real-time analysis.
  • Automatic failover to reduce the risk of data loss during node failures.

Data belongs here when files are being actively written, read, or modified by running simulations and post-processing jobs.

A common pitfall is treating this tier as permanent storage. Tier 0 systems are designed for throughput, not long-term durability. When the cluster is replaced, data left on Tier 0 may be lost.

Tier 1: Primary Research Storage

Medium: High-capacity HDD arrays and distributed object storage.

Purpose: This is the working archive. Data is accessed regularly during the active project lifecycle, but it does not require sub-millisecond latency.

Data belongs here when it includes completed simulation runs, intermediate analysis results, and datasets actively used by collaborators.

Common access patterns include full-file reads, metadata queries, and selective subsetting.

Best practices include:

  • Automated migration from Tier 0 once write operations cease.
  • Redundant copies following a 3 copies, 2 media types rule.
  • Integrity checks through checksums and error detection.

Tier 2: Cold Storage and Archive

Medium: Object storage such as S3 or Azure Blob, tape libraries, and deep archive systems.

Purpose: This tier supports long-term preservation. Data is rarely accessed but must remain intact for years or decades.

Data belongs here when it includes completed projects, publication supplements, repository deposits, and datasets that are more than several years old.

The primary preservation tier should be immutable when possible. Write-once-read-many storage or read-only repository deposits help prevent accidental deletion or modification.

The Data Lifecycle: From Creation to Preservation

Understanding how data moves through its lifecycle helps you choose the right strategy at each phase.

Phase 1: Plan Before You Run

This is where many researchers fail. You need a Data Management Plan before the first simulation executes.

A Data Management Plan answers several practical questions:

  • What data will be generated, and how much?
  • Which formats will store the data?
  • Which datasets are essential, and which are disposable?
  • Where will data be stored during active research?
  • Who has access, and when does access end?
  • How will metadata document the data?

Recommended tools include DMPTool, RDMO, or your institution’s research data management template. Many funders now require Data Management Plans for grant applications.

Phase 2: Active Creation and Validation

During simulation execution, data management focuses on several priorities:

  • Real-time integrity. Validate outputs immediately after checkpoint completion.
  • Selective archiving. Do not archive everything. Archive what is publication-relevant or scientifically valuable.
  • Automated provenance. Log software versions, parameter files, and execution environments alongside the data.
  • Naming conventions. Use consistent, machine-parseable filenames that encode structure and enable automation.

Phase 3: Active Analysis and Sharing

Once simulations finish and data is migrated to Tier 1, the focus shifts to analysis and collaboration.

  • Use subsetting strategies. Store data so you can read individual time steps or spatial regions without loading entire files. HDF5 and NetCDF support this through hyperslab selections.
  • Make compression decisions carefully. Lossless compression can reduce archival storage costs, but decompression uses CPU time.
  • Support collaborative access. Use read-only mounts or curated data catalogs instead of duplicating data across many directories.

Phase 4: Preservation and Sharing

After project completion, the goal is to preserve data in a form that other researchers can understand and reuse.

  • Deposit curated datasets in domain-specific repositories, institutional repositories, or services such as Zenodo.
  • Obtain persistent identifiers such as DOIs so data can be cited and traced.
  • Add a machine-readable license such as Creative Commons or Open Data Commons.
  • Include a documentation package with README files, data dictionaries, and method descriptions.

FAIR Principles: The Framework

The FAIR principles are not marketing slogans. They are an operational standard for scientific data management and are supported by major funders and institutions.

Findable

Data is findable when:

  1. It has a persistent identifier, such as a DOI or PID.
  2. It is described with rich metadata, not only filenames.
  3. Metadata is indexed in searchable catalogs or repositories.

A practical implementation is to register datasets in Zenodo or a domain-specific repository. Use controlled vocabularies for physical quantities instead of inconsistent labels such as “temperature,” “temp,” and “T_field.”

Accessible

Data is accessible when:

  1. It can be retrieved through standard protocols such as HTTP, FTP, or an S3 API.
  2. Authentication and authorization rules are clearly defined.
  3. Metadata remains accessible even when the raw data is archived.

Interoperable

Interoperability requires:

  1. Open, standardized formats such as HDF5, NetCDF, CSV, or JSON.
  2. Community vocabularies recognized by your discipline.
  3. Qualified references that link data to publications, software, and related datasets.

Reusable

Reusability depends on:

  1. Clear provenance documentation from raw data to published figures.
  2. Explicit licensing that explains what others can do with the data.
  3. Compliance with community standards and domain-specific metadata schemas.

For simulation data, domain-specific repositories can be especially useful. For example, NOMAD provides FAIR-compliant infrastructure for computational materials science. If your work involves molecular dynamics, phase-field modeling, or a related field, check whether a domain repository exists.

Metadata and Provenance: What Actually Matters

Metadata is one of the most common failure points in scientific data management. A dataset without metadata is not reusable data. It is a mystery.

Essential Metadata Fields

Metadata Field What It Should Capture Why It Matters
Project name Research project, grant, or campaign identifier Connects the dataset to its scientific context
Simulation purpose Question, hypothesis, or benchmark being tested Explains why the dataset exists
Input parameters Configuration files, parameter ranges, random seeds, boundary conditions Supports reproducibility and reruns
Software environment Solver version, dependencies, compiler, container image, operating system Prevents environment ambiguity
Data format File type, schema, units, coordinate system, compression method Helps other tools read and interpret the data
Provenance Command line, workflow step, script version, Git commit Traces how outputs were generated
Ownership and access Creator, lab, institution, access rights, embargo status Clarifies responsibility and reuse rules
License Reuse terms such as CC BY, CC0, or another license Allows others to reuse the data legally

Provenance Tracking Strategies

A lightweight approach is to store a JSON metadata file alongside each dataset. It should contain run parameters, software versions, and a reference to the input file.

A more robust approach is to use provenance tracking frameworks such as DAGMan, Workflow-ng, or MyExperiment to automatically record every computation step, input file, and software invocation.

The best practice is simple: record the exact command line used to run the simulation. This is often the single most important provenance artifact.

Storage Strategies That Actually Work

Strategy 1: Input-First Preservation

Instead of archiving terabytes of raw output, preserve the data-generating algorithm and its inputs. If you can reproduce the simulation, the outputs are derivable.

Use this strategy for parameter sweeps, validation studies, and algorithm development where the scientific contribution is the methodology rather than every individual output file.

To implement it, archive input files, configuration scripts, and a documented reproduction procedure. Store outputs selectively. Keep representative cases and move the rest to lower-cost tiers if needed.

Strategy 2: Hierarchical Organization by Scientific Question

Structure storage to match research logic, not only the computational workflow.

project_name/
├── 00_methods/
│   ├── grid_setup/
│   ├── boundary_conditions/
│   └── solver_configuration/
├── 01_reference_solutions/
│   ├── analytical/
│   └── benchmark/
├── 02_parameter_sweeps/
│   ├── sweep_1_conductivity/
│   ├── sweep_2_temperature/
│   └── sweep_3_pressure/
├── 03_published_results/
│   ├── figures/
│   ├── supplementary/
│   └── manuscript_data/
└── metadata/
    ├── README.md
    ├── data_dictionary.csv
    └── run_log.csv

This organization helps you find datasets without searching and helps new team members understand the project structure quickly.

Strategy 3: Automated Lifecycle Policies

Manual file transfers are unreliable. Automated Hierarchical Storage Management frameworks can help move data between tiers safely.

  • Chronos-based policies migrate files based on age and access frequency.
  • Workflow-aware tiering links storage placement to scientific pipeline stages.
  • Automated integrity checks use checksums to detect data corruption over time.

At HPC centers, systems such as JUST at FZ Jülich or LRZ DSS provide automated tiering. If you manage data on institutional clusters, check whether your center offers Hierarchical Storage Management tools.

Common Mistakes and How to Avoid Them

Mistake 1: Storing Data on Compute Nodes

This happens because it is convenient. Data is generated where the simulation runs.

It fails because compute node storage is often ephemeral. When nodes are replaced, reformatted, or decommissioned, data can be lost. Never treat compute storage as archival.

Fix this by writing output directly to a designated storage tier during simulation. Use parallel I/O if running on distributed systems.

Mistake 2: Over-Compressing During Active Use

This happens because teams want to save space.

It fails because decompression overhead can slow analysis. For data that is actively accessed, compression may increase total workflow time.

Fix this by applying compression mainly during archival transfer. Keep active copies uncompressed when performance matters.

Mistake 3: No Naming Convention

This happens because everyone agrees naming is important until someone needs to find a file urgently.

It fails because inconsistent naming makes automation fragile. Scripts that rename, search, or migrate files become difficult to maintain.

Fix this by adopting a convention such as {project}_{quantity}_{resolution}_{time_step}.{extension}. For example: bte_thermal_field_500x500_t0123.h5.

Mistake 4: Assuming Backup Equals Preservation

This happens because backups are familiar.

It fails because backups protect against accidental deletion, but they do not solve format obsolescence, incomplete documentation, or broken access dependencies.

Fix this by combining backups with repository deposition, metadata documentation, and format standardization.

Decision Framework: What to Store and When

Not all data deserves equal storage resources. Use this framework to decide what to keep, where to keep it, and for how long.

Data Type Recommended Storage Retention Rule Reason
Input files and configuration scripts Version-controlled repository and archive Keep permanently These allow simulations to be reproduced
Software environment files Repository, container registry, or archive Keep permanently They document how the workflow was executed
Raw checkpoint files Tier 0 during execution, then Tier 1 or Tier 2 selectively Keep only critical restart points They are large and often not all publication-relevant
Intermediate derived results Tier 1 during active analysis Keep while the project is active They support analysis but can often be regenerated
Published figures and tables Repository deposit and publication archive Keep permanently They support the published record
Curated datasets for reuse Domain repository, Zenodo, or institutional archive Keep permanently They are the reusable scientific output
Temporary debug output Local or Tier 0 scratch storage Delete after validation It has low long-term scientific value

A Practical Checklist

Before your next simulation campaign, review this checklist:

  • [ ] Write a Data Management Plan that estimates volume, selects formats, and defines retention rules.
  • [ ] Choose storage tiers and map data types to storage media and migration policies.
  • [ ] Implement consistent, machine-parseable naming conventions.
  • [ ] Set up automated provenance logging for software versions, parameters, and timestamps.
  • [ ] Configure metadata documentation with README files, data dictionaries, and controlled vocabularies.
  • [ ] Verify a backup strategy with 3 copies, 2 media types, and 1 offsite copy.
  • [ ] Plan repository deposition in a domain-specific, institutional, or general-purpose repository.
  • [ ] Test recovery by confirming that archived data can be accessed on a different system.

Internal Linking and Related Guides

Understanding large-scale data management complements other topics covered in MatForge:

Recommendations: What We Would Do Differently

Most researchers treat data management as an afterthought. These changes would prevent many long-term problems:

  1. Start with a Data Management Plan. Even a one-page document with storage tiers, retention rules, and metadata standards can prevent organizational collapse.
  2. Store inputs, not only outputs. The algorithm and parameters are often more durable than terabytes of simulation results.
  3. Use open formats. HDF5 or NetCDF is safer than proprietary binaries for long-term reuse.
  4. Document everything. If you cannot explain what a dataset represents in one paragraph, it is not truly usable.
  5. Deposit in repositories. Use Zenodo, domain-specific archives, or institutional repositories with persistent identifiers.

The difference between poorly managed and well-managed datasets is not storage cost. It is reproducibility. Good dataset management turns outputs from temporary files into permanent research assets.

Conclusion

Managing large-scale scientific datasets requires a deliberate strategy that spans hardware, metadata, access patterns, and preservation.

The framework is straightforward:

  • Tiered storage optimizes performance and cost.
  • FAIR principles support long-term utility.
  • Data Management Plans prevent organizational chaos.
  • Metadata and provenance determine whether data survives as reusable science.

The alternative is storing data wherever it is convenient and hoping it survives. That often leads to orphaned datasets that contribute little to the field. With a proper strategy, simulation outputs can become citable, reusable research infrastructure.

That is the difference between temporary files and permanent science.

Further Reading