← Back to Prominent Systems Capability · 01

HPC Modernization

We make legacy Fortran and C++ scientific codes run two orders of magnitude faster on modern GPUs — without rewriting the science.

Decades of scientific computing live in Fortran. Weather models, combustion codes, neutron-transport solvers, climate ensembles, CFD pipelines, molecular-dynamics frameworks — billions of lines, all written for an era of single-threaded CPUs and modest cluster sizes.

Modern NVIDIA hardware — Hopper, Blackwell, Grace-Hopper — can deliver 100× or more on the same workloads. But the path from a working Fortran code to a performant CUDA / OpenACC / Kokkos port is full of traps: allocation overhead that dominates wall-clock time, false sharing in OpenMP regions, memory-bound kernels that ignore the GPU's compute units, MPI communication patterns that serialize at scale.

Prominent Systems specializes in that path. We profile against ground-truth baselines in Nsight Compute, hand-roll SIMD intrinsics where the compiler won't vectorize, restructure kernels for fused two-pass execution to eliminate dynamic allocation, distribute via MPI with GPU rank-to-device pinning, and document every transformation with reproducible benchmarks on the customer's hardware.

Demonstrated performance

Reference workload: parallel molecular-dynamics kernel, performed at Harvard University Research Engineering (HUIT cluster, 2025). Full methodology and source available at arvaer.com/writing/ghostly.

Hardware: Harvard HUIT cluster · NVIDIA L40S GPUs · SLURM with GPU rank-to-device pinning. Reproducible from the published source.

Engagement methodology

A standard HPC Modernization engagement runs in four phases.

  1. 01

    Profile against ground truth.

    We benchmark the customer's code as it ships today, in Nsight Compute and perf, against scientifically validated reference results. Every subsequent transformation must preserve scientific correctness — measured, not asserted.

  2. 02

    Identify the dominant cost.

    We characterize where wall-clock actually goes. Memory allocation overhead, floating-point throughput, memory bandwidth, MPI communication patterns, cache behavior. The right optimization depends on which of these dominates.

  3. 03

    Apply the transformation, measure the lift.

    Hand-rolled SIMD intrinsics. Fused multi-pass kernels. OpenMP threading with NUMA awareness. CUDA kernels designed for the target architecture's compute throughput. MPI distribution with explicit GPU pinning. Every change generates a reproducible benchmark in the project record.

  4. 04

    Hand back code, hand back receipts.

    Customer receives the modernized code, full benchmark report, methodology documentation, and a reproducible build harness. Reviewable, auditable, transferable.

Target customers

We work with customers who hold legacy scientific codebases that are slow on current hardware. Typical engagements include:

  • National laboratory teams maintaining DOE-funded simulation codes (combustion, climate, neutron transport, plasma physics, multiphysics).
  • USDA / USFS / NOAA research groups modernizing wildfire, weather, and ocean codes (e.g., QUIC-Fire, WRF, MPAS, HYCOM).
  • Defense CFD teams porting solvers (Kestrel, OVERFLOW, FUN3D, CFL3D) to GPU-accelerated HPC.
  • Academic research engineering groups needing acceleration for a specific scientific kernel before the next funding cycle.
  • Prime contractors holding facilities-support or R&D IDIQ vehicles who need a specialty subcontractor for kernel-level GPU work.

Differentiators

  • Reproducible

    Every claim is backed by a benchmark on the customer's hardware, not vendor-marketing data sheets.

  • Correctness-gated

    Scientific correctness is the gate. We never trade accuracy for speed without explicit customer sign-off.

  • Single-engineer

    Single-engineer engagement, deep accountability. No handoffs, no off-shore implementation team, no junior engineer learning on the customer's dime.

  • Stack-fluent

    CUDA, OpenACC, Kokkos, ISO Fortran do-concurrent, SIMD intrinsics, MPI, OpenMP — we use whichever maps cleanest to the customer's compiler and target hardware.

  • Documented

    Documentation as a first-class deliverable. Every transformation is written down in a form the customer's own engineers can audit and extend.

Engagement framing

A typical HPC Modernization engagement runs 3–9 months and produces a documented, reproducible speedup on the customer's reference workload.

If you hold a scientific code that is slow on modern GPUs and you want a structured, benchmark-driven path to fix it — we'd like to hear about it.

michael@prominent-systems.com