HPC Modernization
We make legacy Fortran and C++ scientific codes run two orders of magnitude faster on modern GPUs — without rewriting the science.
Decades of scientific computing live in Fortran. Weather models, combustion codes, neutron-transport solvers, climate ensembles, CFD pipelines, molecular-dynamics frameworks — billions of lines, all written for an era of single-threaded CPUs and modest cluster sizes.
Modern NVIDIA hardware — Hopper, Blackwell, Grace-Hopper — can deliver 100× or more on the same workloads. But the path from a working Fortran code to a performant CUDA / OpenACC / Kokkos port is full of traps: allocation overhead that dominates wall-clock time, false sharing in OpenMP regions, memory-bound kernels that ignore the GPU's compute units, MPI communication patterns that serialize at scale.
Prominent Systems specializes in that path. We profile against ground-truth baselines in Nsight Compute, hand-roll SIMD intrinsics where the compiler won't vectorize, restructure kernels for fused two-pass execution to eliminate dynamic allocation, distribute via MPI with GPU rank-to-device pinning, and document every transformation with reproducible benchmarks on the customer's hardware.
Demonstrated performance
Reference workload: parallel molecular-dynamics kernel, performed at Harvard University Research Engineering (HUIT cluster, 2025). Full methodology and source available at arvaer.com/writing/ghostly.
Optimization stage Speedup Notes ───────────────────────────────────────────────────────────────────────── Baseline (single-thread C++) 1.00× 238 ms wall-clock Hand-rolled SIMD intrinsics 5.80× over compiler baseline Fused two-pass kernel 1.36× alloc 53.7% → 0.38% OpenMP, 24 threads 12.98× thread scaling CUDA on single GPU 5.10× 53.7% → 84.3% throughput (Nsight) MPI, 192 ranks across 8 nodes 59.00× weak-scaling, GPU pinning ───────────────────────────────────────────────────────────────────────── Cumulative pipeline 287.00× 238 ms → 0.83 ms wall-clock
Hardware: Harvard HUIT cluster · NVIDIA L40S GPUs · SLURM with GPU rank-to-device pinning. Reproducible from the published source.
Engagement methodology
A standard HPC Modernization engagement runs in four phases.
-
01
Profile against ground truth.
We benchmark the customer's code as it ships today, in Nsight Compute and perf, against scientifically validated reference results. Every subsequent transformation must preserve scientific correctness — measured, not asserted.
-
02
Identify the dominant cost.
We characterize where wall-clock actually goes. Memory allocation overhead, floating-point throughput, memory bandwidth, MPI communication patterns, cache behavior. The right optimization depends on which of these dominates.
-
03
Apply the transformation, measure the lift.
Hand-rolled SIMD intrinsics. Fused multi-pass kernels. OpenMP threading with NUMA awareness. CUDA kernels designed for the target architecture's compute throughput. MPI distribution with explicit GPU pinning. Every change generates a reproducible benchmark in the project record.
-
04
Hand back code, hand back receipts.
Customer receives the modernized code, full benchmark report, methodology documentation, and a reproducible build harness. Reviewable, auditable, transferable.
Target customers
We work with customers who hold legacy scientific codebases that are slow on current hardware. Typical engagements include:
- National laboratory teams maintaining DOE-funded simulation codes (combustion, climate, neutron transport, plasma physics, multiphysics).
- USDA / USFS / NOAA research groups modernizing wildfire, weather, and ocean codes (e.g., QUIC-Fire, WRF, MPAS, HYCOM).
- Defense CFD teams porting solvers (Kestrel, OVERFLOW, FUN3D, CFL3D) to GPU-accelerated HPC.
- Academic research engineering groups needing acceleration for a specific scientific kernel before the next funding cycle.
- Prime contractors holding facilities-support or R&D IDIQ vehicles who need a specialty subcontractor for kernel-level GPU work.
Differentiators
-
Reproducible
Every claim is backed by a benchmark on the customer's hardware, not vendor-marketing data sheets.
-
Correctness-gated
Scientific correctness is the gate. We never trade accuracy for speed without explicit customer sign-off.
-
Single-engineer
Single-engineer engagement, deep accountability. No handoffs, no off-shore implementation team, no junior engineer learning on the customer's dime.
-
Stack-fluent
CUDA, OpenACC, Kokkos, ISO Fortran do-concurrent, SIMD intrinsics, MPI, OpenMP — we use whichever maps cleanest to the customer's compiler and target hardware.
-
Documented
Documentation as a first-class deliverable. Every transformation is written down in a form the customer's own engineers can audit and extend.
Engagement framing
A typical HPC Modernization engagement runs 3–9 months and produces a documented, reproducible speedup on the customer's reference workload.
If you hold a scientific code that is slow on modern GPUs and you want a structured, benchmark-driven path to fix it — we'd like to hear about it.
michael@prominent-systems.com