← Back to Prominent Systems Capability · 01

HPC Modernization

We make legacy Fortran and C++ scientific codes run two orders of magnitude faster on modern GPUs — without rewriting the science.

Decades of scientific computing live in Fortran. Weather models, combustion codes, neutron-transport solvers, climate ensembles, CFD pipelines, molecular-dynamics frameworks — billions of lines, all written for an era of single-threaded CPUs and modest cluster sizes.

Modern NVIDIA hardware — Hopper, Blackwell, Grace-Hopper — can deliver 100× or more on the same workloads. But the path from a working Fortran code to a performant CUDA / OpenACC / Kokkos port is full of traps: allocation overhead that dominates wall-clock time, false sharing in OpenMP regions, memory-bound kernels that ignore the GPU's compute units, MPI communication patterns that serialize at scale.

Prominent Systems specializes in that path. We profile against ground-truth baselines in Nsight Compute, hand-roll SIMD intrinsics where the compiler won't vectorize, restructure kernels for fused two-pass execution to eliminate dynamic allocation, distribute via MPI with GPU rank-to-device pinning, and document every transformation with reproducible benchmarks on the customer's hardware.

Demonstrated performance

Reference workload: parallel molecular-dynamics kernel, performed at Harvard University Research Engineering (HUIT cluster, 2025). Full methodology and source available at arvaer.com/writing/ghostly.

Optimization stage                       Speedup        Notes
─────────────────────────────────────────────────────────────────────────
Baseline (single-thread C++)               1.00×        238 ms wall-clock
Hand-rolled SIMD intrinsics                5.80×        over compiler baseline
Fused two-pass kernel                      1.36×        alloc 53.7% → 0.38%
OpenMP, 24 threads                        12.98×        thread scaling
CUDA on single GPU                         5.10×        53.7% → 84.3% throughput (Nsight)
MPI, 192 ranks across 8 nodes             59.00×        weak-scaling, GPU pinning
─────────────────────────────────────────────────────────────────────────
Cumulative pipeline                      287.00×        238 ms → 0.83 ms wall-clock

Hardware: Harvard HUIT cluster · NVIDIA L40S GPUs · SLURM with GPU rank-to-device pinning. Reproducible from the published source.

Engagement methodology

A standard HPC Modernization engagement runs in four phases.

01
Profile against ground truth.

We benchmark the customer's code as it ships today, in Nsight Compute and perf, against scientifically validated reference results. Every subsequent transformation must preserve scientific correctness — measured, not asserted.
02
Identify the dominant cost.

We characterize where wall-clock actually goes. Memory allocation overhead, floating-point throughput, memory bandwidth, MPI communication patterns, cache behavior. The right optimization depends on which of these dominates.
03
Apply the transformation, measure the lift.

Hand-rolled SIMD intrinsics. Fused multi-pass kernels. OpenMP threading with NUMA awareness. CUDA kernels designed for the target architecture's compute throughput. MPI distribution with explicit GPU pinning. Every change generates a reproducible benchmark in the project record.
04
Hand back code, hand back receipts.

Customer receives the modernized code, full benchmark report, methodology documentation, and a reproducible build harness. Reviewable, auditable, transferable.

Target customers

We work with customers who hold legacy scientific codebases that are slow on current hardware. Typical engagements include:

National laboratory teams maintaining DOE-funded simulation codes (combustion, climate, neutron transport, plasma physics, multiphysics).
USDA / USFS / NOAA research groups modernizing wildfire, weather, and ocean codes (e.g., QUIC-Fire, WRF, MPAS, HYCOM).
Defense CFD teams porting solvers (Kestrel, OVERFLOW, FUN3D, CFL3D) to GPU-accelerated HPC.
Academic research engineering groups needing acceleration for a specific scientific kernel before the next funding cycle.
Prime contractors holding facilities-support or R&D IDIQ vehicles who need a specialty subcontractor for kernel-level GPU work.

Differentiators

Reproducible
Every claim is backed by a benchmark on the customer's hardware, not vendor-marketing data sheets.
Correctness-gated
Scientific correctness is the gate. We never trade accuracy for speed without explicit customer sign-off.
Single-engineer
Single-engineer engagement, deep accountability. No handoffs, no off-shore implementation team, no junior engineer learning on the customer's dime.
Stack-fluent
CUDA, OpenACC, Kokkos, ISO Fortran do-concurrent, SIMD intrinsics, MPI, OpenMP — we use whichever maps cleanest to the customer's compiler and target hardware.
Documented
Documentation as a first-class deliverable. Every transformation is written down in a form the customer's own engineers can audit and extend.

Engagement framing

A typical HPC Modernization engagement runs 3–9 months and produces a documented, reproducible speedup on the customer's reference workload.

If you hold a scientific code that is slow on modern GPUs and you want a structured, benchmark-driven path to fix it — we'd like to hear about it.

michael@prominent-systems.com

HPC Modernization

Demonstrated performance

Engagement methodology

Profile against ground truth.

Identify the dominant cost.

Apply the transformation, measure the lift.

Hand back code, hand back receipts.

Target customers

Differentiators

Engagement framing