This page briefly introduces the DOE Miniapps and their programming models. We also provide performance evaluation for a single node for both serial and parallel runs.

ExMatEx (Exascale Co-Design Center for Materials in Extreme Environments)

  • ASPA: Adaptive sampling.
  • CoMD: Extensible molecular dynamics.
  • HILO: Stochastic solutions to the Boltzmann transport equation.
    • CMC = classic Monte Carlo,
    • QDA-MC = quasi diffusion accelerated Monte Carlo, designed for hybrid archictectures.
  • LULESH: Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics.
  • VPFFT++: Crystal viscoplasticity.

 Exact (Center for Exascale Simulation of Combustion in Turbulence)

  • Exp_CNS_NoSpec: A simple stencil-based test code for computing the hyperbolic component of a time-explicit advance for the Compressible Navier-Stokes equations using 8th order finite differences and a 3rd order, low-storage TVD RK algorithm in time.
  • MultiGrid_C: A multigrid-based solver for a model linear elliptic system based on a centered second-order discretization.

  CESAR (Center for Exascale Simulation of Advanced Reactors)

  • mocfe_bone: Deterministic neutronics code.
  • nekbone: Solves a Poisson equation using a conjugate gradient iteration with no preconditioner on a block or linear geometry.
  • openmcbone: Deterministic neutronics code.
  • XSBench: Calculation of macroscopic cross sections in Monte Carlo particle transport code.

 Single Node Performance

All runs were performed on a 48-core AMD Opteron 6174 2.2GHz, with four sockets and eight NUMA nodes, with a total of 128GB memory. Cache sizes are: L1 = 128K/core, L2 = 512K/core, L3 = 12M/socket. We start with single core performance evaluation and then provide parallel (still within a single node) evaluation.

  Serial Performance

All data is for a single complete run of the app, including any initialization code. This includes the run time, which is measured externally for a run of the app.
 secs  /cyc/cyccycshitBWBWVMRSS 
HILO 1D4046921140.2130.604440.9991.20.004135003
HILO 2D5056319000.2560.536401.0008570.014135003


  • KEYS:
    • FP/cyc : floating point operations per cycle
    • Vec/cyc : vector operations per cycle
    • Stall cycs : % of cycles stalled on any resource
    • L2 hit : L2 data cache hit rate
    • L? BW : bandwidth to L? cache in MB/s
    • Mem VM : peak virtual memory in MB
    • Mem RSS : peak resident set size in MB
    • LOC : lines of code
  • For CoMD, we use the serial version, not the OCL version.
  • For HILO, we use the CMC version.

Parallel Speedup

We present the performance and scaling behavior for various applications within a node.

VPFFTCPP/OMP50, 1, 0.01, 1e-571.
MultiGrid_CCPP/OMP/MPIinputs.3d, n_cell 25670.
 CPP/MPIinputs.3d, n_cell 25670.93.97.514.321.5
 CPP/OMPinputs.3d, n_cell 25670.
mocfe_boneF90/MPI48 n 16 16 1 1 11013.26.29.715.3
nekboneF90/MPIlp=48 lelt=24002433.
XSBenchC/OMP 9605.09.112.318.1


  • The time in seconds is for one thread on one core.
  • Speedup is the speedup over one core.
  • We always try to pin threads/processes according to NUMA nodes, e.g. from 0-5, 6-11, etc.
  • For hybrid models, we have one MPI process per NUMA node, and one OpenMP thread per core.
  • For CoMD, we use the OCL version, and only report the SoA results.
  • Note that pinning threads gets overridden for CoMD.
  • nekbone was modified from weak to strong scaling.
  • For XSBench, when running in serial, the initialization time is about four times the actual runtime, so we report the sum of the two values.
  • ASPA and vodeDriver are omitted because they do not have parallel versions.