This page briefly introduces the DOE Miniapps and their programming models. We also provide performance evaluation for a single node for both serial and parallel runs.

ExMatEx (Exascale Co-Design Center for Materials in Extreme Environments)

  • ASPA: Adaptive sampling.
  • CoMD: Extensible molecular dynamics.
  • HILO: Stochastic solutions to the Boltzmann transport equation.
    • CMC = classic Monte Carlo,
    • QDA-MC = quasi diffusion accelerated Monte Carlo, designed for hybrid archictectures.
  • LULESH: Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics.
  • VPFFT++: Crystal viscoplasticity.

 Exact (Center for Exascale Simulation of Combustion in Turbulence)

  • Exp_CNS_NoSpec: A simple stencil-based test code for computing the hyperbolic component of a time-explicit advance for the Compressible Navier-Stokes equations using 8th order finite differences and a 3rd order, low-storage TVD RK algorithm in time.
  • MultiGrid_C: A multigrid-based solver for a model linear elliptic system based on a centered second-order discretization.

  CESAR (Center for Exascale Simulation of Advanced Reactors)

  • mocfe_bone: Deterministic neutronics code.
  • nekbone: Solves a Poisson equation using a conjugate gradient iteration with no preconditioner on a block or linear geometry.
  • openmcbone: Deterministic neutronics code.
  • XSBench: Calculation of macroscopic cross sections in Monte Carlo particle transport code.

 Single Node Performance

All runs were performed on a 48-core AMD Opteron 6174 2.2GHz, with four sockets and eight NUMA nodes, with a total of 128GB memory. Cache sizes are: L1 = 128K/core, L2 = 512K/core, L3 = 12M/socket. We start with single core performance evaluation and then provide parallel (still within a single node) evaluation.

  Serial Performance

All data is for a single complete run of the app, including any initialization code. This includes the run time, which is measured externally for a run of the app.
NameTimeMflpsMipsFPVecStallL2L1L2MemMemLOC
 secs  /cyc/cyccycshitBWBWVMRSS 
ExMatEx            
ASPA323.626720.0020.006480.9821634.22211333081
CoMD19222929790.1040.383460.48115.57.937122548
HILO 1D4046921140.2130.604440.9991.20.004135003
HILO 2D5056319000.2560.536401.0008570.014135003
LULESH342107923030.4910.896600.2701293928121892350
VPFFT7062223870.2830.713570.692781672362637
Exact            
CNS_NoSpec3579519940.3610.538680.50026021307599553787
MultiGrid_C9057722570.2620.434560.280637486255324741704
Cesar            
mocfe_bone132109630690.4980.786460.591931380236623236252
nekbone244146031570.6641.044490.35838524692727230105
XSBench961748990.0340.125730.832155726116881661663

 Notes:

  • KEYS:
    • FP/cyc : floating point operations per cycle
    • Vec/cyc : vector operations per cycle
    • Stall cycs : % of cycles stalled on any resource
    • L2 hit : L2 data cache hit rate
    • L? BW : bandwidth to L? cache in MB/s
    • Mem VM : peak virtual memory in MB
    • Mem RSS : peak resident set size in MB
    • LOC : lines of code
  • For CoMD, we use the serial version, not the OCL version.
  • For HILO, we use the CMC version.

Parallel Speedup

We present the performance and scaling behavior for various applications within a node.

NameLanguagesParametersTimeSpeedup   
   secs6122448
ExMatEx       
CoMDCPP/OCL-e53.95.911.721.635.9
HILO 1DC/MPI 40.84.69.319.037.6
HILO 2DC/MPIiters 150.44.89.418.737.3
LULESHCPP/OMP 3394.05.44.33.8
VPFFTCPP/OMP50, 1, 0.01, 1e-571.32.63.23.53.8
Exact       
CNS_NoSpecF90/OMP/MPIinputs_3d34.13.96.69.713.6
 F90/MPIinputs_3d34.13.35.86.16.3
 F90/OMPinputs_3d34.13.85.35.25.0
MultiGrid_CCPP/OMP/MPIinputs.3d, n_cell 25670.92.23.84.85.9
 CPP/MPIinputs.3d, n_cell 25670.93.97.514.321.5
 CPP/OMPinputs.3d, n_cell 25670.92.21.91.31.0
Cesar       
mocfe_boneF90/MPI48 n 16 16 1 1 11013.26.29.715.3
nekboneF90/MPIlp=48 lelt=24002433.67.115.127.0
XSBenchC/OMP 9605.09.112.318.1

 Notes:

  • The time in seconds is for one thread on one core.
  • Speedup is the speedup over one core.
  • We always try to pin threads/processes according to NUMA nodes, e.g. from 0-5, 6-11, etc.
  • For hybrid models, we have one MPI process per NUMA node, and one OpenMP thread per core.
  • For CoMD, we use the OCL version, and only report the SoA results.
  • Note that pinning threads gets overridden for CoMD.
  • nekbone was modified from weak to strong scaling.
  • For XSBench, when running in serial, the initialization time is about four times the actual runtime, so we report the sum of the two values.
  • ASPA and vodeDriver are omitted because they do not have parallel versions.