This page briefly introduces the DOE Miniapps and their programming models. We also provide performance evaluation for a single node for both serial and parallel runs.
ExMatEx (Exascale Co-Design Center for Materials in Extreme Environments)
- ASPA: Adaptive sampling.
- CoMD: Extensible molecular dynamics.
- HILO: Stochastic solutions to the Boltzmann transport equation.
- CMC = classic Monte Carlo,
- QDA-MC = quasi diffusion accelerated Monte Carlo, designed for hybrid archictectures.
- LULESH: Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics.
- VPFFT++: Crystal viscoplasticity.
Exact (Center for Exascale Simulation of Combustion in Turbulence)
- Exp_CNS_NoSpec: A simple stencil-based test code for computing the
hyperbolic component of a time-explicit advance for the Compressible
Navier-Stokes equations using 8th order finite differences and a 3rd order,
low-storage TVD RK algorithm in time.
-
MultiGrid_C: A multigrid-based solver for a model linear elliptic
system based on a centered second-order discretization.
CESAR (Center for Exascale Simulation of Advanced Reactors)
-
mocfe_bone: Deterministic neutronics code.
- nekbone: Solves a Poisson equation using a conjugate gradient iteration
with no preconditioner on a block or linear geometry.
- openmcbone: Deterministic neutronics code.
- XSBench: Calculation of macroscopic cross sections in Monte Carlo
particle transport code.
Single Node Performance
All runs were performed on a 48-core AMD Opteron 6174 2.2GHz, with four
sockets and eight NUMA nodes, with a total of 128GB memory. Cache sizes are:
L1 = 128K/core, L2 = 512K/core, L3 = 12M/socket.
We start with single core performance evaluation and then provide parallel (still within a single node) evaluation.
Serial Performance
All data is for a single complete run of the app, including any initialization
code. This includes the run time, which is measured externally for a run of
the app.
|
Name | Time | Mflps | Mips | FP | Vec | Stall | L2 | L1 | L2 | Mem | Mem | LOC |
| secs | | | /cyc | /cyc | cycs | hit | BW | BW | VM | RSS | |
|
ExMatEx | | | | | | | | | | | | |
|
ASPA | 32 | 3.6 | 2672 | 0.002 | 0.006 | 48 | 0.982 | 163 | 4.2 | 221 | 13 | 33081 |
CoMD | 192 | 229 | 2979 | 0.104 | 0.383 | 46 | 0.481 | 15.5 | 7.9 | 37 | 12 | 2548 |
HILO 1D | 40 | 469 | 2114 | 0.213 | 0.604 | 44 | 0.999 | 1.2 | 0.00 | 41 | 3 | 5003 |
HILO 2D | 50 | 563 | 1900 | 0.256 | 0.536 | 40 | 1.000 | 857 | 0.01 | 41 | 3 | 5003 |
LULESH | 342 | 1079 | 2303 | 0.491 | 0.896 | 60 | 0.270 | 1293 | 928 | 121 | 89 | 2350 |
VPFFT | 70 | 622 | 2387 | 0.283 | 0.713 | 57 | 0.692 | 78 | 16 | 72 | 36 | 2637 |
|
Exact | | | | | | | | | | | | |
|
CNS_NoSpec | 35 | 795 | 1994 | 0.361 | 0.538 | 68 | 0.500 | 2602 | 1307 | 599 | 553 | 787 |
MultiGrid_C | 90 | 577 | 2257 | 0.262 | 0.434 | 56 | 0.280 | 637 | 486 | 2553 | 2474 | 1704 |
|
Cesar | | | | | | | | | | | | |
|
mocfe_bone | 132 | 1096 | 3069 | 0.498 | 0.786 | 46 | 0.591 | 931 | 380 | 2366 | 2323 | 6252 |
nekbone | 244 | 1460 | 3157 | 0.664 | 1.044 | 49 | 0.358 | 385 | 246 | 927 | 272 | 30105 |
XSBench | 961 | 74 | 899 | 0.034 | 0.125 | 73 | 0.832 | 1557 | 261 | 1688 | 1661 | 663 |
|
Notes:
- KEYS:
- FP/cyc : floating point operations per cycle
- Vec/cyc : vector operations per cycle
- Stall cycs : % of cycles stalled on any resource
- L2 hit : L2 data cache hit rate
- L? BW : bandwidth to L? cache in MB/s
- Mem VM : peak virtual memory in MB
- Mem RSS : peak resident set size in MB
- LOC : lines of code
- For CoMD, we use the serial version, not the OCL version.
- For HILO, we use the CMC version.
Parallel Speedup
We present the performance and scaling behavior for various applications within a node.
|
Name | Languages | Parameters | Time | Speedup | | | |
| | | secs | 6 | 12 | 24 | 48 |
|
ExMatEx | | | | | | | |
|
CoMD | CPP/OCL | -e | 53.9 | 5.9 | 11.7 | 21.6 | 35.9 |
HILO 1D | C/MPI | | 40.8 | 4.6 | 9.3 | 19.0 | 37.6 |
HILO 2D | C/MPI | iters 1 | 50.4 | 4.8 | 9.4 | 18.7 | 37.3 |
LULESH | CPP/OMP | | 339 | 4.0 | 5.4 | 4.3 | 3.8 |
VPFFT | CPP/OMP | 50, 1, 0.01, 1e-5 | 71.3 | 2.6 | 3.2 | 3.5 | 3.8 |
|
Exact | | | | | | | |
|
CNS_NoSpec | F90/OMP/MPI | inputs_3d | 34.1 | 3.9 | 6.6 | 9.7 | 13.6 |
| F90/MPI | inputs_3d | 34.1 | 3.3 | 5.8 | 6.1 | 6.3 |
| F90/OMP | inputs_3d | 34.1 | 3.8 | 5.3 | 5.2 | 5.0 |
MultiGrid_C | CPP/OMP/MPI | inputs.3d, n_cell 256 | 70.9 | 2.2 | 3.8 | 4.8 | 5.9 |
| CPP/MPI | inputs.3d, n_cell 256 | 70.9 | 3.9 | 7.5 | 14.3 | 21.5 |
| CPP/OMP | inputs.3d, n_cell 256 | 70.9 | 2.2 | 1.9 | 1.3 | 1.0 |
|
Cesar | | | | | | | |
|
mocfe_bone | F90/MPI | 48 n 16 16 1 1 1 | 101 | 3.2 | 6.2 | 9.7 | 15.3 |
nekbone | F90/MPI | lp=48 lelt=2400 | 243 | 3.6 | 7.1 | 15.1 | 27.0 |
XSBench | C/OMP | | 960 | 5.0 | 9.1 | 12.3 | 18.1 |
|
Notes:
- The time in seconds is for one thread on one core.
- Speedup is the speedup over one core.
- We always try to pin threads/processes according to NUMA nodes, e.g. from 0-5,
6-11, etc.
- For hybrid models, we have one MPI process per NUMA node, and one OpenMP
thread per core.
- For CoMD, we use the OCL version, and only report the SoA results.
- Note that pinning threads gets overridden for CoMD.
- nekbone was modified from weak to strong scaling.
- For XSBench, when running in serial, the initialization time is about four
times the actual runtime, so we report the sum of the two values.
- ASPA and vodeDriver are omitted because they do not have parallel versions.