Characterization of the DOE Mini-apps

This page briefly introduces the DOE Miniapps and their programming models. We also provide performance evaluation for a single node for both serial and parallel runs.

ExMatEx (Exascale Co-Design Center for Materials in Extreme Environments)

ASPA: Adaptive sampling.
CoMD: Extensible molecular dynamics.
HILO: Stochastic solutions to the Boltzmann transport equation.
- CMC = classic Monte Carlo,
- QDA-MC = quasi diffusion accelerated Monte Carlo, designed for hybrid archictectures.
LULESH: Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics.
VPFFT++: Crystal viscoplasticity.

Exact (Center for Exascale Simulation of Combustion in Turbulence)

Exp_CNS_NoSpec: A simple stencil-based test code for computing the hyperbolic component of a time-explicit advance for the Compressible Navier-Stokes equations using 8th order finite differences and a 3rd order, low-storage TVD RK algorithm in time.
MultiGrid_C: A multigrid-based solver for a model linear elliptic system based on a centered second-order discretization.

CESAR (Center for Exascale Simulation of Advanced Reactors)

mocfe_bone: Deterministic neutronics code.
nekbone: Solves a Poisson equation using a conjugate gradient iteration with no preconditioner on a block or linear geometry.
openmcbone: Deterministic neutronics code.
XSBench: Calculation of macroscopic cross sections in Monte Carlo particle transport code.

Single Node Performance

All runs were performed on a 48-core AMD Opteron 6174 2.2GHz, with four sockets and eight NUMA nodes, with a total of 128GB memory. Cache sizes are: L1 = 128K/core, L2 = 512K/core, L3 = 12M/socket. We start with single core performance evaluation and then provide parallel (still within a single node) evaluation.

Serial Performance

All data is for a single complete run of the app, including any initialization code. This includes the run time, which is measured externally for a run of the app.


Name	Time	Mflps	Mips	FP	Vec	Stall	L2	L1	L2	Mem	Mem	LOC
	secs			/cyc	/cyc	cycs	hit	BW	BW	VM	RSS

ExMatEx

ASPA	32	3.6	2672	0.002	0.006	48	0.982	163	4.2	221	13	33081
CoMD	192	229	2979	0.104	0.383	46	0.481	15.5	7.9	37	12	2548
HILO 1D	40	469	2114	0.213	0.604	44	0.999	1.2	0.00	41	3	5003
HILO 2D	50	563	1900	0.256	0.536	40	1.000	857	0.01	41	3	5003
LULESH	342	1079	2303	0.491	0.896	60	0.270	1293	928	121	89	2350
VPFFT	70	622	2387	0.283	0.713	57	0.692	78	16	72	36	2637

Exact

CNS_NoSpec	35	795	1994	0.361	0.538	68	0.500	2602	1307	599	553	787
MultiGrid_C	90	577	2257	0.262	0.434	56	0.280	637	486	2553	2474	1704

Cesar

mocfe_bone	132	1096	3069	0.498	0.786	46	0.591	931	380	2366	2323	6252
nekbone	244	1460	3157	0.664	1.044	49	0.358	385	246	927	272	30105
XSBench	961	74	899	0.034	0.125	73	0.832	1557	261	1688	1661	663

Notes:

KEYS:
- FP/cyc : floating point operations per cycle
- Vec/cyc : vector operations per cycle
- Stall cycs : % of cycles stalled on any resource
- L2 hit : L2 data cache hit rate
- L? BW : bandwidth to L? cache in MB/s
- Mem VM : peak virtual memory in MB
- Mem RSS : peak resident set size in MB
- LOC : lines of code
For CoMD, we use the serial version, not the OCL version.
For HILO, we use the CMC version.

Parallel Speedup

We present the performance and scaling behavior for various applications within a node.


Name	Languages	Parameters	Time	Speedup
			secs	6	12	24	48

ExMatEx

CoMD	CPP/OCL	-e	53.9	5.9	11.7	21.6	35.9
HILO 1D	C/MPI		40.8	4.6	9.3	19.0	37.6
HILO 2D	C/MPI	iters 1	50.4	4.8	9.4	18.7	37.3
LULESH	CPP/OMP		339	4.0	5.4	4.3	3.8
VPFFT	CPP/OMP	50, 1, 0.01, 1e-5	71.3	2.6	3.2	3.5	3.8

Exact

CNS_NoSpec	F90/OMP/MPI	inputs_3d	34.1	3.9	6.6	9.7	13.6
	F90/MPI	inputs_3d	34.1	3.3	5.8	6.1	6.3
	F90/OMP	inputs_3d	34.1	3.8	5.3	5.2	5.0
MultiGrid_C	CPP/OMP/MPI	inputs.3d, n_cell 256	70.9	2.2	3.8	4.8	5.9
	CPP/MPI	inputs.3d, n_cell 256	70.9	3.9	7.5	14.3	21.5
	CPP/OMP	inputs.3d, n_cell 256	70.9	2.2	1.9	1.3	1.0

Cesar

mocfe_bone	F90/MPI	48 n 16 16 1 1 1	101	3.2	6.2	9.7	15.3
nekbone	F90/MPI	lp=48 lelt=2400	243	3.6	7.1	15.1	27.0
XSBench	C/OMP		960	5.0	9.1	12.3	18.1

Notes:

The time in seconds is for one thread on one core.
Speedup is the speedup over one core.
We always try to pin threads/processes according to NUMA nodes, e.g. from 0-5, 6-11, etc.
For hybrid models, we have one MPI process per NUMA node, and one OpenMP thread per core.
For CoMD, we use the OCL version, and only report the SoA results.
Note that pinning threads gets overridden for CoMD.
nekbone was modified from weak to strong scaling.
For XSBench, when running in serial, the initialization time is about four times the actual runtime, so we report the sum of the two values.
ASPA and vodeDriver are omitted because they do not have parallel versions.