Design Forward

This page describes communication patterns of several DOE full applications and associated mini-applications. It provides a more complete snapshot of the DOE workload requirements than was possible using the mini-apps alone. Where possible, we scaled the application traces up to 100,000 ranks to represent the target scale of an exascale system. Although the data is collected using one MPI rank per core, the correct way of interpreting this data for the purpose of architectural exploration is one MPI rank per network endpoint (e.g. MPI+X model). This interpretation offers sufficient scale for examination of the anticipated 100,000 nodes for an exascale system.

The application analysis includes both an DUMPI trace and a summary IPM analysis of the trace data. IPM provides a summary of the MPI functions used by each job. The DUMPI traces record the details of each MPI message sent during the course of execution. Due to space constraints, the DUMPI trace only captures the communication for a single iteration of the code (excluding startup and shutdown communication). The communication pattern repeats for subsequent iterations. Therefore, you should consider the DUMPI traces to be one iteration of a repeating communication pattern. In most cases, the IPM and DUMPI results are for two runs of the same input so their results should be consistent. Because the intent of IPM is to give an overview of the communication patterns, some data may be averaged or binned, which prevents exact agreement with the DUMPI data.

Application Summary

The applications were selected to provide coverage of the diversity of communication patterns observed in the DOE workload. We attempted (to the best of our ability) to provide the minimum subset of applications that would provide that coverage. Because of concerns about the fidelity of mini-applications, we have provided two different versions of some of these applications -- the mini-app version (AMG,MiniAMR,SNAP,MiniFE,MiniDFT)and a snapshot of the full application (Multigrid_C,BoxLib_AMR,PARTISN,FillBoundary). In addition, some traces come from extracted kernels from production applications (bigFFT,CrystalRouter).

Below is a summary of the applications.

  • AMG: Algebraic Multigrid Solver
    • Author: Ulrike Yang (LLNL), trace collected by Matt Cordery (LBNL)
    • Type: MiniApp (simplified version of the real thing)
    • Pattern: Regional communication with decreasing message size for different parts of the multigrid v-cycle.
    • Uses: Unstructured mesh physics packages
  • PARTISN: Discrete Ordinates neutral-particle transport
    • Author: many people, with trace collected by Scott Pakin (LANL)
    • Type: Full Application (this is a trace of the production application). SNAP is a mini-app version of the same algorithm.
    • Pattern: 2D decomposition (nearest neighbor communication) with 1D wavefront communication pattern due to data dependencies.
    • Uses: Radiation Transport / Neutron Transport
  • SNAP: Discrete Ordinates neutral-particle transport
    • Author: Setup and trace collected by Scott Pakin (LANL)
    • Type: Mini-app
    • Pattern: 2D decomposition (nearest neighbor communication) with 1D wavefront communication pattern due to data dependencies.
    • Uses: Radiation Transport / Neutron Transport
  • Big FFT: large 3D FFT with 2D domain decomposition pattern
    • Author: Dave Richards (LLNL) with Matt Cordrey (LBNL) collecting traces.
    • Type: Extracted Kernel from full application.
    • Pattern: One large 3D FFT with 2D domain decomposition (pencils) pattern. Requires two full transposes with many-to-many communication pattern.
    • Uses: Poisson solvers, fluid turbulence, real-to-frequency space transforms
  • MiniDFT: Density Functional Theory / Ab-Initio Electronic Structure
    • Author: Brian Austin (LBNL). Brian also collected the trace
    • Type: Mini Application (represents VASP electronic structure calculation code, which is largest single contributor to the SC computing workload)
    • Pattern: Many small 3D FFTs (using spherical excision from real-space), and dense linear algebra for orthogonalization of basis functions (1D communication pattern).
    • Uses: Ab-Initio chemistry / Electronic structure
  • Fill Boundary: Halo update from production PDE solver code (BoxLib)
    • Author: CCSE with trace collected by Vince Beckner (LBNL) and Ann Almgren (LBNL).
    • Type: Snapshot of kernel from Full Application (this is a trace of the production application). MiniGhost from the Mantevo suite provides an analogous mini-app.
    • Pattern: Nearest Neighbor 3D block domain decomposition. Communicates directly with 27 neighbors in 3D lattice.
    • Uses: PDE Solvers on Block Structure Grids
  • MultiGrid: Geometric Multigrid V-Cycle from production elliptic solver (BoxLib)
    • Author: CCSE with trace collected by Vince Beckner (LBNL) and Ann Almgren (LBNL).
    • Type: Snapshot of Full Application (this is a trace of the production application). AMG is a related mini-app, but AMG is Algebraic Multigrid (different algorithm).
    • Pattern: 3D lattice communication pattern (similar to FillBoundary) with decreasing message size and collectives for different parts of the multigrid v-cycle.
    • Uses: Structured grid physics packages
  • AMR Boxlib: Full Adaptive Mesh Refinement (AMR) V-Cycle from production cosmology code (BoxLib/Castro)
    • Author: CCSE with trace collected by Vince Beckner (LBNL) and Ann Almgren (LBNL).
    • Type: Snapshot of Full Application (this is a trace of the production application). MiniAMR is a related mini-app version.
    • Pattern: Irregular (but sparse) communication pattern with adaptation.
    • Uses: Structured grid Adaptive Mesh Refinement PDE solvers (many applications)
  • Crystal Router: scalable communication pattern for many-to-many MPI communication pattern
    • Author: Paul Fischer (ANL), Katherine Heisey (ANL) collected the traces
    • Type: Extracted kernel of Full Application (this is a trace of an extracted communication kernel for the Nek5000 production application).
    • Pattern: All-to-all communication through scalable multi-stage communication process.
    • Uses: Nek5000, with application to other FEM/engineering codes.
  • MiniFE: Finite Element solver
    • Author: Mike Heroux (Sandia) with traces collected by Simon Hammond (Sandia)
    • Type: Extracted kernel of Full Application (this is a trace of an extracted communication kernel for the Nek5000 production application).
    • Pattern: All-to-all communication through scalable multi-stage communication process.
    • Uses: Finite Element / Engineering codes.
  • Mini AMR: AMR V-Cycle proxy application
    • Author: Courtney Vaughn, Richard Barrett (Sandia)
    • Type: Mini-application version of AMR communication pattern
    • Pattern: Irregular (but sparse) communication pattern with adaptation.
    • Uses: Structured grid Adaptive Mesh Refinement PDE solvers

Communication Traces and Analysis

AMG

AMG is an Algebraic Multigrid Solver for unstructured mesh physics packages.

We collected both overall communications data and mesh topology data using IPM and MPI traces using the dumpi library.

The source code may be downloaded here:

AMG source code tarball

The source code is from the CORAL procurement and is also available at

AMG at CORAL procurement site

The run parameters are: 

  • Problem size: 40 x 40 x 40

Available IPM data:  

AMG 8 MPI tasks 2x2x2 decomposition

AMG 27 MPI tasks 3x3x3 decomposition

AMG 216 MPI tasks 6x6x6 decomposition

AMG 1728 MPI tasks 12x12x12 decomposition

AMG 13824 MPI tasks 24x24x24 decomposition

Available DUMPI MPI traces:

Note that, unlike the IPM data, the dumpi traces only track the communications for one V-cycle of the multigrid sequence. This was accomplished by turning off profiling for most of the code using the dumpi API and turning it on only in the single routine in the file parcsr_ls/par_cycle.c. Within this file, the variable maxit_prec was set equal to 1.

AMG 8 MPI tasks 2x2x2 decomposition (133 Kb)

AMG 27 MPI tasks 3x3x3 decomposition (803 Kb)

AMG 216 MPI tasks 6x6x6 decomposition (11 Mb)

AMG 1728 MPI tasks 12x12x12 decomposition (108 Mb)

AMG 13824 MPI tasks 24x24x24 decomposition (954 Mb)

PARTISN

PARTISN solves the discrete-ordinates neutral-particle transport equation. The code is unclassified but contains export-control restrictions. What makes PARTISN interesting from the perspective of Design Forward is that the blocking factor can be tuned to amortize commuication overhead or to increase available parallelism. This is important to the underlying KBA sweep algorithm, which communicates as a 1-D diagonal wavefront across a 2-D domain decomposition. In other words, P domains observe only √P parallelism. Reducing the blocking factor increases parallelism by more aggressively pipelining the computation along each wavefront, but doing so causes a larger number of smaller messages to be injected into the network.

We collected DUMPI traces on LANL's Mustang supercomputer. The first trace represents a typical PARTISN run, using OpenMP within each node, MPI across nodes, and a blocking factor (nchunk in the input deck) set to a reasonable value for a cluster like Mustang. The input deck (included in the tarball) is the sntiming input deck weak-scaled to 1008 cores.

PARTISN, 42 MPI ranks, 1 rank/node, 24 threads/rank, typical blocking factor (14 GB)

The second trace represents how one might run PARTISN at exascale. It uses a blocking factor of 1, which exposes maximal parallelism but at the cost of substantial stress on the network: a much larger number of much smaller messages than in a typical PARTISN run. It was run with MPI everywhere to expose in the trace all of the program's communication. As before, the input deck (included in the tarball) is the sntiming input deck weak-scaled to 1008 cores.

PARTISN, 1008 MPI ranks, 24 ranks/node, 1 thread/rank, minimal blocking factor (4.8 TB)

Note that all that message traffic leads to the trace being big. Really big. You just won't believe how vastly, hugely, mindbogglingly big it is. I mean, you may think it's a long way down the road to the chemist's, but that's just peanuts to an nchunk=1 PARTISN trace with over a thousand MPI ranks. You'll probably need to have about 30 TB of free disk space to hold both the tarball and the uncompressed DUMPI files.

The third trace, like the first, represents a typical PARTISN run. It is in fact more typical than the first trace because the input deck was constructed by someone with substantial PARTISN knowledge (Joe Zerr, the author of the SNAP mini-app; see below). This input deck was further constructed concurrently with a SNAP input deck to support comparisons between PARTISN and SNAP. See Joe Zerr's description of the similarities and differences between the two input decks for more information. The input deck, pin1008, is included in the tarball.

PARTISN, 168 MPI ranks, 4 ranks/node, 6 threads/rank, typical blocking factor (1.3 GB)

Also available for the above: IPM graphs (HTML) and raw IPM data (XML)

SNAP

SNAP, the SN (Discrete Ordinates) Application Proxy, is the mini-app that corresponds to PARTISN. See the preceding description of PARTISN for an explanation of what makes SNAP interesting from a communication perspective.

The source code for SNAP is available from GitHub:

SNAP source code

We collected DUMPI traces on LANL's Mustang supercomputer. The following trace represents a typical SNAP run, using OpenMP within each node, MPI across nodes, and a blocking factor (ichunk in the input deck) set to a reasonable value for a cluster like Mustang. The input deck (sin1008 in the tarball) was designed to mimic as closely as possible the corresponding PARTISN input deck (pin1008 in the third PARTISN tarball above). See Joe Zerr's description of the similarities and differences between the two input decks.

SNAP, 168 MPI ranks, 4 ranks/node, 6 threads/rank, typical blocking factor (DUMPI trace, 3.4 GB)

Also available for the above: IPM graphs (HTML) and raw IPM data (XML)

Big FFT

BigFFT is solves a 3D FFT problem (forward and inverse FFTs using FFTW) Two problem sizes are defined for each MPI task count. The 3D problem size is defined by the spatial node counts nx, ny, and nz. The number of rows and columns define 2D decomposition of the MPI tasks handling the FFTs.

The source code may be downloaded here:

BigFFT source code tarball

Small problem:

  • 9 MPI tasks: nx=ny=nz=48, nrow=ncol=3
  • 100 MPI tasks: nx=ny=nz=100, nrow=ncol=10
  • 1024 MPI tasks: nx=ny=nz=216, nrow=ncol=32
  • 10000 MPI tasks: nx=ny=nz=462, nrow=ncol=100
  • 99856 MPI tasks: nx=ny=nz=1000, nrow=ncol=316

Medium problem:

  • 9 MPI tasks: nx=ny=nz=210, nrow=ncol=3
  • 100 MPI tasks: nx=ny=nz=462, nrow=ncol=10
  • 1024 MPI tasks: nx=ny=nz=1000, nrow=ncol=32
  • 10000 MPI tasks: nx=ny=nz=2240, nrow=ncol=100
  • 99856 MPI tasks: nx=ny=nz=4608, nrow=ncol=316

Available IPM data (small problem):  

BigFFT 9 MPI tasks

BigFFT 100 MPI tasks

BigFFT 1024 MPI tasks

BigFFT 10000 MPI tasks

BigFFT 99856 MPI tasks

Available DUMPI data (small problem):  

BigFFT 9 MPI tasks (9K)

BigFFT 100 MPI tasks (217K)

BigFFT 1024 MPI tasks (6.6M)

BigFFT 10000 MPI tasks (659M)

BigFFT 99856 MPI tasks (50G)

Available IPM data (medium problem):  

BigFFT 9 MPI tasks

BigFFT 100 MPI tasks

BigFFT 1024 MPI tasks

BigFFT 10000 MPI tasks

BigFFT 99856 MPI tasks

Available DUMPI data (medium problem):  

BigFFT 9 MPI tasks (1K)

BigFFT 100 MPI tasks (220K)

BigFFT 1024 MPI tasks (6.8M)

BigFFT 10000 MPI tasks (664M)

BigFFT 99856 MPI tasks (50G)

MiniDFT

MiniDFT is a plane-wave density functional theory (DFT) code extracted from Quantum Espresso. Given a set of atomic coordinates and pseudopotentials, MiniDFT computes self-consistent solutions of the Kohn-Sham equations. For each iteration of the self-consistent field cycle, the Fock matrix is constructed and then diagonalized. To build the Fock matrix, Fast Fourier Transforms are used to tranform orbitals from the plane wave basis ( where the kinetic energy is most readily compted ) to real space ( where the potential is evaluated ) and back. Davidson diagonalization is used to compute the orbital energies and update the orbital coefficients.

The MiniDFT source code may be downloaded from the QE-forge website: MiniDFT source

The input files and the parallel decompostitions used to collect these communication profiles may be downloaded here: MiniDFT input

We collected communication data and mesh topology data using IPM. MPI traces were collected using the dumpi library. These MiniDFT runs include a substantial initialization phase and one iteration of the SCF cycle. Only the SCF iteration is relevant- it is identified as the "Benchmark" region in the IPM output. The DUMPI data measures only the SCF cycle region.

Available IPM data:

MiniDFT 9 MPI tasks 3x3x3 supercell           Raw IPM Data (xml)

MiniDFT 30 MPI tasks 4x4x4 supercell           Raw IPM Data (xml)

MiniDFT 76 MPI tasks 5x5x5 supercell           Raw IPM Data (xml)

MiniDFT 125 MPI tasks 6x6x6 supercell           Raw IPM Data (xml)

MiniDFT 192 MPI tasks 7x7x7 supercell           Raw IPM Data (xml)

MiniDFT 315 MPI tasks 8x8x8 supercell           Raw IPM Data (xml)

MiniDFT 424 MPI tasks 9x9x9 supercell           Raw IPM Data (xml)

MiniDFT 648 MPI tasks 10x10x10 supercell           Raw IPM Data (xml)

MiniDFT 900 MPI tasks 11x11x11 supercell           Raw IPM Data (xml)

MiniDFT 2520 MPI tasks 12x12x12 supercell           Raw IPM Data (xml)

MiniDFT 4525 MPI tasks 13x13x13 supercell           Raw IPM Data (xml)

MiniDFT 14650 MPI tasks 14x14x14 supercell           Raw IPM Data (xml) (155 GB)
To avoid inadvertent downloads of very large files, a direct link to the Raw IPM data is not provided.
The file can be obtained using the following filename relative to the current directory.
doe-miniapps-ipm-files/MiniDFT/MDFT_14/MDFT_14.ipm.xml

Available DUMPI data:
Update June 30, 2014: The original DUMPI traces for MiniDFT included only the Benchmark region and excluded the MPI initialization and communicator setup functions. This absence of the MPI setup functions prevented the traces from being used by the DUMPI library. The updated traces include the MPI setup functions and the Benchmark region.

MiniDFT 9 MPI tasks 3x3x3 supercell (Updated) (Original) (2.5 MB)

MiniDFT 30 MPI tasks 4x4x4 supercell (Updated) (Original) (24 MB)

MiniDFT 76 MPI tasks 5x5x5 supercell (Updated) (Original) (96 MB)

MiniDFT 125 MPI tasks 6x6x6 supercell (Updated) (Original) (286 MB)

MiniDFT 192 MPI tasks 7x7x7 supercell (Updated) (Original ) (583 MB)

MiniDFT 315 MPI tasks 8x8x8 supercell (Updated) (Original) (1.4 GB)

MiniDFT 424 MPI tasks 9x9x9 supercell (Updated) (Original) (2.6 GB)

MiniDFT 648 MPI tasks 10x10x10 supercell (Updated) (Original) (5.2 GB)

MiniDFT 900 MPI tasks 11x11x11 supercell (Updated) (Original) (9.9 GB)

MiniDFT 2520 MPI tasks 12x12x12 supercell (Updated) (Oritinal) (34 GB)

MiniDFT 4525 MPI tasks 13x13x13 supercell (Updated) (Original) (76 GB)

MiniDFT 14650 MPI tasks 14x14x14 supercell (335 GB)
To avoid inadvertent downloads of this very large file, a direct link is not provided.
The file can be obtained using the following filename relative to the current directory.
doe-miniapps-mpi-traces/MiniDFT/df_MiniDFT_n14650_dumpi.tar.gz

FillBoundary

FillBoundary is a very simple code designed to profile communication patterns associated with the MultiFab::FillBoundary operation, which exchanges ghost cells between the FABs in a single MultiFab.

The FillBoundary source code may be downloaded from: BoxLib MiniApps source. After untarring the file, look in BoxLib/MiniApps for the code.

Available DUMPI data:

FillBoundary 125 MPI processes (1.8 MB)

FillBoundary 1000 MPI processes (17 MB)

FillBoundary 10648 MPI processes (196 MB)

FillBoundary 110592 MPI processes (2.1 GB)

Available IPM data:

FillBoundary 125 MPI processes

FillBoundary 1000 MPI processes

FillBoundary 10648 MPI processes

FillBoundary 110592 MPI processes

Geometric MultiGrid

MultiGrid_C code solves:

a alpha soln - b div (beta grad soln) = rhs,

where a and b are constants, alpha, beta, rhs, and soln are arrays. It uses the linear solvers from BoxLib. Only one V-Cycle is profiled.

The MultiGrid_C source code may be downloaded from: BoxLib MiniApps source. After untarring the file, look in BoxLib/MiniApps for the code.

Available DUMPI data:

MultiGrid_C 125 MPI processes (5.0 MB)

MultiGrid_C 1000 MPI processes (30 MB)

MultiGrid_C 10648 MPI processes (246 MB)

MultiGrid_C 110592 MPI processes (2.2 GB)

Available IPM data:

MultiGrid_C 125 MPI processes

MultiGrid_C 1000 MPI processes

MultiGrid_C 10648 MPI processes

MultiGrid_C 110592 MPI processes

AMR BoxLib

The BoxLibAMR_MiniApp code does a single time step of an AMR run with compressible hydrodynamics and self-gravity. The represented application is a snapshot of a production AMR simulation code produced by the Center for Computational Sciences and Engineering<\a> This example was collected by Vince Beckner of CCSE.

The BoxLibAMR_MiniApp source code may be downloaded from: BoxLibAMR_MiniApp source. Untarring the file will create a directory named BoxLibAMR_MiniApp. A version of BoxLib is included in the directory. See the README file for information on how to build and run the code.

Available DUMPI data:

BoxLibAMR_MiniApp 64 MPI processes (15 MB)

BoxLibAMR_MiniApp 1728 MPI processes (581 MB)

BoxLibAMR_MiniApp 13824 MPI processes (5.5 GB) This file will uncompress to 322 GB.

BoxLibAMR_MiniApp 110592 MPI processes (not available)

Available IPM data:

BoxLibAMR_MiniApp 64 MPI processes

BoxLibAMR_MiniApp 1728 MPI processes

BoxLibAMR_MiniApp 13824 MPI processes

BoxLibAMR_MiniApp 110592 MPI processes (not available)

Crystal Router

The crystal router test code demonstrates the scalable 'many-to-many' MPI communication pattern used in the application code, Nek5000. Different from the MPI all_to_all, which requires memory scaling with processor count, the crystal router communication pattern only requires the amount of memory needed to transfer the data.

The Crystal Router test source code may be downloaded from: Crystal Router test source.

Available DUMPI data:

Crystal Router 10 MPI processes

Crystal Router 100 MPI processes

Crystal Router 1000 MPI processes

Crystal Router 10000 MPI processes

Available IPM data:

Crystal Router 10 MPI processes

Crystal Router 100 MPI processes

Crystal Router 1000 MPI processes

Crystal Router 10000 MPI processes

Mantevo Mini-Apps

This section contains traces from Mantevo mini-apps that have not yet been scaled to large processor counts. They can be compared against the full production analogues (for example miniAMR is a mini-app and BoxLibAMR is extracted from a full production application, which can add complexity).

MiniFE

MiniFE is a Finite Element mini-application which implements a couple of kernels representative of implicit finite-element applications. It assembles a sparse linear-system from the steady-state conduction equation on a brick-shaped problem domain of linear 8-node hex elements. It then solves the linear-system using a simple un-preconditioned conjugate-gradient algorithm.

The source code may be downloaded here:

MiniFE source code tarball

Available IPM data:  

MiniFE 18 MPI tasks, nx=ny=nz=307

MiniFE 144 MPI tasks, nx=ny=nz=614

MiniFE 1152 MPI tasks, nx=ny=nz=1228

Available DUMPI MPI traces:

MiniFE 18 MPI tasks, nx=ny=nz=307 (1.3 M)

MiniFE 144 MPI tasks, nx=ny=nz=614 (18.1 M)

MiniFE 1152 MPI tasks, nx=ny=nz=1228 (187 M)

MiniAMR

MiniAMR is a mini-app designed to support the study of adaptive mesh refinement (AMR) codes at scale.

We collected communication traces for MiniAMR using DUMPI on the Vulcan BlueGene/Q machine operated by Lawrence Livermore National Laboratory. MiniAMR has the ability to allow the user to specify a number of parameters affecting refinement, including disabling the refinement process and running in a uniform mode. When run without refinement the code mimics simple halo exchanges between MPI ranks. Traces are provided for both modes below.

To obtain the source code or ask questions about MiniAMR, please contact Courtenay Vaughan at Sandia National Laboratories (ctvaugh@sandia.gov)

Available DUMPI Traces (Running in Uniform (Non-Adaptive Mode)):

Coming Soon

Available DUMPI Traces (Running in Adapative Mesh Mode):

MiniAMR 1024 Ranks (39GB)

MiniAMR 2048 Ranks (87GB)

MiniAMR 4096 Ranks (212GB)