/****************************************************************************** * * * * Copyright (C) 2009 * * Permission to use, copy, modify, and distribute this software and its * documentation under the terms of the GNU General Public License is hereby * granted. No representations are made about the suitability of this software * for any purpose. It is provided "as is" without express or implied warranty. * See the GNU General Public License for more details. * * Documents produced by Doxygen are derivative works derived from the * input used in their production; they are not affected by this license. * */ /*! \page decomp Describing decompositions On of the biggest challenges to working with PIO is setting up the call to \ref PIO_initdecomp. The user must properly describe how the data within each MPI tasks memory should be placed or retrieved from disk. PIO provides several interfaces. We describe the simplest interface first and then progress to the most complex and flexible interface. \section decomp_bc Block-cyclic interface The simplest interface assumes that your arrays are decomposed in a block-cyclic structure and can be described simply using a start-count type approach. A simple block-cyclic decomposition for a 1-dimension (1D) array is illustrated in Figure 1. Note that contigious layout of the data in memory can be easily mapped to a contigious layout on disk. The \em start arguments correspond to the starting point of the block of contigious memory, while the \em count is the number of words. Note that \em start and \em count must be arrays of length equal to the dimensionality of the distributed array. While we use a 1D array for simplicity, PIO currently supports up to the Fortran limit of 7-dimension (7D) arrays. In the case of 7D arrays, the start and count arrays would be of length 7. \image html block-cyclic.png "Figure 1: Setting up the \em start and \em count arrays for a single 1D array distributed accross 3 MPI tasks." \image latex block-cyclic.eps "Setting up the \em start and \em count arrays for a single 1D array distributed accross 3 MPI tasks." width=10cm The call to \ref PIO_initdecomp that would implement the decomposition illustrated in Figure 1 is listed below. The variable \em iosystem is created by the call to \ref PIO_init. The second argument \em PIO_double is the PIO kind, and indicates that this is a decomposition for a 8-byte real. (For a list of supported kinds see \ref PIO_kinds.) The argument \em dims is the global dimension for the array. The \em start and \em count arrays are 8-byte integers of type PIO_OFFSET, while \em iodesc is the IO descriptor generated by the call to PIO_initdecomp. \verbinclude simple-bc \section rearr Controlling IO decomposition The above example represents the simplest way to initialize and use PIO to write out and read distributed arrays. However, PIO provides some additional features that allows greater control over the IO process. In particular, it provides the ability to define an IO decomposition. Note that a user defined IO decomposition is optional. If one is not provided and rearrangement is necessary, PIO will internally compute an IO decomposition. The reason an IO decomposition may be necessary is described in the section \ref decomp_dof below. This flexibility provides the ability to define an intermediate decomposition that is unique from the computational decomposition. This IO decomposition can be constructed to maximize the write or read performance to the disk subsystem. We extend the simple example in Figure 1 to include an IO decomposition in Figure 1b. \image html block-cyclic-rearr.png "Figure 1b: Block cyclic decomposition with rearrangement" \image latex block-cyclic-rearr.eps "Block cyclic decomposition with rearrangement" width=10cm Figure 1b illustrates the creation of an IO decomposition on two of the MPI tasks. For this decomposition, the 8 word IO decomposition array and corresponding disk layout are evenly distributed between PE 0 (yellow) and PE 2 (blue). The arrows in Figure 1b indicates rearrangement that is performed within the PIO library. In this case, PE 0 sends a word to PE 2, illustrated by the shading of yellow to blue, while PE 1 sends two words to PE 0 as illustrated by the shading of red to yellow. The rearranged array in the IO decomposition is subsequently written to disk. Note that in this case, only two of three MPI tasks are performing writes to disk. The number of MPI tasks involved in IO to disk is specified in the \ref PIO_init using a combination of the num_aggregator and stride parameters. For figure 1b, the num_aggregator=3 and the stride=2. PIO allows the user to specify the IO decomposition using the optional parameters \em iostart and \em iocount. The following bits of code for PE 0, PE 1, and PE 2 illustrates the necessary calls to \ref PIO_initdecomp. \verbinclude simple-bc-rearr \verbinclude simple-bc-rearr-pe1 \verbinclude simple-bc-rearr-pe2 \section decomp_dof Degree of freedom interface The interface described in Section \ref decomp_bc, while simple and used by both pNetCDF (http://trac.mcs.anl.gov/projects/parallel-netcdf) and NetCDF-4 (http://www.unidata.ucar.edu/software/netcdf/) can be insufficient for applications with non-trivial decompositions. While it is possible to use multiple calls to construct a file with a non-trivial decomposition, the performance penalty may be significant. Therefore, PIO provides a more general interface to \ref PIO_initdecomp based on the degree of freedom concept. Each word within the distributed array must be given a unique value that corresponds to its order placement in the file on disk. So, the first word in the file on disk has a dof of 1, the second 2, etc. This allows a fully general specification of the decomposition. We illustrate its use in Figure 2. Note that in Figure 2, PE 0 and PE 1 do not contain contiguous pieces of the distributed array. The desired order on disk must be specified using the the compDOF argument to \ref PIO_initdecomp. In this case PE 0 contains the 2nd, 4th, and 5th element of the array, PE 1 contains the 1st and 3rd, and PE 2 contains the 6th, 7th, and 8th elements of the array. The integer compDOF arrays for each MPI task is illustrated at the bottom of Figure 2. \image html dof.png "Figure 2: Setting up the comDOF arrays for a single 1D array distributed accross 3 MPI tasks." \image latex dof.eps "Setting up the comDOF arrays for a single 1D array distributed accross 3 MPI tasks." width=10cm The call to \ref PIO_initdecomp which implements Figure 2 on PE 0 is provided below. \verbinclude simple-dof As with the block-cyclic interface, the degree of freedom interface provides the ability to specify the io decomposition through optional arguments to \ref PIO_initdecomp. \image html dof-rearr.png "Figure 3: Setting up the comDOF arrays and setting IO decomposition for a single 1D array distributed accross 3 MPI tasks and written from 2 tasks after rearrangement within the PIO library" \image latex dof-rearr.eps "Setting up the comDOF arrays and, io decomposition for a single 1D array distributed accross 3 MPI tasks" width=10cm Figure 2 illustrates the inclusion of an IO decomposition and associated rearrangement to write out the distributed array. The shading of the array elements shows how the individual PE arrays are blended using the IO decomposition specifications. The subroutine call to \ref PIO_initdecomp for PE 0 is illustrated below: \verbinclude simple-dof-rearr */