MANA for MPI: MPI-Agnostic Network-Agnostic Transparent Checkpointing¶

MANA is an MPI-Agnostic Network-Agnostic transparent checkpointing tool. MANA employs a novel split-process approach, in which a single system process contains two programs in its memory address space, the lower half, containing an MPI proxy application with MPI library, libc and other system libraries, and the upper half, containing the original MPI application and data (see the figure below). MANA tracks which memory regions belong to the upper and lower halves, and achieves MPI agnosticism by checkpointing only the upper-half memory and discarding the lower-half memory at checkpoint, and then reinitializing MPI library upon restart. MANA achieves network agnosticism by draining MPI messages before checkpointing. To make sure no checkpointing occurs when some ranks participate in a collective call, MANA prefaces all collective MPI calls with a trivial barrier. The real collective call happens only when all the ranks arrive or exit the trivial barrier. During the real collective calls the checkpointing is disabled, assuring that no messages floating in the network when checkpointing is initiated.

MANA's MPI Agnosticism

MANA addresses the critical maintenance cost of C/R tools over many combinations of MPI implementations and networks, and is transparent to the underlying MPI, network, libc library and Linux kernel. MANA is implemented as a plugin for DMTCP, therefore it lives completely in user space, and has been proven to be scalable to a large number of processes.

Starting from a proof-of-concept research code, MANA is under active development to use with production workloads. Its use on Cori is therefore experimental. MANA may incur a high runtime overhead for applications that use MPI collective calls frequently. We have seen up to 24% runtime overhead of MANA on Cori KNL when running with VASP, a widely used materials science code, which calls MPI collectives thousands times per second per rank. Developers have made significant progress on reducing the runtime overhead, and expect to fix the problem soon. Please report any issues you encounter at NERSC's Help Desk.

Note

If your MPI application has sufficient internal C/R support, you do not need MANA. MANA is for applications that do not have internal C/R support or have limited C/R support.

MANA on Cori¶

MANA is installed as a module on Cori. To access, do

module load mana

To see what mana module does, do

module show mana

Output of module show mana command on Cori

cori$  module show mana 
-------------------------------------------------------------------
/global/common/software/nersc/cle7/extra_modulefiles/mana/2021-05-24:

module-whatis    MANA (MPI-Agnostic Network-Agnostic) transparent checkpointing tool, 
                 implemented in DMTCP (Distributed MultiThreaded Checkpointing) 
                 transparently checkpoints MPI applications in user-space -- with 
                 no modifications to user code or to the O/S. 
setenv           MANA_ROOT /usr/common/software/mana/2021-05-24/intel 
setenv           MANA_PMI_LIBDIR /usr/common/software/mana/2021-05-24/intel/lib/pmi 
setenv           DMTCP_DL_PLUGIN 0 
setenv           PMI_NO_PREINITIALIZE 1 
prepend-path     LD_LIBRARY_PATH /usr/common/software/mana/2021-05-24/intel/lib/dmtcp 
prepend-path     PATH . 
prepend-path     PATH /usr/common/software/mana/2021-05-24/intel/bin 
prepend-path     MANPATH /usr/common/software/mana/2021-05-24/intel/share/man 
setenv           mana_VERSION 2021-05-24 
setenv           MANA_DIR /usr/common/software/mana/2021-05-24/intel 
setenv           SITE_MODULE_NAMES mana 
prepend-path     mana_PKGCONFIG_LIBS mpidummy 
prepend-path     PE_PKGCONFIG_PRODUCTS mana 
prepend-path     PKG_CONFIG_PATH /usr/common/software/mana/2021-05-24/intel/lib/dmtcp/pkgconfig 
-------------------------------------------------------------------

Benefits of Checkpointing/Restart¶

You are encouraged to experiment with MANA with your applications, enabling checkpoint/restart in your production workloads. Benefits of checkpointing and restarting jobs with MANA includes,

increased job throughput
the capability of running jobs of any length
a 75% charging discount on Cori KNL and the 50% charging discount on Haswell when using the flex QOS
reduced machine time loss due to system failures

Compile to Use MANA¶

To use MANA to checkpoint/restart your applications, you do not need to modify any of your codes. However, you must link your application dynamically, and build shared libraries for the libraries that your application depends on. Additionally, you must link your applications to a library that contains the wrapped MPI APIs from MANA. This may not be required in the future.

The mana module on Cori was written to interact with the compiler wrappers, so the linking to the MANA library can be done automatically if you load the mana module. Here is how you can compile code to use MANA:

module unload darshan xalt
module load mana
ftn -qopenmp -o jacobi.x jacobi.f90

For manual linking you can provide the MANA's MPI wrapper library on the compile/link line as follows after loading the mana module:

-L$MANA_ROOT/lib/dmtcp -lmpidummy

Note that the darshan and xalt modules (a light-weight I/O profiling tool and a library tracking tool, respectively) are unloaded to avoid any complications they may add to MANA.

C/R MPI Applications with MANA¶

C/R Interactive Jobs¶

You can use MANA to checkpoint and restart your MPI application interactively, which is convenient when testing and debugging MANA jobs. Here are the steps on Cori:

Checkpointing¶

Get on a compute node using the salloc command, e.g., requesting 1 KNL node for one hour
```
salloc –N1 –C knl –t 1:00:00 -q interactive
```
then load the mana module once get on the compute node.
```
module load mana 
```
Start the coordinator and specify the checkpoint interval, e.g., 300 seconds (-i300).
```
mana_coordinator -i300
```
Then launch your application (a.out) with the mana_launch command.
```
srun -n 64 -c4 --cpu-bind=cores mana_launch ./a.out [arg1 ...]
```
Then MANA will checkpoint your application every 300 seconds. You can terminate your running job once checkpoint files are generated.

Restarting¶

To restart your job from the checkpoint files, repeat steps 1-3 above, but replace the mana_launch command in step 3 with the mana_restart command. The mana_restart command line is as follows:

srun -n 64 -c4 --cpu-bind=cores mana_restart

Then MANA will restart from the checkpoint files and continue to run your application, checkpointing once every 300 seconds.

Note that MANA is implemented as a plugin in DMTCP, thereby uses the dmtcp_coordinator, dmtcp_launch, dmtcp_restart, and dmtcp_command commands of DMTCP as described in the DMTCP page, but with additional command line options. Since some of the command lines can be long, MANA provides bash scripts, mana_coordinator, mana_launch, mana_restart, and mana_status, respectively, to make the command lines short (easy to use).

In the example above, the mana_coordinator is a bash script that invokes the dmtcp_coordinator command as a daemon (--daemon) in the background. The full dmtcp_coordinator command line used in the above example is as follows:

dmtcp_coordinator --mpi --daemon --exit-on-last -q -i300

Where the --mpi option is required to use MANA. The mana_launch and mana_restart are bash scripts that invoke the dmtcp_launch and dmtcp_restart, respectively, with added options for MANA. Here are the dmtcp_launch and dmtcp_restart command lines used in this example:

srun -n64 -c4 --cpu-bind=cores dmtcp_launch -h `hostname` --no-gzip --join --disable-dl-plugin --with-plugin $MANA_ROOT/lib/dmtcp/libmana.so ./a.out [arg1 ...]

srun -n64 -c4 --cpu-bind=cores dmtcp_restart --mpi -j  -h `hostname` --restartdir ./

MANA provides a command, mana_status, a bash script that invokes the dmtcp_command, to send the commands to the coordinator remotely, provided you need to get on the compute node where your job is running first (instructions here).

mana_status --checkpoint  # checkpoint all processes 
mana_status --status      # query the status 
mana_status --quit        # kill all processes and quit

All mana_* commands support command line options (use --help to see the list). For instance, you can save checkpoint files in a separate directory using the --ckptdir <directory name> option when invoking the mana_launch command. At restart, you can use the --restartdir <directory name> option to specify the checkpoint files for the mana_restart command.

C/R Batch Jobs¶

Assume the job you wish to checkpoint is run.slurm as shown below, in which you request two Cori Haswell nodes to run an MPI application for 48 hours. You can checkpoint and restart this job using the C/R job scripts below, run_launch.slurm and run_restart.slurm.

Cori Haswell¶

run.slurm: the job you wish to checkpoint

#!/bin/bash 
#SBATCH -J test
#SBATCH -q regular
#SBATCH -N 2 
#SBATCH -C haswell
#SBATCH -t 48:00:00
#SBATCH -o %x-%j.out
#SBATCH -e %x-%j.err

srun -n 64 ./a.out

run_launch.slurm: launches your job under MANA control

#!/bin/bash
#SBATCH -J test_cr
#SBATCH -q flex
#SBATCH -N 2 
#SBATCH -C haswell
#SBATCH -t 48:00:00
#SBATCH -o %x-%j.out
#SBATCH -e %x-%j.err
#SBATCH --time-min=2:00:00

#c/r with mana 
module load mana 

#checkpointing once every hour
mana_coordinator -i 3600

#running under mana control
srun -n 64 mana_launch ./a.out

run_restart.slurm: restarts your job from checkpoint files with MANA

#!/bin/bash
#SBATCH -J test_cr
#SBATCH -q flex 
#SBATCH -N 2 
#SBATCH -C haswell
#SBATCH -t 48:00:00
#SBATCH -o %x-%j.out
#SBATCH -e %x-%j.err
#SBATCH --time-min=2:00:00

#c/r with mana 
module load mana 

#checkpointing once every hour
mana_coordinator -i 3600

#restarting from checkpoint files
srun -n 64 mana_restart

Similarly, in the following KNL example you can checkpoint and restart the job run.slurm using the C/R job scripts below, run_launch.slurm and run_restart.slurm.

Cori KNL¶

run.slurm: the job you wish to checkpoint

#!/bin/bash 
#SBATCH -J test
#SBATCH -q regular
#SBATCH -N 2 
#SBATCH -C knl
#SBATCH -t 48:00:00
#SBATCH -o %x-%j.out
#SBATCH -e %x-%j.err

#user setting
export OMP_PROC_BIND=true
export OMP_PLACES=threads
export OMP_NUM_THREADS=8

srun -n 16 -c32 --cpu-bind=cores ./a.out

run_launch.slurm: launches your job under MANA control

#!/bin/bash
#SBATCH -J test_cr
#SBATCH -q flex
#SBATCH -N 2 
#SBATCH -C knl
#SBATCH -t 48:00:00
#SBATCH -o %x-%j.out
#SBATCH -e %x-%j.err
#SBATCH --time-min=2:00:00

#user setting
export OMP_PROC_BIND=true
export OMP_PLACES=threads
export OMP_NUM_THREADS=8

#for c/r with mana
module load mana

#checkpointing once every hour
mana_coordinator -i 3600

#run job under mana control
srun -n 16 -c 32 --cpu-bind=cores mana_launch ./a.out

run_restart.slurm: restarts your job from checkpoint files with MANA

#!/bin/bash
#SBATCH -J test
#SBATCH -q flex
#SBATCH -N 2 
#SBATCH -C knl
#SBATCH -t 48:00:00
#SBATCH -o %x-%j.out
#SBATCH -e %x-%j.err
#SBATCH --time-min=2:00:00

#user settings
export OMP_PROC_BIND=true
export OMP_PLACES=threads
export OMP_NUM_THREADS=8

#for c/r with mana
module load mana

#checkpointing once every hour
mana_coordinator -i 3600

#restart job from checkpoint files
srun -n 16 -c 32 --cpu-bind=cores mana_restart

Since now you can checkpoint/restart the job with MANA, you can run your long job (48 hours) in multiple shorter ones. This increases your job's backfill opportunities, thereby improving your job throughput. You can use the --time-min sbatch flag to specify a minimum time limit for your C/R jobs, allowing your job to run with any time limit between 2 to 48 hours. Note that you must use --time-min=2:00:00 or less to get the 50% charging discount on Haswell and 75% charging discount on KNL from using the flex QOS. You can use a longer time-min, e.g., 6 hours, if you do not use the flex QOS.

To run the job, just submit the C/R job scripts above,

sbatch run_launch.slurm
sbatch run_restart.slurm   #if the first job is pre-terminated
sbatch run_restart.slurm   #if the second job is pre-terminated
...

The first job will run with a time limit anywhere between the specified time-min and 48 hours. If it is pre-terminated, then you need to submit the restart job, run_restart.slurm. You may need to submit it multiple times until the job completes or has run for 48 hours as requested. You can use the job dependencies to submit all your C/R jobs at once (you may need to submit many more restart jobs than actually needed). You can also combine the two C/R job scripts into one (see the next section), and then submit it multiple times as dependent jobs all at once. However, this is still not as convenient as submitting the job script run.slurm only once. The good news is that you can automate the C/R jobs using the features supported in Slurm and a trap function (see the next section). The job scripts in the next section are recommended to run C/R jobs.

Automate C/R Jobs¶

C/R job submissions can be automated using the variable-time job script, so that you just need to submit a single job script once as you would with your original job script, run.slurm.

Here are the sample job scripts:

Cori Haswell¶

run_cr.slurm: a sample job script to checkpoint and restart your job with MANA automatically

#!/bin/bash 
#SBATCH -J test
#SBATCH -q flex 
#SBATCH -N 2             
#SBATCH -C haswell
#SBATCH -t 48:00:00 
#SBATCH -e %x-%j.err 
#SBATCH -o %x-%j.out
#SBATCH --time-min=2:00:00  
#
#SBATCH --comment=48:00:00
#SBATCH --signal=B:USR1@300
#SBATCH --requeue
#SBATCH --open-mode=append

module load mana nersc_cr

#checkpointing once every hour
mana_coordinator -i 3600

#checkpointing/restarting jobs
if [[ $(restart_count) == 0 ]]; then

    srun -n 64 mana_launch ./a.out &
elif [[ $(restart_count) > 0 ]] && [[ -e dmtcp_restart_script.sh ]]; then

    srun -n 64 mana_restart &
else

    echo "Failed to restart the job, exit"; exit
fi

# requeueing the job if remaining time >0
ckpt_command=ckpt_mana    #checkpointing additionally right before the job hits the walllimit 
requeue_job func_trap USR1

wait

Cori KNL¶

run_cr.slurm: a sample job script to checkpoint and restart your job with MANA automatically

#!/bin/bash 
#SBATCH -J test
#SBATCH -q flex 
#SBATCH -N 2             
#SBATCH -C knl
#SBATCH -t 48:00:00 
#SBATCH -e %x-%j.err 
#SBATCH -o %x-%j.out
#SBATCH -time-min=02:00:00  
#
#SBATCH --comment=48:00:00
#SBATCH --signal=B:USR1@300
#SBATCH --requeue
#SBATCH --open-mode=append

module load mana nersc_cr

#checkpointing once every hour
mana_coordinator -i 3600

#c/r jobs
if [[ $(restart_count) == 0 ]]; then

    #user setting
    export OMP_PROC_BIND=true
    export OMP_PLACES=threads
    export OMP_NUM_THREADS=8
    srun -n 16 -c 32 --cpu-bind=cores mana_launch ./a.out &
elif [[ $(restart_count) > 0 ]] && [[ -e dmtcp_restart_script.sh ]]; then

    srun -n 16 -c 32 --cpu-bind=cores mana_restart &
else

    echo "Failed to restart the job, exit"; exit
fi

# requeueing the job if remaining time >0
ckpt_command=ckpt_mana     #additional checkpointing right before the job hits the walllimit
requeue_job func_trap USR1

wait

This job script combines the two C/R job scripts in the previous section, run_launch.slurm and run_restart.slurm by checking the restart count of the job (if block). As before, the --time-min is used to split a long running job into multiple shorter ones to improve backfill opportunity. Each job will run with a time limit anywhere between the specified --time-min and time limit (-t), checkpointing once every hour (-i 3600). In the C/R job scripts, in addition to loading the mana module, the nersc_cr module is loaded as well, which provides a set of bash functions to manage C/R jobs, e.g., restart_count, requeue_job, func_trap, ckpt_mana, etc., that are used in the job script.

What's new in this script is that

It can automatically track the remaining walltime, and resubmit itself until the job completes or the accumulated run time reaches the desired walltime (48 hours in this example).
Optionally, each job checkpoints one more time 300 seconds before the job hits the allocated time limit.
There is only one job id, and one standard output/error file associated with multiple shorter jobs.

These features are enabled with the following additional sbatch flags and a bash function requeue_job, which traps the signal (USR1) sent from the batch system:

#SBATCH --comment=48:00:00         #comment for the job
#SBATCH --signal=B:USR1@<sig_time> 
#SBATCH --requeue                  #specify job is requeueable
#SBATCH --open-mode=append         #to append standard out/err of the requeued job  
                                   #to that of the previously terminated job

#requeueing the job if remaining time >0
ckpt_command=ckpt_mana 
requeue_job func_trap USR1

wait

where the --comment sbatch flag is used to specify the desired walltime and to track the remaining walltime for the job (after pre-termination). You can specify any length of time, e.g., a week or even longer. The --signal flag is used to request that the batch system sends user-defined signal USR1 to the batch shell (where the job is running) sig_time seconds (e.g., 300) before the job hits the wall limit. This time should match the checkpoint overhead of your job.

Upon receiving the signal USR1 from the batch system (300 seconds before the job hits the wall limit), the requeue_job executes the following commands (contained in a function func_trap provided on the reuque_job command line in the job script):

mana_status --checkpoint     #checkpoint the job if ckpt_command=ckpt_mana  
scontrol requeue $SLURM_JOB_ID #requeue the job

If your job completes before the job hits the wall limit, then the batch system will not send the USR1 signal, and the two commands above will not be executed (no additional checkpointing and no more requeued job). The job will exit normally.

For more details about the requeue_job and other functions used in the C/R job scripts, refer to the script cr_functions.sh provided by the nersc_cr module. (type module show nersc_cr to see where the script resides). You may consider making a local copy of this script, and modifying it for your use case.

To run the job, simply submit the job script,

sbatch run_cr.slurm

Note

It is important to make the mana_launch and mana_restart run in the background (&), and add a wait command at the end of the job script, so that when the batch system sends the USR1 signal to the batch shell, the wait command gets killed, instead of the mana_launch or mana_restart commands, so that they can continue to run to complete the last checkpointing right before the job hits the wall limit.
You need to make the sig_time in the --signal sbatch flag match the checkpoint overhead of your job.
You may want to change the checkpoint interval for your job, especially if your job's checkpoint overhead is high. You can checkpoint only once before your job hits the wall limit.
Note that the nersc_cr module does not support csh. Csh/tcsh users must invoke bash before loading the module.

C/R Serial/Threaded Applications¶

If you run serial/threaded applications, we recommend that you use DMTCP to checkpoint and restart your jobs. See the DMTCP page for detailed instructions. MANA is recommended for checkpointing MPI applications.

MANA Help Pages¶

mana_coordinator help page

cori$ mana_coordinator --help
Usage: dmtcp_coordinator [OPTIONS] [port]
Coordinates checkpoints between multiple processes.

Options:
  -p, --coord-port PORT_NUM (environment variable DMTCP_COORD_PORT)
      Port to listen on (default: 7779)
  --port-file filename
      File to write listener port number.
      (Useful with '--port 0', which is used to assign a random port)
  --ckptdir (environment variable DMTCP_CHECKPOINT_DIR):
      Directory to store dmtcp_restart_script.sh (default: ./)
  --tmpdir (environment variable DMTCP_TMPDIR):
      Directory to store temporary files (default: env var TMDPIR or /tmp)
  --exit-on-last
      Exit automatically when last client disconnects
  --exit-after-ckpt
      Kill peer processes of computation after first checkpoint is created
  --daemon
      Run silently in the background after detaching from the parent process.
  -i, --interval (environment variable DMTCP_CHECKPOINT_INTERVAL):
      Time in seconds between automatic checkpoints
      (default: 0, disabled)
  --coord-logfile PATH (environment variable DMTCP_COORD_LOG_FILENAME
              Coordinator will dump its logs to the given file
  -q, --quiet 
      Skip startup msg; Skip NOTE msgs; if given twice, also skip WARNINGs
  --help:
      Print this message and exit.
  --version:
      Print version information and exit.

COMMANDS:
      type '?<return>' at runtime for list

Report bugs to: dmtcp-forum@lists.sourceforge.net
DMTCP home page: <http://dmtcp.sourceforge.net>

mana_launch help page

cori$ dmtcp_launch --help
Usage: dmtcp_launch [OPTIONS] <command> [args...]
Start a process under DMTCP control.

  -h, --coord-host HOSTNAME (environment variable DMTCP_COORD_HOST)
              Hostname where dmtcp_coordinator is run (default: localhost)
  -p, --coord-port PORT_NUM (environment variable DMTCP_COORD_PORT)
              Port where dmtcp_coordinator is run (default: 7779)
  --port-file FILENAME
              File to write listener port number.  (Useful with
              '--coord-port 0', which is used to assign a random port)
  -j, --join-coordinator
              Join an existing coordinator, raise error if one doesn't
              already exist
  --new-coordinator
              Create a new coordinator at the given port. Fail if one
              already exists on the given port. The port can be specified
              with --coord-port, or with environment variable 
              DMTCP_COORD_PORT.
              If no port is specified, start coordinator at a random port
              (same as specifying port '0').
  --any-coordinator
              Use --join-coordinator if possible, but only if port was specified.
              Else use --new-coordinator with specified port (if avail.),
                and otherwise with the default port: --port 7779)
              (This is the default.)
  --no-coordinator
              Execute the process in standalone coordinator-less mode.
              Use dmtcp_command or --interval to request checkpoints.
              Note that this is incompatible with calls to fork(), since
              an embedded coordinator runs in the original process only.
  -i, --interval SECONDS (environment variable DMTCP_CHECKPOINT_INTERVAL)
              Time in seconds between automatic checkpoints.
              0 implies never (manual ckpt only);
              if not set and no env var, use default value set in
              dmtcp_coordinator or dmtcp_command.
              Not allowed if --join-coordinator is specified

Checkpoint image generation:
  --gzip, --no-gzip, (environment variable DMTCP_GZIP=[01])
              Enable/disable compression of checkpoint images (default: 1)
              WARNING: gzip adds seconds. Without gzip, ckpt is often < 1s
  --ckptdir PATH (environment variable DMTCP_CHECKPOINT_DIR)
              Directory to store checkpoint images
              (default: curr dir at launch)
  --ckpt-open-files
  --checkpoint-open-files
              Checkpoint open files and restore old working dir.
              (default: do neither)
  --allow-file-overwrite
              If used with --checkpoint-open-files, allows a saved file
              to overwrite its existing copy at original location
              (default: file overwrites are not allowed)
  --ckpt-signal signum
              Signal number used internally by DMTCP for checkpointing
              (default: SIGUSR2/12).

Enable/disable plugins:
  --with-plugin (environment variable DMTCP_PLUGIN)
              Colon-separated list of DMTCP plugins to be preloaded with
              DMTCP.
              (Absolute pathnames are required.)
  --batch-queue, --rm
              Enable support for resource managers (Torque PBS and SLURM).
              (default: disabled)
  --ptrace
              Enable support for PTRACE system call for gdb/strace etc.
              (default: disabled)
  --modify-env
              Update environment variables based on the environment on the
              restart host (e.g., DISPLAY=$DISPLAY).
              This can be set in a file dmtcp_env.txt.
              (default: disabled)
  --pathvirt
              Update file pathnames based on DMTCP_PATH_PREFIX
              (default: disabled)
  --ib, --infiniband
              Enable InfiniBand plugin. (default: disabled)
  --disable-alloc-plugin: (environment variable DMTCP_ALLOC_PLUGIN=[01])
              Disable alloc plugin (default: enabled).
  --disable-dl-plugin: (environment variable DMTCP_DL_PLUGIN=[01])
              Disable dl plugin (default: enabled).
  --disable-all-plugins (EXPERTS ONLY, FOR DEBUGGING)
              Disable all plugins.

Other options:
  --tmpdir PATH (environment variable DMTCP_TMPDIR)
              Directory to store temp files (default: $TMDPIR or /tmp)
              (Behavior is undefined if two launched processes specify
               different tmpdirs.)
  -q, --quiet (or set environment variable DMTCP_QUIET = 0, 1, or 2)
              Skip NOTE messages; if given twice, also skip WARNINGs
  --coord-logfile PATH (environment variable DMTCP_COORD_LOG_FILENAME
              Coordinator will dump its logs to the given file
  --help
              Print this message and exit.
  --version
              Print version information and exit.

Report bugs to: dmtcp-forum@lists.sourceforge.net
DMTCP home page: <http://dmtcp.sourceforge.net>

mana_restart help page

cori$ dmtcp_restart --help
USAGE: mana_restart [--verbose] [DMTCP_OPTIONS ...] [--restartdir MANA_CKPT_DIR]
       Default for MANA_CKPT_DIR is current directory
       For DMTCP options, do: /usr/common/software/mana/2021-05-24/intel/bin/mana_restart --help --help
 MANA_CKPT_DIR should contain the checkpoint subdirs: ckpt_rank_*

cori$ mana_restart --help --ehlp
Invalid Argument
Usage: dmtcp_restart [OPTIONS] <ckpt1.dmtcp> [ckpt2.dmtcp...]

Restart processes from a checkpoint image.

  -h, --coord-host HOSTNAME (environment variable DMTCP_COORD_HOST)
              Hostname where dmtcp_coordinator is run (default: localhost)
  -p, --coord-port PORT_NUM (environment variable DMTCP_COORD_PORT)
              Port where dmtcp_coordinator is run (default: 7779)
  --port-file FILENAME
              File to write listener port number.
              (Useful with '--port 0', in order to assign a random port)
  -j, --join-coordinator
              Join an existing coordinator, raise error if one doesn't
              already exist
  --new-coordinator
              Create a new coordinator at the given port. Fail if one
              already exists on the given port. The port can be specified
              with --coord-port, or with environment variable
              DMTCP_COORD_PORT.
              If no port is specified, start coordinator at a random port
              (same as specifying port '0').
  --any-coordinator
              Use --join-coordinator if possible, but only if port was specified.
              Else use --new-coordinator with specified port (if avail.),
                and otherwise with the default port: --port 7779)
              (This is the default.)
  -i, --interval SECONDS (environment variable DMTCP_CHECKPOINT_INTERVAL)
              Time in seconds between automatic checkpoints.
              0 implies never (manual ckpt only); if not set and no env
              var, use default value set in dmtcp_coordinator or 
              dmtcp_command.
              Not allowed if --join-coordinator is specified

Other options:
  --no-strict-checking
              Disable uid checking for checkpoint image. Allow checkpoint
              image to be restarted by a different user than the one
              that created it. And suppress warning about running as root.
              (environment variable DMTCP_DISABLE_STRICT_CHECKING)
  --ckptdir (environment variable DMTCP_CHECKPOINT_DIR):
              Directory to store checkpoint images
              (default: use the same dir used in previous checkpoint)
  --restartdir Directory that contains checkpoint image directories
  --mpi       Use as MPI proxy
 (default: no MPI proxy)  --tmpdir PATH (environment variable DMTCP_TMPDIR)
              Directory to store temp files (default: $TMDPIR or /tmp)
  -q, --quiet (or set environment variable DMTCP_QUIET = 0, 1, or 2)
              Skip NOTE messages; if given twice, also skip WARNINGs
  --coord-logfile PATH (environment variable DMTCP_COORD_LOG_FILENAME
              Coordinator will dump its logs to the given file
  --debug-restart-pause (or set env. var. DMTCP_RESTART_PAUSE =1,2,3 or 4)
              dmtcp_restart will pause early to debug with:  GDB attach
  --help
              Print this message and exit.
  --version
              Print version information and exit.

Report bugs to: dmtcp-forum@lists.sourceforge.net
DMTCP home page: <http://dmtcp.sourceforge.net>

mana_status help page

Resources¶

DMTCP website
MANA for MPI: MPI-Agnostic Network-Agnostic Transparent Checkpointing
MANA github
User training on Checkpoint/Restart (May 2021):
- Checkpoint/Restart MPI Applications with MANA on Cori (Slides) (Recording)
- Checkpoint/Restart VASP Jobs Using MANA on Cori (Slides)