DMTCP
Blaz Jesenko ha modificato questa pagina 7 anni fa

DMTCP:

Distributed MultiThreaded CheckPointing transparently checkpoints a single-host computation in user-space -- with no modifications to user code. Precompiled versions of userspace programs are available in /opt/DMTCP folder. Multiple versions are available. The folder will be updated as the software is updated.

Usage examples:

For the continuity's sake I will use the same executable as described in checkpointing section of wiki.

a) Command line checkpointing

The following example is usefull for testing purposes.

1. -export path and set environment

For your convenience a couple of scripts have been provided in "/opt/DMTCP" location You can of course add these lines to .profiles or just include them in a script

source /opt/dmtcp/dmtcp_set-variables.sh
2. -run your program,

Run a command you would normally run, but wrap command with dmtcp dmtcp. Do this on a computing node.

dmtcp_launch ./a.out arg1 arg2 ...
dmtcp_command --checkpoint     [from another terminal window on same computer]
dmtcp_restart ckpt_a.out_*.dmtcp

b) DMTCP checkpointing in slurm

We need to source the new PATH as in previous example, plus 2 more scripts. A batch script for our beforementioned programs would look like this:

cat preemption.sh

#!/bin/bash

#SBATCH --partition=checkpoint     # change to proper partition name or remove
#SBATCH --time=00:50:00            # put proper time of reservation here
#SBATCH --partition=checkpoint     # for now we need to specify both qos and partition
#SBATCH --qos=checkpoint           # for now we need to specify both qos and partition
#SBATCH --nodes=1                  # number of nodes
#SBATCH --nodelist=asgard06        # run only on following node
#SBATCH --ntasks-per-node=24       # processes per node
#SBATCH --mem=10000                # memory resource
#SBATCH --job-name="checkpoint"    # change to your job name
#SBATCH --output=dmtcp.out         # output to this file
#SBATCH --open-mode append         # append to this file, do not start from scratch

source /opt/dmtcp/dmtcp_set-variables.sh
source /opt/dmtcp/dmtcp_slurm_setup.sh
source /opt/dmtcp/print_slurm_info.sh

GRACETIME=1

term_handler(){
    echo "Caught signal: $DMTCP_COORD_HOST $DMTCP_COORD_PORT"
    dmtcp_command -h $DMTCP_COORD_HOST -p $DMTCP_COORD_PORT -c
    sleep ${GRACETIME}
    dmtcp_command -h $DMTCP_COORD_HOST -p $DMTCP_COORD_PORT -q  # quit coordinator
    sbatch preemption.sh
    exit
}

trap 'term_handler' term

start_coordinator # -i 120 ... <put dmtcp coordinator options here>
echo "signal: $DMTCP_COORD_HOST $DMTCP_COORD_PORT"

if [ -f ./dmtcp_restart_script.sh ]
then
    ./dmtcp_restart_script.sh -h $DMTCP_COORD_HOST -p $DMTCP_COORD_PORT 
else
    dmtcp_launch ./test_cpp 
fi &
while :
do
sleep 1
done

A word about signals:

Slurm controller daemon (slurmctld) sends signals to nodes to control them. When preempting a job for another job of higher priority slurmctld sends a couple of signals to the batch script running your executable. The example above caches SIGTERM and starts the checpointing function "term_handler()". It then promptly requeues itself. If you wanted to cancel such a job you could do it with:

scancel -b <JOBID>

Doing it without -b flag will checkpoint and requeue your job.