Будьте уважні! Це призведе до видалення сторінки "DMTCP"
.
Distributed MultiThreaded CheckPointing transparently checkpoints a single-host computation in user-space -- with no modifications to user code. Precompiled versions of userspace programs are available in /opt/DMTCP folder. Multiple versions are available. The folder will be updated as the software is updated.
For the continuity's sake I will use the same executable as described in checkpointing section of wiki.
The following example is usefull for testing purposes.
For your convenience a couple of scripts have been provided in "/opt/DMTCP" location You can of course add these lines to .profiles or just include them in a script
source /opt/dmtcp/dmtcp_set-variables.sh
Run a command you would normally run, but wrap command with dmtcp dmtcp. Do this on a computing node.
dmtcp_launch ./a.out arg1 arg2 ...
dmtcp_command --checkpoint [from another terminal window on same computer]
dmtcp_restart ckpt_a.out_*.dmtcp
We need to source the new PATH as in previous example, plus 2 more scripts. A batch script for our beforementioned programs would look like this:
cat preemption.sh
#!/bin/bash #SBATCH --partition=checkpoint # change to proper partition name or remove #SBATCH --time=00:50:00 # put proper time of reservation here #SBATCH --partition=checkpoint # for now we need to specify both qos and partition #SBATCH --qos=checkpoint # for now we need to specify both qos and partition #SBATCH --nodes=1 # number of nodes #SBATCH --nodelist=asgard06 # run only on following node #SBATCH --ntasks-per-node=24 # processes per node #SBATCH --mem=10000 # memory resource #SBATCH --job-name="checkpoint" # change to your job name #SBATCH --output=dmtcp.out # output to this file #SBATCH --open-mode append # append to this file, do not start from scratch source /opt/dmtcp/dmtcp_set-variables.sh source /opt/dmtcp/dmtcp_slurm_setup.sh source /opt/dmtcp/print_slurm_info.sh GRACETIME=1 term_handler(){ echo "Caught signal: $DMTCP_COORD_HOST $DMTCP_COORD_PORT" dmtcp_command -h $DMTCP_COORD_HOST -p $DMTCP_COORD_PORT -c sleep ${GRACETIME} dmtcp_command -h $DMTCP_COORD_HOST -p $DMTCP_COORD_PORT -q # quit coordinator sbatch preemption.sh exit } trap 'term_handler' term start_coordinator # -i 120 ... <put dmtcp coordinator options here> echo "signal: $DMTCP_COORD_HOST $DMTCP_COORD_PORT" if [ -f ./dmtcp_restart_script.sh ] then ./dmtcp_restart_script.sh -h $DMTCP_COORD_HOST -p $DMTCP_COORD_PORT else dmtcp_launch ./test_cpp fi & while : do sleep 1 done
Slurm controller daemon (slurmctld) sends signals to nodes to control them. When preempting a job for another job of higher priority slurmctld sends a couple of signals to the batch script running your executable. The example above caches SIGTERM and starts the checpointing function "term_handler()". It then promptly requeues itself. If you wanted to cancel such a job you could do it with:
scancel -b <JOBID>
Doing it without -b flag will checkpoint and requeue your job.
Будьте уважні! Це призведе до видалення сторінки "DMTCP"
.