Checkpointing
Blaz Jesenko このページを編集 7 年 前

Some checkpointing examples:

C-code example

Automatic requeueing

Checkpointing upon hitting walltime

Distributed MultiThreaded CheckPointing

C code example

A C program with inbuilt checkpointing capability. It checkpoints itself upon receiving a SIGINT (interrupt signal)

cat chk.cpp

#include <iostream>
#include <csignal>
#include <omp.h>
#include <string>
#include <fstream>
#include <iomanip>

//Kempner's series using a "9", result should be around 22.920676619264150,
//should be impossible to get anywhere near that number with this approach. 

//Novo ime za long long ker staro "too long"
typedef long long ll;

//1st loop, non-important, this one just goes on and on...
const ll num1 = 100000;
//2nd loop which sets the thime inside OMP regiona, bigger number means longer time
const ll num2 = 100000000;

//Boolean for signalling that we have to do something
volatile bool sig = false;

// This will happen when we get "the signal". Currently it just informs us that the signal was received. 
//It also sets the boolean which triggers checkpointig when the program leaves OMP region.
void handler(int signum) {
    std::cout << "Received interrupt signal, scheduling a checkpoint." << std::endl;
    sig = true;
}

//Main part of the program
int main(void) {

//We define 2. badly needed variables
    double sum = 0;
    ll i = 0;

//Check if checkpoint file is allready present. If not, inform user.
    std::cout << "Looking for a checkpoint file." << std::endl;
    std::ifstream input;
    input.open("checkpoint.dat", std::ifstream::in);
    if (input.fail()) std::cout << "No checkpoint file, starting from scratch." << std::endl;

//If checkpoint file is present continue from the point mentioned
    else {
        std::cout << "Found a checkpoint file, reading parameteres and continuing." << std::endl;
        input >> i >> sum;
        input.close();
    }
    std::cout << "Starting signal handler." << std::endl;

//Start signal handling
    signal(SIGINT, handler);

//Read number of available threads
    int nth = omp_get_num_threads();

//Outer loop, it "chops up" the paralelisation. It enables us to do the checkpointingetc. Impact on preformance is on the order of ~1/num^2, a.k.a. nonexistent
    for (; i < num1; i++) {

//This will be the parrallel loop, it does the sums. It converges far too slowly for to prove anything- but it's OK, since we want slow but steady convergence
#pragma omp parallel for schedule(dynamic, 200) reduction(+:sum)
        for (ll j = 1 + i * nth * num2; j <= (i + 1) * nth * num2; j++) {
            std::string a = std::to_string(j);
            bool nine = false;
            for (auto x : a) {
                if (x == '9') {
                    nine = true;
                    break;
                }
            }
            if (nine) continue;
            sum += 1.0 / ((double)j);
        }

//An output (so that we know the program is doing its magic
        std::cout << "Intermediate output at i=" << i << std::fixed << std::setprecision(20) << ": sum=" << sum << std::endl;

//Check for signal, open file and write needed info and kill the progamm if yes
            if (sig) {
            std::cout << "Checkpointing and exiting the program. " << std::endl;
            std::ofstream output;
            output.open("checkpoint.dat", std::ofstream::out);
            output << i + 1 << ' ' << std::fixed << std::setprecision(20) << sum;
            output.close();
            exit(SIGINT);
        }
    }
//You shouldn't get here with the default num1 and num2. Let's just state it here for safety's sake...
    return 0;
}

Compile with:

g++ -std=c++11 -O3 -fopenmp -o test chk.cpp

On PC, this example program can be checkpointed by pressing ctrl + c buttons simultaneously. The following example batch script will send SIGINT (ctrl+C) to compiled program roughly 70s before the walltime, it should then be checkpointed. The signal is hardcoded, but can be changed. The program can also be checkpointed with ">scancel --signal=SIGINT <JOBID>"

We can use --signal directive to automatically checkpoint the above program 70s before end. We should allow the program enough time to finish writing checkpoint file.

Checkpointing upon hitting walltime

cat batch.sh

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --partition=suspend
#SBATCH --mem=10M
#SBATCH --cpus-per-task=5
#SBATCH --qos=suspend
#SBATCH --job-name=checkpoint_test
#SBATCH --signal=SIGINT@70
#SBATCH --time=00:50:00

srun /home/blaz/checkpoint/test

Usually we want to automate things further. Wouldn't it be nice if the program would just magically requeue itself?

Automatic requeueing:

This handy batch script will automatically requeue checkpointable program upon receiveing the walltime --signal. It caches --signal and sends SIGINT to the program. We can also trap multiple signals at once. Swap trap "trap 'term_handler' USR2" with "trap 'term_handler' USR2 SIGCONT" and you can use this batch script on checkpointing partition (change --partition and --QOS accordingly).

cat signal.sh

#!/bin/bash

#SBATCH --time=00:03:00           # put proper time of reservation here
#SBATCH --partition=rude
#SBATCH --qos rude
#SBATCH --nodes=1                 # number of nodes
#SBATCH --ntasks-per-node=4       # processes per node
#SBATCH --mem=10000               # memory resource
#SBATCH --job-name="rude"    # change to your job name
#SBATCH --output=signal.out        # change to proper file name or remove for defaults
#SBATCH --open-mode append
#SBATCH --signal=B:usr2@30


term_handler(){
    echo "Caught signal!"
    echo $PID
    kill -s SIGINT $PID
    sleep 20
    sbatch signal.sh
    exit
}

trap 'term_handler' USR2

./test &
PID=$!

while :
do
  sleep 1
done

Unfortunatelly sometimes you just don't have access to source code. In the event of "black box software" try using DMTC checkpointing.