Toto smaže stránku "Checkpointing"
. Buďte si prosím jisti.
Checkpointing upon hitting walltime
Distributed MultiThreaded CheckPointing
A C program with inbuilt checkpointing capability. It checkpoints itself upon receiving a SIGINT (interrupt signal)
cat chk.cpp
#include <iostream> #include <csignal> #include <omp.h> #include <string> #include <fstream> #include <iomanip> //Kempner's series using a "9", result should be around 22.920676619264150, //should be impossible to get anywhere near that number with this approach. //Novo ime za long long ker staro "too long" typedef long long ll; //1st loop, non-important, this one just goes on and on... const ll num1 = 100000; //2nd loop which sets the thime inside OMP regiona, bigger number means longer time const ll num2 = 100000000; //Boolean for signalling that we have to do something volatile bool sig = false; // This will happen when we get "the signal". Currently it just informs us that the signal was received. //It also sets the boolean which triggers checkpointig when the program leaves OMP region. void handler(int signum) { std::cout << "Received interrupt signal, scheduling a checkpoint." << std::endl; sig = true; } //Main part of the program int main(void) { //We define 2. badly needed variables double sum = 0; ll i = 0; //Check if checkpoint file is allready present. If not, inform user. std::cout << "Looking for a checkpoint file." << std::endl; std::ifstream input; input.open("checkpoint.dat", std::ifstream::in); if (input.fail()) std::cout << "No checkpoint file, starting from scratch." << std::endl; //If checkpoint file is present continue from the point mentioned else { std::cout << "Found a checkpoint file, reading parameteres and continuing." << std::endl; input >> i >> sum; input.close(); } std::cout << "Starting signal handler." << std::endl; //Start signal handling signal(SIGINT, handler); //Read number of available threads int nth = omp_get_num_threads(); //Outer loop, it "chops up" the paralelisation. It enables us to do the checkpointingetc. Impact on preformance is on the order of ~1/num^2, a.k.a. nonexistent for (; i < num1; i++) { //This will be the parrallel loop, it does the sums. It converges far too slowly for to prove anything- but it's OK, since we want slow but steady convergence #pragma omp parallel for schedule(dynamic, 200) reduction(+:sum) for (ll j = 1 + i * nth * num2; j <= (i + 1) * nth * num2; j++) { std::string a = std::to_string(j); bool nine = false; for (auto x : a) { if (x == '9') { nine = true; break; } } if (nine) continue; sum += 1.0 / ((double)j); } //An output (so that we know the program is doing its magic std::cout << "Intermediate output at i=" << i << std::fixed << std::setprecision(20) << ": sum=" << sum << std::endl; //Check for signal, open file and write needed info and kill the progamm if yes if (sig) { std::cout << "Checkpointing and exiting the program. " << std::endl; std::ofstream output; output.open("checkpoint.dat", std::ofstream::out); output << i + 1 << ' ' << std::fixed << std::setprecision(20) << sum; output.close(); exit(SIGINT); } } //You shouldn't get here with the default num1 and num2. Let's just state it here for safety's sake... return 0; }
Compile with:
g++ -std=c++11 -O3 -fopenmp -o test chk.cpp
On PC, this example program can be checkpointed by pressing ctrl + c buttons simultaneously. The following example batch script will send SIGINT (ctrl+C) to compiled program roughly 70s before the walltime, it should then be checkpointed. The signal is hardcoded, but can be changed. The program can also be checkpointed with ">scancel --signal=SIGINT <JOBID>"
We can use --signal directive to automatically checkpoint the above program 70s before end. We should allow the program enough time to finish writing checkpoint file.
cat batch.sh
#!/bin/bash #SBATCH --nodes=1 #SBATCH --partition=suspend #SBATCH --mem=10M #SBATCH --cpus-per-task=5 #SBATCH --qos=suspend #SBATCH --job-name=checkpoint_test #SBATCH --signal=SIGINT@70 #SBATCH --time=00:50:00 srun /home/blaz/checkpoint/test
Usually we want to automate things further. Wouldn't it be nice if the program would just magically requeue itself?
This handy batch script will automatically requeue checkpointable program upon receiveing the walltime --signal. It caches --signal and sends SIGINT to the program. We can also trap multiple signals at once. Swap trap "trap 'term_handler' USR2" with "trap 'term_handler' USR2 SIGCONT" and you can use this batch script on checkpointing partition (change --partition and --QOS accordingly).
cat signal.sh
#!/bin/bash #SBATCH --time=00:03:00 # put proper time of reservation here #SBATCH --partition=rude #SBATCH --qos rude #SBATCH --nodes=1 # number of nodes #SBATCH --ntasks-per-node=4 # processes per node #SBATCH --mem=10000 # memory resource #SBATCH --job-name="rude" # change to your job name #SBATCH --output=signal.out # change to proper file name or remove for defaults #SBATCH --open-mode append #SBATCH --signal=B:usr2@30 term_handler(){ echo "Caught signal!" echo $PID kill -s SIGINT $PID sleep 20 sbatch signal.sh exit } trap 'term_handler' USR2 ./test & PID=$! while : do sleep 1 done
Unfortunatelly sometimes you just don't have access to source code. In the event of "black box software" try using DMTC checkpointing.
Toto smaže stránku "Checkpointing"
. Buďte si prosím jisti.