(Paper #54)
The running times of many computational science programs are now significantly greater than the mean-time-between-failures (MTBF) of the hardware they run on. Therefore, fault-tolerance is becoming a critical issue on high-performance platforms. \emph{Checkpointing} is a technique for making programs fault tolerant by periodically saving their state and restoring this state after failure. In system-level checkpointing, the state of the entire machine is saved periodically on stable storage. This has too much overhead to be practical on high-performance platforms with thousands of processors. In practice, programmers do manual checkpointing by writing code to (i) save the values of key program variables at critical points in the program, and (ii) restore the entire computational state from these values during recovery. However, this can be difficult to do in general MPI programs. In an earlier paper, we presented a distributed checkpoint coordination protocol which handles MPI's point-to-point constructs, and deals with the unique challenges of application-level checkpointing. This protocol is implemented by a thin software layer that sits between the application program and the MPI library, so it does not require any modifications to the MPI library. However, it did not handle collective communication, which is a very important part of MPI. In this paper we extend the protocol to handle MPI's collective communication constructs.
Keywords:
Scientific Computing
Compilers
Real time systems
Fault Tolerance and Reliability
Distributed Systems