Poster Title:  Fault-tolerant parallel-in-time integration
Poster Abstract: 

Resilience is one of the major topics in modern high-performance computing (HPC) research. With millions of processors, the “mean-times between failure” become a relevant issue for simulation scientist around the world. Incorporating countermeasures on the hardware side is expensive, difficult, time-consuming or all at once. Thus, much attention has therefore been paid to “algorithm-based fault tolerance” strategies which exploit specific features of numerical methods to continue working even after a processor crashes or a bit flips. This research aims at using novel methods from the field of parallel-in-time integration techniques for detecting and correcting these faults. ”Parallel-across-the-steps” methods like Parareal or PFASST share features that make them natural candidates for algorithmic-based fault tolerance: they hold copies of the (approximate) solution at different times on different processes and they are iterative as well as hierarchical. First proofs-of-concept show that these properties allow to derive recovery strategies to continue integrating forward in time even when nodes fail or bits flip.


Poster ID:  C-18
Poster File:  PDF document C-18.pdf
Poster Image: 
Poster URL: