The HMDR Project:

Holistic, Measurement-Driven Resilience

Combining Operational Fault and Failure Measurements and Fault Injection for Quantifying Fault Detection, Propagation and Impact

A collaboration between the University of Illinois at Urbana-Champaign (UIUC); Sandia (SNL), Los Alamos (LANL), and Lawrence Berkeley (LBNL) National Laboratories; and Cray Inc.

In HPC systems to date, application resilience to failures has been accomplished by the brute-force method of checkpoint/restart, which allows an application to make forward progress in the face of system faults, errors, and failures independent of root cause or end result. It has remained the primary resilience mechanism because we lack a way to identify faults and project results early enough to take meaningful mitigating action. Because we have not yet operated at scales at which checkpoint/restart cannot help, vendors have had little motivation to provide the instrumentation necessary for early identification. However, as we move from petascale to exascale, mean time to failure (MTTF) will render the existing techniques ineffectual. Instrumentation allowing early indication of problems and tools to enable use of such information by systems, OSes, and applications would offer an alternative, more scalable solution.

In this work, we are building on our experience and expertise developed and accumulated over years of research on design, monitoring, measurement, and assessment of resilient computing systems. Analysis of field data on the current and past generations of extreme-scale systems has revealed several challenges that, if not addressed, may hinder the effectiveness of future exascale computing systems. Specifically, i) file system and interconnect in current-generation large-scale systems already operate at the margins of resiliency and may not scale to larger deployments; ii) automated software-based failover mechanisms are frequently inadequate , such that failures during recovery may lead to system/application failures, including system-wide outages; iii) silent data corruption represents a critical fault mode and will require efficient detection mechanisms if next-generation applications are to take full advantage of exascale hardware; and iv) application-level resilience must be the final arbiter of system effectiveness.

To address the above challenges, we have assembled a team of world-renowned experts in resilient extreme-scale computing from the University of Illinois (Electrical and Computer Engineering, Computer Science, and NCSA), SNL, LANL, NERSC, and Cray. Our team includes representatives from centers that will house many of the largest HPC resources in the world, both today and over the coming years. The team has a unique track record of research in i) system and application failure characterization based on the analysis of field data, ii) data-driven design of fault/error detection mechanisms, and iii) experimental characterization of system/application resiliency. The team includes system owners/operators who can guarantee continuous data collection and access and ensure installation of appropriate analysis tools.

Principal Investigators

William T. C. Kramer, Lead Principal Investigator (PI), NCSA
James Brandt, Institutional PI, Sandia National Laboratory
Ravi Iyer, Institutional PI, University of Illinois at Urbana-Champaign
James Lujan, Institutional PI, Los Alamos National Laboratory
Nicholas Wright, Institutional PI, National Energy Research Scientific Computing Center, Lawrence Berkeley National Laboratory