1. “Measuring Resiliency of Extreme-Scale Computing Systems”, Z. Kalbarczyk, S. Jha, V. Formicola, C. Di Martino, R. Iyer, B. Kramer, Seminar at Advanced Digital Sciences Center (ADSC), 2016.

  2. “Achieving Resilience in Newer Application Domains”, R. Iyer, Z. Kalbarczyk, S. Jha, V. Formicola, C. Di Martino, B. Kramer, The 22nd IEEE Pacific Rim International Symposium on Dependable Computing, http://prdc.dependability.org/PRDC2017/keynote.html

  3. “Failure and Resiliency in the Shadow of Extreme Scale – Will our Current Assumptions Take Us in the Right Direction?”, William Kramer, Workshop on Monitoring and Analysis for HPC Systems Plus Applications (HPCMASPA) @ IPDPS 2016.


  1. ECE 542/ CS 536 – Design of Fault-Tolerant Digital Systems
    Advanced concepts in hardware and software fault tolerance: fault models, coding in computer systems, module and system level fault detection mechanism, reconfiguration techniques in multiprocessor systems and VLSI processor arrays, and software fault tolerance techniques such as recovery blocks, N-version programming, checkpointing, and recovery; survey of practical fault-tolerant systems.

  2. ECE/CS 498 Data Science
    Many modern application domains require engineers and domain experts to work together in the design, and analysis of heterogeneous datasets often with the objective of automating the decision making (sometimes referred to as actionable intelligence). Extracting the right level of knowledge to generate actionable intelligence from these datasets is a compelling problem. The course addresses this problem by providing students with an opportunity to build analysis workflows that use data management, feature engineering, supervised and unsupervised learning to derive real-world insights. In this course, students will have an opportunity to work on real-world applications while interacting with domain experts. The course uses real-world examples (measurement logs from supercomputers, data on security incidences from NCSA) to teach data-management, feature engineering, supervised/unsupervised learning, and testing and validation techniques.

A mini project in the two courses focuses on data-driven design and analysis of failure data.

