1. “Measuring Resiliency of Extreme-Scale Computing Systems”, Z. Kalbarczyk, S. Jha, V. Formicola, C. Di Martino, R. Iyer, B. Kramer, Seminar at Advanced Digital Sciences Center (ADSC), 2016.

  2. “Achieving Resilience in Newer Application Domains”, R. Iyer, Z. Kalbarczyk, S. Jha, V. Formicola, C. Di Martino, B. Kramer, The 22nd IEEE Pacific Rim International Symposium on Dependable Computing, http://prdc.dependability.org/PRDC2017/keynote.html

  3. “Failure and Resiliency in the Shadow of Extreme Scale – Will our Current Assumptions Take Us in the Right Direction?”, William Kramer, Workshop on Monitoring and Analysis for HPC Systems Plus Applications (HPCMASPA) @ IPDPS 2016.


  1. ECE 542/ CS 536 – Design of Fault-Tolerant Digital Systems
    Advanced concepts in hardware and software fault tolerance: fault models, coding in computer systems, module and system level fault detection mechanism, reconfiguration techniques in multiprocessor systems and VLSI processor arrays, and software fault tolerance techniques such as recovery blocks, N-version programming, checkpointing, and recovery; survey of practical fault-tolerant systems.

  2. ECE/CS 498 Data Science
    Many modern application domains require engineers and domain experts to work together in the design, and analysis of heterogeneous datasets often with the objective of automating the decision making (sometimes referred to as actionable intelligence). Extracting the right level of knowledge to generate actionable intelligence from these datasets is a compelling problem. The course addresses this problem by providing students with an opportunity to build analysis workflows that use data management, feature engineering, supervised and unsupervised learning to derive real-world insights. In this course, students will have an opportunity to work on real-world applications while interacting with domain experts. The course uses real-world examples (measurement logs from supercomputers, data on security incidences from NCSA) to teach data-management, feature engineering, supervised/unsupervised learning, and testing and validation techniques.

A mini project in the two courses focuses on data-driven design and analysis of failure data.

Related Papers

  1. Martino, Catello Di, Saurabh Jha, William Kramer, Zbigniew Kalbarczyk, and Ravishankar K. Iyer. "Logdiver: A tool for measuring resilience of extreme-scale systems and applications." In Proceedings of the 5th Workshop on Fault Tolerance for HPC at eXtreme Scale, pp. 11-18. ACM, 2015.

  2. Di Martino, Catello, William Kramer, Zbigniew Kalbarczyk, and Ravishankar Iyer. "Measuring and understanding extreme-scale application resilience: A field study of 5,000,000 hpc application runs." In Dependable Systems and Networks (DSN), 2015 45th Annual IEEE/IFIP International Conference on, pp. 25-36. IEEE, 2015.

  3. C. Di Martino, Z. Kalbarczyk, R. Iyer, "Measuring the Resiliency of Extreme-Scale Computing Environments," in Principles of Performance and Reliability Modeling and Evaluation: Essays in Honor of Kishor Trivedi on His 70th Birthday, L. Fiondella, A. Puliafito, Eds., Springer International Publishing AG Switzerland, pp. 609–655, 2016.

  4. Baler: Deterministic, lossless log message clustering tool. N. Taerat, J. Brandt, A. Gentile, M. Wong, and C. Leangsuksun. In: Computer Science - Research and Development. Volume 26, Numbers 3-4, 285-295, DOI: 10.1007/s00450-011-0155-3. Int'l. Supercomputing Conference (ISC). Hamburg, Germany. June 2011.

  5. New Systems, New Behaviors, New Patterns: Monitoring Insights from System Standup. J. Brandt, A. Gentile, C. Martin, J. Repik, and N. Taerat Workshop on Monitoring and Analysis for High Performance Computing Systems Plus Applications (HPCMASPA) at IEEE Int'l. Conf. on Cluster Computing (CLUSTER) Chicago, IL. Sept 2015.

  6. A. Agelastos, B. Allan, J. Brandt, P. Cassella, J. Enos, J. Fullop, A. Gentile, S. Monk, N. Naksinehaboon, J. Ogden, M. Rajan, M. Showerman, J. Stevenson, N. Taerat, and T. Tucker. "Lightweight Distributed Metric Service: A Scalable Infrastructure for Continuous Monitoring of Large Scale Computing Systems and Applications." IEEE/ACM Int'l. Conf. for High Performance Storage, Networking, and Analysis (SC14)New Orleans, LA. Nov 2014.

  7. Michael Showerman, Jeremy Enos, Joseph Fullop, Paul Cassella, Nichamon Naksinehaboon, Narate Taerat, Thomas Tucker, James Brandt, Ann Gentile, and Benjamin Allan. "Large Scale System Monitoring and Analysis on Blue Waters using OVIS." Proc. Cray Users Group, 2014.

Community Interaction