- “Measuring Resiliency of Extreme-Scale Computing Systems”, Z. Kalbarczyk, S. Jha, V. Formicola, C. Di Martino, R. Iyer, B. Kramer, Seminar at Advanced Digital Sciences Center (ADSC), 2016.
- “Achieving Resilience in Newer Application Domains”, R. Iyer, Z. Kalbarczyk, S. Jha, V. Formicola, C. Di Martino, B. Kramer, The 22nd IEEE Pacific Rim International Symposium on Dependable Computing, http://prdc.dependability.org/PRDC2017/keynote.html
- “Failure and Resiliency in the Shadow of Extreme Scale – Will our Current Assumptions Take Us in the Right Direction?”, William Kramer, Workshop on Monitoring and Analysis for HPC Systems Plus Applications (HPCMASPA) @ IPDPS 2016.
- ECE 542/ CS 536 – Design of Fault-Tolerant Digital Systems
Advanced concepts in hardware and software fault tolerance: fault models, coding in computer systems, module and system level fault detection mechanism, reconfiguration techniques in multiprocessor systems and VLSI processor arrays, and software fault tolerance techniques such as recovery blocks, N-version programming, checkpointing, and recovery; survey of practical fault-tolerant systems.
- ECE/CS 498 Data Science
Many modern application domains require engineers and domain experts to work together in the design, and analysis of heterogeneous datasets often with the objective of automating the decision making (sometimes referred to as actionable intelligence). Extracting the right level of knowledge to generate actionable intelligence from these datasets is a compelling problem. The course addresses this problem by providing students with an opportunity to build analysis workflows that use data management, feature engineering, supervised and unsupervised learning to derive real-world insights. In this course, students will have an opportunity to work on real-world applications while interacting with domain experts. The course uses real-world examples (measurement logs from supercomputers, data on security incidences from NCSA) to teach data-management, feature engineering, supervised/unsupervised learning, and testing and validation techniques.
A mini project in the two courses focuses on data-driven design and analysis of failure data.
- ECE 542/CS536 focuses on the reliability quantification of the Blue Waters memory subsystem i.e., if it fails, how frequently it fails, and how it fails. (https://courses.engr.illinois.edu/ece542/fa2017/)
- ECE/CS 498 DS extends the analysis further by introducing machine learning techniques (such as probabilistic graphical models) to identify root causes of the failures. (https://courses.engr.illinois.edu/ece498dsu/sp2018/)
- Dataset used in the mini-project: Syslog data (particularly pertaining to Machine Check Exceptions) from the system has been chunked into 4 month periods, pre-processed (parsed and cleaned up) and tabulated for use in this project.
- Martino, Catello Di, Saurabh Jha, William Kramer, Zbigniew Kalbarczyk, and Ravishankar K. Iyer. "Logdiver: A tool for measuring resilience of extreme-scale systems and applications." In Proceedings of the 5th Workshop on Fault Tolerance for HPC at eXtreme Scale, pp. 11-18. ACM, 2015.
- Di Martino, Catello, William Kramer, Zbigniew Kalbarczyk, and Ravishankar Iyer. "Measuring and understanding extreme-scale application resilience: A field study of 5,000,000 hpc application runs." In Dependable Systems and Networks (DSN), 2015 45th Annual IEEE/IFIP International Conference on, pp. 25-36. IEEE, 2015.
- C. Di Martino, Z. Kalbarczyk, R. Iyer, "Measuring the Resiliency of Extreme-Scale Computing Environments," in Principles of Performance and Reliability Modeling and Evaluation: Essays in Honor of Kishor Trivedi on His 70th Birthday, L. Fiondella, A. Puliafito, Eds., Springer International Publishing AG Switzerland, pp. 609–655, 2016.
- Baler: Deterministic, lossless log message clustering tool. N. Taerat, J. Brandt, A. Gentile, M. Wong, and C. Leangsuksun. In: Computer Science - Research and Development. Volume 26, Numbers 3-4, 285-295, DOI: 10.1007/s00450-011-0155-3. Int'l. Supercomputing Conference (ISC). Hamburg, Germany. June 2011.
- New Systems, New Behaviors, New Patterns: Monitoring Insights from System Standup. J. Brandt, A. Gentile, C. Martin, J. Repik, and N. Taerat Workshop on Monitoring and Analysis for High Performance Computing Systems Plus Applications (HPCMASPA) at IEEE Int'l. Conf. on Cluster Computing (CLUSTER) Chicago, IL. Sept 2015.
- A. Agelastos, B. Allan, J. Brandt, P. Cassella, J. Enos, J. Fullop, A. Gentile, S. Monk, N. Naksinehaboon, J. Ogden, M. Rajan, M. Showerman, J. Stevenson, N. Taerat, and T. Tucker. "Lightweight Distributed Metric Service: A Scalable Infrastructure for Continuous Monitoring of Large Scale Computing Systems and Applications." IEEE/ACM Int'l. Conf. for High Performance Storage, Networking, and Analysis (SC14)New Orleans, LA. Nov 2014.
- Michael Showerman, Jeremy Enos, Joseph Fullop, Paul Cassella, Nichamon Naksinehaboon, Narate Taerat, Thomas Tucker, James Brandt, Ann Gentile, and Benjamin Allan. "Large Scale System Monitoring and Analysis on Blue Waters using OVIS." Proc. Cray Users Group, 2014.
- HMDR participates in the Organizing and Program Committees of the Workshop on Monitoring and Analysis for HPC Systems Plus Applications Series at IEEE Cluster (2014-Current).
- HMDR hosts a Community Vendor-Neutral Website: Monitoring Large-Scale HPC Systems: https://sites.google.com/site/monitoringlargescalehpcsystems/.
- HMDR hosts a regular BoF Series at Supercomputing and CUG. SC14-17 and CUG 16-18.
- HMDR provided leadership and participation in the Cray System Monitoring Working Group (SMWG) -- an international group of Cray sites seeking to improve monitoring and analysis on large-scale platforms.
- Birds-of-a-Feather Sessions, FRESCO: An Open Failure Data Repository for Dependability Research and Practice, SC 2015.