HMDR Publications

Papers and Presentations

S. Leak, A. Greiner, A. Gentile, and J. Brandt, "Supporting Failure Analysis with Discoverable, Annotated Log Datasets," accepted to Cray Users Group (CUG), Stockholm, Sweden. May 2018.

About: Framework for describing and attaching observations to a dataset, and tools to aid in discovery and publication of system-related insights.

"Network Congestion in Supercomputers," in submission (double-blind).

"Data-driven Application-oriented Reliability Model of a High-Performance Computing System", in submission (double blind).

S. Jha et al., "Holistic Measurement-Driven System Assessment," 2017 IEEE International Conference on Cluster Computing (CLUSTER), Honolulu, HI, 2017, pp. 797-800. doi: 10.1109/CLUSTER.2017.124

About: integrated capabilities for holistic monitoring and analysis to understand and characterize propagation of performance-degrading events.

V. Formicola, S. Jha, F. Deng, D. Chen, A. Bonnie, M. Mason, A. Greiner, A. Gentile, J, Brandt, L. Kaplan, J. Repik, J. Enos, M. Showerman, Z. Kalbarczyk, W. Kramer, and R. Iyer. "Data-Driven Understanding of Fault Scenarios and Impacts Through Fault Injection: Experimental Campaign in Cielo." Cray Users Group (CUG), May 2017. Highlight slide
About: Results from fault injection study on ACES Cielo System. Includes elucidation of fault to failure paths; how those are manifested in the log files; timelines for onset and recovery; and failures of failover paths.
J. Brandt, E. Froese, A. Gentile, L. Kaplan, B. Allan, and E. Walsh, "Network Performance Counter Monitoring and Analysis on the Cray XC Platform," In Proc. Cray User’s Group (CUG), London, England, April 2016.
About: High Speed Network counter collection and analysis from ACES Trinity. Counters indicative of error, performance, and status issues in the Aries network.
A. DeConinck, A. Bonnie, K. Kelly, S. Sanchez, C. Martin, M. Mason, J. Brandt, A. Gentile, and B. Allan, "Design and Implementation of a Scalable Monitoring System for Trinity," In Proc. Cray User’s Group (CUG), London, England, April 2016.
About: Architecture to enable scalable data collection and analysis for resilience applied to ACES Trinity system.
S. Sanchez, A. Bonnie, G. Van Heule, C. Robinson, A. DeConinck, K. Kelly, Q. Snead, and J. Brandt, "Design and Implementation of a Scalable HPC Monitoring System," In Workshop on Monitoring and Analysis for High Performance Computing Systems Plus Applications (HPCMASPA) in conjunction with IEEE Int'l. Parallel and Distributed Processing Symposium (IPDPS) Chicago, IL, USA, May 2016.
About: Architecture to enable scalable data collection and analysis for resilience.
S. Jha, V. Formicola, C. Di Martino, Z. Kalbarczyk, W. Kramer, and R. Iyer, "Analysis of Gemini Interconnect Recovery Mechanisms: Methods and Observations," In Proc. Cray User’s Group (CUG), London, England. April 2016.
About: This paper presents methodology and tools to understand and characterize the recovery mechanisms of the Gemini interconnect system from raw system logs. The tools can assess the impact of these recovery mechanisms on the system and user workloads.
C. Keywhan, V. Formicola, Z. Kalbarczyk, R. Iyer, A. Withers, and Adam J. Slagell. "Attacking supercomputers through targeted alteration of environmental control: A data driven case study." In Communications and Network Security (CNS), 2016 IEEE Conference on, pp. 406-410. IEEE, 2016.
About: The paper demonstrates that the control systems of chilled water in Blue Waters’ facilities can be used as entry points by an attacker to indirectly compromise the computing functionality.
S. Jha, V. Formicola, C. Di Martino, M. Dalton, W. Kramer, Z. Kalbarczyk, and R. Iyer. "Resiliency of HPC Interconnects: A case study of interconnect failures and recovery in Blue Waters." in IEEE Transactions on Dependable and Secure Computing, doi: 10.1109/TDSC.2017.2737537. Highlight slide
About: This study characterizes the recovery procedures of the Gemini interconnect network, the largest Gemini network built by Cray.

Datasets

Note: All datasets are available via Globus for efficient file transfer. We recommend using Globus for datasets more than a few GB in size. To use Globus, point your browser to www.globus.org, log in, and go to the File Manager page. (If you don't have a Globus account, you can sign up for free at globus.org.) In the Collection field on the left, select HMDR Datasets, and then on the right side, select the endpoint to which you wish to transfer the files. You will not need to authenticate to the NERSC endpoint, though you will need to meet the authentication requirements of the destination. Click to select the files you want, and click the right-pointing Start arrow to move them.

Mutrino Dataset 2/15-6/16 (12/16 Release), J. Brandt, A. Gentile, and J. Repik, SAND2016-12310 O, Dec 2016.
This is a unique resource for resilience studies consisting of 1 year of logs from the ACES Trinity testbed system Mutrino, including production, standup, and induced network, electrical, thermal, and functional failures.
Mutrino 2/15-6/16 Dataset (3.5 GB .tgz file) | About the dataset (80K PDF)

Mutrino Dataset 2/15-5/15, J. Brandt, A. Gentile, and J. Repik, SAND2016-2449 O, Mar 2016.
This is a unique resource for resilience studies consisting of the first 3 months of logs from the ACES Trinity testbed system Mutrino, including production, standup, and induced network, electrical, thermal, and functional failures. A timeline of system events, including errors, tests, and software upgrades is included.
Mutrino 2/15-5/15 Dataset (648 MB .tgz file) | About the dataset (360 K PDF) | Annotations (includes Anno, a tool for querying log annotations)

Cielo Fault Injection Artifact 2016, S. Jha, V. Formicola, A. Bonnie, M. Mason, D. Chen, F. Deng, A. Gentile, J. Brandt, L. Kaplan, J. Repik, J. Enos, M. Showerman, A. Greiner, Z. Kalbarczyk, R. Iyer, and B. Kramer. LA-UR-19-22749, SAND2019-3531 O, Mar 2019.
This dataset consists of selected and edited logs from a set of Fault Injection Experiments run by the Holistic Measurement Driven Resilience Project (HMDR) which is a multi-site project. The runs were performed in 2016 on the ACES Cielo Machine, a ~9000 node Cray XE system, sited at Los Alamos National Laboratory. This is a unique dataset from the investigation of simple and complex faults on a large-scale system and includes numerical data demonstrating the resultant effects on the system and applications. Faults investigated include: node down, blade down, link down, multiple links forming a directional connection (e.g. X+), and multiple concurrent connections down. In addition, this data set includes artifacts from analysis of the log files using LogDiver, which was published as "Understanding Fault Scenarios and Impacts Through Fault Injection Experiments in Cielo", V. Formicola et al., Proc. Cray Users Group, 2017. Thus, this dataset comprises an artifact for that work.
Cielo Fault Injection Artifact (152 MB .tgz file) | About the dataset (31 K text file) | Annotations (includes Anno, a tool for querying log annotations)