Documentation & Data

Summary of Tools
Summary of Fault injection Campaigns
Summary of Log Analysis
Summary of Annotations
Representative Data Sources and Sizes


Summary of Tools


Develop/enhance tools to support high-fidelity, low-impact data collection, log file and numeric data processing, and publication of whole-system datasets for use by both HMDR and external researchers


Baler new features and functionalities:

Log Diver new features and functionalities

LDMS new features and functionalities

HPCArrow features and functionalities (new for HMDR)

Annotation features and functionalities (new for HMDR)

Log Cleansers features and functionalities (new for HMDR)

Summary of Fault Injection Campaigns


Augment production datasets with controlled simple and complex fault scenarios through fault injection


Cielo (9000 node Cray XE), JYC (Blue Waters Test and Development Platform Cray XE6/XK7), Mutrino (Trinity Test and Development system Cray XC 40), Voltrino (Test system Cray XC40)


Workload execution time

30 Minutes per application


Workload execution to measure execution time and produce golden output.

Injection Types

Network topology-aware Single and Multiple Network Link(s) failures and errors, Compute Blade(s) failures

Data Collected

Error logs, Resource Utilization Metrics (e.g., network send/receive bytes, filesystem read/write bytes, cpu and memory utilization)

Summary of Log Analysis


Characterize fault scenarios and fault-to-failure propagation using HMDR tools which discover, extract, and assess patterns and pattern relationships from log lines


Baler -- Discovers log and numeric patterns with no user input

Log Diver -- Determines statistics and relationships of event sequences from log patterns

Datasets Analyzed

Blue Waters (27000 node Cray XE/XK)

Trinity (20000 node Cray XC 40)

Mutrino (Trinity Test and Development system Cray XC 40)

Edison (Cray XC30)

Cielo (9000 node Cray XE)

Summary of Annotations


Enable increased understanding of production log datasets through expert annotations of (a) platform and architecture specific events and (b) non-log events (e.g., system operational events).

Datasets Annotated

Mutrino (Trinity Test and Development system Cray XC 40)

Cielo (9000 node Cray XE) - Fault injection experiments

Representative Data Sources and Sizes

Blue Waters: About 10.6 trillion datums to July 2017

Amount of data by source as of July 2017

Data feed Average (Bytes/day) Max (Bytes/day) class
apres 30M 148M logs
apstat 60K 62K metrics
backup 40K 74K metrics
ddn 43K 326K logs
esms 1G 3G logs
hpss 135M 3.6G logs
hpss_core 112K 192K metrics
ibswitch 790K 801K logs
moab 2.5G 3G logs
qos-ping 3.3M 3.6M metrics
quotas-hpss 944K 950K metrics
scheduler 76K 78K metrics
cabinet env/pwr/temp/status 45M 45M metrics
SEL 1K 6K logs
sonexion 250M 3.5G logs
sonexion perf home 4.5G 4.5G metrics
sonexion perf projects 4.5G 4.5G metrics
sonexion perf scratch 4.5G 4.5G metrics
spectra 1.5K 9K logs
Mainframe LLM 4G 120G logs
Torque 75M 359M logs
volkseti 19M 19M metrics
OVIS 135G 135G metrics