Documentation & Data
Activities
Summary of Tools
Summary of Fault injection Campaigns
Summary of Log Analysis
Summary of Annotations
Representative Data Sources and Sizes
Activities
Summary of Tools
About
Develop/enhance tools to support high-fidelity, low-impact data collection, log file and numeric data processing, and publication of whole-system datasets for use by both HMDR and external researchers
Tools
- Baler -- Discovers log and numeric patterns with no user input
- Log Diver -- Determines statistics and relationships of event sequences from log patterns
- Lightweight Distributed Metric Service (LDMS) -- Lightweight, whole-system data collection resulting in coherent system “snapshots”
- HPCArrow -- Fault injection and recovery
- Annotations -- Annotation representation and search
- Log Cleansers -- Anonymize logs for release
Baler new features and functionalities:
- User defined pattern annotations to associate meta-data with patterns
- Format-specific filters to handle multiple log types
- Data-type recognition in patterns to facilitate search, such as host, router link, hex, num, char_dump
- Domain-specific pattern weighting to identify patterns of particular interest
Log Diver new features and functionalities
- New Log Diver plugin to Baler enables on-the-fly search of the Baler pattern database for regular expressions indicative of defined faults and/or failures.
- Previous version required reprocessing of log files when adding new regular expressions
LDMS new features and functionalities
- High-fidelity (sub-second) data sampling with outputs to arrays to support lower frequency transport without data loss
- Enables collection of power related data at native sampling rates (e.g., 10Hz on Cray XC)
- No statistically significant impact to application run times while collecting ~1500 metrics per second per node at scales of ~10,000 nodes
- Numerical data enhances association of fault events with application and system impact
HPCArrow features and functionalities (new for HMDR)
- User interactive interface for network-topology aware injection of single and multiple faults and errors and for fault recovery
- Automated logging of fault campaigns, system response events, and timestamps
Annotation features and functionalities (new for HMDR)
- Schema to support annotations and meta-data and integrate with jobs and architecture data
- Interactions with Baler for event annotation and extractions
- Search interface for determining annotation types and possible job impact
Log Cleansers features and functionalities (new for HMDR)
- Tools and configuration file specification for cleansing logs for release.
- Substitution of Tags enables retention of log lines, while preserving anonymity. Exclusions also supported.
Summary of Fault Injection Campaigns
About
Augment production datasets with controlled simple and complex fault scenarios through fault injection
Platforms
Cielo (9000 node Cray XE), JYC (Blue Waters Test and Development Platform Cray XE6/XK7), Mutrino (Trinity Test and Development system Cray XC 40), Voltrino (Test system Cray XC40)
Applications
- MPI - PSDNS, AWP-ODC (Cray MPI), MILC (Intel MPI)
- Charm++ - Kripke, AMR, LeanMD (Hugepages + SMP over ugni), NAMD (SMP over ugni)
- PGAS - UPC-FT
Workload execution time
30 Minutes per application
Baselines
Workload execution to measure execution time and produce golden output.
Injection Types
Network topology-aware Single and Multiple Network Link(s) failures and errors, Compute Blade(s) failures
Data Collected
Error logs, Resource Utilization Metrics (e.g., network send/receive bytes, filesystem read/write bytes, cpu and memory utilization)
Summary of Log Analysis
About
Characterize fault scenarios and fault-to-failure propagation using HMDR tools which discover, extract, and assess patterns and pattern relationships from log lines
Tools
Baler -- Discovers log and numeric patterns with no user input
Log Diver -- Determines statistics and relationships of event sequences from log patterns
Datasets Analyzed
Blue Waters (27000 node Cray XE/XK)
- 3 mo, 3.4 billion log lines, results in 150,000 patterns
Trinity (20000 node Cray XC 40)
- Open Science 1 - HSW, ~ 5 months, 2.5 billion log lines, results in 52,000 patterns
- Open Science 2 - KNL, ~ 5 months, 4 billion log lines, results in 500,000 patterns
Mutrino (Trinity Test and Development system Cray XC 40)
- 1 year (including standup)
- 3 months (including standup), 150 million log lines, results in 12,000 patterns and 1,900 weighted patterns
Edison (Cray XC30)
- 5 mo
Cielo (9000 node Cray XE)
- Fault injection experiments, results in 600 patterns and 291 weighted patterns
Summary of Annotations
About
Enable increased understanding of production log datasets through expert annotations of (a) platform and architecture specific events and (b) non-log events (e.g., system operational events).
Datasets Annotated
Mutrino (Trinity Test and Development system Cray XC 40)
- 3 months (including standup), annotations for 130 log patterns and 90 non-log events, resulting in 860,000 annotated log lines and events.
Cielo (9000 node Cray XE) - Fault injection experiments
- Annotations for 13 log patterns and 200 non-log events, resulting in 23,000 annotations
Representative Data Sources and Sizes
Blue Waters: About 10.6 trillion datums to July 2017
Amount of data by source as of July 2017
Data feed | Average (Bytes/day) | Max (Bytes/day) | class |
---|---|---|---|
apres | 30M | 148M | logs |
apstat | 60K | 62K | metrics |
backup | 40K | 74K | metrics |
ddn | 43K | 326K | logs |
esms | 1G | 3G | logs |
hpss | 135M | 3.6G | logs |
hpss_core | 112K | 192K | metrics |
ibswitch | 790K | 801K | logs |
moab | 2.5G | 3G | logs |
qos-ping | 3.3M | 3.6M | metrics |
quotas-hpss | 944K | 950K | metrics |
scheduler | 76K | 78K | metrics |
cabinet env/pwr/temp/status | 45M | 45M | metrics |
SEL | 1K | 6K | logs |
sonexion | 250M | 3.5G | logs |
sonexion perf home | 4.5G | 4.5G | metrics |
sonexion perf projects | 4.5G | 4.5G | metrics |
sonexion perf scratch | 4.5G | 4.5G | metrics |
spectra | 1.5K | 9K | logs |
Mainframe LLM | 4G | 120G | logs |
Torque | 75M | 359M | logs |
volkseti | 19M | 19M | metrics |
OVIS | 135G | 135G | metrics |