Cielo Fault Injection Artifact 2016 Saurabh Jha, Valerio Formicola, Amanda Bonnie, Mike Mason, Daniel Chen, Fei Deng, Ann Gentile, Jim Brandt, Larry Kaplan, Jason Repik, Jeremy Enos, Mike Showerman, Annette Greiner, Zbigniew Kalbarczyk, Ravishankar Iyer, and Bill Kramer. =============================================================== =============================================================== This dataset consists of selected and edited logs from a set of Fault Injection Experiments run by the Holistic Measurement Driven Resilience Project (HMDR) which is a multi-site project. The runs were performed in 2016 on the ACES Cielo Machine, a ~9000 node Cray XE system, sited at Los Alamos National Laboratory. This is a unique dataset from the investigation of simple and complex faults on a large-scale system and includes numerical data demonstrating the resultant effects on the system and applications. Faults investigated include: node down, blade down, link down, multiple links forming a directional connection (e.g. X+), and multiple concurrent connections down. In addition, this data set includes artifacts from analysis of the log files using LogDiver, which was published as "Understanding Fault Scenarios and Impacts Through Fault Injection Experiments in Cielo", V. Formicola et al., Proc. Cray Users Group, 2017. Thus, this dataset comprises an artifact for that work. If you refer to this dataset, please cite as: Cielo Fault Injection Artifact 2016. S. Jha, V. Formicola, A. Bonnie, M. Mason, D. Chen, F. Deng, A. Gentile, J. Brandt, L. Kaplan, J. Repik, J. Enos, M. Showerman, A. Greiner, Z. Kalbarczyk, R. Iyer, and B. Kramer. LA-UR-19-22749, SAND2019-3531 O, Mar 2019. [Online]: https://portal.nersc.gov/project/m888/resilience/datasets/cielo/CieloFaultInjectionArtifact2016.tgz The data consists of: 0) A README explaining the directory structure and providing citation information. RAW DATA: 1) InjectionSummary - A Summary document of the injections and affected runs. 2) InjectorLogs - Output of the injection tool, with timestamped injection and recovery commands. Details of injection codes have been redacted where directed by the vendor. 3) RunOutput - Output of the application runs (a publically available open source external benchmark (IMB)), which includes effects on the runs due to the injections, such as throttle events. 4) SystemLogs - A selection of edited system logs providing information about the injections and recoveries. 5) ArchitecturalInformation - Network architectural information which is generic for a system of this make and size. 6) MonitoringData - Selected network counter information from which effects of the injections and recovery processes can be extracted. ANALYSIS SUPPORT: 7) LogDiver_Output - Analysis README and artifact files for the fault and recovery events, error counts, and timelines of the recoveries. These are entirely based on the log data using domain knowledge of log lines of interest. Examples/Additional_Description: =================================================== 1) InjectionSummary Compilation of JobId, Start Time, End Time, Nodelist, RunOutput File, and brief summary of related injection, ordered by Time. 2) InjectorLogs. This is output from the Injection capability: Fault Injection Script Options: 0: Exit. 01: Link Down - manually bring down a minor link given its cname. 02: Link Up - manually bring up a minor link given its cname. 03: Node Down - manually bring down a node given its cname. 04: Mezzanine Up - manually bring ddown a mezzanine given its cname. 05: Mezzanine Down - manually bring down a mezzanine given its cname. 06: Restore components in the restore.txt 1: FB1- Random Single Link Down - Small App 2: FB2- Random Single Connection Down - Small App 3: FB3- Random Single Blade Down - Small App 4: FB4- Random Single Compute Node Down - Small App 7: FB1- Random Single Link Down - Medium App 8: FB2- Random Single Connection Down - Medium App 9: FB3- Random Single Blade Down - Medium App 10: FB4- Random Single Compute Node Down - Medium App 13: FB1- Random Single Link Down - Large App 14: FB2- Random Single Connection Down - Large App 15: FB3- Random Single Blade Down - Large App 16: FB4- Random Single Compute Node Down - Large App 17: FM1 - Fail two connections on non-overlapping dimension - Small App 18: FM2 - Fail two blades on non-overlapping dimension - Small App 19: FM1 - Fail two connections on non-overlapping dimension - Medium App 20: FM2 - Fail two blades on non-overlapping dimension - Medium App 21: FM1 - Fail two connections on non-overlapping dimension - Large App 22: FM2 - Fail two blades on non-overlapping dimension - Large App 23: FC1 - Fail one link, then another link during recovery - Medium App 24: FC2 - Fail one connection , then another connection during recovery - Medium App 25: FC3 - Fail two blades on overlapping dimension - Medium App Selection? 1 Run FB1 - Random Single link Down for small app now? [y/n] y Tue Sep 6 11:42:10 MDT 2016 *** Running FB2 - Random Single link Down for small app *** Proceed with execution of: <> c15-4c1s0g0 [y/n]y Tue Sep 6 11:42:29 MDT 2016 Executed link_down c15-4c1s0g0l56: <> c15-4c1s0g0 Received all responses. done with event loop End Experiment and attempt to restore now? [y/n]y Select Annotation: 1. App completion or dies, 2. fall-out complete, 3. others (type directly)? 2 Tue Sep 6 11:47:28 MDT 2016 End of Experiment - Attempt to restore system: Type 2. Proceed with execution of: su - crayadm -c 'xtwarmswap -s c15-4c1s0g0l56,c15-4c1s0g1l51 -p p0'? [y/n]y Tue Sep 6 11:47:42 MDT 2016 Executed link_up() c15-4c1s0g0l56,c15-4c1s0g1l51 : su - crayadm -c 'xtwarmswap -s c15-4c1s0g0l56,c15-4c1s0g1l51 -p p0' Adding blades: Removing blades: Sending command to xtnlrd 11:47:43(T+00:00) Warm swap beginning 11:47:43(T+00:00) Clearing alert flags on LCBs being brought back up 11:47:43(T+00:00) Testing for routeability 11:47:54(T+00:11) Test reroute proceeding... 11:47:56(T+00:13) Finished running test reroute: configuration is OK 11:47:56(T+00:13) Sending 'alive' message to blades now 11:47:58(T+00:14) Alive message complete 11:47:58(T+00:14) Computing new routes 11:47:59(T+00:16) Route computation proceeding... 11:48:09(T+00:25) Route computation proceeding... 11:48:12(T+00:28) Finished computing new routes 11:48:12(T+00:28) Calling xtbounce to initialize new links 11:48:13(T+00:30) Link initialization proceeding... 11:48:23(T+00:40) Link initialization proceeding... 11:48:33(T+00:50) Link initialization proceeding... 11:48:41(T+00:58) Finished initializing new links 11:48:41(T+00:58) Quiescing the high-speed network 11:48:42(T+00:59) Finished quiescing the high-speed network 11:48:42(T+00:59) Waiting for the high-speed network traffic to drain 11:48:45(T+01:02) High-speed network drained of traffic 11:48:45(T+01:02) Switching link monitoring to use new routes 11:48:47(T+01:04) Finished switching link monitoring to use new routes 11:48:47(T+01:04) Calling xtbounce to turn off links that will be unused 11:48:48(T+01:05) Link down proceeding... 11:48:58(T+01:15) Link down proceeding... 11:49:06(T+01:23) Links that will be unused are now turned off 11:49:06(T+01:23) Installing new routes 11:49:07(T+01:24) Route installation proceeding... 11:49:09(T+01:26) Finished installing new routes 11:49:09(T+01:26) Unquiescing the high-speed network 11:49:11(T+01:28) Finished unquiescing the high-speed network 11:49:11(T+01:28) Cleaning up Successfully completed warm swap operation SUCCESS: Warm swap command completed successfully. *** Please restore services for the following components: c15-4c1s0g0l56,c15-4c1s0g1l51 Press enter to continue... 3) RunOutput: benchmarks to run Alltoall #------------------------------------------------------------ # Intel (R) MPI Benchmarks 4.0 Update 1, MPI-1 part #------------------------------------------------------------ # Date : Fri Sep 2 12:09:52 2016 # Machine : x86_64 # System : Linux # Release : 2.6.32.59-0.7.1_1.0402.7496-cray_gem_c # Version : #1 SMP Mon Dec 22 19:37:52 UTC 2014 # MPI Version : 3.0 # MPI Thread Environment: # New default behavior from Version 3.2 on: # the number of iterations per message size is cut down # dynamically when a certain run time (per message size sample) # is expected to be exceeded. Time limit is defined by variable # "SECS_PER_SAMPLE" (=> IMB_settings.h) # or through the flag => -time # Calling sequence was: # /users/XXX/cieloTesting/IMB/imb/src/IMB-MPI1 -npmin 8192 -iter 30 -time 3600 # -iter_policy off -msglen /users/XXX/cieloTesting/IMB/512_large/mlfile Alltoall # Message lengths were user defined # # MPI_Datatype : MPI_BYTE # MPI_Datatype for reductions : MPI_FLOAT # MPI_Op : MPI_SUM # # # List of Benchmarks to run: # Alltoall #---------------------------------------------------------------- # Benchmarking Alltoall # #processes = 8192 #---------------------------------------------------------------- #bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] 65536 30 21783196.97 21784433.73 21783868.12 # All processes entering MPI_Finalize Application 10767913 network throttled: 512 nodes throttled, 80:21:16 node-seconds Application 10767913 network quiesced: 151 nodes quiesced, 01:08:55 node-seconds Application 10767913 balanced injection 100, after throttle 63 Application 10767913 resources: utime ~13935119s, stime ~1528977s, Rss ~1078872, inblocks ~8824721, outblocks ~1011110154 4) System Logs: a) Consumer Logs: Mon Sep 5 00:05:43 2016 - rs_event_t at 0x7fddf0000c60 ev_id = 0x080040ed (ec_l1_failed) ev_src = ::c1-0 ev_gen = ::c0-0c0s0n0 ev_flag = 0x00000002 ev_priority = 0 ev_len = 32 ev_seqnum = 0x00000000 ev_stp = 57cd0a49.0003491d [Mon Sep 5 00:01:45 2016] svcid 0: ::c1-0 = svid_inst=0x0/svid_type=0x0/svid_node=c1-0[rsn_node=0x2000/rsn_type=0x3/rsn_state=0x0], err code 65542 - Cabinet MicroController Communications Fault ev_data... 00000000: 01 00 00 00 00 00 00 00 00 00 00 00 0c 00 00 20 *............... * 00000010: 04 00 08 00 00 00 00 00 01 00 00 00 06 00 01 00 *................* Mon Sep 5 00:43:21 2016 - rs_event_t at 0x7fddf00026a0 ev_id = 0x080040ed (ec_l1_failed) ev_src = ::c0-4 ev_gen = ::c0-0c0s0n0 ev_flag = 0x00000002 ev_priority = 0 ev_len = 543 ev_seqnum = 0x00000000 ev_stp = 13854959.000b67cc [Sun May 18 06:49:29 1980] svcid 0: ::c0-4 = svid_inst=0x0/svid_type=0x0/svid_node=c0-4[rsn_node=0x200/rsn_type=0x3/rsn_state=0x0], err code 131400 - Cabinet Power Controller Communication fault svcid 1: ::c0-4c0s0 = svid_inst=0x0/svid_type=0x0/svid_node=c0-4c0s0[rsn_node=0x200/rsn_type=0x1/rsn_state=0x7], err code 328 - Cabinet Power Controller Communication fault svcid 2: ::c0-4c0s1 = svid_inst=0x0/svid_type=0x0/svid_node=c0-4c0s1[rsn_node=0x200/rsn_type=0x1/rsn_state=0x7], err code 328 - Cabinet Power Controller Communication fault svcid 3: ::c0-4c0s2 = svid_inst=0x0/svid_type=0x0/svid_node=c0-4c0s2[rsn_node=0x200/rsn_type=0x1/rsn_state=0x7], err code 328 - Cabinet Power Controller Communication fault svcid 4: ::c0-4c0s3 = svid_inst=0x0/svid_type=0x0/svid_node=c0-4c0s3[rsn_node=0x200/rsn_type=0x1/rsn_state=0x7], err code 328 - Cabinet Power Controller Communication fault svcid 5: ::c0-4c0s4 = svid_inst=0x0/svid_type=0x0/svid_node=c0-4c0s4[rsn_node=0x200/rsn_type=0x1/rsn_state=0x7], err code 328 - Cabinet Power Controller Communication fault svcid 6: ::c0-4c0s5 = svid_inst=0x0/svid_type=0x0/svid_node=c0-4c0s5[rsn_node=0x200/rsn_type=0x1/rsn_state=0x7], err code 328 - Cabinet Power Controller Communication fault b) Netwatch Logs: 160905 00:00:32 c7-0c2s3g1l45 c7-2c0s3g0l50 1 Mode Exchanges 160905 00:00:41 c7-2c0s3g0l50 c7-0c2s3g1l45 1 Mode Exchanges 160905 03:40:36 c4-2c2s3g0l13 c2-2c2s3g0l47 1 Mode Exchanges 160905 03:40:53 c2-2c2s3g0l47 c4-2c2s3g0l13 1 Mode Exchanges 160905 04:51:34 c13-3c0s6g0l00 c13-3c0s7g0l32 1 Mode Exchanges 160905 04:51:34 c13-3c0s6g0l01 c13-3c0s7g0l21 1 Mode Exchanges 160905 04:51:34 c13-3c0s6g0l10 c13-3c0s7g0l20 1 Mode Exchanges 160905 04:51:34 c13-3c0s6g0l11 c13-3c0s7g0l22 1 Mode Exchanges 160905 04:51:40 c13-3c0s7g0l20 c13-3c0s6g0l10 1 Mode Exchanges 160905 04:51:40 c13-3c0s7g0l21 c13-3c0s6g0l01 1 Mode Exchanges 160905 04:51:40 c13-3c0s7g0l22 c13-3c0s6g0l11 1 Mode Exchanges 160905 04:51:40 c13-3c0s7g0l32 c13-3c0s6g0l00 1 Mode Exchanges 160905 07:36:38 c4-2c2s3g0l13 c2-2c2s3g0l47 1 Mode Exchanges 160905 07:36:57 c2-2c2s3g0l47 c4-2c2s3g0l13 1 Mode Exchanges 160905 10:42:41 c4-2c2s3g0l13 c2-2c2s3g0l47 1 Mode Exchanges 160905 10:43:01 c2-2c2s3g0l47 c4-2c2s3g0l13 1 Mode Exchanges 160905 11:01:38 c13-3c0s6g0l00 c13-3c0s7g0l32 1 Mode Exchanges 160905 11:01:38 c13-3c0s6g0l01 c13-3c0s7g0l21 1 Mode Exchanges 160905 11:01:38 c13-3c0s6g0l10 c13-3c0s7g0l20 1 Mode Exchanges 160905 11:01:45 c13-3c0s7g0l20 c13-3c0s6g0l10 1 Mode Exchanges c) Event Logs: 2016-09-02T00:00:00.266274-06:00 c0-0c0s2 7575 2016-09-01 23:55:18|ec_hw_error|src:::c0-0c0s2|pri:0x0|seqnum:0x0|svc:::c0-0c0s2n2|Comp=c0-0c0s2n2,Code=0xb11,Cat=4,PTAG=0 2016-09-02T00:00:00.266322-06:00 c0-0c0s2 7575 2016-09-01 23:55:18|ec_hw_error|src:::c0-0c0s2|pri:0x0|seqnum:0x0|svc:::c0-0c0s2n2|Comp=c0-0c0s2n2,Code=0xb11,Cat=4,PTAG=0 2016-09-02T00:00:00.266344-06:00 c0-0c0s2 7575 2016-09-01 23:55:18|ec_hw_error|src:::c0-0c0s2|pri:0x0|seqnum:0x0|svc:::c0-0c0s2n2|Comp=c0-0c0s2n2,Code=0xb11,Cat=4,PTAG=0 2016-09-02T00:00:00.767087-06:00 s0 7974 2016-09-02 00:00:00|ec_smw_resiliency_hb|src::19:s0|pri:0x1|seqnum:0x58f0b3f0|svc:::c0-0c0s0n0 2016-09-02T00:00:01.025023-06:00 c4-5c1s3 7575 1981-12-04 11:24:57|ec_hw_error|src:::c4-5c1s3|pri:0x0|seqnum:0x0|svc:::c4-5c1s3|Comp=c4-5c1s3g1,Code=0xd09,Cat=32,PTAG=0 2016-09-02T00:00:01.370325-06:00 c1-5c0s4 7575 1982-06-24 01:03:20|ec_hw_error|src:::c1-5c0s4|pri:0x0|seqnum:0x0|svc:::c1-5c0s4|Comp=c1-5c0s4g1l45,Code=0x3138,Cat=32,PTAG=0|Comp=c1-5c0s4g1l45,Code=0x313a,Cat=32,PTAG=0|Comp=c1-5c0s4g1l45,Code=0x313b,Cat=32,PTAG=0|Comp=c1-5c0s4g1l45,Code=0x313d,Cat=32,PTAG=0 2016-09-02T00:00:03.767774-06:00 s0 7974 2016-09-02 00:00:03|ec_smw_resiliency_hb|src::19:s0|pri:0x1|seqnum:0x58f0b690|svc:::c0-0c0s0n0 2016-09-02T00:00:03.840802-06:00 c1-5c0s3 7575 1982-06-24 01:03:23|ec_hw_error|src:::c1-5c0s3|pri:0x0|seqnum:0x0|svc:::c1-5c0s3|Comp=c1-5c0s3g1,Code=0xd09,Cat=32,PTAG=0 2016-09-02T00:00:04.742046-06:00 c6-2c2s7 7575 2016-09-01 23:56:23|ec_hw_error|src:::c6-2c2s7|pri:0x0|seqnum:0x0|svc:::c6-2c2s7|Comp=c6-2c2s7g0l50,Code=0x3138,Cat=32,PTAG=0|Comp=c6-2c2s7g0l50,Code=0x313a,Cat=32,PTAG=0|Comp=c6-2c2s7g0l50,Code=0x313b,Cat=32,PTAG=0 d) nlrd Log: 2016-09-02T11:41:58.128156-06:00 ci-smw 22766 2016-09-02 11:41:57 ci-smw 22768 cb_hw_error: failed_component c2-0c0s1g0l13, type 23, error_code 0x1207, error_category 0x0002 2016-09-02T11:41:58.128211-06:00 ci-smw 22766 2016-09-02 11:41:57 ci-smw 22768 cb_hw_error: MMR[0]=0x000000000000000e 2016-09-02T11:41:58.128220-06:00 ci-smw 22766 2016-09-02 11:41:57 ci-smw 22768 cb_hw_error: MMR[1]=0x3d88000000077249 2016-09-02T11:41:58.128227-06:00 ci-smw 22766 2016-09-02 11:41:57 ci-smw 22768 cb_hw_error: MMR[2]=0x0de48001dddc004d 2016-09-02T11:41:58.128234-06:00 ci-smw 22766 2016-09-02 11:41:57 ci-smw 22768 cb_hw_error: MMR[3]=0x0000000002900000 2016-09-02T11:41:58.128239-06:00 ci-smw 22766 2016-09-02 11:41:57 ci-smw 22768 cb_hw_error: handling failed link c2-0c0s1g0l13 2016-09-02T11:41:58.128246-06:00 ci-smw 22766 2016-09-02 11:41:57 ci-smw 22768 add_link_to_list: adding link c2-0c0s1g0l13 to linkfailed list 2016-09-02T11:41:58.128255-06:00 ci-smw 22766 2016-09-02 11:41:57 ci-smw 22768 ***** dispatch: current_state aggregate_failures ***** 2016-09-02T11:41:59.424677-06:00 ci-smw 22766 2016-09-02 11:41:58 ci-smw 22768 cb_hw_error: failed_component c0-0c0s1g0l47, type 23, error_code 0x1207, error_category 0x0002 ... 2016-09-02T15:17:01.961263-06:00 ci-smw 22766 2016-09-02 15:04:21 ci-smw 22768 WARNING: blade c0-0c2s3 was auto-throttled for 7 seconds 2016-09-02T15:17:01.961286-06:00 ci-smw 22766 2016-09-02 15:04:21 ci-smw 22768 WARNING: blade c0-0c2s6 was auto-throttled for 7 seconds 2016-09-02T15:17:01.961309-06:00 ci-smw 22766 2016-09-02 15:04:21 ci-smw 22768 WARNING: blade c0-0c1s0 was auto-throttled for 8 seconds 2016-09-02T15:17:01.961339-06:00 ci-smw 22766 2016-09-02 15:04:21 ci-smw 22768 WARNING: blade c0-0c2s1 was auto-throttled for 7 seconds ... 2016-09-02T12:06:05.845280-06:00 ci-smw 22766 2016-09-02 12:04:59 ci-smw 22768 send_warm_swap_response: sending response vers 258 cmd 2 len 102 uniq_id 11686 code 0 errno \ 0 err_len 53 err_string 12:04:59(T+00:57) Finished initializing new links 2016-09-02T12:06:05.845289-06:00 ci-smw 22766 2016-09-02 12:04:59 ci-smw 22768 send_warm_swap_response: sending response vers 258 cmd 2 len 99 uniq_id 11686 code 0 errno 0\ err_len 50 err_string 12:04:59(T+00:57) Quiescing the high-speed network 2016-09-02T12:06:05.845297-06:00 ci-smw 22766 2016-09-02 12:04:59 ci-smw 22768 INFO: 2301 out of 2304 L0s are alive 2016-09-02T12:06:05.845309-06:00 ci-smw 22766 2016-09-02 12:04:59 ci-smw 22768 Stopping unthrottle timer: unthrottle_tag 0 2016-09-02T12:06:05.845318-06:00 ci-smw 22766 2016-09-02 12:04:59 ci-smw 22768 Beginning to wait for response(s) 2016-09-02T12:06:05.845328-06:00 ci-smw 22766 2016-09-02 12:04:59 ci-smw 22768 Received 31 of 2301 responses 2016-09-02T12:06:05.845337-06:00 ci-smw 22766 2016-09-02 12:05:00 ci-smw 22768 Received 846 of 2301 responses 2016-09-02T12:06:05.845350-06:00 ci-smw 22766 2016-09-02 12:05:00 ci-smw 22768 Received 1847 of 2301 responses 2016-09-02T12:06:05.845360-06:00 ci-smw 22766 2016-09-02 12:05:00 ci-smw 22768 Received 2301 of 2301 responses 2016-09-02T12:06:05.845369-06:00 ci-smw 22766 2016-09-02 12:05:00 ci-smw 22768 ***** dispatch: current_state check_quiesce ***** 2016-09-02T12:06:05.845378-06:00 ci-smw 22766 2016-09-02 12:05:00 ci-smw 22768 ***** dispatch: current_state quiesce_drain ***** 2016-09-02T12:06:05.845387-06:00 ci-smw 22766 2016-09-02 12:05:00 ci-smw 22768 send_warm_swap_response: sending response vers 258 cmd 2 len 112 uniq_id 11686 code 0 errno \ 0 err_len 63 err_string 12:05:00(T+00:58) Finished quiescing the high-speed network 2016-09-02T12:06:05.845397-06:00 ci-smw 22766 2016-09-02 12:05:00 ci-smw 22768 send_warm_swap_response: sending response vers 258 cmd 2 len 118 uniq_id 11686 code 0 errno \ 0 err_len 69 err_string 12:05:00(T+00:58) Waiting for the high-speed network traffic to drain e) alps (scheduler) Logs: 2016-09-02T12:10:19.585617-06:00 ci-smw 22766 2016-09-02 12:10:18 ci-smw 22762 cb_alps_app_status: subtype 1 apid 10767913 userid 00000 apname IMB-MPI1 numnids 512 f) Hardware Error Log: 2016-09-02 00:11:29 | HWERR[c14-1c2s7g0l50][263]:0x3138:Sender Packet Timeout 2016-09-02 00:11:29 | HWERR[c14-1c2s7g0l50][264]:0x313a:Receiver EOP Bad 2016-09-02 00:11:29 | HWERR[c14-1c2s7g0l50][265]:0x313b:Receiver CC1 Bad 2016-09-02 00:11:30 | HWERR[c0-0c0s0n3][84311]:0x0b11:SSID Detected Misrouted Packet:Info1=0x801b261000014081:Info2=0x0:Info3=0x5a4e0 2016-09-02 00:11:30 | HWERR[c5-3c0s6n2][28580]:0x0d04:NIF Squashed Request Packet:Info=0xf0010 5) ArchitecturalInformation Note that the architectural information is generic for any Cray XE of this node count. a) router interconnect information: c0-0c0s0g0l00[(0,0,0)] Z+ -> c0-0c0s1g0l32[(0,0,1)] LinkType: backplane c0-0c0s0g0l01[(0,0,0)] Z+ -> c0-0c0s1g0l21[(0,0,1)] LinkType: backplane c0-0c0s0g0l02[(0,0,0)] X+ -> c1-0c0s0g0l02[(1,0,0)] LinkType: cable11x b) Map of nodes to routers: NID NIC-Addr Node Gemini X Y Z ---- -------- ------------ ------------ -- -- -- 0 0 c0-0c0s0n0 c0-0c0s0g0 0 0 0 1 1 c0-0c0s0n1 c0-0c0s0g0 0 0 0 2 4 c0-0c0s1n0 c0-0c0s1g0 0 0 1 3 5 c0-0c0s1n1 c0-0c0s1g0 0 0 1 c) List of service (as opposed to compute) nodes (1st column is node id, 3rd column is node name) 0 0x0 c0-0c0s0n0 service xt (service) 6 6 16384 4096 2200 0 1 1 1 1 0x1 c0-0c0s0n1 service xt (service) 6 6 16384 4096 2200 0 1 1 1 4 0x4 c0-0c0s2n0 service xt (service) 6 6 16384 4096 2200 0 1 1 1 6) MonitoringData Time stamped component data in CSV format of the values identified below (network tarffic and stall, I/O, and load information). Collection rates of 1 Hz. component_id and Producer name are the node numbers. job_id is the job number. HEADER: #Time, Time_usec, ProducerName,component_id,job_id,nettopo_mesh_coord_X,nettopo_mesh_coord_Y,nettopo_mesh_coord_Z,X+_traffic (B),X-_traffic (B),Y+_traffic (B),Y-_traffic (B),Z+_traffic (B),Z-_traffic (B),X+_packets (1),X-_packets (1),Y+_packets (1),Y-_packets (1),Z+_packets (1),Z-_packets (1),X+_inq_stall (ns),X-_inq_stall (ns),Y+_inq_stall (ns),Y-_inq_stall (ns),Z+_inq_stall (ns),Z-_inq_stall (ns),X+_credit_stall (ns),X-_credit_stall (ns),Y+_credit_stall (ns),Y-_credit_stall (ns),Z+_credit_stall (ns),Z-_credit_stall (ns),X+_sendlinkstatus (1),X-_sendlinkstatus (1),Y+_sendlinkstatus (1),Y-_sendlinkstatus (1),Z+_sendlinkstatus (1),Z-_sendlinkstatus (1),X+_recvlinkstatus (1),X-_recvlinkstatus (1),Y+_recvlinkstatus (1),Y-_recvlinkstatus (1),Z+_recvlinkstatus (1),Z-_recvlinkstatus (1),X+_SAMPLE_GEMINI_LINK_BW (B/s),X-_SAMPLE_GEMINI_LINK_BW (B/s),Y+_SAMPLE_GEMINI_LINK_BW (B/s),Y-_SAMPLE_GEMINI_LINK_BW (B/s),Z+_SAMPLE_GEMINI_LINK_BW (B/s),Z-_SAMPLE_GEMINI_LINK_BW (B/s),X+_SAMPLE_GEMINI_LINK_USED_BW (% x1e6),X-_SAMPLE_GEMINI_LINK_USED_BW (% x1e6),Y+_SAMPLE_GEMINI_LINK_USED_BW (% x1e6),Y-_SAMPLE_GEMINI_LINK_USED_BW (% x1e6),Z+_SAMPLE_GEMINI_LINK_USED_BW (% x1e6),Z-_SAMPLE_GEMINI_LINK_USED_BW (% x1e6),X+_SAMPLE_GEMINI_LINK_PACKETSIZE_AVE (B),X-_SAMPLE_GEMINI_LINK_PACKETSIZE_AVE (B),Y+_SAMPLE_GEMINI_LINK_PACKETSIZE_AVE (B),Y-_SAMPLE_GEMINI_LINK_PACKETSIZE_AVE (B),Z+_SAMPLE_GEMINI_LINK_PACKETSIZE_AVE (B),Z-_SAMPLE_GEMINI_LINK_PACKETSIZE_AVE (B),X+_SAMPLE_GEMINI_LINK_INQ_STALL (% x1e6),X-_SAMPLE_GEMINI_LINK_INQ_STALL (% x1e6),Y+_SAMPLE_GEMINI_LINK_INQ_STALL (% x1e6),Y-_SAMPLE_GEMINI_LINK_INQ_STALL (% x1e6),Z+_SAMPLE_GEMINI_LINK_INQ_STALL (% x1e6),Z-_SAMPLE_GEMINI_LINK_INQ_STALL (% x1e6),X+_SAMPLE_GEMINI_LINK_CREDIT_STALL (% x1e6),X-_SAMPLE_GEMINI_LINK_CREDIT_STALL (% x1e6),Y+_SAMPLE_GEMINI_LINK_CREDIT_STALL (% x1e6),Y-_SAMPLE_GEMINI_LINK_CREDIT_STALL (% x1e6),Z+_SAMPLE_GEMINI_LINK_CREDIT_STALL (% x1e6),Z-_SAMPLE_GEMINI_LINK_CREDIT_STALL (% x1e6),totaloutput_optA,totalinput,fmaout,bteout_optA,bteout_optB,totaloutput_optB,SAMPLE_totaloutput_optA (B/s),SAMPLE_totalinput (B/s),SAMPLE_fmaout (B/s),SAMPLE_bteout_optA (B/s),SAMPLE_bteout_optB (B/s),SAMPLE_totaloutput_optB (B/s),client.lstats.dirty_pages_hits#llite.scratch4,client.lstats.dirty_pages_misses#llite.scratch4,client.lstats.writeback_from_writepage#llite.scratch4,client.lstats.writeback_from_pressure#llite.scratch4,client.lstats.writeback_ok_pages#llite.scratch4,client.lstats.writeback_failed_pages#llite.scratch4,client.lstats.read_bytes#llite.scratch4,client.lstats.write_bytes#llite.scratch4,client.lstats.brw_read#llite.scratch4,client.lstats.brw_write#llite.scratch4,client.lstats.ioctl#llite.scratch4,client.lstats.open#llite.scratch4,client.lstats.close#llite.scratch4,client.lstats.mmap#llite.scratch4,client.lstats.seek#llite.scratch4,client.lstats.fsync#llite.scratch4,client.lstats.setattr#llite.scratch4,client.lstats.truncate#llite.scratch4,client.lstats.lockless_truncate#llite.scratch4,client.lstats.flock#llite.scratch4,client.lstats.getattr#llite.scratch4,client.lstats.statfs#llite.scratch4,client.lstats.alloc_inode#llite.scratch4,client.lstats.setxattr#llite.scratch4,client.lstats.getxattr#llite.scratch4,client.lstats.listxattr#llite.scratch4,client.lstats.removexattr#llite.scratch4,client.lstats.inode_permission#llite.scratch4,client.lstats.direct_read#llite.scratch4,client.lstats.direct_write#llite.scratch4,client.lstats.lockless_read_bytes#llite.scratch4,client.lstats.lockless_write_bytes#llite.scratch4,client.lstats.dirty_pages_hits#llite.scratch3,client.lstats.dirty_pages_misses#llite.scratch3,client.lstats.writeback_from_writepage#llite.scratch3,client.lstats.writeback_from_pressure#llite.scratch3,client.lstats.writeback_ok_pages#llite.scratch3,client.lstats.writeback_failed_pages#llite.scratch3,client.lstats.read_bytes#llite.scratch3,client.lstats.write_bytes#llite.scratch3,client.lstats.brw_read#llite.scratch3,client.lstats.brw_write#llite.scratch3,client.lstats.ioctl#llite.scratch3,client.lstats.open#llite.scratch3,client.lstats.close#llite.scratch3,client.lstats.mmap#llite.scratch3,client.lstats.seek#llite.scratch3,client.lstats.fsync#llite.scratch3,client.lstats.setattr#llite.scratch3,client.lstats.truncate#llite.scratch3,client.lstats.lockless_truncate#llite.scratch3,client.lstats.flock#llite.scratch3,client.lstats.getattr#llite.scratch3,client.lstats.statfs#llite.scratch3,client.lstats.alloc_inode#llite.scratch3,client.lstats.setxattr#llite.scratch3,client.lstats.getxattr#llite.scratch3,client.lstats.listxattr#llite.scratch3,client.lstats.removexattr#llite.scratch3,client.lstats.inode_permission#llite.scratch3,client.lstats.direct_read#llite.scratch3,client.lstats.direct_write#llite.scratch3,client.lstats.lockless_read_bytes#llite.scratch3,client.lstats.lockless_write_bytes#llite.scratch3,client.lstats.dirty_pages_hits#llite.scratch2,client.lstats.dirty_pages_misses#llite.scratch2,client.lstats.writeback_from_writepage#llite.scratch2,client.lstats.writeback_from_pressure#llite.scratch2,client.lstats.writeback_ok_pages#llite.scratch2,client.lstats.writeback_failed_pages#llite.scratch2,client.lstats.read_bytes#llite.scratch2,client.lstats.write_bytes#llite.scratch2,client.lstats.brw_read#llite.scratch2,client.lstats.brw_write#llite.scratch2,client.lstats.ioctl#llite.scratch2,client.lstats.open#llite.scratch2,client.lstats.close#llite.scratch2,client.lstats.mmap#llite.scratch2,client.lstats.seek#llite.scratch2,client.lstats.fsync#llite.scratch2,client.lstats.setattr#llite.scratch2,client.lstats.truncate#llite.scratch2,client.lstats.lockless_truncate#llite.scratch2,client.lstats.flock#llite.scratch2,client.lstats.getattr#llite.scratch2,client.lstats.statfs#llite.scratch2,client.lstats.alloc_inode#llite.scratch2,client.lstats.setxattr#llite.scratch2,client.lstats.getxattr#llite.scratch2,client.lstats.listxattr#llite.scratch2,client.lstats.removexattr#llite.scratch2,client.lstats.inode_permission#llite.scratch2,client.lstats.direct_read#llite.scratch2,client.lstats.direct_write#llite.scratch2,client.lstats.lockless_read_bytes#llite.scratch2,client.lstats.lockless_write_bytes#llite.scratch2,nr_dirty,nr_writeback,loadavg_latest(x100),loadavg_5min(x100),loadavg_running_processes,loadavg_total_processes,current_freemem,SMSG_ntx,SMSG_tx_bytes,SMSG_nrx,SMSG_rx_bytes,RDMA_ntx,RDMA_tx_bytes,RDMA_nrx,RDMA_rx_bytes,ipogif0_rx_bytes,ipogif0_tx_bytes 7) LogDiver_Output. This is analysis support artifact files output from the LogDiver tool: a) filtered_logs -- subset of the raw log data (item 4), consisting of only the log lines for predetermined events of interest (e.g., steps in fault injection and recovery), divided into time ranges corresponding to each fault injection test. Format is NumberId of event of interest followed by UTC timestamp and raw log line: 2093 1472815521 2016-09-02 11:25:21 | hwerr[c4-5c2s3n0][11341]:0x0d04:nif squashed request packet:info=0x9c0010 2093 1472815521 2016-09-02 11:25:21 | hwerr[c4-5c2s3n0][11342]:0x0d04:nif squashed request packet:info=0x660010 2093 1472815521 2016-09-02 11:25:21 | hwerr[c4-5c2s3n2][14070]:0x0d04:nif squashed request packet:info=0x430010 2093 1472815521 2016-09-02 11:25:21 | hwerr[c4-5c2s3n3][7672]:0x0d04:nif squashed request packet:info=0x640010 13 1472815522 2016-09-02 11:25:22 ci-smw 22768 cb_hw_error: handling failed link c2-0c0s1g0l31 164 1472815522 2016-09-02 11:25:22 ci-smw 22768 add_link_to_list: adding link c2-0c0s1g0l31 to linkfailed list b) event_counts - tallies of each event of interest for each experiment. Format is NumberId of event of interest followed by descriptive identifier and number of occurrences: 2089,HWERR_NIF_BAD_REQ_PACKET,37 2090,HWERR_NIF_BAD_RESPONSE_PACKET_HSN_BUG_1,66 2091,HWERR_FLOW_CTL_ORB_TO_NL_HSN_BUG_2,0 2092,HWERR_SSID,0 2093,HWERR_NIF_SQUASHED_REQ,2306 2094,illegal_r_flag,0 c) network_recovery_reports - xml representation of sequence of events in an experiment using LogDiver to cluster events in an experiment: warm_swap_started warm_swap_success 2 FALSE d) FaultInjectionTable.xlsx: All data above, summarized in a spreadsheet of event occurences.