The Performance Monitoring Archive (PMA) for IO at NERSC

Franklin (historical values)

Hopper

Grace

This web portal provides information on monitoring of the Lustre filesystems at NERSC. Our goals are to provide:

Most of the data gathered here is from the Lustre Monitoring Tool (LMT) instrumentation of the Luster servers. The audience for this portal includes NERSC staff, IO researchers, and users with large or challenging I/O requirements.

Background: LMT monitoring currently occurs on the Hopper HPC system and its testbed platform Grace. Hopper has two parallel file systems, /scratch and /scratch2. Each file system has 26 Object Storage Servers (OSSs), and each OSS has 6 Object Storage Targets (OSTs). An OST manages the data traffic to one disk resource (or LUN), which is generally a high-performance RAID-based device. In addition to the OSSs, which are responsible for bulk I/O trasfers to storage, there is a Metadata Server (MDS) for each file system. The MDS manages the file system name space and the mapping of files in the namespace to the locations of the data on the OSTs. Each server reports performance data at five second intervals, sending the observations to a database on a DB server external to the HPC system. This web interface provides access to and analysis of that data. From NERSC systems there is also a mechanism for accessing and graphing the data from the command line.

Benchmarking: IO is hard to measure. The are many layers from the application down to the disk spindle. Each layer may see the IO happen in a different way and with different resource constraints and performance impacts. As such it is important that we have clear understanding of our measurements and that we are able to corroborate performance measurements from multiple sources. IO measurements can also be hard to reproduce given the shared nature of the filesystem resource.

Here is a simple walk-through of performing an I/O test (on the Grace system) and validating the results. Additional links there point to a battery of callibration studies performed in August of 2012.

A similar set of callibration studies on Hopper are now available. These tests were run in October/November 2012.

LMT Data Access: The links above lead to the per machine LMT data. The Franklin data is historical, running from August 2008 to May 2012.