The Performance Monitoring Archive (PMA) for IO at NERSC
This web portal provides information on monitoring of the Lustre
filesystems at NERSC. Our goals are to provide:
-
insight into resource utilziation
-
a characterization of IO workload
-
a view of the current/recent state of IO activity
-
a better window into how users actually use filesystems at NERSC
-
help identifying better IO and data strategies
Most of the data gathered here is from the Lustre Monitoring Tool
(LMT)
instrumentation of the Luster servers. The audience for this portal
includes NERSC staff, IO researchers, and users with large or
challenging I/O requirements.
Background: LMT monitoring currently occurs on the
Hopper
HPC system and its testbed platform Grace. Hopper has two parallel
file systems, /scratch and /scratch2. Each file system
has 26 Object Storage Servers
(OSSs),
and each OSS has 6 Object Storage Targets (OSTs). An OST manages the data
traffic to one disk resource (or LUN), which is generally a
high-performance RAID-based device. In addition to the OSSs, which are
responsible for bulk I/O trasfers to storage, there is a Metadata Server
(MDS) for each file system. The MDS manages the file system name space and
the mapping of files in the namespace to the locations of the data on
the OSTs. Each server reports performance data at five second
intervals, sending the observations to a database on a DB server external
to the HPC system. This web interface provides access to and analysis of that
data. From NERSC systems there is also a mechanism for accessing and
graphing the data from the command line.
Benchmarking:
IO is hard to measure. The are many layers from the application down
to the disk spindle. Each layer may see the IO happen in a different
way and with different resource constraints and performance
impacts. As such it is important that we have clear understanding of
our measurements and that we are able to corroborate performance
measurements from multiple sources. IO measurements can
also be hard to reproduce given the shared nature of the filesystem
resource.
Here is a simple walk-through of performing an I/O test
(on the Grace system) and validating the results. Additional links there
point to a battery of callibration studies performed in August of 2012.
A similar set of callibration studies on Hopper
are now available. These tests were run in October/November 2012.
LMT Data Access: The links above lead to the per machine LMT data.
The Franklin data is historical, running from August 2008 to May 2012.