LMT Testing and Callibration

In IO measurements, if one can establish the same bandwidth or timing from two different sources this provides some first order level of assurance that the reported numbers are not being influenced by the many complexities invovled in IO measurment. The example on this page is a simple exercise in such corroboration. LMT data is checked against the I/O performance reported by the IOR benchmark.

Mechanics of an IOR benchmark test

The IOR run is described here and is configured to use a single OST (see 'lfs setstripe') to which 1024 write() system calls of 4MB each are made. The output of the IOR run can be seen here.

Time series graph of the test results

Data from the LMT DB plotted against the values reported by IOR

The figure was generate on euclid.nersc.gov with this command line. In this graph the data from the LMT DB is superimpsed on the results reported by IOR. Time is on the x-axis, and data rate, in MB/s, is on the y-axis. The solid lines show the write (blue) and read (red) observations from LMT. LMT has data rate observations for each OST at five second intervals. The graph shows the sum, at each interval, of the data rate across the four OSTs on Grace.

The dashed lines represent the results reported by IOR. Those results tell when the write (respectively read) test started and the observed data rate as well as the amount of data moved. From those values we can draw an idealized 'square-wave pulse' representing the test.

The LMT plot shows data being written at a peak of just under 400 MB/s for about 10 seconds. This matches the value reported by IOR reasonably well. The exact timing of the two mesurements suffers a slight coordination problem. This is not surprising, since the IOR results rely on the compute node clock, and the LMT results rely on the OSS clock. They are a few seconds apart, and the graphs have been manually adjusted to line up a little better. More importantly, LMT reports (in its BYTES_WRITTEN observations) that exactly 4096 MB were transfered, and all to the one OST (see below as well).

IOR reported a very high read rate, which is entirely possible in this scenario if the reads are actually drawing from cache data rather than from disk. The LMT BYTES_READ data shows only a very small amount of read activity, so there is agreement on this as well.

RPC traffic durring the test

The LMT data collection libraries on Grace have been augmented to collect RPC statistics as well as observations of the BYTES_READ and BYTES_WRITTEN counters. The Lustre servers maintain a count of how many 4kb pages were in each RPC trasferred from the compute nodes to the servers. These RPC statistics are represented as a histogram of bins in powers of two from 1 to 256, which is the largest possible RPC. The I/O presented by IOR in this test is a well-aligned streaming I/O load from one compute node to one OST, so we would expect that it would be served entirely by RPCs of the maximum size: 256 pages = 1MB, each. All of those RPCs should show up in the 256 page bin on the target OST.

RPC data from the LMT DB for each OST

The figure was generated on euclid.nersc.gov with this command line. There is a graph for each of the four OSTs. In each graph the x-axis shows the bins from 1 page up to 256 pages. The y-axis shows how many RPCs of that size arrived in the given interval. In this case there are 4096 write RPCs in the 256 page bin on OST scratch-OST0002, and (almost) no other RPCs in any other bin or on any other OST.

A very small number (5) of additional RPCs are lost in the background, but the LMT and IOR data agree in the volume and timing of the data in this highly controlled experiment. The command line mentioned above also gives detailed output where these few extra RPCs are visible. Note especially that there were few, if any, read RPCs during the test. Thus the caching took place on the compute nodes, not on the servers.

Conclusions and further work

Experiments on the observed server-side performance during an IOR are relatively easy to conduct and interpret for the write phase of the IOR, when that test is creating and writing to a new file. IOR measures the I/O time from before the beginning of the file open command until after the end of the file close sommand. In that way it avoids the spurious results that can come from write-back caching. The conduct and interpretation of read tests requires more care, as we saw obove. The first additional efforts outlined below will focus on write performance observations. Later, we will explore some of IORs mechanisms for making useful observations of read performance. By then we should have seen enough, from the IOR tests suggested below, that we may begin to make useful hypotheses about the file system's read behavior.

The IOR test presented here only lasted about 11 seconds. That is close to the five second resolution of the LMT data. The alignement issue brings up a subtle assumption about how the file system (the OST) responds to I/O traffic. Our hypothesis, which still needs to be demonstrated, is that when I/O is pending it will proceed at the OST's best possible rate. The middle observation in the graph is of data being moved during those five seconds at 355.8 MiB/s. That is quite close to the reported IOR rate of 347.75 MiB/s. The fine adjustment on the starting point in the graph for the IOR test (the dotted line) was set on the assumption that for that five second interval there was an initial 2.27 second of 0 MiB/s and then 2.73 seconds of 355.8 MB/s.

In order to test that hypothesis we will want to run a sequence of IOR tests with successively larger I/O requirements. If the interior observations (not the first, not the last) all show something very close to 355.8 MiB/s then it is a good bet that that is the rate for the first and last observations as well, though only for a fraction of the inteval.

At this point, we do not know if the limit on the data rate is based on the maximum capacity of the disks, the server, the network, or the compute node. Having determined a total I/O amount sufficient to observe the maximum bandwidth, we will want to run a succession of tests that a) runs multiple tasks on the node, and b) runs tasks on multiple nodes. With those tests we will be able to identify the maximum write rate under ideal circumstances for one OST.

There may also be contention for resources on the OSSs and on the disk device(s), so we will want to duplicate the test using both OSTs on one OSS and using both OSSs. With that we should be able to identify the maximum available bandwidth for the file system as a whole.


Why do we think these tests show the maximum achievable I/O rate? (Link to discussion of RPC size, network bandwidth, and I/O latency.) Under what circumstances will the maximum achievable rate be less that what we've shown here? One possible mechanism is for the RPCs to be smaller than the ideal size. We will want to design and run tests that deliberately send I/O with 128 (instead of 256) pages per RPC. We will be able to see that happen in the reported RPC statistics, and we will be able to see their effect on performance. We can then repeat that experiment for each of the bins in the RPC histograms down to 1 page per RPC.

RPC size is not the only factor affecting performance, and RPC statistics are not the only ones gathered by the new LMT "BRW" statistics module. We'll want to duplicate the above proceedure for each of those to see if it has an effect and what that effect might be.

back to the LMT portal