Burst Buffer¶

The 1.8 PB NERSC Burst Buffer is based on Cray DataWarp that uses flash or SSD (solid-state drive) technology to significantly increase the I/O performance on Cori for all file sizes and all access patterns that sit within the High Speed Network (HSN) on Cori. Accessible only from compute nodes, the Burst Buffer provides per-job (or short-term) storage for I/O intensive codes.

The peak bandwidth performance is over 1.7 TB/s with each Burst Buffer node contributing up to 6.5 GB/s. The number of Burst Buffer nodes available to a job depends on the granularity and size of the Burst Buffer allocation. Performance is also dependent on access pattern, transfer size and access method (e.g. MPI I/O, shared files). There's a 50 TB (52828800 MB) maximum threshold on each BB reservation, but multiple reservations can be requested concurrently.

Check the DataWarp limitations below

Please note that support for DataWarp has been reduced. The Burst Buffer is also not a persistent storage and a reservation can become unavailable if hardware is unstable. A user reported a data corruption event, detailed in the Known Issues below. We invite users to consider using the Cori SCRATCH file system whenever possible. DataWarp is still available for those who benefit from it and recognize the possible risks.

Make sure to understand all limitations of Burst Buffer reported in the General Issues section below, to avoid loss of data or waste of precious compute hours.

Performance Tuning¶

Striping¶

Currently, the Burst Buffer granularity is 20 GB. If you request an allocation smaller than this amount, your files will sit on a single Burst Buffer node. If you request more than this amount, then your files will be striped over multiple Burst Buffer nodes. For example, if you request 19 GB then your files all sit on the same Burst Buffer server. This is relevant for performance, because each Burst Buffer server has a maximum possible bandwidth of roughly 6.5 GB/s - so your aggregate bandwidth is summed over the number of Burst Buffer servers you use. If other people are accessing data on the same Burst Buffer server at the same time, then you will share that bandwidth and will be unlikely to reach the theoretical peak.

Therefore:

It is better to spread your data over many Burst Buffer servers, particularly if you have a large number of compute nodes trying to access the data.
The number of Burst Buffer nodes used by an application should be scaled up with the number of compute nodes, to keep the Burst Buffer nodes busy but not over-subscribed. The exact ratio of compute to Burst Buffer nodes will depend on the I/O load produced by the application.

Use Large Transfer Size¶

We have seen that using transfer sizes less than 512 KiB results in poor performance. In general, we recommend using as large a transfer size as possible.

Use More Processes per Burst Buffer Node¶

We noticed that the Burst Buffer cannot be kept busy with less than 4 processes writing to each Burst Buffer node - less than this will not be able to achieve the peak potential performance of roughly 6.5 GB/s per node.

Known Issues¶

There are a number of known issues to be aware of when using the Burst Buffer on Cori. This page will be updated as problems are discovered, and as they are fixed.

General Issues¶

A user reported a data corruption on files written to a Burst Buffer Persistent Reservation: the files were found to contain several sequences of null/zero characters and had a wrong file size. This is believed to be associated with a known aspect of the underlying XFS filesystem, that may be triggered by a Burst Buffer cabinet crash or when SSD write-protection is activated: it seems that XFS records abnormal write operations during particularly stressful I/O events, which then appear as blocks of null/zeroes in files. HPE engineers are currently investigating the bug, but no patch is available yet. Usage of Burst Buffer is discouraged until a patch is provided, unless you are conscious of the possibile risks for your data.
If you have multiple jobs writing to the same directory in a Persistent Reservation, you will run into race conditions due to the DataWarp caching. The second job will likely fail with Permission denied or No such file or directory messages, as the metadata in the compute node cache does not match the reality of the metadata on the BB; use different subdirectories for each job, e.g. by creating a directory for each job with mkdir $DW_PERSISTENT_STRIPED_PRname/$SLURM_JOB_ID/ (make sure to set the correct PRname in the variable)
A Burst Buffer allocation or Persistent Reservation may suddenly become read-only, causing your program to fail: this happens when too many writes have been performed in your Burst Buffer allocation, which trigger the write protection of the underlying SSDs and make the file system read-only, to avoid drive blocks to wear out too quickly. If you are using a Burst Buffer allocation and need to access the same files from multiple jobs, consider using a Persistent Reservation (PR). The solution to the read-only error is to either wait some hours for the SSDs to "cooldown", or request a new and bigger PR (remember that resources are limited and other users may also need storage space).
Do not use a decimal point when you specify the burst buffer capacity - slurm does not parse this correctly and will allocate you one grain of space instead of the full request. This is easy to work around - request e.g. 3500GB instead of 3.5TB.
Data is at risk in a Persistent Reservation if an SSD fails - there is no possibility to recover data from a dead SSD. Please back up your data and don't leave important data in a PR for longer than necessary!
If you request a too-small allocation on the Burst Buffer (e.g. request 200 GB and actually write out 300 GB) your job will fail, and the BB node will go into an undesirable state and need to be recovered manually. Please be careful of how much space your job requires - if in doubt, over-request.
If you use dwstat in your batch job, you may occasionally run into [SSL: CERTIFICATE_VERIFY_FAILED] errors, which may fail your job. If you see this error, it is due to a modulefile issue - please use the full path to the dwstat command: /opt/cray/dws/default/bin/dwstat.
If the primary SLURM controller is down, the secondary (backup) controller will be scheduling jobs - and the secondary controller does not know about the Burst Buffer. If you happen to submit a job when the backup scheduler is running your jobs will fail with the message sbatch: error: burst_buffer/cray: Slurm burst buffer configuration error / sbatch: error: Batch job submission failed: Burst Buffer request invalid. If you receive this error and your submission script looks correct, please check MOTD for SLURM downtime/issues, and try again later.

Staging Issues¶

The Burst Buffer cannot access GPFS for staging data, only files in $SCRATCH are supported for stage-in/stage-out. Data in your home or Community directories can still be accessed and transferred into a PR using e.g. cp, rsync, etc, but this will need to be manually performed in your compute job and will consume part of your compute time.
Staging out files to destinations other than a location on $SCRATCH will result in those files being lost.
The command squeue -l -u $USER will give you useful information on how your stage-in process is going. If you see an error message (e.g. (burst_buffer/cray: dws_data_in: Error staging in session 20868 configuration 6953 path /global/cscratch1/sd/username/stagemein.txt -> /stagemein.txt: offline namespaces: [44223] - ask a system administrator to consult the dwmd log for more information) then you may have a permissions issue with the files you are trying to stage_in, or be trying to stage_in a non-existent file.
stage_in and stage_out using access_mode=private does not work (by design).
If you have multiple files to stage in, you will need to tar them up and use type=file, or keep them all in one directory and use type=directory.
type=directory fails with large directories (>~200,000 files) due to a timeout error. In this case, consider tar-ing your files and staging in the tarball.
Symbolic links are not preserved when staging in, the link will be lost.
Staging in/out with hard links does not work.