Current known ML software issues¶
This page lists current known issues and workarounds, if any, for machine learning software at NERSC
Issues on Perlmutter¶
-
Users sometimes encounter a
CUDA Unknown Errorduring initialization. Nvidia is still investigating the issue, but provided a workaround in the meantime: run a simple executable that creates a GPU context, then run your normal job steps. One can create the executable with the following line:srun -C gpu -N1 -n1 bash -c 'echo "int main() {cudaFree(0);}" > dummy.cu && nvcc -o dummy dummy.cu'Then, the
dummyexecutable can be saved somewhere (e.g. in your$HOMEdirectory) and reused for your jobs. To prevent theCUDA Unknown Error, run thedummyexecutable once on each GPU of your job before running your actual code. Note thedummyexecutable does not need to be run from inside a shifter container. -
Some Nvidia ngc containers don't properly enter compatibility mode when running with shifter. To ensure correct behavior in ngc deep learning containers, you must wrap your commands inside the container with
bashto ensure the compatibility check is set up properly. For example, the linesrun shifter --image=nvcr.io/nvidia/pytorch:21.05-py3 python train.pywould change to
srun shifter --image=nvcr.io/nvidia/pytorch:21.05-py3 bash -c 'python train.py'Alternately, you can put your code inside a bash script and just run the bash script with shifter. In contrast to the deep learning ngc containers, the base Nvidia CUDA containers will not work with the above
bashtrick. In those, you will manually need to set$LD_LIBRARY_PATHto expose the proper compatibility libraries. To do so, in the container, set the environment variableexport LD_LIBRARY_PATH=/usr/local/cuda/compat/lib.real:$LD_LIBRARY_PATH -
Conda-installed pytorch comes with an older version of NCCL (\<2.8) that is incompatible with an InfiniBand setting on Perlmutter NICs, so multi-node distributed trainings with a
ncclbackend will hang. There are a number of possible workarounds:- Use our
pytorch/1.9.0module, which is built from source with NCCL 2.9.8 - Use a container with pytorch and a version of NCCL>=2.8. The Nvidia ngc deep learning containers have many versions available, and are optimized for Nvidia GPUs.
- Set the environment variable
export NCCL_IB_DISABLE=1before running your training. This will disable collective communications over InfiniBand, so will incur a slight performance hit.
- Use our
- Tensorflow + horovod built with Cray MPICH hangs when running multi-node training. Currently, we recommend using shifter to run multi-node tensorflow + horovod workflows. Note that since the Nvidia ngc containers use an installation of OpenMPI, the
--mpi=pmi2option forsrunmentioned above is needed to get this working.