SuperLU

(Supernodal LU)

Copyright and License

SuperLU is a general purpose library for the direct solution of large, sparse, nonsymmetric systems of linear equations. The library is written in C and is callable from either C or Fortran program. It uses MPI, OpenMP and CUDA to support various forms of parallelism. It supports both real and complex datatypes, both single and double precision, and 64-bit integer indexing. The library routines performs an LU decomposition with partial pivoting and triangular system solves through forward and back substitution. The LU factorization routines can handle non-square matrices but the triangular solves are performed only for square matrices. The matrix columns may be preordered (before factorization) either through library or user supplied routines. This preordering for sparsity is completely separate from the factorization. Working precision iterative refinement subroutines are provided for improved backward stability. Routines are also provided to equilibrate the system, estimate the condition number, calculate the relative backward error, and estimate error bounds for the refined solutions.
Serial SuperLU package also contains ILU routines, using numerical threshold-based dropping, with partial pivoting (ILUTP).

SuperLU package comes in three different flavors:

SuperLU for sequential machines ( Github . Code documentation)
SuperLU_MT for shared memory parallel machines ( Github )
SuperLU_DIST for distributed memory ( Github . Code documentation. Users' Guide for Version-8 release )
The target machines for SuperLU_DIST are the highly parallel distributed memory hybrid systems. The numerical factorization routines are already implemented for hybrid systems with multiple GPUs. Further work will be needed to implement the other phases of the algorithms on the hybrid systems and to enhance strong scaling to extreme scale.

FAQ (Frequently Asked Questions)

The Users' Guide (Tech report LBNL-44289) describes all three libraries. (Last update: June 2018)

How to Cite SuperLU in your publication.

Please send email if you have used any versions of the library.

This is a survey article about sparse direct solvers of various flavours.

Usage of SuperLU (page is under construction)

This project has been funded by DOE, NSF and DARPA.

Developers:
     X. Sherry Li
     Wajih Boukaram
     Jim Demmel
     Nan Ding
     John Gilbert
     Laura Grigori
     Yang Liu
     Piyush Sao
     Meiyue Shao
     Ichitaro Yamazaki

Other Contributors:
     Pietro Cicotti, UCSD
     Daniel Schreiber
     Jinqchong Teo
     Yu Wang
     Eric Zhang, Albany High

SuperLU Version 7.0.0

Download software (v7.0.0) -- source code and documentation in a compressed tar file (~2.5 MB).
Supports both real and complex data types, in single or double precision.
The SIMAX paper describes the algorithms and performance on various machines.
SuperLU has achieved up to 40% of the theoretical floating-point rate on a number of workstations, such as MIPS R8000 and IBM RS/6000. The megaflop rate usually increases with increasing ratio of floating-point operations count over the number of nonzeros in the L and U factors.
This Technical Report (published in ACM Trans. Math. Software, Vol. 37, Issue 4, Article No. 43, April 2011) describes the ILU algorithm implemented in SuperLU 4.0.
These slides are from an up-to-date talk.
Release notes:
- February 4, 1997 Version 1.0
- November 15, 1997 Version 1.1
- September 1, 1999 Version 2.0 (Last update: 06/03/03)
- October 15, 2003 Version 3.0 (Last update: 01/02/06)
  - Include "symmetric mode"
- August 1, 2008 Version 3.1
- June 30, 2009 Version 4.0 (Last update: 12/18/09)
  - Include threshold-based incomplete factorization (ILU)
    (See details in this paper published in ACM Trans. Math. Software, Vol. 37, No. 4, Article No. 43, April 2011.)
- November 25, 2010 Version 4.1
- August 25, 2011 Version 4.2
- October 27, 2011 Version 4.3 (Last update: 12/14/2011)
  - Renamed several enum constants.
- July 26, 2015 Version 5.0
  - thread-safe: remove static variables; replace xLAMCH by table lookup in float.h (C99 standard).
  - Interface changes to the follwoing routines: xGSSVX (expert driver), xGSTRF (factorization). Consult the example dlinsolx.c in EXAMPLE/ diretory to see the calling sequence.
- December 3, 2015 Version 5.1 (Bug fix: Version 5.1.1 January 22, 2016)
  - Renames of internal routines. No interface change.
  - Real & complex being used together.
  - Bug fix for iterative refinement for complex conjugate.
  - Added CMake build option.
- April 8, 2016 Version 5.2.0 (Small patches: Version 5.2.1, May 22, 2016)
  - First xSDK release, add copyright notice in each file.
- October 17, 2020 Version 5.2.2
  - Applied a number of patches, merged a number of PRs.
- Septtember 29, 2021 Version 5.3.0
  - Added CI with github Actions.
  - Applied a number of patches.
  - Cleaned up warnings.
- April 5, 2023 Version 6.0.0
  - Add 64-bit indexing support and METIS ordering option.
- August 5, 2023 Version 6.0.1
  - Minor fixes, mostly documentation and clean up warnings
- August 17, 2024 Version 7.0.0
  - API change: "complex" to "singlecomplex"
  - Many other fixesx
Change Log

SuperLU_MT Version 4.0.0

Download software (V4.0.0) -- source code and documentation in a compressed tar file (~1.7 MB).
Provide Pthreads and OpenMP interfaces. There are also parallel directives for several older SMPs.
Supports both real and complex data types, in single or double precision.
The SIMAX paper describes the algorithms and performance on various machines.
SuperLU_MT demonstrated 5--10 fold speedups on a range of commercially popular SMPs, and up to 2.5 Gigaflops factorization rate.
These slides are from an up-to-date talk.
Release note:
- November 15, 1997 Version ALPHA
- September 1, 1999 Version 1.0 (Last update: 5/24/05)
- September 10, 2007 Version 2.0 (Last update: 12/12/12)
- March 20, 2013 Version 2.1
- August 18, 2014 Version 2.2
- December 20, 2014 Version 2.3
- February 7, 2015 Version 2.4
- May 1, 2015 Version 3.0
- March 29, 2016 Version 3.1
  - Add {c,z}matvec2() routines in {c,z}myblas2.c
- April 23, 2024 Version 4.0.0
Change Log

SuperLU_DIST Version 9.2.1

Download software (v9.2.1) -- source code and documentation in a compressed tar file (~2.9 MB).
Supports manycore heterogeous node architecture: MPI is used for interprocess communication, OpenMP is used for on-node threading, CUDA (for Nvidia) and HIP (for AMD) are used for computing on GPUs.
Supports both real and complex data types, in double precision.
The SC98 paper describes our new GESP algorithm designed for large scale parallel machines.
GESP stands for Gaussian Elimination with "Static Pivoting". Static pivoting is a technique that combines the numerical stability of partial pivoting with the scalability of no pivoting, to run accurately and efficiently on large numbers of processors.
The SIAM PP99 paper improves the kernel performance -- 20% to 40% better than the SC98 paper.
SuperLU_DIST demonstrated up to 100 fold speedup on the 512-PE Cray T3E at NERSC, and 10.2 Gigaflops factorization rate, using MPI.
The SC01 paper describes the use of SuperLU_DIST to solve complex sparse linear systems up to order 2 million, for a quantum mechanics problem.
The scientific result was reported earlier in the cover article of Science, Dec 24, 1999.
The ACM TOMS paper (2003) discusses all aspects of the SuperLU_DIST library.
This table contains detailed numerical results of the GESP algorithm.
The SIAM SISC paper (2007) describes the parallel algorithm and performance for symbolic factorization.
Release Notes:
- September 1, 1999 Version 1.0 (Last update: 2/18/03)
- March 15, 2003 Version 2.0 (Last update: 01/02/06)
  - Included distributed input interface
- November 1, 2007 Version 2.1 (Last update: 12/19/07)
  - Included parallel symbolic factorization. (cite SISC paper, SIAM J. Sci. Comp., Vol. 29, Issue 3, 1289-1314, 2007.)
- Feburary 20, 2008 Version 2.2
  - Fixed memory leaks and a few other bugs in parallel symbolic factorization
- October 15, 2008 Version 2.3 (Last update: 04/02/10)
  - Fixed a few bugs related to 64-bit long long int. (04/02/10)
  - Improved speed of factorization and triangular solve with multicore nodes.
- June 9, 2010 Version 2.4 (Last update: 06/25/10)
  - Updated several header files.
- November 25, 2010 Version 2.5 (Last update: 08/01/2011)
- October 18, 2011 Version 3.0 (Last update: 10/25/2011)
  - Factorization is 2-3x faster using hundreds of processors. (cite IPDPS 2012 paper, IPDPS 2012 Proceedings, pp. 619-630, doi:10.1109/IPDPS.2012.63.)
- May 20, 2012 Version 3.1 (Last update: 06/27/2012)
  - Update (par)metis interface to be compatible with new 64-bit ParMetis 4.0.2.
  - Fix a bug in the factorization routine when using 64-bit integer.
- Oct. 24, 2012 Version 3.2
  - bug fixes, and complex datatype for F90 wrapper.
- March 31, 2013 Version 3.3
  - bug fixes in the factorization routines.
- October 1, 2014 Version 4.0 (Last update: 10/11/2014, add conditional compilation for OpenMP)
  - add multithreading (OpenMP) and GPU (CUDA) support. (cite EuroPar2014 paper, LNCS Vol. 8632. Porto, Portugal, August 25-29, 2014.)
  - bug fixes.
- July 17, 2015 Version 4.1 (Last update: 08/05/2015, minor corrections to a few prototypes)
  - bug fixes.
- September 25, 2015 Version 4.2
  - replace xLAMCH by xMACH, using C99 standard.
- December 31, 2015 Version 4.3 (Last update: 01/07/2016)
  - Fix a bug in parallel symbolic factorization, related to dense separator.
  - Resolve name conflicts with serial SuperLU and LAPACK.
  - Driver routines return proper INFO error flag, instead of exiting.
  - Correctly flag zero pivot in a singular matrix.
- April 8, 2016 Version 5.0.0
- May 15, 2016 Version 5.1.0
  - Add p?GetDiagU() to extract the diagonal entries of U factor
- October 4, 2016 Version 5.1.1 --> October 21, 2016 Version 5.1.2
  - Fix a bug in calculating the size of U panel buffer.
  - (Release V5.1.2): bug fix related to using 64-bit integer.
- December 31, 2016 Version 5.1.3 (last update: 09/17/2017)
  - Bug fix for repeated factorization using SamePattern or DOFACT.
- September 30, 2017 Version 5.2.0
  - Include a number of optimizations for the factorization routines, targeted at the Intel KNL manycore node.
  - Should link with a multithreaded BLAS (GEMM) for best performance.
- October 13, 2017 Version 5.2.1
  - Change legacy makefile interface from -D_LONGINT to 'XSDK_INDEX_SIZE = 64'
  - Create superlu_dist_config.h from legacy make.inc makefiles
- October 29, 2017 Version 5.2.2
  - Include a query function superlu_dist_GetVersionNumber().
  - Fixed small issues with 64-bit index size.
- January 28, 2018 Version 5.3.0
  - Can disable linking with (Par)Metis.
  - Works with Windows.
- June 1, 2018 Version 5.4.0
  - Incorporated a parallel approximate max-weight matching as numerical pre-pivoting.
  - Fixed bugs associated with static scheduling and look-ahead, which caused deadlock for certain process grid.
- September 18, 2018 Version 6.0.0 (minor update 9/23/18)
  - Improved strong scaling of the triangular solve -- up to 4x faster than Version 5.x on 4000+ cores.
    (Requires C++ compiler)
  - Cite SIAM CSC2018 paper.
- December 6, 2018 Version 6.1.0 (minor update 12/9/18)
  - Improved threading performance in triangular solve.
  - Fixed several bugs in factorization, see Change Log page.
- February 8, 2019 Version 6.1.1
  - Several bug fixes, see Change Log page.
- Nov. 12, 2019 Version 6.2.0
  - Use cmake's FortranCInterface module to automatically handle Fortran-C name mangling. Create a new file superlu_FCnames.h.
  - Several small fixes, see Change Log page.
- Feb. 27, 2020 Version 6.3.0
  - A number of fixes for GPU code, Dec. 2019 - Jan. 2020
  - Redefine several structures as precision-dependent, so that both real and complex codes can be compiled at the same time.
  - Several small fixes, see Change Log page.
- April 2, 2020 Version 6.3.1
  - update interface to CombBLAS, include both real and complex.
    (Cite paper: A Distributed-Memory Algorithm for Computing a Heavy-Weight Perfect Matching on Bipartite Graphs.)
  - update FORTRAN/ wrapper to include CMake file.
- October 23, 2020 Version 6.4.0
  - Improved CMakeLists.txt for CUDA and Fortran
  - Improved OpenMP performance for triangular solve
- May 12, 2021 Version 7.0.0
  - New communication-avoiding 3D sparse LU factorization algorithm
  - Cite paper: A communication-avoiding 3D algorithm for sparse LU factorization on heterogeneous systems
  - To use: start with EXAMPLE/pddrive3d.c
- October 5, 2021 Version 7.1.0
  - Much improved user interface for the 3D algorithm framework
  - For GPU code, allow OpenMP to be disabled
  - Bug fixes
  - Version 7.1.1 (Oct. 18, 2021) : small fix for Windows, dereference several unallocated arrays
- December 12, 2021 Version 7.2.0
  - Remove cub/ dependency, update xsuperlu_gpu.cu files
  - Update ComBLAS interface
- May 22, 2022 Version 8.0.0
  - Include support for AMD GPUs with HIP programming.
  - Include mixed-precision routines: 'psdrive' (single working precision), can take double-precision iterative refinement as an option.
- July 05, 2022 Version 8.1.0
  - Improved GPU U-solve performance
  - Added single precision interface to HWPM pivoting code in CombBLAS
  - Improved FORTRAN/CMakeLists.txt
- May 8, 2024 Version 9.0.0
  - extensive GPU support for 3D communication-avoiding LU and tiangular solves
  - batched sparse direct solver on GPU
  - Julia interface
- November 10, 2024 Version 9.1.0
- October 22, 2025 Version 9.2.0
Change Log