• BLAS library
  • Windows users
  • Symmetric or diagonally dominant problems
  • Complex types
  • Compressed row storage
  • Convert supernodal L & U to CSC format
  • SuperLU_MT: exceeding storage error
  • SuperLU_DIST: 2D process grid
  • SuperLU_DIST: ReplaceTinyPivot
  • SuperLU_DIST: support for 64-bit integers
  • SuperLU_DIST: symbolic factorization
  • SuperLU_DIST: reproducibility
  • GPU support

  • BLAS library

    The BLAS library in SuperLU distribution (under CBLAS/) only serves as functional purpose, but is not fast. Many vendors provide high-performance BLAS library in their math library distributions. You should try to link with those if they are available. Alternatively, you may obtain the following public-domain fast BLAS libraries:

    Windows users

    This was tested in MS Visual Studio. However the configuration highly depends on which compiler you using. Normally there is an IDE (Integrated Development Environment) editor associated with your compiler. You can do it in two steps:

    If you are using a compiler with command line only, you have to play with the makefile or -I -L -O -c options. As SuperLU calls BLAS routines but BLAS is not a native library of MS Visual Studio, you have to build your own BLAS library in the similar way as SuperLU library. The SuperLU distribution includes a C version of BLAS in SuperLU/CBLAS directory. This version is only functional but not fast. For speed, it's better to use vendor-supplied BLAS (e.g., Intel MKL) or public domain versions (e.g., ATLAS, or Goto BLAS).

    Further remarks on SuperLU_MT Windows compilation:

    1. On Cygwin, both Pthread and OpenMP versions can be compiled without trouble, just like on linux.
    2. MinGW works with both Pthread and OpenMP, too. But for MinGW users, they have to enter each directory (SRC, CBLAS, EXAMPLE, etc) and run mingw32-make separately. It is not easy to get the highest level Makefile work with MinGW.
    3. Visual Studio only works with the OpenMP version.
      The user needs to define the macro __OPENMP by editing the property of the VS project.

    Symmetric or diagonally dominant problems

    SuperLU cannot take advantage of symmetry, but it can still solve the linear system as long as you input both the lower and upper parts of the matrix A. If off-diagonal pivoting does not occur, the U matrix in A = L*U can be rewritten as D*L1', with L1 ~= L, modular roundoff errors.
    In many applications, matrix A may be diagonally dominant or nearly so. In this case, pivoting on the diagonal is sufficient for stability and is preferable for sparsity to off-diagonal pivoting. To do this, the user can set a small (less-than-one) diagonal pivot threshold (e.g., 0.0, 0.01, ...) and choose an (A' + A)-based column permutation algorithm. We call this setting Symmetric Mode. To use this (in serial SuperLU), you need to set:
        options.SymmetricMode = YES;
        options.ColPerm = MMD_AT_PLUS_A;
        options.DiagPivotThresh = 0.0; /* or 0.001, 0.01, etc. */
    For serial SuperLU, an example of using symmetric mode can be found in EXAMPLE/dlinsol1.c
    (Note, when a diagonal entry is smaller than the threshold, the code will still choose an off-diagonal pivot. That is, the row permutation P_r may not be Identity.)

    For SuperLU_MT, an example can be found in EXAMPLE/pdlinsolx1.c

    The algorithm in SuperLU_DIST is already in the spirit of symmetric mode.

    Complex types

    SuperLU supports single complex and double complex data types. The double complex is defined as a pair of doubles, as follows:
      typedef struct { double r, i; } doublecomplex;
    For a variable "v" of doublecomplex type, you need to use "v.r" and "v.i" to refer to the real and imaginary parts, respectively. For an array "a[]" of doublecomplex type, you can refer to the k-th element by "a[k].r" or "a[k].i".

    Compressed row storage

    The factorization and solve routines in SuperLU are designed to handle column-wise storage only. If the input matrix A is in row-oriented storage, i.e., in SLU_NR format, then the driver routines (dgssv() and dgssvx() in serial SuperLU, and pdgssv() and pdgssvx() in SuperLU_MT) actually perform the LU decomposition on A^T, which is column-wise, and solve the system using the L^T and U^T factors. The data structures holding L and U on output are different (swapped) from the data structures you get from column-wise input. For more detailed descriptions about this process, please refer to the leading comments of the routines dgssv/dgssvx and pdgssv/pdgssvx.

    Convert supernodal L & U to CCS format

    The L matrix is represented as *supernode*, the format is defined in structure SCformat{...} in supermatrix.h. In serial SuperLU/MATLAB/mexsuperlu.c, a routine called LUextract() converts L & U into respective CCS format (for matlab to use).
    In SuperLU_MT, the data structure is the same for U. The difference in L is permuted columns, that is, the supernodes are stored out of order. It is defined in structure SCPformat{...} in supermatrix.h. Section 3.2 of the Users Guide explains this.

    SuperLU_MT: exceeding storage error

    SuperLU_MT does not support dynamic memory expansion. It pre-allocates storage based on the nonzeros in original matrix A and the estimated fill ratios given by the #6, #7, #8 parameters n SRC/sp_ienv.c. If the guestimate is small, you may get the following error message.
        Storage for L subscripts exceeded; Current column xxxx; Need at least yyyyyy;
        You may set it by the 8-th parameter in routine sp_ienv().
    Then, you need to set the corresponding parameter to be a larger value in magnitude. Section 3.5.2 of the Users' Guide explains this in more detail.

    SuperLU_DIST: 2D process grid

    For best performance, the process grid should be arranged as square as possible. When square grid is not possible, it is better to set the row dimension (nprow) slightly smaller than the column dimension (npcol). For example, the following are good options: 2x3, 2x4, 4x4, 4x8, etc.

    SuperLU_DIST: ReplaceTinyPivot

    SuperLU_DIST uses "static pivoting" strategy, which pre-pivot large elements of A to the diagonal before numerical factorization. No dynamic pivoting is performed during factorization. This is less rigorous than classical partial pivoting, but makes it much more scalable. During factorization, if we encounter a value smaller than sqrt(epsilon) * norm(A), we bump it up to sqrt(epsilon)*norm(A). This corresponds to a single precision perturbation, that is, we compute a factorization of a nearby matrix by half-of-double precision distance to the original one. This trick avoids division by small pivots, and work well for lots of problems whose conditioner numbers are not too large.

    For very ill-conditioned problems, this perturbation may be too large, you can turn it off by setting:

        options.ReplaceTinyPivot = NO;

    SuperLU_DIST: support for 64-bit integers

    Sequential SuperLU does not explicitly support 64-bit integer; all the integers are declared as "int". The parallel SuperLU_DIST declares integers as "int_t", which can be defined as "int" or "long int" at compile time. This is explained in the top-level README file.

    If you use 64-bit integer and (Par)METIS for sparsity-preserving ordering, you need to install (Par)METIS with 64-bit integer as well.

    SuperLU_DIST: symbolic factorization

    The default setting uses serial symbolic factorization routine, which can be used together with any fill-reducing orderings. The disadvantage is that it gathers the entire graph of A to one process, which may become a memory bottleneck. To use parallel symbolic factorization routine, you need to set the following:
        options.ParSymbFact    = YES;                                                  
        options.ColPerm        = PARMETIS;
    Note that the parallel symbolic factorization must be used together with a parallel ordering algorithm, such as ParMETIS, by which the amount of fill may be more than that given by a serial algorithm.

    SuperLU_DIST: reproducibility

    SuperLU_DIST does not guarantee bit-wise reproducible result. It may be true with one MPI task and one OpenMP thread. But with multiple MPI tasks, the communication order and computation order may be different from run to run, since the algorithm uses asynchronous pipelined execution with overlapping communication and computation, there is no guarantee of the same order of floating point operations. For ill-conditioned problems, different order of operations may cause large change in the solution. For not so ill-conditioned problems the difference may be only in the last few digits. You can check the error bound returnedto help confirm whether the difference between runs is as large as may be expected.

    GPU support

    Only SuperLU_DIST has GPU support, currently on Nvidia/CUDA. In the near future, we will support AMD and Intel GPUs. Serial SuperLU uses partial pivoting, with row interchanges during elimination and a fully dynamic memory management. It is not beneficial to parallelize it on GPUs. SuperLU_DIST uses static pivoting, potentially less stable than serial SuperLU, but having separate pivoting and symbolic factorization phases, so we can tackle them one by one.