(omega-dev-tridiagonal)=

# Tridiagonal Solvers

Omega provides batched solvers for tridiagonal systems of equations.
Here, batched means that many systems of the same size need to be solved at the same time.
Typically, this is because one system needs to be solved per ocean column.
In addition to providing solvers for general tridiagonal systems,
specialized solvers for diffusion-type systems with better stability
properties are available.
To obtain portable performance, different solution algorithms are utilized
on CPUs and GPUs.
The solvers implement the same user interface for CPUs and GPUs, so users
should be able to ignore which algorithm is used.
Two type aliases are provided that resolve to the optimal solver for a given
architecture:
- `TriDiagSolver` for general systems
- `TriDiagDiffSolver` for diffusion-type systems

## Representations of tridiagonal systems

### General tridiagonal system
A general tridiagonal system is represented by three vectors
`DL`, `D`, `DU`, representing the lower, main, and upper diagonal, respectively, and a right-hand-side vector `X`
$$
(DL)_i Y_{i - 1} + D_i Y_i + (DU)_i Y_{i + 1} = X_i.
$$

### Diffusion-type tridiagonal system

A diffusion-type system is represented by two vectors `G` and `H`, and a right-hand-side vector `X`. It has the form
$$
-G_{i - 1} Y_{i - 1} + (G_{i - 1} + G_i + H_i) Y_i - G_{i} Y_{i + 1} = X_i.
$$

## Using tridiagonal solvers

There are two main ways to use Omega batched tridiagonal solvers:
- Create global arrays storing the coefficients of all systems and call the top-level `solve` method
- Use a team-level solver inside a parallel loop using Kokkos team policy

While the first method is simpler, for computational performance it is better to use
the second method. This is because the second method allows one to define a system, solve it,
and do some post-processing of the solution in one parallel loop.
With the first method this would require multiple parallel loops, and allocation of storage
for the coefficients of all systems, which are typically no longer needed after the solution is obtained.

### Top-level method

To solve `NBatch` systems of size `NRow` using the general solver
four arrays `DL`, `D`, `DU`, and `X` need to be defined.
All arrays are of size (`NBatch`, `NRow`) and contain the
coefficients for each of the systems.
```c++
Array2DReal DL("DL", NBatch, NRow); // DL(n, :) = lower diagonal of system n
Array2DReal D("D", NBatch, NRow);   // D(n, :)  = main diagonal of system n
Array2DReal DU("DU", NBatch, NRow); // DU(n, :) = upper diagonal of system n
Array2DReal X("X", NBatch, NRow);   // X(n, :)  = rhs of system n
```
Once these arrays are filled with the coefficients, to solve the systems the
static `TriDiagSolver:solve` method is called
```
TriDiagSolver::solve(DL, D, DU, X);
```
After this call, the rhs array `X` contains the solution.

Solving diffusion-type systems is similar, except the input arrays contain coefficients in the
form presented [above](#diffusion-type-tridiagonal-system)
and the static `TriDiagDiffSolver::solve` method is called
```c++
Array2DReal G("G", NBatch, NRow); // G(n, :) = coefficients G_i of system n
Array2DReal H("H", NBatch, NRow); // H(n, :) = coefficients H_i of system n
Array2DReal X("X", NBatch, NRow); // X(n, :) = rhs of system n

// fill G, H, X

TriDiagDiffSolver::solve(G, H, X);
```

### Team-level method

Note: since the steps to use the team-level general and specialized solvers are similar,
this subsection first shows how to use the general solver.
At the end, an example of using the specialized solver is presented.

The team-level solvers need to be used inside a parallel loop using Kokkos team
policy. To create a team policy for solving `NBatch` systems of size `NRow`
the static member function `makeTeamPolicy` is used
```c++
TeamPolicy Policy = TridDiagSolver::makeTeamPolicy(NBatch, NRow);
```
Every team of threads except the last is responsible for solving `VecLength` systems.
Since the total number of systems likely doesn't evenly divide `VecLength`, the last
team solves fewer than `VecLength` systems. Hence, it is necessary to handle that with
if statements.

Typically, the parallel loop consists of three main parts:
- defining the system
- solving the system
- using the solution

At the beginning of the loop, temporary scratch storage for the coefficients of `VecLength` systems
needs to be allocated.
This is done by creating a `TriDiagScratch` struct, which contains
four member arrays `DL`, `D`, `DU`, and `X` of size (`NRow`, `VecLength`).
Note that, contrary to the global arrays approach, the system dimension comes first.

These arrays need to be filled with the coefficients of `VecLength` systems solved by the team.
This is done in a parallel loop over system size (using `TeamThreadRange`)
and a serial loop over `VecLength`.
Overall, the code that defines the systems coefficients looks typically like this:
```c++
parallel_for(Policy, KOKKOS_LAMBDA (TeamMember &member) {
   // create scratch data
   TriDiagScratch Scratch(Member, NRow);

   int IStart = Member.league_rank() * VecLength;

   // define the systems coefficients
   parallel_for(TeamThreadRange(Member, NRow), [=] (int K) {
      for (int IVec = 0; IVec < VecLength; ++IVec) {
         const int I = IStart + IVec;
         // handle last team having fewer systems
         if (I < NBatch) {
            Scratch.DL(K, IVec) = ...;
            Scratch.D(K, IVec)  = ...;
            Scratch.DU(K, IVec) = ...;
            Scratch.X(K, IVec)  = ...;
         }
      }
      ...
   });
```
To perform a team-level solve a different overload of the static member function `TridDiagSolver::solve`
needs to be called.
This overload takes the team member `Member` and the filled scratch struct `Scratch`.
This call needs to be surrounded by barriers to ensure that the
parallel loop setting the coefficients has finished and to make the computed solution available to all threads.
```c++
parallel_for(Policy, KOKKOS_LAMBDA (TeamMember &member) {
   // previous code

   // solve the system
   Member.team_barrier();
   TridDiagSolver::solve(Member, Scratch);
   Member.team_barrier();

   ...
});
```
After the call to `solve` the solution is available in `Scratch.X`. It can be manipulated further or simply stored.
```cpp
parallel_for(Policy, KOKKOS_LAMBDA (TeamMember &member) {
   // previous code

   // multiply the solution by 2 (just an example)
   parallel_for(TeamThreadRange(Member, NRow), [=] (int K) {
      for (int IVec = 0; IVec < VecLength; ++IVec) {
        Scratch.X(K, IVec) *= 2;
      }
   });

   Member.team_barrier();

   // store the solution
   parallel_for(TeamThreadRange(Member, NRow), [=] (int K) {
      for (int IVec = 0; IVec < VecLength; ++IVec) {
         const int I = IStart + IVec;
         // handle last team having fewer systems
         if (I < NBatch) {
            X(I, K) = Scratch.X(K, IVec);
         }
      }
   });
});
```
The team-level specialized diffusion solver can be used similarly, the only substantial difference is in the form
of the coefficients. A minimal complete example of using it is shown below.
```cpp
// create team policy
TeamPolicy Policy = TridDiagDiffSolver::makeTeamPolicy(NBatch, NRow);

parallel_for(Policy, KOKKOS_LAMBDA (TeamMember &member) {
   // create scratch data
   TriDiagDiffScratch Scratch(Member, NRow);

   int IStart = Member.league_rank() * VecLength;

   // define the systems coefficients
   parallel_for(TeamThreadRange(Member, NRow), [=] (int K) {
      for (int IVec = 0; IVec < VecLength; ++IVec) {
         const int I = IStart + IVec;
         // handle last team having fewer systems
         if (I < NBatch) {
            Scratch.G(K, IVec) = ...;
            Scratch.H(K, IVec)  = ...;
            Scratch.X(K, IVec)  = ...;
         }
      }

   // solve the system
   Member.team_barrier();
   TridDiagDiffSolver::solve(Member, Scratch);
   Member.team_barrier();

   // store the solution
   parallel_for(TeamThreadRange(Member, NRow), [=] (int K) {
      for (int IVec = 0; IVec < VecLength; ++IVec) {
         const int I = IStart + IVec;
         // handle last team having fewer systems
         if (I < NBatch) {
            X(I, K) = Scratch.X(K, IVec);
         }
      }
   });
});