Technical Note : Improving the computational efficiency of sparse matrix multiplication in linear atmospheric inverse problems

Matrix multiplication of two sparse matrices is a fundamental operation in linear Bayesian inverse problems for computing covariance matrices of observations and a posteriori uncertainties. Applications of sparse-sparse matrix multiplication algorithms for specific use-cases in such inverse problems remain unexplored. Here we present a hybridparallel sparse-sparse matrix multiplication approach that is more efficient by a third in terms of execution time and 10 operation count relative to standard sparse matrix multiplication algorithms available in most libraries. Two modifications of this hybrid-parallel algorithm are also proposed for the types of operations typical of atmospheric inverse problems, which further reduce the cost of sparse matrix multiplication by yielding only upper triangular and/or dense matrices.


Introduction
Sparse-Sparse (SS) matrix multiplication forms the computational backbone of scientific computation in many fields.In this study, applications of SS matrix multiplication to the solution of linear Bayesian inverse problems aimed at estimating greenhouse gas emissions are presented (for details on the methodological formulation of these inverse problems, see Enting 2005).However, the issues associated with SS matrix multiplication discussed here have wide implications for many fields of research.
Most of the SS matrix multiplication algorithms that can be used with any sparse matrix structure (e.g., diagonal, blockdiagonal, Toeplitz, Binary and Vandermonde matrices) use modifications of the algorithm proposed by Gustavson (1978).
Parallel implementation of this algorithm for matrices stored in compressed sparse row (CSR) format for CPUs and GPUs appear in standard libraries (NVIDIA 2016;Intel 2016a).For CPUs, these general-purpose parallel implementations either (1) precompute the number of non-zero entries in the output matrix (e.g. a matrix

B
) and then compute matrix-matrix product (for details see, Math Kernel Library routine "mkl_csrmultcsr", Intel 2016b), or (2) progressively adjust the space required to store the output matrix (for details see; Davis 2006;Gilbert, Moler, and Schreiber 1992).Alternatively, on a GPU, a sufficiently large non-adjustable memory (a.k.a. Random Access Memory) is assigned to store the output of SS matrix multiplication (variations of this approach are proposed in Liu and Vinter 2014).Other approaches for increasing the computational efficiency of SS matrix multiplication have focused on specific applications, such as SS matrix multiplication for hyper-sparse matrices (Buluç and Gilbert 2010), multigrid PDEs (Mccourt et al., 2013), or multiplication of a binary matrix with a dense matrix (Blelloch et al., 2008), among others.
The development of SS matrix multiplication approaches tailored to improving the computational efficiency of atmospheric inverse problems has not been previously explored.Here, we address this need and propose (1) a single pass hybrid-parallel algorithm for performing matrix-matrix multiplication on CPUs and describe conditions under which this approach can be computationally more efficient than a standard dual pass algorithm, (2) a modification of the Gustavson's (1978) algorithm for obtaining the upper or lower triangular portion of a symmetric sparse matrix, and (3) a modification of the Gustavson's (1978) algorithm for obtaining a dense matrix from the multiplication of two sparse matrices.In lieu of providing pseudo code, the source code for all the proposed algorithms is provided in supplementary material.This source code can be compiled and directly incorporated into linear atmospheric inverse modeling procedures.

Application of sparse-sparse matrix multiplication in inverse problems
In a previous work, Yadav and Michalak (2013) presented an efficient method for multiplying two matrices in the context of inverse problems when one of the matrices could be expressed as a Kronecker product of two smaller matrices.This form of matrix multiplication occurs in inverse problems when the Jacobian of the forward problem (i.e. the sensitivity matrix of the elements of the observation vector to the elements of the state vector)

Q
that describes the decorrelation of prior uncertainties in space and in time.In this special case, the Yadav and Michalak (2013) algorithm accommodates all possible combinations of sparse and dense matrix multiplication (e.g.sparse-sparse, dense-dense, dense-sparse and sparse-dense).
However, this method does not improve the computational efficiency of matrix multiplication (a) when Q is sparse and cannot be expressed as a Kronecker product, or (b) when a matrix HQ that does not have any special structure needs to be multiplied by T H , which is another common operation in inverse problems.In the first case, the matrix multiplication can be performed through a general sparse matrix routine, whereas in the second case computational efficiency can be improved by considering the properties of T HQH , such as the fact that matrix multiplication of H with a symmetric matrix Q followed by multiplication by T H results in a symmetric matrix ( T HQH ).Thus, computational efficiency can be gained by computing only the upper or lower triangular portions of T HQH .Additional, computational efficiency can be gained when it is known that the output of the matrix multiplication of two SS matrices will yield a dense matrix.Such a case is common in linear inverse problems because the multiplication of H by Q reduces sparsity, and therefore HQ , when multiplied by T H results in a dense symmetric matrix.In these situations, pre-allocation of memory for T HQH is straightforward, and if only the upper or lower triangular matrix is required, then obtaining a dense matrix from the multiplication of two sparse matrices is even more efficient.Routines for obtaining a triangular matrix as an output from the matrix multiplication of two dense Thus, algorithmically, SS multiplication in inverse problems manifests itself in three possible forms: (a) Multiplication of a sparse matrix H with a sparse diagonal, sparse block-diagonal, or full-dense covariance matrix HQ with a sparse T H that result in a sparse symmetric matrix (c) Multiplication of a sparse HQ with a sparse T H that result in a dense symmetric matrix Each of these three forms of matrix multiplication requires a distinct algorithmic treatment for increasing computational efficiency.The names of the routines that perform these three forms of SS matrix multiplication, as given in supplementary material, are shown in Figure 1 in terms of matrices HQ and T H .Note that even though the three forms of matrix multiplication described above are given in terms of atmospheric inverse problems, they are also common in other applications (e.g., Cressie and Johannesson, 2008).

Classical approach for multiplication of two sparse matrices
The classical approach for matrix multiplication of two SS matrices involves either a single pass or a double pass algorithm.This approach relies on keeping track of indices of non-zero entries in the two matrices to be multiplied.The operations involved are shown in Appendix 1 in the function "dense_nosym."Note that this function does not depict automatic adjustment of the memory and results in a dense matrix.In the single pass version of the classical algorithm (e.g.adopted in Matlab (Gilbert, Moler, and Schreiber (1992)), a set amount of memory is assigned for storing the output of the matrix multiplication, which is then adjusted if the size of the output exceeds the assigned memory.On the other hand, in a double pass algorithm, the number of non-zero entries in the output matrix is first computed (this is also referred to as symbolic matrix multiplication), after which SS matrix multiplication is performed.

Parallel approach for multiplying two sparse matrices
For the purpose of application to atmospheric inverse problems, we parallelize the classical single pass algorithm and divide work of multiplying two SS matrices across threads and nodes by rows.This can be defined as 1-dimensional (1D) matrix multiplication, as partitioning of the matrix multiplication between two conforming matrices happens on the basis of rows of the first matrix (see Figure 2).It is similar to 1D row-wise SS matrix multiplication as discussed in, Ballard et al. (2016).
However, Ballard et al. (2016) do not provide a methodology for computing non-zero entries in the output matrix, which in our approach can be specifically tailored for inverse problems.Even though the 1D algorithm implemented here can be used for general SS matrix multiplication, it is most suited for multiplying H and Q or HQ and T H in linear Bayesian inverse problems, as: (1) each row in H in these problems is guaranteed to have non-zero entries, which allows good load balancing across nodes and threads, and (2) the memory for storing non zero entries in the output matrix can be computed efficiently if the structure of the covariance matrix Q is pre- specified (e.g., diagonal, block-diagonal or full) as allowed in the routines given in supplementary material.
The application of the 1D approach for SS matrix multiplication in a hybrid environment is exemplified in Figure 2, where an arbitrary (rt) matrix A is multiplied by a (tt) matrix B using two nodes.In this case, each of the two nodes gets r/2 rows of A to multiply with matrix B .This task is further divided across two threads on each node and in the final step results are collected to obtain the (rt) matrix AB .As parallel SS matrix multiplication is communication-limited, performing this computation on a single node by using multiple threads is much faster relative to SS matrix multiplication accomplished by using a single thread across multiple nodes.This happens because matrices A and B , as shown in Figure 2, are shared by threads and the only private variables are the partitioned output of the matrix multiplication (e.g. The ID approach is simple; however, knowing the number of non-zero entries in the output matrix can significantly reduce the computational cost of SS matrix multiplication, as it allows efficient pre-allocation of memory for storing the output of matrix multiplication.In linear Bayesian inverse problems, even though the exact number of non-zero entries in the output matrix is unknown, an estimate can be obtained by looking at the structure of Q and the sensitivity matrix H , which is sparse and banded in the case of regional inversions and where the elements of the state space only influence observations over a finite period of time (see Figure 3; for details on regional inversions see Gourdji et. al., 2010).For example, when Q is a diagonal matrix, the number of non-zero entries in HQ would be equal to non-zero entries in H . Due to the sparsity structure of H in regional atmospheric inverse problems, approximate estimates of non-zero entries in HQ can also be obtained for dense and block-diagonal Q as shown in Table 1.Since T HQH is symmetric and most likely to be dense or have a significant proportion of non-zero entries, it should always be obtained as dense upper or lower triangular matrix.For inversions where elements of the state space indefinitely impact later observations (e.g.global atmospheric inversions), H is denser than in the regional inversion case, and a default formulation as discussed in the next paragraph should be used for multiplying H and Q .
In supplementary material, we show the application of 1D SS matrix multiplication algorithm in a multithreaded framework, which can be easily extended to a hybrid computing environment consisting of multiple nodes and multiple threads.If the covariance structure of Q is specified, then we use Table 1 to compute non-zero entries in the output matrix.In case it is not specified, the algorithm pre-allocates the number of non-zero entries in the output matrix as a percentage.In the default case, it is assumed that the output matrix would have 10% non-zero entries (for e.g.0.10rt for storing

AB
that is obtained by

B
) divided across nodes and/or threads.This percentage can be overridden and lowered.As implemented in the source code, if it is instead specified as > 10% then an upper limit of 10% is set and adjusted after this threshold is crossed i.e., a larger amount of memory is required.In automated mode, the size of the memory is doubled after the threshold is reached; however, this can also be overridden.If a lower amount of memory is utilized then what is specified, then we adjust the memory at the end of the matrix multiplication.

Performance of the algorithm
The performance of the 1D SS matrix multiplication algorithm is evaluated by comparing it with the "mkl_csrmultcsr" routine provided in Intel's MKL library.A direct comparison of the computational efficiency of the proposed algorithms with the algorithms designed for GPUs is not performed as GPUs are not universally available, considerable time, for performing multiplication of two SS matrices is used for transferring data on a GPU, and most of the large inverse problems require out of core implementation that makes SS matrix multiplication inefficient on GPUs.
A real application from an atmospheric inverse problem is used to assess the performance of the two algorithms.In this application, double-precision matrices, HQ and T H of dimensions (nm) and (mn) where n =1070, m =10 7 ,

NNZ =
T NNZ H = 8.5510 7 are multiplied in a multithreaded environment to yield a dense (nn) symmetric matrix with 1070 2 =1.1410 6 non-zero entries.Even though computing an upper or lower triangular matrix would suffice for this problem, a full dense symmetric matrix was obtained in CSR form and the number of non-zero entries in the output matrix was not specified.The performance of the two algorithms was assessed on an Intel Xeon E5440 Harpertown 2.83 Ghz computer with 12MB L2 Cache.They were compared on three metrics, namely the: (1) floating-point operations (FLOPS) required, (2) time taken for execution, and, (3) peak and total memory usage.The results presented in Table 2 show that the proposed algorithm is approximately 33% faster in execution time and takes approximately 38% fewer FLOPS than the "mkl_csrmultcsr" routine included in Intel's MKL.However, as the 1D parallel algorithm does not know the exact number of non-zeros in the output matrix, the total and peak memory usage is higher than for "mkl_csrmultcsr", which requires an exact count of the number of non-zero entries in the output matrix.The difference in the performance of the 1D parallel algorithm and "mkl_csrmultcsr" is primarily due to the elimination of the time required to compute non-zero entries in the output matrix.These differences become even more pronounced when only the upper or lower triangular sparse or dense matrix is required from the matrix multiplication of sparse HQ and T H . Evaluation of the computational efficiency for obtaining these triangular matrices was not performed as the FLOPS required for computing the matrix product in these cases is substantially lower than "mkl_csrmultcsr".Moreover, if the dense triangular product is obtained then the need to compute number of non-zero entries is also eliminated.

Conclusion
Matrix multiplication remains a critical bottleneck in the solution of large-scale linear inverse problems and for many other scientific applications, and improving the computational efficiency of matrix multiplication is therefore an active area of research.A practical algorithm that can reduce the asymptotic bounds of matrix multiplication below O(n 2.81 ) remains elusive.However, novel algorithms are being proposed for eliminating this bottleneck for specific classes of matrices.The methods for SS matrix multiplication proposed here increase the computational efficiency of atmospheric inversions by accounting for the sparsity structure of the sensitivity matrix and a priori covariance matrix.Three different formulations of the SS matrix multiplication are proposed.The source code for each of these is provided for direct integration in software designed for the solution of inverse problems.Hardware-specific optimizations can further improve the performance of the proposed algorithms.Updates to the algorithms proposed in this work would be required for porting them on new computer architectures and would be explored as part of further research.
Geosci.Model Dev.Discuss., doi:10.5194/gmd-2016-204,2016 Manuscript under review for journal Geosci.Model Dev.Published: 21 September 2016 c Author(s) 2016.CC-BY 3.0 License.matrices already appear in the Math Kernel Library and therefore development of an algorithm to accommodate this situation for inverse problems is not required (for details see; A Matrix Multiplication Routine That Updates Only the Upper or Lower Triangular Part of the Result Matrix, Intel (2016c)).

Figure 1 :
Figure 1: Software routines as given in supplementary material for sparse-sparse matrix multiplication in inverse problems.

Figure 2 :
Figure 2: Example of one-dimensional hybrid parallel sparse-sparse matrix multiplication.Note that in this figure it is assumed that A has even number rows.In case there are odd number of rows in A than the 1D parallel algorithm divides them across 5

Figure 3 :
Figure 3: Non-zero entries (brown) in a typical sensitivity matrix ( H ) used in regional atmospheric inversions.