## Solving Linear Systems and Inverting a Matrix is equivalent to Matrix Multiplication

Matrix multiplication consists on given two matrices $A$ and $B$, say for now both of dimensions $n \times n$, compute $C = A \cdot B$ with $C_{ij} = \sum_k A_{ik} \cdot B_{kj}$. The definition implies a naive algorithm of running-time $O(n^3)$ for multiplying two matrices. When I first heard about that, I was somewhat surprised that one can multiply matrices faster then that. The simpler algorithm beating the naive $O(n^3)$ was given by Strassen (1969) and has running-time $O(n^{\log_2 7})$ and is based on a simple observation. Say $n = 2^k$ and break each matrix in 4 blocks of dimensions $2^{k-1} \times 2^{k-1}$. So we get something like that:

$\begin{bmatrix} C_{11} & C_{12} \\ C_{21} & C_{22} \end{bmatrix} = \begin{bmatrix} A_{11} & A_{12} \\ A_{21} & A_{22} \end{bmatrix} \cdot \begin{bmatrix} B_{11} & B_{12} \\ B_{21} & B_{22} \end{bmatrix} = \begin{bmatrix} A_{11} B_{11} + A_{12} B_{21} & A_{11} B_{12} + A_{12} B_{22} \\ A_{21} B_{11} + A_{22} B_{21} & A_{21} B_{12} + A_{22} B_{22} \end{bmatrix}$
which are 8 multiplications of smaller matrices. Strassen found a recursion that computes the $C_{ij}$ factors using 7 matrix multiplication, which leads to $O(n^{\log_2 7})$. This number has been improved various times by Coppersmith and Winograd, Stothers, Vassilevska-Williams and Le Gall. The current best algorithm has running-time $O(n^{2.3728})$. No lower bound is known for this problem other then $O(n^2)$ which is the time needed to read the matrices. The constant $\omega$ is defined as the minimum constant for which it is possible to multiply matrices in time $O(n^\omega)$. So far it is known that $2 \leq \omega \leq 2.3728...$.

From multiplying matrices to solving linear systems

What I learned this week is that solving linear systems, inverting a matrix and computing determinants can be done also in time $O(n^\omega)$ with an ingenious idea by Hopcroft and Bunch (1974). Next we show how to do it following the presentation in Mucha and Sankowski (2004). In one line, the idea consists in performing a lazy LU decomposition.

The LU decomposition is another way to refer to the Gaussian elimination procedure. We decompose a matrix $A$ in the form $A = L U$ where L is a lower-diagonal matrix with ones in the diagonal and U is an upper-diagonal matrix with ones in the diagonal. Here are some great lecture notes on the topic, but here is a concise description. The following code is in Octave:

function [L,A] = naive_lu(A)
n = length(A);
L = eye(n);
for i = 1 : n
L(i+1:n,i) = A(i+1:n,i) / A(i,i);
A(i+1:n,i:n) = A(i+1:n,i:n) - L(i+1:n,i)*A(i,i:n);
endfor
endfunction


This implementation is very naive in the sense that it doesn’t do pivoting, it doesn’t work for non-singular matrices and so on, but it should be fine for a random matrix, for example. This captures the essence of Gaussian elimination: one makes the matrix triangular by eliminating rows one-by-one. The complexity of the naive implementation is $O(n^3)$ since in each step we update a matrix of size $n-i \times n-i+1$.

The great thing of having an LU decomposition is that now solving $Ax = b$ is very easy, since solving $LUx = b$ involves solving $L y = b$ and then $U x = y$. Since L and U are both triangular, this can be solved in $O(n^2)$ time. Also, we can compute determinants easily, since $det(A) = det(L) \cdot \det(U) = \prod_{i=1}^n u_{ii}$. The inverse of A is given by $A^{-1} = U^{-1} \cdot L^{-1}$ which is twice the time to invert a triangular matrix plus the time to multiply to matrices. The following argument shows that inverting a triangular matrix can be done in the same time as multiplying two matrices: consider an upper triangular matrix U, for example, then:

$\begin{bmatrix} U_{11} & U_{12} \\ & U_{22} \end{bmatrix} \cdot \begin{bmatrix} Y_{11} & Y_{12} \\ & Y_{22} \end{bmatrix} = \begin{bmatrix} I & 0 \\ 0 & I \end{bmatrix}$

So, we can solve this system by solving $Y_{11} = U_{11}^{-1}$, $Y_{22} = U_{22}^{-1}$ and since $U_{11} Y_{12} + U_{12} Y_{22} = 0$, by pre-multiplying this system by $Y_{11}$, we get $Y_{12} = - Y_{11} U_{12} Y_{22}$. Therefore, the time $I(n)$ to invert a triangular $n \times n$ corresponds to two inversions of $\frac{n}{2}$ matrices plus two matrix multiplications, i.e., $I(n) = 2 I(\frac{n}{2}) + O(n^\omega)$. Solving the recursion we get $I(n) = O(n^\omega)$.

The previous arguments reduce the problem of computing determinants, solving linear systems and computing inverses to the problem of computing an LU decomposition in $O(n^\omega)$ time.

Lazy LU decomposition

Note that the expensive step of the LU decomposition consists in updating the matrix $A$ by subtracting a rank one matrix. The idea behind the lazy LU decomposition is to postpone the updates such that they are done many row updates are done together in an exponential fashion, i.e., for each $j \geq 1$ a $2^{-j}$ fraction of the updates modifies only $2^{j-1}$ rows. The following picture show the traditional LU decomposition:

The k-th iteration, uses entry $A_{ii}$ in order to zero the entries in the blue block marked with $k$. The Hopcroft-Bunch version of the LU factorization zeros the entries in a different fashion, shown in the following picture:

The first few steps work like that: the first step will be almost like the traditional decomposition except that we will use $A_{11}$ to make entry $A_{21}$ zero (instead of the entire 1-column): this is done by setting $B \leftarrow B - A C^{-1} D$ and $A \leftarrow 0$. Notice that this is the same as doing $[A,B] \leftarrow [A,B] - A C^{-1} [C,D]$.

Next we do the same thing for block 2, we perform the same matrix operation: $B \leftarrow B - A C^{-1} D$. Notice that at this point $C$ will be an upper triangular non-singular matrix. We continue the same procedure until we the entire matrix become upper-diagonal. We store the modifications we performed (i.e., the matrices $A C^{-1}$ in the lower diagonal matrix $L$ in the same way we did or the usual LU decomposition. The few next steps are:

Generalizing this for an matrix of size $2^k \times 2^k$ we notice that there are $2^{k-j}$ operations with matrices of size $2^j$ for $j = 1$ to $k-1$. Each iteration consists of inverting a $O(2^{j\omega})$ and a multiplication of a $2^j \times 2^j$ matrix by a $2^j \times n$ matrix, which takes time $\frac{n}{2^j} O(2^{j\omega})$, since this can be decomposed in $\frac{n}{2^j}$ multiplications of matrices of size $2^j \times 2^j$. The overall time is then $\sum_{j=1}^{\log(n)} \frac{n}{2^j} \cdot \frac{n}{2^j} O(2^{j\omega}) = n^2 \sum_j O(2^{j(\omega - 2)}) = n^2 \cdot 2^{(\omega-2)\log n} = O(n^\omega)$.

The code for the lazy LU decomposition is surprisingly simple and similar to the code for the traditional LU decomposition. Essentially in each iteration $i$, we compute the size of the block $g$ as the maximum value of $2^j$ such that $2^j$ divides $i$ (in the code this is done using a bitwise operation).

function [L,A] = lazy_lu(A)
n = length(A);
L = eye(n);
for i = 1 : n-1
g = bitxor(i, bitand(i, i-1));
L(i+1:i+g,i-g+1:i) =
A(i+1:i+g,i-g+1:i) *  inv(A(i-g+1:i,i-g+1:i)) ;
A(i+1:i+g,i-g+1:n) =
A(i+1:i+g,i-g+1:n) - L(i+1:i+g,i-g+1:i) * A(i-g+1:i,i-g+1:n);
endfor
endfunction


The Octave code has running time $O(n^\omega)$ where $\omega$ is the exponent associated with the matrix multiplication algorithm used. Notice that above we are still using Octave’s native matrix multiplication and inversion, which are $O(n^3)$. In order to improve on that, one would need to substitute the inversion by a custom inversion (following the receipt given earlier in this post) and the multiplication operations  A(i+1:i+g,i-g+1:i) * inv(A(i-g+1:i,i-g+1:i))  and L(i+1:i+g,i-g+1:i) * A(i-g+1:i,i-g+1:n) by custom matrix multiplication.

Also, the code below should produce a correct decomposition for a random matrix, which is guaranteed to be well behaved. For specific matrices (like the matrix with all zeros and one in the secondary diagonal), the LU will fail since it will try to invert a singular matrix (the same way that a naive LU decomposition — our first algorithm in this page — will fail). The solution is to do pivoting and write an LUP decomposition, where L and U are as before and $P$ is a permutation matrix. Pivoting here can be done as usual, here is a code snippet that is guaranteed to work for any non-singular matrix:

function [L,A,P] = lazy_lu_with_pivoting(A)
n = length(A);
L = eye(n);
perm = [1:n];
for i = 1 : n-1
g = grow(i);
f = min(i+g, n);

# pivoting step
[_, piv] = max(A(i,i:end)); piv = piv + i - 1;
if (i != piv)
temp = A(:,i); A(:,i) = A(:,piv); A(:,piv) = temp;
temp = perm(i); perm(i) = perm(piv); perm(piv) = temp;
endif

L(i+1:f,i-g+1:i) =
A(i+1:f,i-g+1:i) *  inv(A(i-g+1:i,i-g+1:i)) ;
A(i+1:f,i-g+1:n) =
A(i+1:f,i-g+1:n) - L(i+1:f,i-g+1:i) * A(i-g+1:i, i-g+1:n);
endfor
P = eye(n)(perm, :);
endfunction


Lastly, one might wonder if solving a linear system or inverting a matrix is perhaps easier then multiplying two matrices. Hopcroft and Bunch point out the following fact that shows that inverting a matrix in $O(n^\omega)$ implies an algorithm for multiplying matrices in the same time:

$\begin{bmatrix} I & A & 0 \\ 0 & I & B \\ 0 & 0 & I \end{bmatrix}^{-1} = \begin{bmatrix} I & -A & AB \\ 0 & I & -B \\ 0 & 0 & I \end{bmatrix}$