Let $A \in \R^{n \times n}$ be invertible and $b \in \R^n$. A projection method for solving $Ax = b$ produces an approximation $\tilde{x}$ to the exact solution $x^*$ within an $m$-dimensional search subspace $\K$ translated by an initial guess $x^{(0)}$, such that the residual $b - A\tilde{x}$ is orthogonal to an $m$-dimensional constraint subspace $\L$. In other words, $\tilde{x} \in x^{(0)} + \K$ with $b - A\tilde{x} \in \L^\perp$. If $\L = \K$, the projection method is said to be orthogonal and its orthogonality constraints are known as the Galerkin conditions; otherwise, the method is said to be oblique and its constraints are known as the Petrov–Galerkin conditions.
Such a method is well-defined if and only if $A\K \cap \L^\perp = \set{0}$. Indeed, if $A\K \cap \L^\perp = \set{0}$ and $V, W \in \R^{n \times m}$ are matrices whose columns are bases of $\K$ and $\L$, then we must have $\tilde{x} = x^{(0)} + Vy$ for some $y \in \R^m$ such that $W^\tp (r^{(0)} - AVy) = 0$, where $r^{(0)} := b - Ax^{(0)}$. Hence $$ \tilde{x} = x^{(0)} + V(W^\tp AV)^{-1} W^\trans r^{(0)}, $$ where $W^\tp AV$ is invertible because $\im(AV) \cap \ker(W^\tp) = \set{0}$. In addition, if $\tilde{x}’ \in x^{(0)} + \K$ with $b-A\tilde{x}’ \in \L^\perp$, then $A(\tilde{x}-\tilde{x}’) \in A\K \cap \L^\perp$, so $\tilde{x} = \tilde{x}’$. Conversely, if the method is well-defined and $Av \in \L^\perp$ for some $v \in \K$, then $\tilde{x} + v \in x^{(0)} + \K$ and $b - A(\tilde{x} + v) \in \L^\perp$, so $v = 0$ and hence $Av = 0$.
This projection process may be iterated by selecting new subspaces $\K$ and $\L$ and using $\tilde{x}$ as the initial guess for the next iteration, yielding a variety of iterative methods for linear systems, such as the well-known Krylov subspace methods. These iterative methods can sometimes experience a “lucky breakdown” when the projection produces the exact solution:
If $r^{(0)} \in \K$ and $\K$ is $A$-invariant, then $A\tilde{x} = b$ (or equivalently, $\tilde{x} = x^*$).
Proof. By definition, $\tilde{x} - x^{(0)} \in \K$ and $b - A\tilde{x} \in \L^\perp$. On the other hand, $A\K \subseteq \K$ and $\dim(A\K) = \dim(\K)$ since $A$ is invertible, so $A\K = \K$. Hence $b - A\tilde{x} = r^{(0)} - A(\tilde{x} - x^{(0)}) \in A\K \cap \L^\perp = \set{0}$. ∎
Error projection methods
An error projection method is a projection method where $A$ is symmetric positive definite (SPD) and $\L = \K$. Such methods are well-defined because if $Av \in \K^\perp$ for some $v \in \K$, then $\norm{v}_A^2 = 0$.
If $A$ is SPD and $\L = \K$, then $\tilde{x}$ uniquely minimizes the $A$-norm of the error $x^* - \tilde{x}$ over $x^{(0)} + \K$.
Proof. For all $x \in x^{(0)} + \K$, we have $\norm{x^* - x}_A^2 = \norm{x^* - \tilde{x}}_A^2 + \norm{\tilde{x} - x}_A^2$ because $\tilde{x} - x \in \K$ and $x^* - \tilde{x} \perp_A \K$ according to the Galerkin conditions. ∎
The gradient descent method
The gradient descent method for solving $Ax = b$ when $A$ is SPD is the iterative method with $\K = \L := \span \set{r^{(k)}}$, where $x^{(k)}$ denotes the $k$th iterate and $r^{(k)} := b - Ax^{(k)}$. Thus, $x^{(k+1)}$ minimizes the $A$-norm of the error over the line $x^{(k)} + \span \set{r^{(k)}}$; indeed, if $f(x) := \frac{1}{2} \norm{x^* - x}_A^2$, then $\nabla f(x^{(k)}) = -r^{(k)}$, so $r^{(k)}$ represents the direction of steepest descent of $f$. The projection formula above reduces to $$ x^{(k+1)} = x^{(k)} + \frac{\inner{r^{(k)}}{r^{(k)}}}{\inner{Ar^{(k)}}{r^{(k)}}} \, r^{(k)} =: x^{(k)} + \alpha_k r^{(k)}. $$ We also note that $r^{(k+1)} = r^{(k)} - \alpha_k Ar^{(k)}$, so this method can be implemented with only one multiplication by $A$ per iteration.
To analyze the convergence of the gradient descent method, we consider the error $e^{(k)} := x^* - x^{(k)}$. Using the fact that $e^{(k+1)} = e^{(k)} - \alpha_k r^{(k)} \perp_A r^{(k)}$, we compute that $$ \begin{align*} \norm{e^{(k+1)}}_A^2 &= \inner{e^{(k+1)}}{e^{(k)}}_A \\ &= \norm{e^{(k)}}_A^2 - \alpha_k \inner{r^{(k)}}{e^{(k)}}_A \\ &= \left(1 - \frac{\inner{r^{(k)}}{r^{(k)}}^2}{\inner{r^{(k)}}{r^{(k)}}_A \inner{r^{(k)}}{r^{(k)}}_{A^{-1}}}\right) \norm{e^{(k)}}_A^2. \end{align*} $$
Next, we establish a useful algebraic inequality:
Kantorovich’s inequality
If $\theta_i \geq 0$ and $0 < a \leq x_i \leq b$ for $1 \leq i \leq n$, then $$ \left(\sum_{i=1}^n \theta_i x_i\right) \left(\sum_{i=1}^n \frac{\theta_i}{x_i}\right) \leq \frac{(a+b)^2}{4ab} \left(\sum_{i=1}^n \theta_i\right)^2. $$
Proof. By homogeneity, we may assume that $\sum_i \theta_i = 1$ and $ab = 1$. Since $x \mapsto x + \frac{1}{x}$ is convex on $[a, b]$, we have $x_i + \frac{1}{x_i} \leq a + b$ and hence $\sum_i \theta_i x_i + \sum_i \frac{\theta_i}{x_i} \leq \sum_i \theta_i (a+b) = a+b$. The result then follows from the AM–GM inequality. ∎
Now if the eigenvalues of $A$ are $\lambda_1 \geq \cdots \geq \lambda_n > 0$, then by Kantorovich’s inequality, $$ \frac{\inner{r^{(k)}}{r^{(k)}}^2}{\inner{r^{(k)}}{r^{(k)}}_A \inner{r^{(k)}}{r^{(k)}}_{A^{-1}}} \geq \frac{4\lambda_1 \lambda_n}{(\lambda_1 + \lambda_n)^2} = \frac{4\kappa}{(\kappa + 1)^2}, $$ where $\kappa$ is the (2-norm) condition number of $A$. Hence
$$ \norm{e^{(k)}}_A \leq \left(\frac{\kappa - 1}{\kappa + 1}\right)^k \norm{e^{(0)}}_A. $$
Residual projection methods
A residual projection method is a projection method where $A$ is invertible and $\L = A\K$.
If $A$ is invertible and $\L = A\K$, then $\tilde{x}$ uniquely minimizes the norm of the residual $b - A\tilde{x}$ over $x^{(0)} + \K$.
Proof. For all $x \in x^{(0)} + \K$, we have $\norm{b - Ax}^2 = \norm{b - A\tilde{x}}^2 + \norm{A(\tilde{x} - x)}^2$ because $A(\tilde{x} - x) \in A\K$ and $b - A\tilde{x} \perp A\K$ according to the Petrov–Galerkin conditions. ∎