Connection between the Normal Equations and Weak Solutions

I want to connect the normal equations, which come up in statistics, to weak solutions, which come up in PDEs. Let’s start with the normal equations. These arise when you want to solve a linear system A\vec{x}=\vec{b}, where A is m\times n for m>n. When you have more equations than unknowns, the system is called “overdetermined.” If your system has no exact solution (which is frequently the case for overdetermined systems), then the best you can do is minimize \|A\vec{x}-b\|^2. This is done by formulating the normal equations A^TA\vec{x}=A^T\vec{b} and solving for \vec{x}. This approach has a lot of issues, but I just want to use it to show a connection to another area of math.

We can rearrange the normal equations and rewrite them as A^T(A\vec{x}-\vec{b})=0. Since A\vec{x}=\vec{b} was the system we wanted to solve originally, A\vec{x}-\vec{b} is the residual vector, or how far \vec{x} is from solving the system exactly. Since A^T(A\vec{x}-\vec{b})=0, this means that the residual A\vec{x}-\vec{b} is orthogonal to every column of A. If we write this in the language of bilinear forms, we get that (A[:,i],A\vec{x}-\vec{b}) = 0. This bilinear form is symmetric, so we can also write this as (A\vec{x}-\vec{b},A[:,i]) = 0.

Now, let’s talk about what a weak solution is. These usually come up when differential equations have solutions that might not be as differentiable as the equation itself requires. The idea behind weak solutions is to remove a bit of the strict requirement that a function be smooth enough, as long as it satisfies the other aspects of the PDE. If you wanted to solve a PDE of the form \mathcal{L}u=f for some differential operator \mathcal{L}, the weak formulation instead seeks to solve (\mathcal{L}u,v) = (f,v), where (\cdot,\cdot) is a bilinear form and v is a sufficiently smooth test function.

Using properties of bilinear forms, we can rewrite this equation as (\mathcal{L}u-f,v)=0.

If we use the example of the linear system we started with, this says that (A\vec{x}-\vec{b},\vec{v})=0 for all vectors \vec{v}. This is sort of a weak formulation of the linear system A\vec{x}=\vec{b}.

This looks identical to the result in the normal equations that (A\vec{x}-\vec{b},A[:,i]) = 0., except that the normal equations impose that particular vectors, the columns of A must be orthogonal to residual A\vec{x}-b, rather than just a general vector \vec{v}.

This means that, in some ways, you can think of the least squares formulation of an overdetermined linear system as a weak formulation. In PDEs, weak formulations where the test functions are the same as the elements you’re using to approximate your solution (for least squares those are the columns of A) are called Galerkin finite element methods.

So that means there’s a nice relationship between Galerkin finite element methods and the least square formulation of an overdetermined linear system!

Rank-Nullity Theorem and First Isomorphism Theorem

In this post, I want to talk about the connection between the Rank-Nullity Theorem from Linear Algebra and the First Isomorphism Theorem, which comes up in Abstract Algebra. We talked about the Rank-Nullity Theorem in the last post, but here’s a quick overview.

The Rank-Nullity theorem relates the dimension of the column space of an m\times n matrix A to the dimension of the null space of that matrix. Specifically, it says that the sum of these two values is equal to the number of columns of A. That is, rank(A)+nul(A) = n.

Before we talk about the First Isomorphism Theorem and relate it to linear transformations, let’s just define a few things that we’ll need to know to understand it.

Definition 1: The kernel of a transformation f:A\to B is the set \{x\in A \vert f(x)=0 \}. Note that 0\in B.

Note that the kernel is a generalization of the null space of a matrix that is defined for both linear and nonlinear transformations.

Definition 2: An element b is in the image of a transformation f:A\to B if there exists some a\in A such that f(a)=b.

The image of a transformation is a generalization of the column space that holds for both linear and nonlinear transformations. That is, the image of matrix transformation is the column space of that matrix.

Definition 3: A group homomorphism from (G,*) to (H,\cdot) is a function h:G\to H such that for all u and v in G, it holds that h(u*v) = h(u)\cdot h(v)

A good way to think about a group homomorphism is that it’s a transformation between groups that maintains the relationship between the groups elements. You will get the same result if you multiply two elements in G and apply the function h as if you apply the function h to them both separately and then multiply the two outputs in the group H together. This is interesting because the group composition rules are not necessarily the same.

Now, let’s state the specific case of First Isomorphism Theorem we’ll talk about.

First Isomorphism Theorem: Let \phi:G\to H be a surjective homomorphism. Then, H is isomorphic to G/ker(\phi). Since \phi is surjective, H=im(\phi).

Hold on…What’s does G/ker(\phi) mean? That notation is a quotient group. There is a formal definition of it, but for now, let’s think about it as a way to separate elements in G into classes (called cosets) depending on their relationship with ker(\phi). This quotient group will have two cosets: one containing everything in G that IS IN ker(\phi) and one containing everything in G that IS NOT IN ker(\phi).

Because of the way quotient groups work, if \phi goes between spaces of the same dimension, the original group G can be written as the union of two sets: G = ker(\phi) \cup im(\phi).

Since G can be written in this way, it must be true that the size of G is equal to the sum of the sizes of ker(\phi) and im(\phi).

What does this say about linear transformations represented by square matrices? Since you can’t both be in a set AND not be in a set, it says that for a linear transformation from \mathbb{R}^n to \mathbb{R}^n, the null space and column space must be disjoint (except they’ll both have the zero vector in them).

Let’s look at a few examples and see why this makes sense.

Example: Let C be an n \times n invertible matrix. This means that the columns map onto \mathbb{R}^n and also that the null space has only the zero vector, meaning that these two sets share only the zero vector.

Example: Consider the non-invertible matrix \begin{pmatrix}1&2\\2&4\end{pmatrix}, which has the vector \begin{pmatrix} 2 \\ -1 \end{pmatrix} in its null space. If you try to solve \begin{pmatrix}1&2\\2&4\end{pmatrix}\begin{pmatrix}x\\y\end{pmatrix}=\begin{pmatrix}2\\-1\end{pmatrix} , you won’t be able to!

So, what have we learned? The First Isomorphism Theorem generalizes the Rank-Nullity Theorem in a way that lets us handle transformations between groups that are not necessarily Euclidean spaces. There is a tradeoff between having elements in the kernel of a transformation and elements in the image of a transformation. The larger the kernel is, the smaller the image is, and vice versa.

Let’s talk about the progression of these three theorems. The Invertible Matrix Theorem draws connections between the algebraic properties of a square matrix to the geometry of the transformation it represents. The Rank-Nullity theorem makes claims about the geometry of transformations between Euclidean spaces with potentially different dimensions. And the First Isomorphism Theorem makes claims about transformations between groups in general (note that \mathbb{R}^n and \mathbb{R}^m are groups for any integers n and m).

In the next post, we’ll talk more about what the First Isomorphism Theorem is actually saying and a neat application of it to transformations between different dimensions.

Stay tuned!