Connection between the Normal Equations and Weak Solutions

I want to connect the normal equations, which come up in statistics, to weak solutions, which come up in PDEs. Let’s start with the normal equations. These arise when you want to solve a linear system A\vec{x}=\vec{b}, where A is m\times n for m>n. When you have more equations than unknowns, the system is called “overdetermined.” If your system has no exact solution (which is frequently the case for overdetermined systems), then the best you can do is minimize \|A\vec{x}-b\|^2. This is done by formulating the normal equations A^TA\vec{x}=A^T\vec{b} and solving for \vec{x}. This approach has a lot of issues, but I just want to use it to show a connection to another area of math.

We can rearrange the normal equations and rewrite them as A^T(A\vec{x}-\vec{b})=0. Since A\vec{x}=\vec{b} was the system we wanted to solve originally, A\vec{x}-\vec{b} is the residual vector, or how far \vec{x} is from solving the system exactly. Since A^T(A\vec{x}-\vec{b})=0, this means that the residual A\vec{x}-\vec{b} is orthogonal to every column of A. If we write this in the language of bilinear forms, we get that (A[:,i],A\vec{x}-\vec{b}) = 0. This bilinear form is symmetric, so we can also write this as (A\vec{x}-\vec{b},A[:,i]) = 0.

Now, let’s talk about what a weak solution is. These usually come up when differential equations have solutions that might not be as differentiable as the equation itself requires. The idea behind weak solutions is to remove a bit of the strict requirement that a function be smooth enough, as long as it satisfies the other aspects of the PDE. If you wanted to solve a PDE of the form \mathcal{L}u=f for some differential operator \mathcal{L}, the weak formulation instead seeks to solve (\mathcal{L}u,v) = (f,v), where (\cdot,\cdot) is a bilinear form and v is a sufficiently smooth test function.

Using properties of bilinear forms, we can rewrite this equation as (\mathcal{L}u-f,v)=0.

If we use the example of the linear system we started with, this says that (A\vec{x}-\vec{b},\vec{v})=0 for all vectors \vec{v}. This is sort of a weak formulation of the linear system A\vec{x}=\vec{b}.

This looks identical to the result in the normal equations that (A\vec{x}-\vec{b},A[:,i]) = 0., except that the normal equations impose that particular vectors, the columns of A must be orthogonal to residual A\vec{x}-b, rather than just a general vector \vec{v}.

This means that, in some ways, you can think of the least squares formulation of an overdetermined linear system as a weak formulation. In PDEs, weak formulations where the test functions are the same as the elements you’re using to approximate your solution (for least squares those are the columns of A) are called Galerkin finite element methods.

So that means there’s a nice relationship between Galerkin finite element methods and the least square formulation of an overdetermined linear system!

AM-GM Inequality

The Arithmetic Mean – Geometric Mean inequality says that if you have numbers a_1,\ldots,a_n\geq0, then (a_1\times a_2\times\ldots \times a_n)^{\frac{1}{n}}\leq\frac{1}{n}(a_1+\ldots+a_n). Another way to say this is that, given a set of nonnegative numbers, their geometric mean is always less than or equal to their arithmetic mean.

Before we look at some examples, let’s think about what this means geometrically. If you raise both sides to the n, you get (a_1\times a_2\times\ldots \times a_n) \leq \left(\frac{a_1+\ldots+a_n}{n}\right)^n. You can think of both sides of this as the area of an n-dimensional box. And this statement says that given sides lengths of a fixed sum, to maximize the volume of the box, you should make each side length the same. That side length is \frac{a_1+\ldots+a_n}{n}.

Now let’s look at some examples where we can apply this theorem.

If A is a positive semidefinite matrix (meaning that its eigenvalues \lambda_1,\ldots,\lambda_n are nonnegative), then (\lambda_1\times\ldots\times\lambda_n)^\frac{1}{n}\leq \frac{\lambda_1+\ldots+\lambda_n}{n}. The product of the eigenvalues of a matrix is det(A), and sum of the eigenvalues is tr(A). That means we can write that inequality as det(A)^{\frac{1}{n}}\leq\left(\frac{tr(A)}{n}\right). We can raise both sides to the n to get a bound on det(A). We get that, for a positive semidefinite matrix det(A) \leq \left(\frac{tr(A)}{n}\right)^n.

This is nice because the determinant is really expensive to calculate, but the trace is very easy to calculate as the sum of the diagonal entries of A. So this gives an upper bound on det(A) that is easy to obtain.

Let’s consider some positive semidefinite matrices that arise in applications and see what this inequality can tell us.

Let \vec{f}:\mathbb{R}^n\to\mathbb{R}^n be a nonlinear function. Then the Jacobian of this nonlinear transformation is J=\begin{bmatrix}\frac{\partial f_1}{\partial x_1}&\ldots&\frac{\partial f_1}{\partial x_n}\\ \vdots & \ddots & \vdots \\ \frac{\partial f_n}{\partial x_1}&\ldots&\frac{\partial f_n}{\partial x_n}\end{bmatrix}. If the Jacobian matrix is positive semidefinite we can apply the inequality from earlier. It will tell us that det(J)\leq \left(\frac{tr(J)}{n}\right)^n. The right hand side of this can be rewritten as \left(\frac{\frac{\partial f_1}{\partial x_1}+\ldots+\frac{\partial f_n}{\partial x_n}}{n}\right)^n, but this is the same thing as \left(\frac{div \vec{f}}{n}\right)^n, where div is the divergence of a vector field. So we get that det J \leq \left(\frac {div \vec{f}}{n}\right)^n. This is interesting because the determinant of a linear transformation essentially tells you the volume that its basis vectors occupy, and this inequality says that this quantity is bounded above by a function of div f. div f tells you whether f behaves like a source or a sink at a point, so this inequality relates the volume occupied by the linearization of f to extent to which f behaves like a source or a sink at a point.

Now, suppose g:\mathbb{R}^n\to \mathbb{R} is a convex function. Then the Hessian matrix H is given by H=\begin{bmatrix}\frac{\partial^2f}{\partial x_1^2}&\frac{\partial^2 f}{\partial x_1\partial x_2}&\ldots&\frac{\partial^2 f}{\partial x_1\partial x_n}\\ \frac{\partial^2 f}{\partial x_2\partial x_1}&\frac{\partial^2f}{\partial x_2^2}&\ldots&\frac{\partial^2 f}{\partial x_2\partial x_n}\\ \vdots & \vdots & \ddots&\vdots\\ \frac{\partial^2 f}{\partial x_n\partial x_1}&\frac{\partial^2 f}{\partial x_n\partial x_2}&\ldots&\frac{\partial^2 f}{\partial x_n^2}\end{bmatrix} is positive semidefinite. Therefore det(H)\leq \left(\frac{tr(H)}{n}\right)^n. Like in the previous example, we can simplify the numerator. The trace of H is \frac{\partial^2 f}{\partial x_1^2}+\ldots+\frac{\partial^2 f}{\partial x_n^2}, but this is just \Delta f, the Laplacian of f. So this shows that for a convex function (Hessian is positive semidefinite) with Hessian matrix H, det(H)\leq(\frac{\Delta f}{n})^n. If you compute det H at a critical point of the function f (a place where \nabla f=\vec{0}), you get the Gaussian curvature of the function at that point. That means that this inequality gives an upper bound on the Gaussian curvature at a critical point in terms of \Delta f.

Hopefully this gives some insight into how the Arithmetic Mean – Geometric Mean Inequality comes up in different areas of math!