Regression IV – Does it really work (mathematically speaking)?

So, in an earlier post, we saw how to derive the formula for the coefficients of regression. We did this by defining the error function

\[ (X\beta – Y)^T(X\beta – Y) \]

differentiating with respect to \(\beta\) and setting to zero to get

\[ \beta^T X^T X - Y^T X = 0 \]

and then solving for \(\beta\).

Although our method was correct, we were cheating a little. As this function is twice continuously differentiable, we have definitely found a stationary point, but we haven't show that this is a minimum as opposed to a saddle point or a maximum.

To do this, let's have a look at the second derivative of our function (the Hessian). This is the constant matrix

\[ X^TX \]

So, at any point on our surface and for any vector \(v\), the second directional derivative will be

\[ v^T X^T X v \]

We can rewrite this as,

\[ (X v)^T (X v) \]

Which is just the dot product of the vector \(Xv\) with itself, so it is always greater than or equal to zero. More concisely, the Hessian is positive semidefinite everywhere.

This means that at every point on the surface, the directional derivative is always increasing in every direction. In particular this means that at any stationary point, when we move away from it in any direction, the value of our function will increase, so, our stationary point is in fact a minimum. What is more, it is a global mininum! This property, a positive semidefinite Hessian at every point, is called convexity.

Now, how do we know that this global minimum is unique? Well, we derived a formula for a unique global minimum,

\[ (X^TX)^{-1}X^TY \]

However, when we did it we sneakily assumed that the matrix \(X^TX\) was invertible. If it is not invertible, there will not be a unique minimum. This matrix is invertible exactly when X has full column rank. We know from linear algebra, that this matrix is invertible exactly when the matrix \(X\) has full column rank. However the columns of \(X\) are our predictors, so we have a unique minimum, exactly when our predictors are linearly independent.

If we think about this it makes a lot of sense. Suppose one of our predictors is a multiple of another, say \(X_1 = a X_2\). Then for any stationary point \(\beta_1,\beta_2,...\) for our error function, we can trivially create infinitely many new stationary points by replacing \(\beta_1\) with \(\beta_1 + t\) and \(\beta_2\) with \(\beta_2 - \frac{t}{a}\) for any \(t\). In other words, we have an extra degree of freedom.

This can cause us problems if we are finding our minimum numerically. Suppose one of our predictors is almost a linear combination of the others, then the optimisation that searches for the minimum will spend a lot of time swimming around before it finds it.

Leave a Reply

Your email address will not be published. Required fields are marked *