Regression VIII – Looking to the Past for Inspiration

Sometimes, time series data will have certain amount of inertia. This is when the values tend to stick around whatever value they currently are at. This is a slightly weird concept, so let’s generate some toy data that illustrates it.

import numpy as np
X = np.arange(0, 1, 0.001)

Y = [0]

for in it in range(1, 1000):
    Y.append(Y[i - 1] + np.random.normal(0, 0.05))

We have created two arrays. The first, X, which is just an evenly spaced range of 1000 points from 0 to 1. The second, Y, is defined iteratively, it starts and zero and then each consecutive value is the previous value plus a draw from a random normal with mean 0 and standard deviation 0.05.

When we plot Y against X it looks like this,

We can see, that rather than varying randomly around 0, this time series will wander occasionally to a certain level and then stay near there.

So, how do we model this kind of behaviour? With a lag feature. We can define a lag feature based on Y by

lagging_predictor = [0] + Y[:-1]

For each index, this new feature will have the same value as the previous value of Y. We’ve also filled in the first value with 0.

Now let’s try some Linear Regression.

from sklearn.linear_model import LinearRegression
lagging_model = LinearRegression().fit(np.array(lagging_predictor).reshape(-1,1), Y)

When we look at the coefficients of this, we get an slope of 0.99 and an intercept of 0.003.

So our lagging model always predicts 0.003 plus 0.99 times the last value of Y. Which is a pretty good prediction! We can build more complex lagging features, but remember that linear combinations of features are taken care of by the fitting in ordinary least squares, so we don’t need to try those features ourselves.

Regression VII – New Harmonies

It is common enough when looking at time series, to see a cyclic pattern. The values in our series will oscillate predictably. This may be driven by an underlying effect based on the time of day, or the day of the week, or even the day of the year. We can model this with a little feature engineering.

What we are talking about is a wave. Any wave can be modelled as a sine function. Suppose we have a time variable t that is normalised to have values between \(0\) and \(1\), then an arbitrary wave will be of the form

\[ A\sin(o + tf2\pi) \]

where

  1. A is the amplitude of the wave, that is, how high and low each crest and trough are
  2. f is the frequency, that is, how many times the wave repeats for a single unit of time
  3. o is the offset, how much the wave is shifted forwards

Now, these terms are quite non-linear, so they will by hard to model. However, we can do a little trigonometry. Thanks to the identity

\[ \sin(\alpha + \beta) = \sin(\alpha)\cos(\beta) + \cos(\alpha)\sin(\beta) \]

we can rewrite the wave function as

\[ A\sin(o + tf2\pi) = A\sin(o)\cos(tf2\pi) + A\cos(o)\sin(tf2\pi) \]

We can then define \(B = A\sin(o)\) and \(C = A\cos(o)\), and then write any wave in the form

\[ B\cos(tf2\pi) + C \sin(tf2\pi) \]

Now the only nonlinear term is the frequency of the oscillations.

Let's look at an example, suppose we have a time series that looks like,

By eyeballing it, we can see a seasonal effect which repeats four times. So let's try some linear regression against the features \(\sin(x*4*2*\pi)\) and \(\cos(x*4*2*\pi)\).

import numpy as np
from sklearn.linear_model import LinearRegression
X, Y = # wherever our data comes from
sin_feature = np.sin(X*4*np.pi)
cos_feature = np.cos(X*4*np.pi)

harmonic_model = LinearRegression().fit([[s,c] for s,c in zip(sin_feature, cos_feature)], Y)

Now, I ran this regression myself, and the coefficents I got where, 1.69 and 0.38. So, we have

\[ A\sin(o) = 1.69, A\cos(o) = 0.38 \]

Which means our model's offset is

\[ o = \tan^{-1}(0.38/1.69) = 0.22 \]

and our amplitude is

\[ A = 1.69 /\cos(0.22) = 1.73 \]

This is pretty good, as I generated the above data with an amplitude of 1.8 and an offset of 0.2 via

import numpy as np
X = np.arange(0, 1, 0.002)
Y = 1.8*np.sin(0.2 + X*4*2*np.pi)

We won't always have such an obvious cyclical pattern where we can spot the frequency by eye. We'll be looking at how to determine the appropriate frequency in these cases in a future post.

Regression VI – Polynomial Features

It’s time to talk about feature engineering. Using just the tools of linear regression, we can greatly increase the complexity of our models by creating features from our underlying data. A feature is a new predictor that we create by applying some function to our existing set of predictors. We can think of it as manipulating the data to dig out underlying patterns.

The best example are polynomial features. Suppose we have a single predictor X which is spaced evenly between 0 and 10, and a response Y. Say we plot Y against X and it looks like this

Lets try and do a simple linear regression against Y.

import numpy as np
from sklearn.linear_model import LinearRegression
from matplotlib import pyplot as plt

linear_model = LinearRegression().fit([[x] for x in X],Y)
linear_predictions = linear_model.predict([[x] for x in X])
plt.plot(X,Y)
plt.plot(X, linear_predictions)
plt.show()

This will create something like this

Not very good at all. So, let’s try adding a new feature. We will define this feature as the square of X, like so

square_model = LinearRegression().fit([[x, x**2] for x in X], Y)
square_predictions = square_model.predict([[x, x**2]])
plt.plot(X,Y)
plt.plot(X, square_predictions)
plt.show()

This will give you something that looks a bit better

Although our regression is still linear, now we are fitting the closest quadratic rather than the closest straight line. This is also called polynomial regression. But I think it makes more sense to think of it as a form of feature engineering.

Now, let’s add another feature, the cube of X.

cubic_model = LinearRegression().fit([[x, x**2, x**3] for x in X], Y)
cubic_predictions = cubic_model.predict([[x, x**2, x**3]])
plt.plot(X,Y)
plt.plot(X, cubic_predictions)
plt.show()

Now we’re getting there. If we add the fourth powers of X as a feature, we will get an even better looking fit

We are not just limited to polynomial features, as we shall see. However we shouldn’t bother with linear features, as any linear term is already captured in the regression itself. When we fit, we already find the best possible linear combination of the predictors . So our features need to be non-linear in our existing predictors. A big part of building and fitting models is finding useful features in our data like this.

Regression V – Why does OLS make me so blue?

When we derived our formula for the coefficients of regression, we minimised the sum of the squares of the difference between our predicted values for \(Y\) and the true values of \(Y\). This technique is called ordinarly least squares. It is sometimes characterised as the best unbiased linear estimator (aka BLUE), what exactly does this mean?

First, let’s think about regression again, suppose we have \(m\) samples of a set of \(n\) predictors \(X_i\) and a response \(Y\). We can say that the predictors and response are observable, because we have real measurements of their values. We hypothesise that there is an underlying linear relationship of the form

\[ Y_j = \sum_{i = 1}^n X_{i,j} \beta_i + \epsilon_j \]

where the \(\epsilon\) represent noise or error terms. We can write this in matrix form as

\[ y^T = X \beta + \epsilon \]

Since they are supposed to be noise , let’s also assume that the \(\epsilon\) terms have expectation zero, that they all have the same fine standard deviation \(\sigma\) and that they are uncorrelated.

This is quite a strong set of assumptions, in particular we are assuming that there is a real linear relationship between our predictors and response that is constant over all our samples, and that everything else is just uncorrelated random noise. (We’re also going to quietly assume that our estimators are independent, and so \(X\) has full rank.)

This can be a little uninuitive, because normally we think about regression as knowing our predictors and using those to estimate our response. Now we imagine that we know a set of values for our predictors and response and using those to estimate the underlying linear relationship between them. More concretely vector \(\beta\) is unobservable, as we cannot measure it directly and when we “do” a regression, we are estimating this value. The ordinary least squares estimate of \(\beta\) is given by

\[ \hat{\beta} = (X^TX)^{-1}X^Ty \]

Usually we don’t differentiate between \(\hat{\beta}\) and \(\beta\).

OLS is linear and Unbiased

First of all, what makes an estimator linear? Well, an estimator is linear, if it it is a linear combination \(y\). Or equivalently, it is a matrix multiplication of \(y\), which we can see is true of \(\hat{\beta}\).

Now, what is bias? An estimator is unbiased when the expected value of the estimator is equal to the underlying value. We can see that this is true for our ordinary least squares estimate by using our formula for \(Y\) and the linearity of expectation,

\[ \mathbb{E}(\hat{\beta}) = \mathbb{E}((X^TX)^{-1}X^TY) = \mathbb{E}((X^TX)^{-1}X^TX \beta) + \mathbb{E}((X^TX)^{-1}X^T \epsilon) \]

and then remembering that the expectation of the \(\epsilon\) is zero, which gives

\[ \mathbb{E}(\hat{\beta})= \mathbb{E}((X^TX)^{-1}X^TX \beta) = \mathbb{E}(\beta) = \beta \]

So, now we know that, \(\hat{\beta}\) is an unbiased linear estimator of \(\beta\).

OLS is Best

How do we compare different unbiased linear estimators of \(\beta\)? Well, all unbiased estimators will have the same expectation, so the one with the lowest variance should be best, in some way. It is important conceptually to understand that we are thinking about the variance of an estimator of \(\beta\), so how far away does it usually get from \(\beta\).

Now, \(\beta\) is a vector, so there is not a single number representing it’s variance. We have to look at the whole covariance matrix, not just a single variance term. We say that the variance of the estimator \(\gamma\) is lower than \(\gamma^{\prime}\) if the matrix

\[ \text{Var}(\gamma^{\prime}) – \text{Var}(\gamma) \]

is positive semidefinite. Or equivalently (by the definition of positive semi definite), if for all vectors we have

\[ \text{Var}(c^T\gamma) <= \text{Var}(c^T\gamma^{\prime}) \]

First, let’s derive the covariance matrix for our estimator \(\hat{\beta}\). We have

\[ \text{Var}(\hat{\beta}) = \text{Var}((X^TX)^{-1}X^TY) = \text{Var}((X^TX)^{-1}X^T(X\beta + \epsilon)) \]

By the normal properties of variance and because the first terms are constant and add no covariance terms, this is equal to

\[ \text{Var}((X^TX)^{-1}X^T\epsilon) = ((X^TX)^{-1}X^T) \text{Var}(\epsilon) ((X^TX)^{-1}X^T)^T \]

which equals,

\[ \sigma^2 (X^TX)^{-1}X^TX (X^TX)^{-1} = \sigma^2 (X^TX)^{-1} \]

Now, let’s compare \(\hat{\beta}\) to an arbitrary unbiased linear estimator. That is, suppose we are estimating \(\beta\) with some other linear combination of \(Y\), given by \(Cy\), for some matrix \(C\), with

\[ \mathbb{E}(Cy) = \beta \]

Now, let’s look at the covariance matrix of \(Cy\). We are going to use a trick here, first we define the matrix \(D\) as \(C – (X^TX)^{-1}X^T\), then we have

\[ Cy = Dy + (X^TX)^{-1}X^Ty = Dy + \hat{\beta} \]

Now, if we take the expectation of this, we have

\[ \mathbb{E}(Cy) = \mathbb{E}(Dy + \hat{\beta}) = \mathbb{E}(Dy) + \beta \]

and expanding that expectation, we have

\[ \mathbb{E}(Dy) = \mathbb{E}(DX\beta + \epsilon) = DX\beta \]

putting this all together we have,

\[ DX\beta + \beta = \mathbb{E}(Cy) = \beta \]

So, that gives us, , which might not seem very helpful right now, but let’s look at the covariance matrix of .

\[ \text{Var}(Cy) = C \text{Var}(y) C^T = C \text{Var}(X\beta + \epsilon) C^T \]

now, by our assumptions about \epsilon, and as X and \beta our constants, so they have no variance, this is equal to

\[ \sigma^2 C C^T = \sigma^2(D + (X^TX)^{-1}X^T)(D +(X^TX)^{-1}X^T)^T \]

distributing the transpose this is

\[ \sigma^2 (D + (X^TX)^{-1}X^T) (D^T + X(X^TX)^{-1}) \]

writing this out in full we have

\[ \sigma^2(DD^T + DX(X^TX)^{-1} + (X^TX)^{-1}X^TD^T + (X^TX)^{-1}X^TX(X^TX)^{-1}) \]

using our above result for \(DX\), and cancelling out some of the \(X\), we get

\[ \sigma^2 D^T D + \sigma^2(X^TX)^{-1} = \sigma^2 D^T D + \text{Var}(\hat{\beta}) \]

We can rearrange the above as

\[ \text{Var}(Cy) – \text{Var}(\hat{\beta}) = \sigma^2 DD^T \]

and a matix of the form \(DD^T\) is always positive semidefinite. So we have shown that the variance of our arbitrary unbiased linear operator is at least as great as that of \(\hat{\beta}\).

So, the result of all this is that we have a pretty good theoretical justification for using OLS in regression! However, it does not mean that OLS is always the right choice. In some ways, an unbiased estimator is the correct estimator for \(\beta\), but sometimes there are other things we are considering, and we are actually quite happy with bias! We will see such estimators when we look at feature selection.