Linear Regression Simulation Experiments

Let's generate some data using $y = \theta^T x + N(0, \sigma^2)$.

In [1]:
from numpy import *
from matplotlib.pyplot import *
import scipy.stats

def plot_line(theta, x_min, x_max, color, label):
    x_grid_raw = arange(x_min, x_max, 0.01)
    x_grid = vstack((    ones_like(x_grid_raw),
                         x_grid_raw,
                    ))
    y_grid = dot(theta, x_grid)
    plot(x_grid[1,:], y_grid, color, label=label)


def gen_lin_data_1d(theta, N, sigma):

    #####################################################
    # Actual data
    x_raw = 100*(random.random((N))-.5)
    
    x = vstack((    ones_like(x_raw),
                    x_raw,
                    ))
                
    y = dot(theta, x) + scipy.stats.norm.rvs(scale= sigma,size=N)
    
    
    plot(x[1,:], y, "ro", label = "Training set")
    #####################################################
    # Actual generating process
    #
    plot_line(theta, -70, 70, "b", "Actual generating process")
    
    
    #######################################################
    # Least squares solution
    #
    
    theta_hat = dot(linalg.inv(dot(x, x.T)), dot(x, y.T))
    plot_line(theta_hat, -70, 70, "g", "Maximum Likelihood Solution")
    

    legend(loc = 1)
    xlim([-70, 70])
    ylim([-100, 100])
/usr/lib/python2.7/dist-packages/matplotlib/font_manager.py:273: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment.
  warnings.warn('Matplotlib is building the font cache using fc-list. This may take a moment.')
In [2]:
%matplotlib inline
theta = array([-3, 1.5])
N = 10
sigma = 20.2
In [3]:
gen_lin_data_1d(theta, N, sigma)
In [4]:
gen_lin_data_1d(theta, N, sigma)
In [5]:
gen_lin_data_1d(theta, N, sigma)

As you can see, the Maximum Likelihood solution (aka the Least Squares solution) will be different every time.

Now, let's see what happens when the $\sigma^2$ is very large:

In [6]:
theta = array([-3, 1.5])
N = 10
sigma = 120.2
In [7]:
gen_lin_data_1d(theta, N, sigma)
In [8]:
gen_lin_data_1d(theta, N, sigma)
In [9]:
gen_lin_data_1d(theta, N, sigma)

The solution can be very far away from the actual generating process. This will substantially decrease the test likelihood (increase the test error.)

We can fix this if we increase N

In [10]:
theta = array([-3, 1.5])
N = 100
sigma = 120.2
In [11]:
gen_lin_data_1d(theta, N, sigma)
In [12]:
gen_lin_data_1d(theta, N, sigma)

The magnitude of the coefficients as an indication of overfitting

Let's try an experiment: generate data points, and fit a $\theta$. If the true $\theta$ is small-ish, then in general we will overestimate its magnitude.

In [13]:
def compare_theta_theta_hat(N, ndim, sigma_theta, sigma):
    theta = scipy.stats.norm.rvs(scale=sigma_theta,size=ndim+1)
    
    x_raw = 100*(random.random((N, ndim))-.5)
    
    x = hstack((    ones((N, 1)),
                    x_raw,
                    ))
                
    y = dot(x, theta) + scipy.stats.norm.rvs(scale= sigma,size=N)
    
    
    theta_hat = dot(linalg.pinv(dot(x.T, x)), dot(x.T, y))
    
    return theta[1], theta_hat[1]
In [14]:
N = 10
ndim = 5
sigma_theta = 1
sigma = 100
In [15]:
compare_theta_theta_hat(N, ndim, sigma_theta, sigma) 
Out[15]:
(0.32968922348457841, -2.6007613728264438)
In [16]:
compare_theta_theta_hat(N, ndim, sigma_theta, sigma)  
Out[16]:
(0.19494420968828235, 1.263294254668331)
In [17]:
compare_theta_theta_hat(N, ndim, sigma_theta, sigma) 
Out[17]:
(1.3894750256536736, 6.4783304381666635)
In [18]:
compare_theta_theta_hat(N, ndim, sigma_theta, sigma) 
Out[18]:
(-0.44502106389239426, -1.4849140682912148)

There is overfitting: if there weren't overfitting, we would not be systematically getting larger magnitudes for $\hat{\theta}_1$

N = 1000 ndim = 5 sigma_theta = 1 sigma = 100

In [19]:
compare_theta_theta_hat(N, ndim, sigma_theta, sigma) 
Out[19]:
(-1.1833775601489724, 3.0628772648894271)
In [20]:
compare_theta_theta_hat(N, ndim, sigma_theta, sigma) 
Out[20]:
(0.18350521708576811, -4.7498116712053378)
In [21]:
compare_theta_theta_hat(N, ndim, sigma_theta, sigma) 
Out[21]:
(0.25622461004237274, -2.3578648254444459)

With a larger N, we are not overfitting as much.

For every dimension of $\theta$, there is a chance of a lot of overfitting. If we have lots of dimensions, there is an increased chance of overfitting really badly

In [22]:
def compare_theta_theta_hat_Linf(N, ndim, sigma_theta, sigma):
    theta = scipy.stats.norm.rvs(scale=sigma_theta,size=ndim+1)
    
    x_raw = 100*(random.random((N, ndim))-.5)
    
    x = hstack((    ones((N, 1)),
                    x_raw,
                    ))
                
    y = dot(x, theta) + scipy.stats.norm.rvs(scale= sigma,size=N)
    
    
    theta_hat = dot(linalg.pinv(dot(x.T, x)), dot(x.T, y))
    
    return max(theta), max(theta_hat)
In [23]:
N = 100
ndim = 50
sigma_theta = 1
sigma = 100
compare_theta_theta_hat_Linf(N, ndim, sigma_theta, sigma)
Out[23]:
(1.6484436575435313, 32.217429485235336)
In [24]:
compare_theta_theta_hat_Linf(N, ndim, sigma_theta, sigma)
Out[24]:
(2.5743079770873196, 13.274256131200596)
In [25]:
compare_theta_theta_hat_Linf(N, ndim, sigma_theta, sigma)
Out[25]:
(1.9004612299177268, 5.2233432356286835)
In [26]:
compare_theta_theta_hat_Linf(N, ndim, sigma_theta, sigma)
Out[26]:
(2.1954309632409434, 20.124153029193067)
In [27]:
compare_theta_theta_hat_Linf(N, ndim, sigma_theta, sigma)
Out[27]:
(2.3584164763495186, 2.6672059730981106)
In [28]:
compare_theta_theta_hat_Linf(N, ndim, sigma_theta, sigma)
Out[28]:
(4.0618106925075299, 3.7817703395917412)