What is the most parsimonious model that can capture the trends in a data set?

We’ll assume a linear relationship between two variables, \(x\) and \(y\). Imagine all the possible lines that could pass through the data set.

Starting with a linear relationship \(ax + b = y\), let’s generate a random set of x’s and y’s.

library(tidyverse)
## -- Attaching packages ---------------------------------- tidyverse 1.2.1 --
## v ggplot2 2.2.1     v purrr   0.2.5
## v tibble  1.4.2     v dplyr   0.7.5
## v tidyr   0.8.1     v stringr 1.3.1
## v readr   1.1.1     v forcats 0.3.0
## -- Conflicts ------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(modelr)
models = tibble(
  a = runif(250, -20, 40),
  b = runif(250, -5, 5)
)

ggplot(sim1, aes(x, y)) + 
  geom_abline(aes(intercept=a, slope=b), data=models, alpha=0.25, color='red') +
  geom_point(size=3, color='blue', alpha=0.5) 

One way to pick the right one would be by eye. Just generate enough lines and pick the best one.

But that’s tedious. The better way is to get the computer to pick the best line for you. All we need to get started is to define what “best” line means.

We can define the quality of our fit in terms of “residuals”. A residual is the difference between the predicted y-value and the actual y-value. In the figure below, the red line is a potential fit line, and the black lines are the residuals.

a = 1.5
b = 4.2
sim1_tmp = sim1
sim1_tmp$x = sim1$x + runif(length(sim1$x), -1, 1)
p = ggplot(sim1_tmp, aes(x, y)) + 
  geom_abline(aes(intercept=b, slope=a), alpha=0.5, size=1, color='red') +
  geom_point(size=3, alpha=0.5, color='blue')

for( i in 1:length(sim1$x)){
  p = p + geom_segment(x=sim1_tmp$x[i], y=(a*sim1_tmp$x[i] + b), xend=sim1_tmp$x[i], yend=sim1_tmp$y[i], color='black', size=1, alpha=0.05)
}
  
print(p)

The “best” fit line would somehow minimize the residuals. This could be done in many ways. One way would be to minimize the overall sum of the residuals—but that leads to a problem of positive and negative residuals canceling each other. So a line could pass through perpendicular to the “best” fit, and be considered good because all the positive residuals on one side would cancel the negative ones on the other side. No, we need something different.

To deal with this issue, another way to define “best” is to minimize the sum of the square root of the squared residuals. This would result in residuals always being considered positive. Another variation is to minimize the absolute value of the residuals, or as we’ll do here, minimize the “worst” residual.

The residual for each data point can be captured by \(r_i = y_i - \hat{y}_i\) where \(\hat{y}\) is the estimate for \(y_i\) given by the fit line \(\hat{y}_i = ax_i + b\). The residual then becomes \(r_i = y_i - ax_i - b\).

Now, let’s define a variable \(z\) that represents the “worst” or maximum residual in absolute terms. \(z = max(|r|)\). You can imagine this as two lines parallel to the fit line \(z\)-units away as shown by the dashed-lines in the plot below.

a = 1.5
b = 4.2
residuals = c()
for( i in 1:length(sim1$x)){
  r_i = abs((a*sim1$x[i] + b) - sim1$y[i])
  residuals = c(residuals, r_i)
}
z = max(residuals)

ggplot(sim1, aes(x, y)) + 
  geom_abline(aes(intercept=b, slope=a), alpha=0.5, size=1, color='red') +
  geom_abline(aes(intercept=b+z, slope=a), alpha=0.5, size=1, color='red', linetype='dashed') +
  geom_abline(aes(intercept=b-z, slope=a), alpha=0.5, size=1, color='red', linetype='dashed') +
  geom_point(size=3, alpha=0.5, color='blue')

You can think of the optimization problem as squeezing those dashed lines together until they are as close as possible. In other words: \[ minimize\; z \\ s.t. \quad\quad\quad\\ r_i \leq z, \forall i \\ r_i \geq -z, \forall i \] Substituting our equation for \(r_i\) we get: \[ minimize\; z \\ s.t. \quad\quad\quad\\ y_i - ax_i - b \leq z, \forall i \\ y_i - ax_i - b \geq -z, \forall i \] Finally, to get our optimization in the standard form (which will make it easier to give to a solver!), we move constans to the RH side of the inequality constraints: \[ minimize\; z \\ s.t. \quad\quad\quad\\ -ax_i - b - z \leq -y_i, \forall i \\ -ax_i - b + z\geq -y_i, \forall i \]

Now, to do this in R, we first import lpSolveAPI and create an empty linear program (i.e. LP or a linear optimization problem) with 3 variables (\(a\), \(b\), and \(z\)). We set the objective function to be the third variable and define the objective to be a minimization.

library(lpSolveAPI)

lprec = make.lp(0, 3)
lp.control(lprec, sense="min")
set.objfn(lprec, c(0, 0, 1)) # a, b, z

Next we set our constraints, two for each data point as we described above, and a pair for each variable defining the valid range each can take. \(a\) and \(b\) can both range from -Inf to Inf, while \(z\) is strictly positive (which is the default with lpSolve):

for(i in 1:length(sim1$x)){
  add.constraint(lprec, c(-sim1$x[i], -1, 1), ">=", -sim1$y[i])
  add.constraint(lprec, c(-sim1$x[i], -1, -1), "<=", -sim1$y[i])
}

set.bounds(lprec, lower = c(-Inf, -Inf), columns = c(1, 2))

Finally, we solve the LP (a zero means it identified an optimal solution) and plot the fit line.

solve(lprec)
## [1] 0
vars = get.variables(lprec)
print(vars)
## [1] 2.205525 4.011185 4.091238
ggplot(sim1, aes(x, y)) + 
  geom_abline(aes(intercept = vars[2], slope = vars[1]), alpha=0.5, size=1, color='red') +
  geom_point(size=3, alpha=0.5, color='blue')