Tutorial 08: Floating point computations and least squares

Introduction

In this lab session you will explore

Speed of matrix computations
Floating point accuracy in least squares estimation of statistical linear models

Open your github repository clone project from Lab 2 or 4 and upgrade the StatCompLab package
During this lab, you can work in a .R file, but working with code chunks in a .Rmd is highly recommended.

For measuring relative running times of different methods, we’ll use the bench package. You won’t need to load the package with library. Instead, check if bench::mark() runs without errors. If it doesn’t, and it’s because the package is not installed, then install it with install.packages("bench") in the R Console.

Suggested code for the setup code chunk:

suppressPackageStartupMessages(library(tidyverse))
library(StatCompLab)

Solution:

The accompanying Tutorial08Solutions tutorial/vignette documents contain the solutions explicitly, to make it easier to review the material after the workshops.

Numerical speed

Define the following function for computing a vector norm:

vec_norm <- function(x) {
  sum(x^2)^0.5
}

Try it on a simple vector where you can verify that the result is correct, e.g.

c(vec_norm(c(1, 2, 3)), sqrt(1 + 4 + 9))

## [1] 3.741657 3.741657

We will now take a look at the computational speed of matrix operations.

First construct a vector $\boldsymbol{b}$ and matrix $\boldsymbol{A}$ :

n <- 100000
m <- 100
b <- rnorm(n)
A <- matrix(rnorm(n * m), n, m)

Let’s check the speed difference between $\boldsymbol{A}^\top\boldsymbol{b}$ and $(\boldsymbol{b}^\top\boldsymbol{A})^\top$ (convince yourself that the two expressions would be equivalent in exact arithmetic, and possibly also in floating point arithmetic). The function t() construct the transpose of a matrix. Note that when b is a vector (as opposed to a single-column matrix) it is interpreted as either a column vector or a row vector, depending on the operation. When b is a vector, as.matrix(b) is always a single-column matrix, and t(b) is a single-row matrix, but b %*% A is interpreted as t(b) %*% A.

Which method for computing $\boldsymbol{A}^\top \boldsymbol{b}$ is faster? Use bench::mark() to compare the different ways of evaluating the expression. By extracting the variables expression, median, and itr/sec from the bench::mark() output using the select() function, we can summarise the results in a table.

To save typing, define a function bench_table that does the kable printing:

bench_table <- function(bm) {
  knitr::kable(
    bm %>%
      select(expression, median, `itr/sec`)
  )
}

For $\boldsymbol{A}\boldsymbol{B}\boldsymbol{b}$ , the order in which matrix&vector multiplication is done is important. Define these new variables:

m <- 1000
n <- 1000
p <- 1000
A <- matrix(rnorm(m * n), m, n)
B <- matrix(rnorm(n * p), n, p)
b <- rnorm(p)

Compare the running times of A %*% B %*% b, (A %*% B) %*% b, and A %*% (B %*% b)).

The three versions should produce very similar numerical results (bench::mark will complain if they don’t) Which method is fastest? Why? Are the computed results the same?

For a square matrix $\boldsymbol{A}$ of size $1000\times 1000$ , compare the cost of solve(A) %*% b (Get the matrix inverse $\boldsymbol{A}^{-1}$ , then multiply with $\boldsymbol{b}$ ) against the cost of solve(A, b) (Solve the linear system $\boldsymbol{A}\boldsymbol{u}=\boldsymbol{b}$ ).

Numerical Least Squares estimation

You’ll now investigate some numerical issues when doing numerical least squares estimation.

Run the following code that generates synthetic observations from a linear model $\begin{aligned} y_i &= \frac{x_i-100.5}{5} + (x_i-100.5)^2 + e_i, \end{aligned}$ where $e_i \sim \mathsf{Normal}(0, \sigma_e=0.1)$ , independent, $i=1,\dots,n=100$ .

## Set the "seed" for the random number sequences, so we can
## reproduce exactly the same random numbers every time we run the code
set.seed(1)
## Simulate the data
# x: 100 random numbers between 100 and 101
data <- data.frame(x = 100 + sort(runif(100)))
# y: random variation around a quadratic function:
data <- data %>%
  mutate(y = (x - 100.5) / 5 + (x - 100.5)^2 + rnorm(n(), sd = 0.1))

Let’s plot the data we have stored in the data.frame:

ggplot(data) + geom_point(aes(x, y))

What are the true values of $\beta_0$ , $\beta_1$ , and $\beta_2$ in the standardised model formulation, $\begin{aligned} y_i &= \mu_i + e_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + e_i ? \end{aligned}$ Hint: Identify the values by matching the terms when expanding the initial model definition above.

Create a beta_true vector containing $(\beta_0, \beta_1, \beta_2)$ :

Use these $\beta$ -values to plot the quadratic true model predictor function as a function of $x\in[100,101]$ , together with the observations, and the estimate provided by lm(). Since the observed $x$ values are relatively dense, we don’t necessarily need a special plot sequence of $x$ values, but can reuse the observed values, and the model predictions for those value can be obtained with the fitted() method. In the lm() formula specification, use + I(x^2) for the quadratic term. This tells lm() to first compute x^2, and then use the result as a new covariate. The I() method can be interpreted as “use the computed value of this instead of other possible special meanings of the expression”.

Use the model.matrix(mod1) function to extract the $\boldsymbol{X}$ matrix for the vector formulation of the model, $\boldsymbol{y}=\boldsymbol{X}\boldsymbol{\beta}+\boldsymbol{e}$ , and store in X1. (See ?model.matrix for more information). Then use the direct normal equations solve shown in Lecture 8, $(\boldsymbol{X}^\top\boldsymbol{X})^{-1}\boldsymbol{X}^\top\boldsymbol{y}$ , to estimate the $\beta$ parameters. Does it work?

Shifted covariate

What if the whole model was shifted to the right, so that we were given $x+1000$ instead of the original $x$ -values? The model expression would still be able to represent the same quadratic expression but with different $\beta$ values. Create a new data set:

data2 <- data %>%
  mutate(x = x + 1000)

Estimate the model using lm() for this new data set. Does it work?

The default for lm() is to allow singular models, and “fix” the problem by removing co-linear columns of the model matrix. We can ask it to disallow singular models with the singular.ok argument.

Define a function cond_nr() taking a matrix X as input and returning the condition number, computed with the help of svd (see Lecture 8 for a code example for the condition number).

With the help of model.matrix, examine the condition numbers of the model matrices in the two cases from the previous tasks. lm() computes the fit using the QR decomposition approach, not by direct solution of the normal equations. Why was the second lm() fit so bad?

Plot the second and third columns of the model matrix X2 against each other, and use cor to examine their correlation (see ?cor). How does this explain the large condition number?

cor(X2[, 2:3])
ggplot() +
  geom_point(aes(X2[, 2], X2[, 3])) +
  ggtitle(paste("Correlation =", cor(X2[,2], X2[,3])))
## Very close to co-linear

Since the linear model says simply that the expected value vector $\mathsf{E}(\boldsymbol{y})$ lies in the space spanned by the columns of $\boldsymbol{X}$ , one possibility is to attempt to arrive at a better conditioned $\boldsymbol{X}$ by linear rescaling and/or recombination of its columns. This is always equivalent to a linear re-parameterization.

Try this on the model matrix of the model that causes lm() to fail. In particular, for each column (except the intercept column) subtract the column mean (e.g., try the sweep() and colMeans() functions, see example code below). Then divide each column (except the intercept column) by its standard deviation. Find the condition number of the new model matrix. Fit the model with this model matrix using something like mod3 <- lm(y ~ x1 + x2 + x3 - 1, data = data3)

Produce a plot that confirms that the resulting fit is sensible now.

Note: If the data is split into estimation and test sets, the same transformation must be applied to the test data, using the scaling parameters derived from the observation data. Therefore the scale() function isn’t very useful for this. Instead, explicitly computing and storing the transformation information is necessary.

Compute the condition numbers for X2 and X3. Also compute the correlation between column 2 and 3 of X3 (subtract $1$ to see how close to $1$ it is).

Did subtracting and rescaling help?

An alternative fix is to subtract the mean x value from the original x vector before defining the quadratic model. Try this and see what happens to the condition number and column correlations of the model matrix. First, construct the modified data:

## Data setup:
data4 <- data2 %>%
  mutate(x_shifted = x - mean(x))

Different methods for modifying the inputs and model definitions require different calculations for compensating in the model parameters, and has to be figured out for each case separately. The most common effect is to change the interpretation of the $\beta_0$ parameter.

Finn Lindgren

Introduction

Numerical speed

Numerical Least Squares estimation

Shifted covariate