Tutorial 08: Floating point computations and least squares
Finn Lindgren
Source:vignettes/Tutorial08.Rmd
Tutorial08.Rmd
Introduction
In this lab session you will explore
- Speed of matrix computations
- Floating point accuracy in least squares estimation of statistical linear models
- Open your github repository clone project from Lab 2 or 4 and
upgrade the
StatCompLab
package - During this lab, you can work in a
.R
file, but working with code chunks in a.Rmd
is highly recommended.
For measuring relative running times of different methods, we’ll use
the bench
package. You won’t need to load the package with
library
. Instead, check if bench::mark()
runs
without errors. If it doesn’t, and it’s because the package is not
installed, then install it with install.packages("bench")
in the R Console.
Suggested code for the setup code chunk:
Solution:
The accompanying Tutorial08Solutions
tutorial/vignette
documents contain the solutions explicitly, to make it easier to review
the material after the workshops.
Numerical speed
Define the following function for computing a vector norm:
vec_norm <- function(x) {
sum(x^2)^0.5
}
Try it on a simple vector where you can verify that the result is
correct, e.g.
## [1] 3.741657 3.741657
We will now take a look at the computational speed of matrix operations.
First construct a vector and matrix :
Let’s check the speed difference between
and
(convince yourself that the two expressions would be equivalent in exact
arithmetic, and possibly also in floating point arithmetic). The
function t()
construct the transpose of a matrix. Note that
when b
is a vector (as opposed to a single-column matrix)
it is interpreted as either a column vector or a row vector, depending
on the operation. When b
is a vector,
as.matrix(b)
is always a single-column matrix, and
t(b)
is a single-row matrix, but b %*% A
is
interpreted as t(b) %*% A
.
Which method for computing
is faster? Use bench::mark()
to compare the different ways
of evaluating the expression. By extracting the variables
expression
, median
, and itr/sec
from the bench::mark()
output using the
select()
function, we can summarise the results in a
table.
To save typing, define a function bench_table
that does
the kable
printing:
For , the order in which matrix&vector multiplication is done is important. Define these new variables:
m <- 1000
n <- 1000
p <- 1000
A <- matrix(rnorm(m * n), m, n)
B <- matrix(rnorm(n * p), n, p)
b <- rnorm(p)
Compare the running times of A %*% B %*% b
,
(A %*% B) %*% b
, and A %*% (B %*% b))
.
The three versions should produce very similar numerical results
(bench::mark
will complain if they don’t) Which method is
fastest? Why? Are the computed results the same?
For a square matrix
of size
,
compare the cost of solve(A) %*% b
(Get the matrix inverse
,
then multiply with
)
against the cost of solve(A, b)
(Solve the linear system
).
Numerical Least Squares estimation
You’ll now investigate some numerical issues when doing numerical least squares estimation.
Run the following code that generates synthetic observations from a linear model where , independent, .
## Set the "seed" for the random number sequences, so we can
## reproduce exactly the same random numbers every time we run the code
set.seed(1)
## Simulate the data
# x: 100 random numbers between 100 and 101
data <- data.frame(x = 100 + sort(runif(100)))
# y: random variation around a quadratic function:
data <- data %>%
mutate(y = (x - 100.5) / 5 + (x - 100.5)^2 + rnorm(n(), sd = 0.1))
Let’s plot the data we have stored in the
data.frame
:
ggplot(data) + geom_point(aes(x, y))
What are the true values of , , and in the standardised model formulation, Hint: Identify the values by matching the terms when expanding the initial model definition above.
Create a beta_true
vector containing
:
Use these
-values
to plot the quadratic true model predictor function as a function of
,
together with the observations, and the estimate provided by
lm()
. Since the observed
values are relatively dense, we don’t necessarily need a special plot
sequence of
values, but can reuse the observed values, and the model predictions for
those value can be obtained with the fitted()
method. In
the lm()
formula specification, use + I(x^2)
for the quadratic term. This tells lm()
to first compute
x^2
, and then use the result as a new covariate. The
I()
method can be interpreted as “use the computed value of
this instead of other possible special meanings of the expression”.
Use the model.matrix(mod1)
function to extract the
matrix for the vector formulation of the model,
,
and store in X1
. (See ?model.matrix
for more
information). Then use the direct normal equations solve shown
in Lecture 8,
,
to estimate the
parameters. Does it work?
Shifted covariate
What if the whole model was shifted to the right, so that we were given instead of the original -values? The model expression would still be able to represent the same quadratic expression but with different values. Create a new data set:
Estimate the model using lm()
for this new data set.
Does it work?
The default for lm()
is to allow singular models, and
“fix” the problem by removing co-linear columns of the model matrix. We
can ask it to disallow singular models with the singular.ok
argument.
Define a function cond_nr()
taking a matrix
X
as input and returning the condition number, computed
with the help of svd
(see Lecture 8 for a code example for
the condition number).
With the help of model.matrix
, examine the condition
numbers of the model matrices in the two cases from the previous tasks.
lm()
computes the fit using the QR decomposition approach,
not by direct solution of the normal equations. Why was the second
lm()
fit so bad?
Plot the second and third columns of the model matrix X2
against each other, and use cor
to examine their
correlation (see ?cor
). How does this explain the large
condition number?
cor(X2[, 2:3])
ggplot() +
geom_point(aes(X2[, 2], X2[, 3])) +
ggtitle(paste("Correlation =", cor(X2[,2], X2[,3])))
## Very close to co-linear
Since the linear model says simply that the expected value vector lies in the space spanned by the columns of , one possibility is to attempt to arrive at a better conditioned by linear rescaling and/or recombination of its columns. This is always equivalent to a linear re-parameterization.
Try this on the model matrix of the model that causes
lm()
to fail. In particular, for each column (except the
intercept column) subtract the column mean (e.g., try the
sweep()
and colMeans()
functions, see example
code below). Then divide each column (except the intercept column) by
its standard deviation. Find the condition number of the new model
matrix. Fit the model with this model matrix using something like
mod3 <- lm(y ~ x1 + x2 + x3 - 1, data = data3)
Produce a plot that confirms that the resulting fit is sensible now.
Note: If the data is split into estimation and test sets, the same
transformation must be applied to the test data, using the scaling
parameters derived from the observation data. Therefore the
scale()
function isn’t very useful for this. Instead,
explicitly computing and storing the transformation information is
necessary.
Compute the condition numbers for X2
and
X3
. Also compute the correlation between column 2 and 3 of
X3
(subtract
to see how close to
it is).
Did subtracting and rescaling help?
An alternative fix is to subtract the mean x
value from
the original x
vector before defining the quadratic model.
Try this and see what happens to the condition number and column
correlations of the model matrix. First, construct the modified
data:
Different methods for modifying the inputs and model definitions require different calculations for compensating in the model parameters, and has to be figured out for each case separately. The most common effect is to change the interpretation of the parameter.