StatComp Project 1 (2022/23): Simulation and sampling
Finn Lindgren
Source:vignettes/Project01.Rmd
Project01.Rmd
This document can either be accessed in html form on https://finnlindgren.github.io/StatCompLab/,
as a vignette in StatCompLab
version 21.4.1 or later, with
vignette("Project01", "StatCompLab")
, or in pdf format on
Learn.
Use the template material in the zip file project01.zip
in Learn to write your report. Add all your function definitions on the
code.R
file and write your report using
report.Rmd
. You must upload the following three files as
part of this assignment: code.R
, report.html
,
report.Rmd
. Specific instructions for these files are in
the README.md
file.
The main text in your report should be a coherent presentation of theory and discussion of methods and results, showing code for code chunks that perform computations and analysis but not code for code chunks that generate figures or tables.
Use the echo=TRUE
and echo=FALSE
to control
what code is visible.
The styler
package addin is useful for restyling code
for better and consistent readability. It works for both .R
and .Rmd
files.
The Project01Hints
file contains some useful tip and
CWmarking
contain guidelines. Both are attached in Learn as
pdf files.
Submission should be done through Gradescope.
Confidence interval approximation assessment
As in Lab 4, consider the Poisson model for observations : that has joint probability mass function In Lab 4, one of the considered parameterisation alternatives was
- , and
Create a function called estimate_coverage
(in
code.R
; also document the function) to perform interval
coverage estimation taking arguments CI
(a function object
for confidence interval construction taking arguments y
and
alpha
and returning a 2-element vector [see Lab 4]),
N
(the number of simulation replications to use for the
coverage estimate), alpha
(1-alpha
is the
intended coverage probability), n
(the sample size) and
lambda
(the true lambda
values for the Poisson
model).
Use your function estimate_coverage
to estimate the
coverage of the construction of confidence intervals from samples
and discuss the following:
- Since the model involves the discrete Poisson distribution, one might ask how sensitive these results are to the precise values of and . To investigate this, run the coverage estimation for different combinations of model parameters and (fix and )
- Present your results of estimated coverage in two plots, (1) as a function of for fixed , and (2) as a function of , for fixed .
- Discuss the plots in regards to whether the coverage of the intervals achieve the desired 90% confidence level, and if not identify under which cases and provide a suggestion as to why.
3D printer materials prediction
The aim is to estimate the parameters of a Bayesian statistical model of material use in a 3D printer. The printer uses rolls of filament that get heated and squeezed through a moving nozzle, gradually building objects. The objects are first designed in a CAD program (Computer Aided Design) that also estimates how much material will be required to print the object.
The data can be loaded with
data("filament1", package = "StatCompLab")
, and contains
information about one 3D-printed object per row. The columns are
-
Index
: an observation index -
Date
: printing dates -
Material
: the printing material, identified by its colour -
CAD_Weight
: the object weight (in grams) that the CAD software calculated -
Actual_Weight
: the actual weight of the object (in grams) after printing
If the CAD system and printer were both perfect, the
CAD_Weight
and Actual_Weight
values would be
equal for each object. In reality, there are random variations, for
example, due to varying humidity and temperature, and systematic
deviations due to the CAD system not having perfect information about
the properties of the printing materials.
When looking at the data (see below) itβs clear that the variability
of the data is larger for larger values of CAD_Weight
. The
printer operator has made a simple physics analysis, and settled on a
model where the connection between CAD_Weight
and
Actual_Weight
follows a linear model, and the variance
increases with square of CAD_Weight
. If we denote the CAD
weight for observations
by x_i
, and the corresponding actual weight by
,
the model can be defined by
To ensure positivity of the variance, the parameterisation is introduced, and the printer operator assigns independent prior distributions as follows: where denotes the logarithm of an exponentially distributed random variable with rate parameter , as seen in Tutorial 4. The values are positive parameters.
The printer operator reasons that due to random fluctuations in the material properties (such as the density) and room temperature should lead to a relative error instead of an additive error, which leads them to the model as an approximation of that. The basic physics assumption is that the error in the CAD software calculation of the weight is proportional to the weight itself.
Start by loading the data and plotting it.
Prior density
With the help of dnorm
and the dlogexp
function (see the code.R
file for documentation), define
and document (in code.R
) a function
log_prior_density
with arguments theta
and
params
, where theta is the
parameter vector, and params
is the vector of
parameters. Your function should evaluate the logarithm of the joint
prior density
for the four
parameters.
Observation likelihood
With the help of dnorm
, define and document a function
log_like
, taking arguments theta
,
x
, and y
, that evaluates the observation
log-likelihood
for the model defined above.
Posterior density
Define and document a function log_posterior_density
with arguments theta
, x
, y
, and
params
, which evaluates the logarithm of the posterior
density
,
apart from some unevaluated normalisation constant.
Posterior mode
Define a function posterior_mode
with arguments
theta_start
, x
, y
, and
params
, that uses optim
together with the
log_posterior_density
and filament data to find the mode
of the log-posterior-density and evaluates the Hessian at the mode as
well as the inverse of the negated Hessian,
.
This function should return a list with elements mode
(the
posterior mode location), hessian
(the Hessian of the
log-density at the mode), and S
(the inverse of the negated
Hessian at the mode). See the documentation for optim
for
how to do maximisation instead of minimisation.
Gaussian approximation
Let all
,
,
and use posterior_mode
to evaluate the inverse of the
negated Hessian at the mode, in order to obtain a multivariate Normal
approximation
to the posterior distribution for
.
Use start values
.
Importance sampling function
The aim is to construct a 90% Bayesian credible interval for each
using importance sampling, similarly to the method used in lab 4. There,
a one dimensional Gaussian approximation of the posterior of a parameter
was used. Here, we will instead use a multivariate Normal approximation
as the importance sampling distribution. The functions
rmvnorm
and dmvnorm
in the
mvtnorm
package can be used to sample and evaluate
densities.
Define and document a function do_importance
taking
arguments N
(the number of samples to generate),
mu
(the mean vector for the importance distribution), and
S
(the covariance matrix), and other additional parameters
that are needed by the function code.
The function should output a data.frame
with five
columns, beta1
, beta2
, beta3
,
beta4
, log_weights
, containing the
samples and normalised
-importance-weights,
so that sum(exp(log_weights))
is
.
Use the log_sum_exp
function (see the code.R
file for documentation) to compute the needed normalisation
information.
Importance sampling
Use your defined functions to compute an importance sample of size
.
Plot the empirical weighted CDFs together with the un-weighted CDFs for
each parameter, with the help of stat_ewcdf
, and discuss
the results. To achieve simpler ggplot
code, you may find
pivot_longer(???, starts_with("beta"))
and
facet_wrap(vars(name))
useful.
Construct 90% credible intervals for each of the four model
parameters, based on the importance sample. In addition to
wquantile
and pivot_longer
, the methods
group_by
and summarise
are helpful. You may
wish to define a function make_CI
taking arguments
x
, weights
, and prob
(to control
the intended