StatComp Project Example: Numerical statistics
Finn Lindgren
Source:vignettes/articles/Project_Example.Rmd
Project_Example.Rmd
- These were instructions for this example project; the projects this year will have different instructions that you should follow instead. The instructions here are only meant to be used to understand why the solution does technical things in the specific way it does. For your own project, make sure to follow the relevant instructons instead.
- Your project repository contains a template RMarkdown document,
report.Rmd
, that you should expand with your answers and solutions for this project assignment, and two code files,code.R
containing some predefined functions, andmy_code.R
that should be used for some of your own code. Do not change these filenames. - Some of the code solutions (as indicated for those items) should be
placed in the
my_code.R
file, and included in the report as indicated in the template document (also see theRMdemo.Rmd
,my_code.R
, andT4sol.Rmd
document examples from the Week 4 lecture and example materials on Learn). - Code that is placed directly into code chunks in the
report.Rmd
file should be shown in a separate section at the end of the report, so that the main text is a coherent presentation of theory and discussion of methods and results. In the code presentation section, include a brief code comment for each chunk. Also include brief documentation as code comments before each function defined inmy_code.R
. - The
styler
package Addin for restyling code for better and consistent readability works for both.R
and.Rmd
files.
The assignment is submitted by pushing your work to the github
repository before your submission deadline. Make sure you include your
name, student number, and github username in both
report.Rmd
and my_code.R
. When submitting,
also include a generated html version of the report, with filename
report.html
. Normally, generated images are automatically
included in the html file itself; otherwise they need to be added to the
repository as well.
It is strongly recommended to commit and push to github frequently. If you’re uncertain about whether you’re pushing all the needed submission files to github, ask Finn to check the repository contents at least a week before the submisison deadline.
The 10 sub-tasks contribute equally to the total mark, but the project will be marked as a whole, between and . A marking guide will be posted on Learn.
Archaeology in the Baltic Sea
In Lab 3, you investigated data from an archaeological excavation on Gotland in the Baltic sea. In the 1920s, a battle gravesite from 1361 was subject to several archaeological excavations. A total of femurs (thigh bones) ( left, and right) were found. We wanted to figure out how many persons were likely buried at the gravesite. It must reasonably have been at least , but how many more? In this assignment, you will use importance sampling to estimate credible intervals for this problem.
In the lab, it seemed difficult to extract much information from the
sample, since there were only two observations and two parameters. You
will here investigate if including data from other excavations of the
same type might decrease the uncertainty about the number of persons
buried, through improved information about the average detection
probability,
.
The function arch_data
in code.R
returns a
data frame with data for up to four different excavations (note: only
one of them is real; the other three are synthetic data generated for
this assignment).
We assume the same model as in the lab for each excavation, with an
unknown
,
,
for each of the
excavations, and a common detection probability
.
We assume them model that, conditionally on
and
,
,
,
all independent, and prior distributions
,
,
and
,
,
where
is the Beta function, see ?beta
in R.
Joint probability function for
Conditionally on , and the observations , the conditional posterior distribution for is where and , and the sums go over and .
Show that the joint probability function for , with and , is given by1 where is the probability mass function for the Geometric distribution prior for .
Probability function implementation
With the help of the functions lbeta
,
lchoose
, and dgeom
, define (in
my_code.R
) a function log_prob_NY(N,Y,xi,a,b)
with arguments N
(a vector of length
)
and
(a data frame of size
,
formatted like the output from arch_data(J)
). The arguments
xi
, a
, b
are the model
hyperparameters
,
,
and
.
The function should return the logarithm of
when the combination of
and
is valid, and should otherwise return -Inf
.
Importance sampling function
The aim is to construct a Bayesian credible interval for each using importance sampling, similarly to the method used in lab 4. There, a Gaussian approximation of the posterior of a parameter was used. Here, we will instead use the prior distribution for as sampling distribution, with samples .
Since the observations are fixed, the posterior probability function for is proportional to .
We define the logarithm of unnormalised importance weights, , where is the product of geometric distribution probabilities.
Define (in my_code.R
) a function
arch_importance(K,Y,xi,a,b)
with the same arguments
Y
, xi
, a
, and b
as
before, but with the first argument K
defining the number
of samples to generate. The function should return a data frame with
rows and
columns, named N1
, N2
, etc, for each
,
containing samples from the prior distributions for
,
and a final column called Log_Weights
, containing
re-normalised log-importance-weights, constructed by shifting
by subtracting the largest
value, as in lab 4.
More efficient alternative
Using the prior distribution for each leads to a rather inefficient importance sampler, especially for ; Most of the weights will become zero (when some ) or near zeros, and only a few samples will influence the practical estimates and credible intervals.
As an alternative, we can use sampling distributions that are better adapted to the posterior distributions. Let be the smallest allowed value to sample, for each . This would eliminate all the impossible combinations, so that all weights would be (in theory if not numerically) positive.
To further improve the method, we can choose different values of the probability parameter for the samples, and also adapt it differently for each . Without motivation, you can use values when sampling, so that .
Note: Due to the nature of the geometric distribution, this is also equivalent to an ordinary conditioned on being at least .
To take this different sampling method into account, the log-weights need to be defined as , where , the product of different geometric distribution probabilities.
You may adapt this example code (the details will depend on which vectorisation method you use):
# matrix of minimum allowed N for each j, repeated K times
N_min <- matrix(rep(pmax(Y[, 1], Y[, 2]), each = K), K, J)
xi_sample <- 1 / (1 + 4 * N_min)
# Sample values and call log_prob_NY for each sample,
# store N-values in a K-by-J matrix, and log(p_NY)-values in a vector log_PY
N <- matrix(rgeom(..., prob = xi_sample), K, J) + N_min
log_PY <- ...
# Subtract the sampling log-probabilities
log_PY - rowSums(dgeom(N - N_min, prob = xi_sample, log = TRUE))
Using this method,
doesn’t need to be larger than
,
as small as
starts to give reasonable values for code testing purposes, and
should only take at most a few seconds to run if the code is fully
vectorised (the vapply
function can be used, but a basic
for
-loop is also fine in this case).
Credible intervals for
As in the lab, let
,
and
.
Use arch_importance
and the wquantile
function
with type=1
(see the help text for information) to compute
credible intervals for
.
Start by including only the data from a single excavation, and observe
how the interval for
changes when data from the other excavations are added. Present and
discuss the results.
Credible intervals for
Update2 the arch_importance
function
to add a column Phi
with samples from the conditional
distribution
for each row of the importance sample. Use the updated function to
construct credible intervals for
,
one for
,
and one for
.
Present and discuss the results.