StatComp Project Example: Numerical statistics

These were instructions for this example project; the projects this year will have different instructions that you should follow instead. The instructions here are only meant to be used to understand why the solution does technical things in the specific way it does. For your own project, make sure to follow the relevant instructons instead.
Your project repository contains a template RMarkdown document, report.Rmd, that you should expand with your answers and solutions for this project assignment, and two code files, code.R containing some predefined functions, and my_code.R that should be used for some of your own code. Do not change these filenames.
Some of the code solutions (as indicated for those items) should be placed in the my_code.R file, and included in the report as indicated in the template document (also see the RMdemo.Rmd, my_code.R, and T4sol.Rmd document examples from the Week 4 lecture and example materials on Learn).
Code that is placed directly into code chunks in the report.Rmd file should be shown in a separate section at the end of the report, so that the main text is a coherent presentation of theory and discussion of methods and results. In the code presentation section, include a brief code comment for each chunk. Also include brief documentation as code comments before each function defined in my_code.R.
The styler package Addin for restyling code for better and consistent readability works for both .R and .Rmd files.

The assignment is submitted by pushing your work to the github repository before your submission deadline. Make sure you include your name, student number, and github username in both report.Rmd and my_code.R. When submitting, also include a generated html version of the report, with filename report.html. Normally, generated images are automatically included in the html file itself; otherwise they need to be added to the repository as well.

It is strongly recommended to commit and push to github frequently. If you’re uncertain about whether you’re pushing all the needed submission files to github, ask Finn to check the repository contents at least a week before the submisison deadline.

The 10 sub-tasks contribute equally to the total mark, but the project will be marked as a whole, between $0$ and $40$ . A marking guide will be posted on Learn.

Archaeology in the Baltic Sea

In Lab 3, you investigated data from an archaeological excavation on Gotland in the Baltic sea. In the 1920s, a battle gravesite from 1361 was subject to several archaeological excavations. A total of $493$ femurs (thigh bones) ( $256$ left, and $237$ right) were found. We wanted to figure out how many persons were likely buried at the gravesite. It must reasonably have been at least $256$ , but how many more? In this assignment, you will use importance sampling to estimate credible intervals for this problem.

In the lab, it seemed difficult to extract much information from the sample, since there were only two observations and two parameters. You will here investigate if including data from other excavations of the same type might decrease the uncertainty about the number of persons buried, through improved information about the average detection probability, $\phi$ . The function arch_data in code.R returns a data frame with data for up to four different excavations (note: only one of them is real; the other three are synthetic data generated for this assignment).

We assume the same model as in the lab for each excavation, with an unknown $N_j$ , $j=1,\dots,J$ , for each of the $J$ excavations, and a common detection probability $\phi$ . We assume them model that, conditionally on $\{N_j\}$ and $\phi$ , $(Y_{j,i}|N_j,\phi)\sim\mathsf{Bin}(N_j,\phi)$ , $i=1,2$ , all independent, and prior distributions $N_j\sim\mathsf{Geom}(\xi)$ , $0<\xi<1$ , and $\phi\sim\mathsf{Beta}(a, b)$ , $a,b>0$ , $\begin{aligned} p_{Y_{j,i}|N_j=n,\phi}(y) = \mathsf{P}(Y_{j,i}=y|N_j=n,\phi) &= {n \choose y} \phi^y(1-\phi)^{n-y}, \quad y=0,\dots,n, \\ p_{N_j}(n) = \mathsf{P}(N_j=n) &= \xi\,(1-\xi)^n,\quad n=0,1,2,3,\dots, \\ p_\phi(\phi) &= \frac{\phi^{a-1}(1-\phi)^{b-1}}{B(a,b)}, \quad \phi\in(0,1) , \end{aligned}$ where $B(a,b)$ is the Beta function, see ?beta in R.

Joint probability function for $(\boldsymbol{N},\boldsymbol{Y})$

Conditionally on $\boldsymbol{N}=\{N_1,\dots,N_J\}$ , and the observations $\boldsymbol{Y}$ , the conditional posterior distribution for $(\phi|\boldsymbol{N},\boldsymbol{Y})$ is $\mathsf{Beta}(\widetilde{a},\widetilde{b})$ where $\widetilde{a}=a+\sum_{j,i} Y_{j,i}$ and $\widetilde{b}=b+2\sum_j N_j -\sum_{j,i} Y_{j,i}$ , and the sums go over $j=1,\dots,J$ and $i=1,2$ .

Show that the joint probability function for $(\boldsymbol{N},\boldsymbol{Y})$ , with $N_j\geq\max(Y_{j,1},Y_{j,2})$ and $Y_{j,i}=0,1,2,\dots$ , is given by¹ $p(\boldsymbol{N},\boldsymbol{Y}) = \frac{B(\widetilde{a},\widetilde{b})}{B(a,b)} \prod_{j=1}^J \left[ p(N_j) \prod_{i=1}^2 {N_j \choose Y_{j,i}} \right],$ where $p(N_j)$ is the probability mass function for the Geometric distribution prior for $N_j$ .

Probability function implementation

With the help of the functions lbeta, lchoose, and dgeom, define (in my_code.R) a function log_prob_NY(N,Y,xi,a,b) with arguments N (a vector of length $J$ ) and $Y$ (a data frame of size $J\times 2$ , formatted like the output from arch_data(J)). The arguments xi, a, b are the model hyperparameters $\xi$ , $a$ , and $b$ . The function should return the logarithm of $p(\boldsymbol{N},\boldsymbol{Y})$ when the combination of $\boldsymbol{N}$ and $\boldsymbol{Y}$ is valid, and should otherwise return -Inf.

Importance sampling function

The aim is to construct a Bayesian credible interval for each $N_j$ using importance sampling, similarly to the method used in lab 4. There, a Gaussian approximation of the posterior of a parameter was used. Here, we will instead use the prior distribution for $\boldsymbol{N}$ as sampling distribution, with samples $N_j^{[k]}\sim\mathsf{Geom}(\xi)$ .

Since the observations $\boldsymbol{Y}$ are fixed, the posterior probability function $p(\boldsymbol{N}|\boldsymbol{Y})$ for $\boldsymbol{N}$ is proportional to $p(\boldsymbol{N},\boldsymbol{Y})$ .

We define the logarithm of unnormalised importance weights, $\log[w^{[k]}]=\log[p(\boldsymbol{N}^{[k]},\boldsymbol{Y})]-\log[p(\boldsymbol{N}^{[k]})]$ , where $p(\boldsymbol{N}^{[k]})$ is the product of $J$ geometric distribution probabilities.

Define (in my_code.R) a function arch_importance(K,Y,xi,a,b) with the same arguments Y, xi, a, and b as before, but with the first argument K defining the number of samples to generate. The function should return a data frame with $K$ rows and $J+1$ columns, named N1, N2, etc, for each $j=1,\dots,J$ , containing samples from the prior distributions for $N_j$ , and a final column called Log_Weights, containing re-normalised log-importance-weights, constructed by shifting $\log[w^{[k]}]$ by subtracting the largest $\log[w^{[k]}]$ value, as in lab 4.

More efficient alternative

Using the $\mathsf{Geom}(\xi)$ prior distribution for each $N_j$ leads to a rather inefficient importance sampler, especially for $J=4$ ; Most of the weights will become zero (when some $N_j^{[k]}<Y_{j,i}$ ) or near zeros, and only a few samples will influence the practical estimates and credible intervals.

As an alternative, we can use sampling distributions that are better adapted to the posterior distributions. Let $N^\text{min}_j=\max(Y_{j,1}, Y_{j,2})$ be the smallest allowed value to sample, for each $j=1,\dots,J$ . This would eliminate all the impossible combinations, so that all weights would be (in theory if not numerically) positive.

To further improve the method, we can choose different values of the probability parameter for the samples, and also adapt it differently for each $j$ . Without motivation, you can use values $\xi_j=1/(1 + 4 N^\text{min}_j)$ when sampling, so that $N_j^{[k]}=N^\text{min}_j + \mathsf{Geom}(\xi_j)$ .

Note: Due to the nature of the geometric distribution, this is also equivalent to an ordinary $\mathsf{Geom}(\xi_j)$ conditioned on being at least $N^\text{min}_j$ .

To take this different sampling method into account, the log-weights need to be defined as $\log[w^{[k]}]=\log[p(\boldsymbol{N}^{[k]},\boldsymbol{Y})]-\log[\widetilde{p}(\boldsymbol{N}^{[k]})]$ , where $\widetilde{p}(\boldsymbol{N}^{[k]})=\prod_{j=1}^J p_{\mathsf{Geom}(\xi_j)}(N_j^{[k]}-N^\text{min}_j)$ , the product of $J$ different geometric distribution probabilities.

You may adapt this example code (the details will depend on which vectorisation method you use):

# matrix of minimum allowed N for each j, repeated K times
N_min <- matrix(rep(pmax(Y[, 1], Y[, 2]), each = K), K, J)
xi_sample <- 1 / (1 + 4 * N_min)
# Sample values and call log_prob_NY for each sample,
# store N-values in a K-by-J matrix, and log(p_NY)-values in a vector log_PY
N <- matrix(rgeom(..., prob = xi_sample), K, J) + N_min
log_PY <- ...
# Subtract the sampling log-probabilities
log_PY - rowSums(dgeom(N - N_min, prob = xi_sample, log = TRUE))

Using this method, $K$ doesn’t need to be larger than $100000$ , as small as $K=1000$ starts to give reasonable values for code testing purposes, and $K=10000$ should only take at most a few seconds to run if the code is fully vectorised (the vapply function can be used, but a basic for-loop is also fine in this case).

Credible intervals for $N_j$

As in the lab, let $\xi=1/1001$ , and $a=b=1/2$ . Use arch_importance and the wquantile function with type=1 (see the help text for information) to compute credible intervals for $N_j$ . Start by including only the data from a single excavation, and observe how the interval for $N_1$ changes when data from the other excavations are added. Present and discuss the results.

Credible intervals for $\phi$

Update² the arch_importance function to add a column Phi with samples from the conditional distribution $\mathsf{Beta}(\widetilde{a},\widetilde{b})$ for each row of the importance sample. Use the updated function to construct credible intervals for $\phi$ , one for $J=1$ , and one for $J=4$ . Present and discuss the results.

Finn Lindgren