Skip to contents

Deriving the Hessian

Define the individual terms as \(f_k=y_k\log(1+e^{-\eta_k}) + (N-y_k)\log(1+e^{\eta_k})\), so that \(f=\sum_{k=1}^n f_k\). We know that \(\partial\eta_k/\partial\theta_1\equiv 1\) and \(\partial\eta_k/\partial\theta_2=k\). Then the derivatives with respect to \(\theta_1\) and \(\theta_2\) can be obtain with the chain rule: \[\begin{align*} \frac{df_k}{d\theta_i} &= \frac{\partial\eta_k}{\partial\theta_i} \frac{\partial f_k}{\partial\eta_k} \\ &= \frac{\partial \eta_k}{\partial \theta_i} \left[ y_k \frac{-e^{-\eta_k}}{1+e^{-\eta_k}} + (N-y_k) \frac{e^{\eta_k}}{1+e^{\eta_k}} \right] \\ &= \frac{\partial\eta_k}{\partial\theta_i} \left[ y_k \frac{-e^{-\eta_k/2}}{e^{\eta_k/2}+e^{-\eta_k/2}} + (N-y_k) \frac{e^{\eta_k/2}}{e^{-\eta_k/2}+e^{\eta_k/2}} \right] \\ &= \frac{\partial\eta_k}{\partial\theta_i} \left[ - y_k \frac{e^{-\eta_k/2}+e^{\eta_k/2}}{e^{\eta_k/2}+e^{-\eta_k/2}} + N \frac{1}{e^{-\eta_k}+1} \right] \\ &= \frac{\partial\eta_k}{\partial\theta_i} \left[ \frac{N}{e^{-\eta_k}+1} - y_k \right] \\ \end{align*}\] Since the derivatives of \(\eta_k\) with respect to \(\theta_1\) and \(\theta_2\) do not depend on the values of \(\theta_1\) and \(\theta_2\), the second order derivatives are \[\begin{align*} \frac{d^2f_k}{d\theta_i d\theta_j} &= \frac{\partial\eta_k}{\partial\theta_i} \frac{\partial\eta_k}{\partial\theta_j} \frac{\partial}{\partial\eta_k}\left[ \frac{N}{e^{-\eta_k}+1} - y_k \right] \\ &= \frac{\partial\eta_k}{\partial\theta_i} \frac{\partial\eta_k}{\partial\theta_j} \frac{N e^{-\eta_k}}{(e^{-\eta_k}+1)^2} \\ &= \frac{\partial\eta_k}{\partial\theta_i} \frac{\partial\eta_k}{\partial\theta_j} \frac{N}{(e^{-\eta_k/2}+e^{\eta_k/2})^2} \\ &= \frac{\partial\eta_k}{\partial\theta_i} \frac{\partial\eta_k}{\partial\theta_j} \frac{N}{4\cosh(\eta_k/2)^2} \\ \end{align*}\] Plugging in the \(\theta\)-derivatives gives the Hessain contribution for term \(f_k\) as \[ \frac{N}{4\cosh(\eta_k/2)^2} \begin{bmatrix}1 & k \\ k & k^2\end{bmatrix} . \]

Positive definite Hessian

To show that the total Hessian for \(f\) is positive definite for \(n \geq 2\), we define vectors \(\boldsymbol{u}_k=\begin{bmatrix}1 \\ k\end{bmatrix}\) and \(d_k=\frac{N}{4\cosh(\eta_k/2)^2}\). Define the 2-\(n\) matrix \(\boldsymbol{U}=\begin{bmatrix}\boldsymbol{u}_1 & \boldsymbol{u}_2 & \cdots & \boldsymbol{u}_n\end{bmatrix}\) and a diagonal matrix \(\boldsymbol{D}\) with \(D_{ii}=d_i\). The product \(\boldsymbol{U}\boldsymbol{D}\boldsymbol{U}^\top\) is then another way of writing the Hessian for \(f\). Since all the vectors \(\boldsymbol{u}_k\) are non-parallel, and the \(d_i\) are strictly positive for all combinations of \(\theta_1\) and \(\theta_2\), this matrix has full rank (rank 2) for \(n \geq 2\), and is positive definite.

Alternative reasoning

The Hessian for each \(f_k\) is positive semi-definite, since it can be written \(d_k\boldsymbol{u}_k\boldsymbol{u}_k^\top\) for some \(d_k > 0\) and vector \(\boldsymbol{u}_k\). The sum of a positive definite matrix and a positive semi-definite matrix is positive definite, so it’s sufficient to prove that the sum of the first two terms is positive definite. For any positive scaling constant \(w\), the determinant of \[ \begin{bmatrix}1 & 1\\1 & 1\end{bmatrix}+w\begin{bmatrix}1 & 2\\2 & 4\end{bmatrix} \] is \((1+w)(1+4w)-(1+2w)^2=1+5w+4w^2-1-4w-4w^2=w > 0\). this means that any (positively) weighted sum of those two matrices is positive definite (since each is positive semi-definite, and a positive determinant rules ot the sum being only positive semi-definite). This proves that the total Hessian is positive definite for \(n\geq 2\).


Note that in the first proof of positive definiteness, we didn’t actually need to know the specific values of the \(\boldsymbol{u}_k\) vectors, that are proportional to the gradients fo \(\eta_k\) with respect to \(\boldsymbol{\theta}=(\theta_1,\theta_2)\); it was sufficient that they were non-parallel, ensuring that \(\boldsymbol{U}\) had full rank. This means that any linear model for \(\eta_k\) in a set of parameters \(\boldsymbol{\theta}=(\theta_1,\dots,\theta_p)\) leads to a positive definite Hessian for this model, if the collection of gradient vectors of \((\eta_1,\dots,\eta_n)\) with respect to \(\boldsymbol{\theta}\) have collective rank at least \(p\).