Deriving the Hessian
Define the individual terms as
,
so that
.
We know that
and
.
Then the derivatives with respect to
and
can be obtain with the chain rule:
Since the derivatives of
with respect to
and
do not depend on the values of
and
,
the second order derivatives are
Plugging in the
-derivatives
gives the Hessain contribution for term
as
Positive definite Hessian
To show that the total Hessian for
is positive definite for
,
we define vectors
and
.
Define the
2-
matrix
and a diagonal matrix
with
.
The product
is then another way of writing the Hessian for
.
Since all the vectors
are non-parallel, and the
are strictly positive for all combinations of
and
,
this matrix has full rank (rank 2) for
,
and is positive definite.
Alternative reasoning
The Hessian for each
is positive semi-definite, since it can be written
for some
and vector
.
The sum of a positive definite matrix and a positive semi-definite
matrix is positive definite, so it’s sufficient to prove that the sum of
the first two terms is positive definite. For any positive scaling
constant
,
the determinant of
is
.
this means that any (positively) weighted sum of those two matrices is
positive definite (since each is positive semi-definite, and a positive
determinant rules ot the sum being only positive semi-definite). This
proves that the total Hessian is positive definite for
.
Note that in the first proof of positive definiteness, we didn’t
actually need to know the specific values of the
vectors, that are proportional to the gradients fo
with respect to
;
it was sufficient that they were non-parallel, ensuring that
had full rank. This means that any linear model for
in a set of parameters
leads to a positive definite Hessian for this model, if the collection
of gradient vectors of
with respect to
have collective rank at least
.