This vignette covers the derivatives of the surface created by a Gaussian process model with respect to the spatial dimensions. The other vignette has derivatives of the deviance (likelihood) with respect to the parameters.

Gradient of mean function

The mean function is \[ \hat{y}(x) = \hat{\mu} + r^T R^{-1}(Y - 1\hat{\mu}). \] $r$ is the only part that depends on $x$, and is defined below, where $K$ is the correlation function.

\[ r = (r_1(x), \ldots, r_d(x))^T\] \[ r_i(x) = K(z(x), z(u_i))\] \[\frac{\partial \hat{y}(x)}{\partial x_i} = \frac{\partial r}{\partial x_i}^T R^{-1}(Y - 1\hat{\mu})\]

\[\frac{\partial r_j}{\partial x_i} = \frac{\partial K(z(x), z(u_j))}{\partial x_i} \] This value depends on the correlation function $K$. This will give the vector $\frac{\partial r}{\partial x_i}$ which is calculates a single partial derivative, then the vector of these gives the gradient $\nabla_x \hat{y}(x)$

Hessian of mean function

The second derivatives can be calculated similarly \[\frac{\partial^2 \hat{y}(x)}{\partial x_i \partial x_k} = \frac{\partial^2 r}{\partial x_i \partial x_k}^T R^{-1}(Y - 1\hat{\mu})\]

Each element of this matrix is \[\frac{\partial^2 r_j}{\partial x_i \partial x_k} = \frac{\partial^2 K(z(x), z(u_j))}{\partial x_i \partial x_k} \]

Gaussian correlation derivative

The equations above work for any correlation function, but then we need to have the derivatives of the correlation function with respect to the spatial variables.

\[K(z(x), z(u_j)) = \exp(-\sum_{\ell =1}^d \theta_\ell (x_\ell - u_{j\ell})^2) \]

\[\frac{\partial K(z(x), z(u_j))}{\partial x_i} = -2\theta_i (x_i - u_{ji}) \exp(-\sum_{\ell =1}^d \theta_\ell (x_\ell - u_{j\ell})^2) \] The second derivative with respect to the same dimension is

\[\frac{\partial^2 K(z(x), z(u_j))}{\partial x_i^2} = (-2\theta_i + 4\theta_i^2 (x_i - u_{ji})^2) \exp(-\sum_{\ell =1}^d \theta_\ell (x_\ell - u_{j\ell})^2) \]

The cross derivative is \[\frac{\partial^2 K(z(x), z(u_j))}{\partial x_i \partial x_k} = 4\theta_i \theta_k (x_i - u_{ji}) (x_k - u_{jk}) \exp(-\sum_{\ell =1}^d \theta_\ell (x_\ell - u_{j\ell})^2) \]

Gradient distribution

A big problem with using the gradient of the mean function of a GP is that it doesn’t give an idea of its distribution/randomness. The gradient could be zero in a region where the surface is not flat just because we don’t any information in that region yet.

Can we find the distribution of the gradient by taking a limit?

In $d$ dimensions we

First start with the directional derivative in the direction of the ith component. \[ \frac{\partial \hat{y}(x)}{\partial x_i} = \lim_{\delta -> 0} \frac{ \hat{y}(x + \delta e_i) - \hat{y}(x)}{\delta} \]

Now we find the joint distribution of $(x + e_i) $ and $ (x)$

Joint distribution

If $(y,z) \sim N((\mu_1, \mu_2), \Sigma)$ then \[ z|y \sim N(\mu_{z|y}, \Sigma_{z|y} )\] \[ \mu_{z|y}= \mu_2 + \Sigma_{zy}\Sigma_{yy}^{-1} (y - \mu_1)\] \[ \Sigma_{z|y} = \Sigma_{zz} - \Sigma_{zy}\Sigma_{yy}^{-1}\Sigma_{yz} \]

Linear combination of normal variables is normal

\[ E[\hat{y}(x + \delta e_i) - \hat{y}(x)] = (\Sigma_{z_1y} - \Sigma_{z_2y})\Sigma_{yy}^{-1} (y - \mu_1) = (r_1-r_2)R^{-1}(y-\mu_1)\]

\[ E[\frac{\hat{y}(x + \delta e_i) - \hat{y}(x)}{\delta}] = (\frac{\Sigma_{z_1y} - \Sigma_{z_2y}}{\delta})\Sigma_{yy}^{-1} (y - \mu_1) = (\frac{r_1 - r_2}{\delta})R^{-1} (y - \mu_1)\]

\[ \frac{\Sigma_{z_1y} - \Sigma_{z_2y}}{\delta} = \frac{\hat{\sigma}^2 (r_1 - r_2)}{\delta}\] The jth element of this is \[ \lim_{\delta -> 0} \frac{R(x+\delta e_i, X_j) - R(x, X_j)}{\delta} = \frac{\partial R(x, X_j)}{\partial x_i} = \frac{\partial r_{(j)}}{\partial x_i}\] The mean of the gradient distribution is \[ \frac{\partial r}{\partial x_i}^TR^{-1} (y - \mu_1)\] This is the same as the derivative of the mean function.

The variance is the more difficult part.

Variance of lincom

\[ Var\big[ \frac{\hat{y}(x + \delta e_i) - \hat{y}(x)}{\delta} \big] = \frac{1}{\delta^2}Var\big[ \hat{y}(x + \delta e_i) - \hat{y}(x) \big]\] \[=\frac{1}{\delta^2}\big( Var\big[ \hat{y}(x + \delta e_i) \big] + Var\big[ \hat{y}(x) \big] + -2 Cov\big[ \hat{y}(x + \delta e_i), \hat{y}(x) \big] \big)\] \[ Var\big[ \hat{y}(x + \delta e_i) |Y \big] = \Sigma_{11} - \Sigma_{1Y} \Sigma_{YY}^{-1} \Sigma_{Y1} \] \[ Var\big[ \hat{y}(x) |Y \big] = \Sigma_{22} - \Sigma_{2Y} \Sigma_{YY}^{-1} \Sigma_{Y2} \] \[ Cov\big[ \hat{y}(x + \delta e_i), \hat{y}(x)|Y \big] = Cov\big[ \hat{y}(x + \delta e_i), \hat{y}(x) \big] - Cov\big[ \hat{y}(x + \delta e_i), Y \big] \Sigma_{YY}^{-1} Cov\big[ Y, \hat{y}(x) \big] \] \[ = \Sigma_{12} - \Sigma_{1Y} \Sigma_{YY}^{-1} \Sigma_{Y2}\] \[ = \hat{\sigma}^2 \big(R\big[ \hat{y}(x + \delta e_i), \hat{y}(x) \big] + r_1 R^{-1} r_2 \big) \]

\[=\frac{1}{\delta^2}\big( \Sigma_{11} - \Sigma_{1Y} \Sigma_{YY}^{-1} \Sigma_{Y1} + \Sigma_{22} - \Sigma_{2Y} \Sigma_{YY}^{-1} \Sigma_{Y2} + -2 (\Sigma_{12} - \Sigma_{1Y} \Sigma_{YY}^{-1} \Sigma_{Y2}) \] \[=\frac{1}{\delta^2}\big( \Sigma_{11} + \Sigma_{22} -2 \Sigma_{12} +2 (\Sigma_{1Y}- \Sigma_{2Y}) \Sigma_{YY}^{-1} ( \Sigma_{1Y}-\Sigma_{Y2}) \]

CONTINUE HERE USE THIS AS GUIDE http://mlg.eng.cam.ac.uk/mchutchon/DifferentiatingGPs.pdf

\[=\frac{1}{\delta^2}\big( Var\big[ \hat{y}(x + \delta e_i) \big] + Var\big[ \hat{y}(x) \big] + -2 Cov\big[ \hat{y}(x + \delta e_i), \hat{y}(x) \big] \big)\]

I could have just used this formula: \[ Var[(1,-1)^T (z_1, z_2)] = (1,-1)^T Cov((z_1, z_2)) (1,-1)\]

\[ \frac{\partial R(z_1, z_2)}{\partial \delta} = \]

Spatial derivatives of Gaussian process models

Collin Erickson

2017-09-10