This vignette covers the derivatives of the surface created by a Gaussian process model with respect to the spatial dimensions. The other vignette has derivatives of the deviance (likelihood) with respect to the parameters.
The mean function is ˆy(x)=ˆμ+rTR−1(Y−1ˆμ). r is the only part that depends on x, and is defined below, where K is the correlation function.
r=(r1(x),…,rd(x))T ri(x)=K(z(x),z(ui)) ∂ˆy(x)∂xi=∂r∂xiTR−1(Y−1ˆμ)
∂rj∂xi=∂K(z(x),z(uj))∂xi This value depends on the correlation function K. This will give the vector ∂r∂xi which is calculates a single partial derivative, then the vector of these gives the gradient ∇xˆy(x)
The second derivatives can be calculated similarly ∂2ˆy(x)∂xi∂xk=∂2r∂xi∂xkTR−1(Y−1ˆμ)
Each element of this matrix is ∂2rj∂xi∂xk=∂2K(z(x),z(uj))∂xi∂xk
The equations above work for any correlation function, but then we need to have the derivatives of the correlation function with respect to the spatial variables.
K(z(x),z(uj))=exp(−d∑ℓ=1θℓ(xℓ−ujℓ)2)
∂K(z(x),z(uj))∂xi=−2θi(xi−uji)exp(−d∑ℓ=1θℓ(xℓ−ujℓ)2) The second derivative with respect to the same dimension is
∂2K(z(x),z(uj))∂x2i=(−2θi+4θ2i(xi−uji)2)exp(−d∑ℓ=1θℓ(xℓ−ujℓ)2)
The cross derivative is ∂2K(z(x),z(uj))∂xi∂xk=4θiθk(xi−uji)(xk−ujk)exp(−d∑ℓ=1θℓ(xℓ−ujℓ)2)
A big problem with using the gradient of the mean function of a GP is that it doesn’t give an idea of its distribution/randomness. The gradient could be zero in a region where the surface is not flat just because we don’t any information in that region yet.
Can we find the distribution of the gradient by taking a limit?
In d dimensions we
First start with the directional derivative in the direction of the ith component. ∂ˆy(x)∂xi=lim
Now we find the joint distribution of $(x + e_i) $ and $ (x)$
If (y,z) \sim N((\mu_1, \mu_2), \Sigma) then z|y \sim N(\mu_{z|y}, \Sigma_{z|y} ) \mu_{z|y}= \mu_2 + \Sigma_{zy}\Sigma_{yy}^{-1} (y - \mu_1) \Sigma_{z|y} = \Sigma_{zz} - \Sigma_{zy}\Sigma_{yy}^{-1}\Sigma_{yz}
E[\hat{y}(x + \delta e_i) - \hat{y}(x)] = (\Sigma_{z_1y} - \Sigma_{z_2y})\Sigma_{yy}^{-1} (y - \mu_1) = (r_1-r_2)R^{-1}(y-\mu_1)
E[\frac{\hat{y}(x + \delta e_i) - \hat{y}(x)}{\delta}] = (\frac{\Sigma_{z_1y} - \Sigma_{z_2y}}{\delta})\Sigma_{yy}^{-1} (y - \mu_1) = (\frac{r_1 - r_2}{\delta})R^{-1} (y - \mu_1)
\frac{\Sigma_{z_1y} - \Sigma_{z_2y}}{\delta} = \frac{\hat{\sigma}^2 (r_1 - r_2)}{\delta} The jth element of this is \lim_{\delta -> 0} \frac{R(x+\delta e_i, X_j) - R(x, X_j)}{\delta} = \frac{\partial R(x, X_j)}{\partial x_i} = \frac{\partial r_{(j)}}{\partial x_i} The mean of the gradient distribution is \frac{\partial r}{\partial x_i}^TR^{-1} (y - \mu_1) This is the same as the derivative of the mean function.
The variance is the more difficult part.
Var\big[ \frac{\hat{y}(x + \delta e_i) - \hat{y}(x)}{\delta} \big] = \frac{1}{\delta^2}Var\big[ \hat{y}(x + \delta e_i) - \hat{y}(x) \big] =\frac{1}{\delta^2}\big( Var\big[ \hat{y}(x + \delta e_i) \big] + Var\big[ \hat{y}(x) \big] + -2 Cov\big[ \hat{y}(x + \delta e_i), \hat{y}(x) \big] \big) Var\big[ \hat{y}(x + \delta e_i) |Y \big] = \Sigma_{11} - \Sigma_{1Y} \Sigma_{YY}^{-1} \Sigma_{Y1} Var\big[ \hat{y}(x) |Y \big] = \Sigma_{22} - \Sigma_{2Y} \Sigma_{YY}^{-1} \Sigma_{Y2} Cov\big[ \hat{y}(x + \delta e_i), \hat{y}(x)|Y \big] = Cov\big[ \hat{y}(x + \delta e_i), \hat{y}(x) \big] - Cov\big[ \hat{y}(x + \delta e_i), Y \big] \Sigma_{YY}^{-1} Cov\big[ Y, \hat{y}(x) \big] = \Sigma_{12} - \Sigma_{1Y} \Sigma_{YY}^{-1} \Sigma_{Y2} = \hat{\sigma}^2 \big(R\big[ \hat{y}(x + \delta e_i), \hat{y}(x) \big] + r_1 R^{-1} r_2 \big)
=\frac{1}{\delta^2}\big( \Sigma_{11} - \Sigma_{1Y} \Sigma_{YY}^{-1} \Sigma_{Y1} + \Sigma_{22} - \Sigma_{2Y} \Sigma_{YY}^{-1} \Sigma_{Y2} + -2 (\Sigma_{12} - \Sigma_{1Y} \Sigma_{YY}^{-1} \Sigma_{Y2}) =\frac{1}{\delta^2}\big( \Sigma_{11} + \Sigma_{22} -2 \Sigma_{12} +2 (\Sigma_{1Y}- \Sigma_{2Y}) \Sigma_{YY}^{-1} ( \Sigma_{1Y}-\Sigma_{Y2})
CONTINUE HERE USE THIS AS GUIDE http://mlg.eng.cam.ac.uk/mchutchon/DifferentiatingGPs.pdf
=\frac{1}{\delta^2}\big( Var\big[ \hat{y}(x + \delta e_i) \big] + Var\big[ \hat{y}(x) \big] + -2 Cov\big[ \hat{y}(x + \delta e_i), \hat{y}(x) \big] \big)
I could have just used this formula: Var[(1,-1)^T (z_1, z_2)] = (1,-1)^T Cov((z_1, z_2)) (1,-1)
\frac{\partial R(z_1, z_2)}{\partial \delta} =