Consider the case where we observe \(n\) independent, identically distributed copies of the random variable (\(X_{i}\)) where \(X_{i} \in \mathbb{R}^{p}\) is normally distributed with some mean, \(\mu\), and some variance, \(\Sigma\). That is, \(X_{i} \sim N_{p}\left( \mu, \Sigma \right)\).
Because we assume independence, we know that the probability of observing these specific observations \(X_{1}, …, X_{n}\) is equal to
\begin{align} f(X{1}, …, X{n}; \mu, \Sigma) &= \prod{i = 1}{n}(2\pi){-p/2}\left| \Sigma \right|{-½}\exp\left[ -\frac{1}{2}\left( X{i} - \mu \right){T}\Sigma{-1}\left( X{i} - \mu \right) \right] \ &= (2\pi){-nr/2}\left| \Sigma \right|{-n/2}\mbox{etr}\left[ -\frac{1}{2}\sum{i = 1}{n}\left( X{i} - \mu \right)\left( X{i} - \mu \right){T}\Sigma{-1} \right] \end{align}
where \(\mbox{etr}\left( \cdot \right)\) denotes the exponential trace operator. It follows that the log-likelihood for \(\mu\) and \(\Sigma\) is equal to the following:
\[ l(\mu, \Sigma | X) = const. - \frac{n}{2}\log\left| \Sigma \right| - tr\left[ \frac{1}{2}\sum_{i = 1}^{n}\left(X_{i} - \mu \right)\left(X_{i} - \mu \right)^{T}\Sigma^{-1} \right] \]
If we are interested in estimating \(\mu\), it is relatively straight forward to show that the maximum likelihood estimator (MLE) for \(\mu\) is \(\hat{\mu}_{MLE} = \sum_{i = 1}^{n}X_{i}/n\) which we typically denote as \(\bar{X}\). However, in addition to \(\mu\), many applications require the estimation of \(\Sigma\) as well. We can also find a maximum likelihood estimator:
\begin{align} &\hat{\Sigma}{MLE} = \arg\max{\Sigma \in \mathbb{S}{+}{p}}\left{ const. - \frac{n}{2}\log\left| \Sigma \right| - tr\left[ \frac{1}{2}\sum{i = 1}{n}\left(X_{i} - \mu \right)\left(X{i} - \mu \right){T}\Sigma{-1} \right] \right} \ &\nabla{\Sigma}l(\mu, \Sigma | X) = -\frac{n}{2}\Sigma{-1} + \frac{1}{2}\sum{i = 1}{n}\left(X{i} - \mu \right)\left(X{i} - \mu \right){T}\Sigma{-2} \ \Rightarrow &\hat{\Sigma}{MLE} = \left[ \frac{1}{n}\sum{i = 1}{n}\left(X{i} - \bar{X} \right)\left(X_{i} - \bar{X} \right){T} \right] \end{align}
By setting the gradient equal to zero and plugging in the MLE for \(\mu\), we find that the MLE for \(\Sigma\) is our usual sample estimator often denoted as \(S\). It turns out that we could have just as easily computed the maximum likelihood estimator for the precision matrix \(\Omega \equiv \Sigma^{-1}\) and taken its inverse:
\[ \hat{\Omega}_{MLE} = \arg\min_{\Omega \in S_{+}^{p}}\left\{ tr\left(S\Omega\right) - \log\left|\Omega\right| \right\} \]
so that \(\hat{\Omega}_{MLE} = S^{-1}\). Beyond the formatting convenience, computing estimates for \(\Omega\) as opposed to \(\Sigma\) often poses less computational challenges – and accordingly, the literature has placed more emphasis on efficiently solving for \(\Omega\) instead of \(\Sigma\).
Notice that here we are minimizing the negative log-likelihood as opposed to maximizing the log-likelihood. Both procedures will result in the same estimate.
As in regression settings, we can construct a penalized log-likelihood estimator by adding a penalty term, \(P\left(\Omega\right)\), to the log-likelihood so that
\[ \hat{\Omega} = \arg\min_{\Omega \in S_{+}^{p}}\left\{ tr\left(S\Omega\right) - \log\left|\Omega \right| + P\left( \Omega \right) \right\} \]
\(P\left( \Omega \right)\) is often of the form \(P\left(\Omega \right) = \lambda\|\Omega \|_{F}^{2}/2\) or \(P\left(\Omega \right) = \|\Omega\|_{1}\) where \(\lambda > 0\), \(\left\|\cdot \right\|_{F}^{2}\) is the Frobenius norm and we define \(\left\|A \right\|_{1} = \sum_{i, j} \left| A_{ij} \right|\). These penalties are the ridge and lasso, respectively. In the ADMMsigma
package, we instead take
\[ P\left( \Omega \right) = \lambda\left[\frac{1 - \alpha}{2}\left\| \Omega \right|_{F}^{2} + \alpha\left\| \Omega \right\|_{1} \right] \]
so that solving the full penalized log-likelihood for \(\Omega\) results in solving
\[ \hat{\Omega} = \arg\min_{\Omega \in S_{+}^{p}}\left\{ tr\left(S\Omega\right) - \log\left|\Omega \right| + \lambda\left[\frac{1 - \alpha}{2}\left\| \Omega \right|_{F}^{2} + \alpha\left\| \Omega \right\|_{1} \right] \right\} \]
where \(0 \leq \alpha \leq 1\). This penalty, know as the elastic-net penalty, was explored by Hui Zou and Trevor Hastie [@zou2005regularization] and is identical to the penalty used in the popular penalized regression package glmnet
. Clearly, when \(\alpha = 0\) the elastic-net reduces to a ridge-type penalty and when \(\alpha = 1\) it reduces to a lasso-type penalty. Having this flexibility and generalization allows us to perform cross validation across proposed \(\alpha\) values in addition to proposed \(\lambda\) values.
We will explore how to solve for \(\hat{\Omega}\) in the next section.
\vspace{1cm}
Many efficient methods have been proposed to solve for \(\hat{\Omega}\) when \(\alpha = 1\). The most popular method is the graphical lasso algorithm (glasso) introduced by @friedman2008sparse. However, no methods (to the best of my knowledge) have estimated \(\Omega\) when \(\alpha \in (0, 1)\). We will use the alternating direction method of multipliers (ADMM) algorithm to do so.
As the authors state in @boyd2011distributed, the “ADMM is an algorithm that is intended to blend the decomposability of dual ascent with the superior convergence properties of the method of multipliers.” For our purposes, we will only focus on the ADMM algorithm but it is encouraged to read the original text from Boyd and others for a complete introduction to the other two methods.
In general, suppose we want to solve an optimization problem of the following form:
\begin{align} \mbox{minimize } f(x) + g(z) \ \mbox{subject to } Ax + Bz = c \end{align}
where \(x \in \mathbb{R}^{n}, z \in \mathbb{R}^{m}, A \in \mathbb{R}^{p \times n}, B \in \mathbb{R}^{p \times m}\), \(c \in \mathbb{R}^{p}\), and \(f\) and \(g\) are assumed to be convex functions (following @boyd2011distributed, the estimation procedure will be introduced in vector form though we could just as easily take \(x\) and \(z\) to be matrices). In addition to penalized precision matrix estimation, optimization problems like this arise naturally in several statistics and machine learning applications – particularly regularization methods. For instance, we could take \(f\) to be the squared error loss, \(g\) to be the \(l_{2}\)-norm, \(c\) to be equal to zero and \(A\) and \(B\) to be identity matrices to solve the ridge regression optimization problem. In all cases, our goal is to find \(x^{*} \in \mathbb{R}^{n}\) and \(z^{*} \in \mathbb{R}^{m}\) that achieves the infimum \(p^{*}\):
\[ p^{*} = inf\left\{ f(x) + g(z) | Ax + Bz = c \right\} \]
To do so, the ADMM algorithm uses the augmented lagrangian
\[ L_{\rho}(x, z, y) = f(x) + g(z) + y^{T}(Ax + Bz - c) + \frac{\rho}{2}\left\| Ax + Bz - c \right\|_{2}^{2} \]
where \(y \in \mathbb{R}^{p}\) is the lagrange multiplier and \(\rho > 0\) is a scalar. Clearly any minimizer, \(p^{*}\), under the augmented lagrangian is equivalent to that of the lagrangian since any feasible point \((x, z)\) satisfies the constraint \(\rho\left\| Ax + Bz - c \right\|_{2}^{2}/2 = 0\).
The algorithm consists of the following repeated iterations:
\begin{align} x{k + 1} &= \arg\min{x \in \mathbb{R}{n}}L{\rho}(x, z{k}, y{k}) \ z{k + 1} &= \arg\min{z \in \mathbb{R}{m}}L{\rho}(x{k + 1}, z, y{k}) \ y{k + 1} &= y{k} + \rho(Ax{k + 1} + Bz{k + 1} - c) \end{align}
In the context of precision matrix estimation, we can let \(f\) be equal to the non-penalized likelihood, \(g\) be equal to \(P\left( \Omega \right)\), and use the constraint \(\Omega\) equal to some \(Z\) so that the lagrangian is
\[ L_{\rho}(\Omega, Z, \Lambda) = f\left(\Omega\right) + g\left(Z\right) + tr\left[\Lambda\left(\Omega - Z\right)\right] \]
and the augmented lagrangian is
\[ L_{\rho}(\Omega, Z, \Lambda) = f\left(\Omega\right) + g\left(Z\right) + tr\left[\Lambda\left(\Omega - Z\right)\right] + \frac{\rho}{2}\left\|\Omega - Z\right\|_{F}^{2} \]
The ADMM algorithm is now the following:
\begin{align} \Omega{k + 1} &= \arg\min{\Omega \in \mathbb{S}{+}{p}}\left{ tr\left(S\Omega\right) - \log\left|\Omega\right| + tr\left[\Lambda{k}\left(\Omega - Z{k}\right)\right] + \frac{\rho}{2}\left| \Omega - Z{k} \right|{F}{2} \right} \ Z{k + 1} &= \arg\min{Z \in \mathbb{S}{p}}\left{ \lambda\left[ \frac{1 - \alpha}{2}\left| Z \right|{F}{2} + \alpha\left| Z \right|{1} \right] + tr\left[\Lambda{k}\left(\Omega{k + 1} - Z\right)\right] + \frac{\rho}{2}\left| \Omega{k + 1} - Z \right|_{F}{2} \right} \ \Lambda{k + 1} &= \Lambda{k} + \rho\left( \Omega{k + 1} - Z{k + 1} \right) \end{align}
\vspace{1cm}
Set \(k = 0\) and initialize \(Z^{0}, \Lambda^{0}\), and \(\rho\). Repeat steps 1-3 until convergence:
\[ \Omega^{k + 1} = \frac{1}{2\rho}V\left[ -Q + \left( Q^{2} + 4\rho I_{p} \right)^{½} \right]V^{T} \]
\begin{align} Z{ij}{k + 1} &= \frac{1}{\lambda(1 - \alpha) + \rho}\mbox{sign}\left(\rho\Omega{ij}{k + 1} + \Lambda{ij}{k}\right)\left( \left| \rho\Omega{ij}{k + 1} + \Lambda{ij}{k} \right| - \lambda\alpha \right){+} \ &= \frac{1}{\lambda(1 - \alpha) + \rho}\mbox{soft}\left(\rho\Omega{ij}{k + 1} + \Lambda{ij}{k}, \lambda\alpha\right) \end{align}
\[ \Lambda^{k + 1} = \Lambda^{k} + \rho\left( \Omega^{k + 1} - Z^{k + 1} \right) \]
where \(\mbox{soft}(a, b) = \mbox{sign}(a)(\left| a \right| - b)_{+}\).
\vspace{1cm}
\[ \Omega^{k + 1} = \arg\min_{\Omega \in \mathbb{S}_{+}^{p}}\left\{ tr\left(S\Omega\right) - \log\left|\Omega\right| + tr\left[\Lambda^{k}\left(\Omega - Z^{k}\right)\right] + \frac{\rho}{2}\left\| \Omega - Z^{k} \right\|_{F}^{2} \right\} \]
\vspace{1cm}
\begin{align} \nabla{\Omega}&\left{ tr\left(S\Omega\right) - \log\left|\Omega\right| + tr\left[\Lambda{k}\left(\Omega - Z{k}\right)\right] + \frac{\rho}{2}\left| \Omega - Z{k} \right|{F}{2} \right} \ &= S - \Omega{-1} + \Lambda{k} + \rho\left( \Omega - Z{k} \right) \end{align}
Note that because all of the variables are symmetric, we can ignore the symmetric constraint when deriving the gradient. First set the gradient equal to zero and decompose \(\Omega^{k + 1} = VDV^{T}\) where \(D\) is a diagonal matrix with diagonal elements equal to the eigen values of \(\Omega^{k + 1}\) and \(V\) is the matrix with corresponding eigen vectors as columns:
\[ S + \Lambda^{k} - \rho Z^{k} = (\Omega^{k + 1})^{-1} - \rho \Omega^{k + 1} = VD^{-1}V^{T} - \rho VDV^{T} = V\left(D^{-1} - \rho D\right)V^{T} \]
This equivalence implies that
\[ \phi_{j}\left( S + \Lambda^{k} - \rho Z^{k} \right) = \frac{1}{\phi_{j}(\Omega^{k + 1})} - \rho\phi_{j}(\Omega^{k + 1}) \]
where \(\phi_{j}(\cdot)\) is the $j$th eigen value.
\begin{align} &\Rightarrow \rho\phi{j}{2}(\Omega{k + 1}) + \phi{j}\left( S + \Lambda{k} - \rho Z{k} \right)\phi{j}(\Omega{k + 1}) - 1 = 0 \ &\Rightarrow \phi{j}(\Omega{k + 1}) = \frac{-\phi{j}(S + \Lambda{k} - \rho Z{k}) \pm \sqrt{\phi{j}{2}(S + \Lambda{k} - \rho Z{k}) + 4\rho}}{2\rho} \end{align}
In summary, if we decompose \(S + \Lambda^{k} - \rho Z^{k} = VQV^{T}\) then
\[ \Omega^{k + 1} = \frac{1}{2\rho}V\left[ -Q + (Q^{2} + 4\rho I_{p})^{½}\right] V^{T} \]
\vspace{1cm}
\[ Z^{k + 1} = \arg\min_{Z \in \mathbb{S}^{p}}\left\{ \lambda\left[ \frac{1 - \alpha}{2}\left\| Z \right\|_{F}^{2} + \alpha\left\| Z \right\|_{1} \right] + tr\left[\Lambda^{k}\left(\Omega^{k + 1} - Z\right)\right] + \frac{\rho}{2}\left\| \Omega^{k + 1} - Z \right\|_{F}^{2} \right\} \]
\vspace{1cm}
\begin{align} \partial&\left{ \lambda\left[ \frac{1 - \alpha}{2}\left| Z \right|{F}{2} + \alpha\left| Z \right|{1} \right] + tr\left[\Lambda{k}\left(\Omega{k + 1} - Z\right)\right] + \frac{\rho}{2}\left| \Omega{k + 1} - Z \right|{F}{2} \right} \ &= \partial\left{ \lambda\left[ \frac{1 - \alpha}{2}\left| Z \right|{F}{2} + \alpha\left| Z \right|{1} \right] \right} + \nabla{\Omega}\left{ tr\left[\Lambda{k}\left(\Omega{k + 1} - Z\right)\right] + \frac{\rho}{2}\left| \Omega{k + 1} - Z \right|_{F}{2} \right} \ &= \lambda(1 - \alpha)Z + \mbox{sign}(Z)\lambda\alpha - \Lambda{k} - \rho\left( \Omega{k + 1} - Z \right) \end{align}
where sign is the elementwise sign operator. By setting the gradient/sub-differential equal to zero, we arrive at the following equivalence:
\[ Z_{ij}^{k + 1} = \frac{1}{\lambda(1 - \alpha) + \rho}\left( \rho \Omega_{ij}^{k + 1} + \Lambda_{ij}^{k} - \mbox{sign}(Z_{ij}^{k + 1})\lambda\alpha \right) \]
for all \(i = 1,…, p\) and \(j = 1,…, p\). We observe two scenarios:
\[ \rho\Omega_{ij}^{k + 1} + \Lambda_{ij}^{k} > \lambda\alpha \]
\[ \rho\Omega_{ij}^{k + 1} + \Lambda_{ij}^{k} < -\lambda\alpha \]
This implies that \(\mbox{sign}(Z_{ij}) = \mbox{sign}(\rho\Omega_{ij}^{k + 1} + \Lambda_{ij}^{k})\). Putting all the pieces together, we arrive at
\begin{align} Z{ij}{k + 1} &= \frac{1}{\lambda(1 - \alpha) + \rho}\mbox{sign}\left(\rho\Omega{ij}{k + 1} + \Lambda{ij}{k}\right)\left( \left| \rho\Omega{ij}{k + 1} + \Lambda{ij}{k} \right| - \lambda\alpha \right){+} \ &= \frac{1}{\lambda(1 - \alpha) + \rho}\mbox{soft}\left(\rho\Omega{ij}{k + 1} + \Lambda{ij}{k}, \lambda\alpha\right) \end{align}
where soft is the soft-thresholding function.
\vspace{1cm}
There is another popular, alternate form of the ADMM algorithm used by scaling the dual variable (\(\Lambda^{k}\)). Let us define \(R^{k} = \Omega - Z^{k}\) and \(U^{k} = \Lambda^{k}/\rho\).
\begin{align} tr\left[ \Lambda{k}\left( \Omega - Z{k} \right) \right] + \frac{\rho}{2}\left| \Omega - Z{k} \right|{F}{2} &= tr\left[ \Lambda{k}R{k} \right] + \frac{\rho}{2}\left| R{k} \right|{F}{2} \ &= \frac{\rho}{2}\left| R{k} + \Lambda{k}/\rho \right|{F}{2} - \frac{\rho}{2}\left| \Lambda{k}/\rho \right|{F}{2} \ &= \frac{\rho}{2}\left| R{k} + U{k} \right|{F}{2} - \frac{\rho}{2}\left| U{k} \right|{F}{2} \end{align}
Therefore, a scaled-form can now be written as
\begin{align} \Omega{k + 1} &= \arg\min{\Omega \in \mathbb{R}{+}{p}}\left{ tr\left(S\Omega\right) - \log\left|\Omega\right| + \frac{\rho}{2}\left| \Omega - Z{k} + U{k} \right|{F}{2} \right} \ Z{k + 1} &= \arg\min{Z \in \mathbb{S}{p}}\left{ \lambda\left[ \frac{1 - \alpha}{2}\left| Z \right|{F}{2} + \alpha\left| Z \right|{1} \right] + \frac{\rho}{2}\left| \Omega{k + 1} - Z + U{k} \right|_{F}{2} \right} \ U{k + 1} &= U{k} + \Omega{k + 1} - Z{k + 1} \end{align}
And more generally (in vector form),
\begin{align} x{k + 1} &= \arg\min{x}\left{ f(x) + \frac{\rho}{2}\left| Ax + Bz{k} - c + u{k} \right|{2}{2} \right} \ z{k + 1} &= \arg\min{z}\left{ g(z) + \frac{\rho}{2}\left| Ax{k + 1} + Bz - c + u{k} \right|{2}{2} \right} \ u{k + 1} &= u{k} + Ax{k + 1} + Bz{k + 1} - c \end{align}
Note that there are limitations to using this method. Because the dual variable is scaled by \(\rho\) (the step size), this form limits one to using a constant step size without making further adjustments to \(U^{k}\). It has been shown in the literature that a dynamic step size (like the one used in ADMMsigma
) can significantly reduce the number of iterations required for convergence.
\vspace{1cm}
In discussing the optimality conditions and stopping criterion, we will follow the steps outlined in @boyd2011distributed and cater them to precision matrix estimation.
Below we have three optimality conditions:
\[ \Omega^{k + 1} - Z^{k + 1} = 0 \]
\[ 0 \in \partial f\left(\Omega^{k + 1}\right) + \Lambda^{k + 1} \]
\[ 0 \in \partial g\left(Z^{k + 1}\right) - \Lambda^{k + 1} \]
The first dual optimality condition is a result of taking the sub-differential of the lagrangian (non-augmented) with respect to \(\Omega^{k + 1}\) (note that we must honor the symmetric constraint of \(\Omega^{k + 1}\)) and the second is a result of taking the sub-differential of the lagrangian with respect to \(Z^{k + 1}\) (no symmetric constraint).
We will define the left-hand side of the first condition as the primal residual \(r^{k + 1} = \Omega^{k + 1} - Z^{k + 1}\). At convergence, optimality conditions require that \(r^{k + 1} \approx 0\). The second residual we will define is the dual residual \(s^{k + 1} = \rho\left( Z^{k + 1} - Z^{k} \right)\). This residual is derived from the following:
Because \(\Omega^{k + 1}\) is the argument that minimizes \(L_{p}\left( \Omega, Z^{k}, \Lambda^{k} \right)\),
\begin{align} 0 &\in \partial \left{ f\left(\Omega{k + 1}\right) + tr\left[ \Lambda{k}\left( \Omega{k + 1} - Z{k} \right) \right] + \frac{\rho}{2}\left| \Omega{k + 1} - Z{k} \right|_{F}{2} \right} \ &= \partial f\left(\Omega{k + 1}\right) + \Lambda{k} + \rho\left(\Omega{k + 1} - Z{k}\right) \ &= \partial f\left(\Omega{k + 1}\right) + \Lambda{k} + \rho\left(\Omega{k + 1} + Z{k + 1} - Z{k + 1} - Z{k}\right) \ &= \partial f\left(\Omega{k + 1}\right) + \Lambda{k} + \rho\left(\Omega{k + 1} - Z{k + 1}\right) + \rho\left(Z{k + 1} - Z{k}\right) \ &= \partial f\left(\Omega{k + 1}\right) + \Lambda{k + 1} + \rho\left(Z{k + 1} - Z{k}\right) \ \Rightarrow 0 &\in \rho\left( Z{k + 1} - Z{k} \right) \end{align}
Like the primal residual, at convergence optimality conditions require that \(s^{k + 1} \approx 0\). Note that the second dual optimality condition is always satisfied:
\begin{align} 0 &\in \partial \left{ g\left(Z{k + 1}\right) + tr\left[ \Lambda{k}\left( \Omega{k + 1} - Z{k + 1} \right) \right] + \rho\left| \Omega{k + 1} - Z{k + 1} \right|_{F}{2} \right} \ &= \partial g\left(Z{k + 1}\right) - \Lambda{k} - \rho\left(\Omega{k + 1} - Z{k + 1}\right) \ &= \partial g\left(Z{k + 1}\right) - \Lambda{k + 1} \ \end{align}
One possible stopping criterion is to set \(\epsilon^{rel} = \epsilon^{abs} = 10^{-3}\) and stop the algorithm when \(\epsilon^{pri} \leq \left\| r^{k + 1} \right\|_{F}\) and \(\epsilon^{dual} \leq \left\| s^{k + 1} \right\|_{F}\) where
\begin{align} \epsilon{pri} &= p\epsilon{abs} + \epsilon{rel}\max\left{ \left| \Omega{k + 1} \right|{F}, \left| Z{k + 1} \right|{F} \right} \ \epsilon{dual} &= p\epsilon{abs} + \epsilon{rel}\left| \Lambda{k + 1} \right|_{F} \end{align}
\newpage