output: html_document

title: “softImpute Vignette” author: “Trevor Hastie” date: “September 5, 2014”

output: html_document

softImpute is a package for matrix completion using nuclear norm regularization. It offers two algorithms:

One iteratively computes the soft-thresholded SVD of a filled in matrix - an algorithm described in Mazumder et al (2010). This is option type="svd" in the call to softImpute().
The other uses alternating ridge regression, at each stage filling in the missing entries with the latest estimates. This is described in Hastie et al (2014). This we believe is the faster option, and is option type="als" in the call to softImpute.

The package can deal with both small and very large matrices; the latter are stored in sparse-matrix format using a new S4 class "Incomplete". For example, softImpute can happily fit a rank 100 SVD to the netflix data (480,189 x 17,770, 99% missing) using a machine with about 32Gb of memory. For smaller matrices with missing data, the usual full matrix with NA suffices.

The package has two other notable features. It can compute a (low-rank) SVD of a large sparse matrix, stored in sparse matrix format. This can be done either with nuclear-norm regularization or not, the former being somewhat faster.

There is a function biScale() that generalizes the scale() function in R. It can simultaneously scale a matrix to have row and column means zero, and row and column variances one. Of course, any subset of these can be chosen as well. This function can deal with the missing values, if present. It is also smart enough to deal with large sparse matrices. In this case, it does not actually apply the centering operation and destroy the sparsity; instead it stores the resulting matrix in the new matrix class "SparseplusLowRank". This is a special class used by softImpute, originally created to allow it to avoid ever storing a large, filled in matrix, when the original matrix was stored in sparse matrix format via class "Incomplete".

This vignette is a simple guide to using the package.

What softImpute solves

Here we briefly describe the problem solved. Suppose $X$ is a large $m\times n$ matrix, with many missing entries. Let $\Omega$ contain the pairs of indices $(i,j)$ where $X$ is observed, and let $P_\Omega(X)$ denote a matrix with the entries in $\Omega$ left alone, and all other entries set to zero. So when $X$ has missing entries in $\Omega^\perp$, $P_\Omega(X)$ would set the missing values to zero.

If $\widehat M$ solves this convex-cone problem, then it satisfies the following stationarity condition: \[ {\widehat M}=S_\lambda(Z)\] where \[Z=P_\Omega(X)+P_{\Omega^\perp}(\widehat M).\] Hence $Z$ is the “filled in”“ version of $X$. The operator $S_\lambda(Z)$ applied to matrix $Z$ does the following:

Compute the SVD of $Z=UDV^T$, and let $d_i$ be the singular values in $D$.
Soft-threshold the singular values: $d_i^*= (d_i-\lambda)_+$.
Reconstruct: $S_\lambda(Z)=UD^*V^T$. We call this operation the "soft-thresholded SVD”. Notice that for sufficiently large $\lambda$, $D^*$ will be rank-reduced, and hence so will be $UD^*V^T$.

This suggests the obvious iterative algorithm: using the current estimate for $M$, create $Z$, and update $M$ by the soft-thresholded SVD of $Z$.

This is exactly what softImpute does on (small) matrices with missing values stored as NAs. By small we mean small enough that the SVD can be computed by R in a small amount of time.

This is not tenable for very large matrices, like those stored as class "Incomplete". Here we make two very important changes to the recipe:

Re-express $Z$ at each iteration as as \[Z=P_\Omega(X)-P_\Omega(\widehat M) + \widehat M.\] This is of the form "SparseplusLowRank" (assuming $\widehat M$ is low rank), and hence can be stored. Left and right matrix multiplies by skinny matrices can also be efficiently performed.
Anticipating that the solution $\widehat M$ will have low rank, compute only a low-rank SVD of $Z$, using alternating subspace methods.

Indeed, softImpute has a rank argument that allows one to limit the rank of the solution; if the algorithm returns a solution with rank lower than the specified rank $r$, you know it has solved the unrestricted problem.

Consider the alternative criterion \[\min_{A,B}\frac12\|P_\Omega(X)-P_\Omega(AB^T)\|^2_F+\frac{\lambda}2(\|A\|_F^2 +\|B\|_F^2),\] where $A$ and $B$ have each $r$ columns, and let us suppose that $r$ is bigger than or equal to the solution rank of the earlier criterion. This problem is not convex, but remarkably, it has a solution that satisfies ${\widehat A}{\widehat B}^T=\widehat M$!

We can once again characterize the stationarity conditions, now in terms of $\widehat A$ and $\widehat B$. Let \[Z=P_\Omega(X)+P_{\Omega^\perp}({\widehat A}{\widehat B}^T),\] the filled in version of $X$. Then \[\widehat B= ({\widehat A}^T{\widehat A}+\lambda I)^{-1}{\widehat A}^T Z.\] We get $\widehat B$ by ridge regressions of the columns of $Z$ on $\widehat A$. For $\widehat A$ its the same, with the roles of $A$ and $B$ reversed. This again suggests an obvious alternation, now by ridged regressions. After each regression, we update the component $A$ or $B$, and the filled in $Z$. If $r$ is sufficiently large, this again solves the same problem as before.

This last algorithm (softImpute ALS) can be seen as combining the alternating subspace SVD algorithm for computing the SVD with the iterative filling in and SVD calculation. It turns out that this interweaving leads to computational savings, and allows for a very efficient distributed implementation (not covered here).

A simple example

We will start with a small and simple example. Lets generate a small matrix and make some values missing.

require(softImpute)

## Loading required package: softImpute
## Loading required package: Matrix
## Loaded softImpute 1.4

set.seed(1011)
x=matrix(rnorm(30),6,5)
x[sample(1:30,10,replace=FALSE)]=NA
x

##            [,1]        [,2]       [,3]        [,4]        [,5]
## [1,]  0.8654889  0.01565179  0.1747903          NA          NA
## [2,] -0.6004172          NA -0.2119090          NA          NA
## [3,] -0.7169292          NA         NA  0.06437356 -0.09754133
## [4,]  0.6965558 -0.50331812  0.5584839  1.54375663          NA
## [5,]  1.2311610 -0.34232368 -0.8102688 -0.82006429 -0.13256942
## [6,]  0.2664415  0.14486388         NA          NA -2.24087863

fits=softImpute(x,trace=TRUE,type="svd")

## 1 : obj 0.09323 ratio 0.03954925 
## 2 : obj 0.08281 ratio 0.0304397 
## 3 : obj 0.07372 ratio 0.05560424 
## 4 : obj 0.05961 ratio 0.0692083 
## 5 : obj 0.03804 ratio 0.03375309 
## 6 : obj 0.02178 ratio 0.01186641 
## 7 : obj 0.01407 ratio 0.004025632 
## 8 : obj 0.01094 ratio 0.001469776 
## 9 : obj 0.00965 ratio 0.0006177431 
## 10 : obj 0.00906 ratio 0.0003083128 
## 11 : obj 0.00874 ratio 0.0001828115 
## 12 : obj 0.00854 ratio 0.000124992 
## 13 : obj 0.0084 ratio 9.42978e-05 
## 14 : obj 0.00829 ratio 7.552612e-05 
## 15 : obj 0.0082 ratio 6.258545e-05 
## 16 : obj 0.00812 ratio 5.286399e-05 
## 17 : obj 0.00805 ratio 4.515155e-05 
## 18 : obj 0.008 ratio 3.88322e-05 
## 19 : obj 0.00795 ratio 3.355663e-05 
## 20 : obj 0.00791 ratio 2.910375e-05 
## 21 : obj 0.00787 ratio 2.53197e-05 
## 22 : obj 0.00784 ratio 2.208953e-05 
## 23 : obj 0.00781 ratio 1.932322e-05 
## 24 : obj 0.00779 ratio 1.694815e-05 
## 25 : obj 0.00777 ratio 1.490467e-05 
## 26 : obj 0.00775 ratio 1.31432e-05 
## 27 : obj 0.00773 ratio 1.162228e-05 
## 28 : obj 0.00771 ratio 1.030704e-05 
## 29 : obj 0.0077 ratio 9.168035e-06

fits

## $u
##              [,1]       [,2]
## [1,] -0.356510556 -0.2679832
## [2,]  0.308921735  0.1715671
## [3,] -0.006010048  0.3681593
## [4,] -0.605838251 -0.1107379
## [5,]  0.056717156 -0.8263298
## [6,] -0.638102376  0.2610070
## 
## $d
## [1] 4.752691 2.056353
## 
## $v
##             [,1]       [,2]
## [1,] -0.21291481 -0.7896647
## [2,]  0.04905133  0.2236334
## [3,] -0.23498906  0.4350556
## [4,] -0.56011295  0.3393672
## [5,]  0.76375053  0.1482374
## 
## attr(,"lambda")
## [1] 0
## attr(,"call")
## softImpute(x = x, type = "svd", trace.it = TRUE)

Since this is a small matrix, it has solved it using repeated SVDs. There is no penalization here ($\lambda=0$), and by default the rank was taken to be 2. Since there is no penalization, if the rank was given to be $\min(m,n)$, then there is no restriction, and any values for the missing data would give the same minimum loss of 0. In other words, either penalization, or a rank restriction (or both) are needed for sensible imputation.

We could use ALS instead here (the default for type= argument)

fita=softImpute(x,trace=TRUE)

## 1 : obj 0.24087 ratio 5.681397 
## 2 : obj 0.07848 ratio 0.1433758 
## 3 : obj 0.05745 ratio 0.06018666 
## 4 : obj 0.035 ratio 0.04942818 
## 5 : obj 0.01739 ratio 0.02011121 
## 6 : obj 0.01077 ratio 0.004749478 
## 7 : obj 0.0091 ratio 0.001136182 
## 8 : obj 0.00862 ratio 0.0003569266 
## 9 : obj 0.00841 ratio 0.0001601899 
## 10 : obj 0.00829 ratio 9.816138e-05 
## 11 : obj 0.00819 ratio 7.211271e-05 
## 12 : obj 0.00812 ratio 5.752175e-05 
## 13 : obj 0.00805 ratio 4.751506e-05 
## 14 : obj 0.008 ratio 3.986899e-05 
## 15 : obj 0.00795 ratio 3.372181e-05 
## 16 : obj 0.00791 ratio 2.866179e-05 
## 17 : obj 0.00788 ratio 2.444886e-05 
## 18 : obj 0.00785 ratio 2.092036e-05 
## 19 : obj 0.00782 ratio 1.795519e-05 
## 20 : obj 0.0078 ratio 1.545825e-05 
## 21 : obj 0.00778 ratio 1.33526e-05 
## 22 : obj 0.00777 ratio 1.157505e-05 
## 23 : obj 0.00776 ratio 1.007324e-05 
## 24 : obj 0.00774 ratio 8.803524e-06

The objectives are different! At this point we are playing with non-convex optimization, and so the solutions can be local minima. Lets use some regularization now, choosing a value for lambda that will give a rank 2 solution (this required trial and error to get it right).

fits2=softImpute(x,rank.max=3,lambda=1.9,trace=TRUE,type="svd")

## 1 : obj 0.32634 ratio 0.004981461 
## 2 : obj 0.32629 ratio 0.0005114748 
## 3 : obj 0.32629 ratio 0.0002365886 
## 4 : obj 0.32629 ratio 0.0001715126 
## 5 : obj 0.32628 ratio 0.0001295297 
## 6 : obj 0.32628 ratio 9.811628e-05 
## 7 : obj 0.32628 ratio 7.430162e-05 
## 8 : obj 0.32628 ratio 5.62241e-05 
## 9 : obj 0.32628 ratio 4.250505e-05 
## 10 : obj 0.32628 ratio 3.210126e-05 
## 11 : obj 0.32628 ratio 2.421928e-05 
## 12 : obj 0.32628 ratio 1.825438e-05 
## 13 : obj 0.32628 ratio 1.374544e-05 
## 14 : obj 0.32628 ratio 1.034099e-05 
## 15 : obj 0.32628 ratio 7.773308e-06

fita2=softImpute(x,rank.max=3,lambda=1.9,trace=TRUE)

## 1 : obj 0.37079 ratio 1.161079 
## 2 : obj 0.3342 ratio 0.09961669 
## 3 : obj 0.32942 ratio 0.04511951 
## 4 : obj 0.328 ratio 0.02392935 
## 5 : obj 0.3273 ratio 0.01261684 
## 6 : obj 0.32694 ratio 0.006532963 
## 7 : obj 0.32674 ratio 0.003422696 
## 8 : obj 0.32662 ratio 0.001866468 
## 9 : obj 0.32655 ratio 0.001082436 
## 10 : obj 0.3265 ratio 0.0006754999 
## 11 : obj 0.32647 ratio 0.0004539165 
## 12 : obj 0.32644 ratio 0.0003257315 
## 13 : obj 0.32642 ratio 0.0002465175 
## 14 : obj 0.32641 ratio 0.0001943276 
## 15 : obj 0.32639 ratio 0.0001579382 
## 16 : obj 0.32638 ratio 0.0001313475 
## 17 : obj 0.32637 ratio 0.0001111766 
## 18 : obj 0.32636 ratio 9.541774e-05 
## 19 : obj 0.32636 ratio 8.281382e-05 
## 20 : obj 0.32635 ratio 7.25401e-05 
## 21 : obj 0.32635 ratio 6.403321e-05 
## 22 : obj 0.32634 ratio 5.689528e-05 
## 23 : obj 0.32634 ratio 5.083749e-05 
## 24 : obj 0.32633 ratio 4.564542e-05 
## 25 : obj 0.32633 ratio 4.115677e-05 
## 26 : obj 0.32633 ratio 3.724667e-05 
## 27 : obj 0.32632 ratio 3.381765e-05 
## 28 : obj 0.32632 ratio 3.079265e-05 
## 29 : obj 0.32632 ratio 2.810997e-05 
## 30 : obj 0.32632 ratio 2.571972e-05 
## 31 : obj 0.32631 ratio 2.358112e-05 
## 32 : obj 0.32631 ratio 2.166052e-05 
## 33 : obj 0.32631 ratio 1.992993e-05 
## 34 : obj 0.32631 ratio 1.836585e-05 
## 35 : obj 0.32631 ratio 1.694842e-05 
## 36 : obj 0.32631 ratio 1.566075e-05 
## 37 : obj 0.3263 ratio 1.448833e-05 
## 38 : obj 0.3263 ratio 1.341867e-05 
## 39 : obj 0.3263 ratio 1.244096e-05 
## 40 : obj 0.3263 ratio 1.154575e-05 
## 41 : obj 0.3263 ratio 1.072478e-05 
## 42 : obj 0.3263 ratio 9.970789e-06 
## final SVD: obj 0.32628

fits2$d

## [1] 0.44477822 0.06991327 0.00000000

These two are the same (modulo convergence criterion). Because the smallest singular value is zero, we know we are in the convex regime, and so both algorithms give the same solution. We can impute the missing values using complete(), which returns the full matrix:

complete(x,fits2)

##            [,1]          [,2]         [,3]          [,4]        [,5]
## [1,]  0.8654889  0.0156517882  0.174790310 -0.0031476922 -0.06440892
## [2,] -0.6004172  0.0013211313 -0.211909000  0.0004572512  0.04221075
## [3,] -0.7169292  0.0006880645  0.003009928  0.0643735628 -0.09754133
## [4,]  0.6965558 -0.5033181179  0.558483917  1.5437566264  0.01520347
## [5,]  1.2311610 -0.3423236829 -0.810268751 -0.8200642912 -0.13256942
## [6,]  0.2664415  0.1448638842 -0.053927175 -0.0732346519 -2.24087863

We can first double center our matrix before completion

xc=biScale(x,col.scale=FALSE,row.scale=FALSE,trace=TRUE)

## Iter 1 Total Changes 1.662759 
## Iter 2 Total Changes 0.07348511 
## Iter 3 Total Changes 0.002380561 
## Iter 4 Total Changes 0.0001529569 
## Iter 5 Total Changes 1.082877e-05 
## Iter 6 Total Changes 7.748851e-07 
## Iter 7 Total Changes 5.552276e-08 
## Iter 8 Total Changes 3.979003e-09 
## Iter 9 Total Changes 2.851535e-10

xc

##            [,1]        [,2]        [,3]       [,4]       [,5]
## [1,]  0.1390011 -0.09270246 -0.04629866         NA         NA
## [2,] -0.4469535          NA  0.44695354         NA         NA
## [3,] -0.8252855          NA          NA  0.1160078  0.7092777
## [4,] -0.1981945 -0.77993484  0.16913247  0.8089969         NA
## [5,]  0.9662344  0.01088327 -0.56979652 -0.9250004  0.5176792
## [6,]  0.3651963  0.86175224          NA         NA -1.2269486
## attr(,"biScale:row")
## attr(,"biScale:row")$center
## [1]  0.38623908 -0.49371243 -0.23189248  0.55450155 -0.07532212 -0.43900352
## 
## attr(,"biScale:row")$scale
## [1] 1 1 1 1 1 1
## 
## attr(,"biScale:column")
## attr(,"biScale:column")$center
## [1]  0.3402487 -0.2778848 -0.1651501  0.1802582 -0.5749265
## 
## attr(,"biScale:column")$scale
## [1] 1 1 1 1 1
## 
## attr(,"critmat")
##       iter         crit
##  [1,]    1 1.662759e+00
##  [2,]    2 7.348511e-02
##  [3,]    3 2.380561e-03
##  [4,]    4 1.529569e-04
##  [5,]    5 1.082877e-05
##  [6,]    6 7.748851e-07
##  [7,]    7 5.552276e-08
##  [8,]    8 3.979003e-09
##  [9,]    9 2.851535e-10

fits3=softImpute(xc,rank.max=3,lambda=1,type="svd")
fits3$d

## [1] 1.0937636 0.6483598 0.0000000

complete(x,fits3,unscale=TRUE)

##            [,1]        [,2]       [,3]        [,4]        [,5]
## [1,]  0.8654889  0.01565179  0.1747903  0.53515048 -0.16774406
## [2,] -0.6004172 -0.85835047 -0.2119090 -0.14764048 -1.04932194
## [3,] -0.7169292 -0.78692158 -0.3100817  0.06437356 -0.09754133
## [4,]  0.6965558 -0.50331812  0.5584839  1.54375663  0.16449974
## [5,]  1.2311610 -0.34232368 -0.8102688 -0.82006429 -0.13256942
## [6,]  0.2664415  0.14486388 -0.6315001 -0.37088469 -2.24087863

Notice that we completed x with fits3, which was run on the centered version xc. The scaling info is stored on the SVD object as an attribute, and with unscale=TRUE (actually the default), the centering is reversed.

Debiasing the fit

We have recently added a function deBias (since version 1.4) that allows one to upscale the elements d of the SVD object returned by softImpute. This is achieved by linear regression using the observed values of x.

fits2$d

## [1] 0.44477822 0.06991327 0.00000000

deBias(x,fits2)$d

## [1] 2.343979 1.999733 1.887577

Using the sparse matrix version

So far we have not been worried about matrix size, because x is small. We can convert it to a sparse matrix object

xs=as(x,"Incomplete")
xs

## 6 x 5 sparse Matrix of class "Incomplete"
##                                                               
## [1,]  0.8654889  0.01565179  0.1747903  .           .         
## [2,] -0.6004172  .          -0.2119090  .           .         
## [3,] -0.7169292  .           .          0.06437356 -0.09754133
## [4,]  0.6965558 -0.50331812  0.5584839  1.54375663  .         
## [5,]  1.2311610 -0.34232368 -0.8102688 -0.82006429 -0.13256942
## [6,]  0.2664415  0.14486388  .          .          -2.24087863

Notice that it stores the missing entries as “zeros”, but because it is class "Incomplete", the object “knows”“ this is not really the case. In practice, we would not run as() on a huge matrix with tons of missing values, because we probably could not fit it in memory. So we would need a way of getting this matrix into R. This is typically stored on disk in what is known as "market matrix” format, as the triples (i,j,value). We can reverse engineer this here just for a demo:

i=row(x)[!is.na(x)]
j=col(x)[!is.na(x)]
value=x[!is.na(x)]
cbind(i,j,value)

##       i j       value
##  [1,] 1 1  0.86548894
##  [2,] 2 1 -0.60041722
##  [3,] 3 1 -0.71692924
##  [4,] 4 1  0.69655580
##  [5,] 5 1  1.23116102
##  [6,] 6 1  0.26644155
##  [7,] 1 2  0.01565179
##  [8,] 4 2 -0.50331812
##  [9,] 5 2 -0.34232368
## [10,] 6 2  0.14486388
## [11,] 1 3  0.17479031
## [12,] 2 3 -0.21190900
## [13,] 4 3  0.55848392
## [14,] 5 3 -0.81026875
## [15,] 3 4  0.06437356
## [16,] 4 4  1.54375663
## [17,] 5 4 -0.82006429
## [18,] 3 5 -0.09754133
## [19,] 5 5 -0.13256942
## [20,] 6 5 -2.24087863

For the Netflix data dimensions, there are 99 million non-missing entries, much less than 8.5 trillion entries in the full matrix. We would input the data via the (i,j,value) representation, and then construct xs

Incomplete(i=i,j=j,x=value)

## 6 x 5 sparse Matrix of class "Incomplete"
##                                                               
## [1,]  0.8654889  0.01565179  0.1747903  .           .         
## [2,] -0.6004172  .          -0.2119090  .           .         
## [3,] -0.7169292  .           .          0.06437356 -0.09754133
## [4,]  0.6965558 -0.50331812  0.5584839  1.54375663  .         
## [5,]  1.2311610 -0.34232368 -0.8102688 -0.82006429 -0.13256942
## [6,]  0.2664415  0.14486388  .          .          -2.24087863

Lets pretend this object is huge, for the purposes of our demonstration. We can double center it just as before, and run softImpute()

xsc=biScale(xs,col.scale=FALSE,row.scale=FALSE)
fitss=softImpute(xsc,rank.max=3,lambda=1,trace=TRUE,type="svd")

## Ssvd.als: 1 : obj 0.12477 ratio 1.399773 
## Ssvd.als: 2 : obj 0.10698 ratio 0.03346523 
## Ssvd.als: 3 : obj 0.10654 ratio 0.001959896 
## Ssvd.als: 4 : obj 0.10647 ratio 0.0002632419 
## Ssvd.als: 5 : obj 0.10645 ratio 7.161577e-05 
## Ssvd.als: 6 : obj 0.10643 ratio 3.543294e-05 
## Ssvd.als: 7 : obj 0.10642 ratio 2.193428e-05 
## Ssvd.als: 8 : obj 0.10641 ratio 1.465807e-05 
## Ssvd.als: 9 : obj 0.1064 ratio 1.020856e-05 
## Ssvd.als: 10 : obj 0.1064 ratio 7.331556e-06 
## Ssvd.als: 1 : obj 0.10493 ratio 0.007422874 
## Ssvd.als: 2 : obj 0.10476 ratio 0.0004016194 
## Ssvd.als: 3 : obj 0.10475 ratio 2.523417e-05 
## Ssvd.als: 4 : obj 0.10475 ratio 1.678042e-06 
## 1 : obj 0.15764 ratio 0.0112135 
## Ssvd.als: 1 : obj 0.10461 ratio 0.0009216742 
## Ssvd.als: 2 : obj 0.10459 ratio 6.319806e-05 
## Ssvd.als: 3 : obj 0.10459 ratio 4.108082e-06 
## 2 : obj 0.15695 ratio 0.001509593 
## Ssvd.als: 1 : obj 0.10456 ratio 0.00022082 
## Ssvd.als: 2 : obj 0.10455 ratio 1.726476e-05 
## Ssvd.als: 3 : obj 0.10455 ratio 1.192326e-06 
## 3 : obj 0.15685 ratio 0.0003756824 
## Ssvd.als: 1 : obj 0.10454 ratio 7.009554e-05 
## Ssvd.als: 2 : obj 0.10454 ratio 5.494532e-06 
## 4 : obj 0.15682 ratio 0.0001152908 
## Ssvd.als: 1 : obj 0.10454 ratio 2.460723e-05 
## Ssvd.als: 2 : obj 0.10454 ratio 1.919478e-06 
## 5 : obj 0.15681 ratio 4.042747e-05 
## Ssvd.als: 1 : obj 0.10454 ratio 8.880054e-06 
## 6 : obj 0.15681 ratio 1.305343e-05 
## Ssvd.als: 1 : obj 0.10454 ratio 3.578849e-06 
## 7 : obj 0.15681 ratio 4.999183e-06

fitss$d

## [1] 1.0935910 0.6484576 0.0000000

fits3$d

## [1] 1.0937636 0.6483598 0.0000000

Notice here that we get additional trace info with trace=TRUE. Since xs is “huge”, the SVD is computed using alternating subspace methods (with warm starts), and so we are seeing that as inner loops as well.

With huge matrices we would not use the complete() function, but rather request individual predictions. For example entries (2,2), and (3,2) could be imputed via

impute(fitss,i=c(2,3),j=c(2,2))

## [1] -0.8584843 -0.7868812

Again, the unscale=TRUE default for impute() means that the centering (stored on the object fitss) was reversed.

Reduced rank SVD of large sparse matrices

This is almost an aside, but the tools we developed for large matrix completion problems are also useful for working with large (but sparse) matrices. Suppose we had such a beast, and we wanted to compute its principal components. We would need to column-center the matrix, which would render it no longer sparse. We would also want to compute a few of the largest singular vectors for this beast. Lets see how we do that.

First we will read in our large matrix, again in market-matrix format. For simplicity we use our same matrix x, except now the missing values are zeros.

x0=sparseMatrix(i=i,j=j,x=value)
x0

## 6 x 5 sparse Matrix of class "dgCMatrix"
##                                                               
## [1,]  0.8654889  0.01565179  0.1747903  .           .         
## [2,] -0.6004172  .          -0.2119090  .           .         
## [3,] -0.7169292  .           .          0.06437356 -0.09754133
## [4,]  0.6965558 -0.50331812  0.5584839  1.54375663  .         
## [5,]  1.2311610 -0.34232368 -0.8102688 -0.82006429 -0.13256942
## [6,]  0.2664415  0.14486388  .          .          -2.24087863

x0c=biScale(x0,col.scale=FALSE,row.scale=FALSE,row.center=FALSE)
x0c

## An object of class "SparseplusLowRank"
## Slot "x":
## 6 x 5 sparse Matrix of class "dgCMatrix"
##                                                               
## [1,]  0.8654889  0.01565179  0.1747903  .           .         
## [2,] -0.6004172  .          -0.2119090  .           .         
## [3,] -0.7169292  .           .          0.06437356 -0.09754133
## [4,]  0.6965558 -0.50331812  0.5584839  1.54375663  .         
## [5,]  1.2311610 -0.34232368 -0.8102688 -0.82006429 -0.13256942
## [6,]  0.2664415  0.14486388  .          .          -2.24087863
## 
## Slot "a":
##      [,1] [,2]
## [1,]    0    1
## [2,]    0    1
## [3,]    0    1
## [4,]    0    1
## [5,]    0    1
## [6,]    0    1
## 
## Slot "b":
##      [,1]        [,2]
## [1,]   -1 -0.29038347
## [2,]   -1  0.11418769
## [3,]   -1  0.04815059
## [4,]   -1 -0.13134432
## [5,]   -1  0.41183156

So the column centered matrix is still stored in sparse format, but now it has class "SparseplusLowRank", with slots a and b, and slot x the original matrix (x0). In fact, the centered matrix is $x+ab^T$.

Now we compute the SVD of this matrix, using our function svd.als() ( a workhorse for softImpute())

svdx0c=svd.als(x0c,rank=3,trace=TRUE)

## Ssvd.als: 1 : obj 0.07908 ratio 4.007993 
## Ssvd.als: 2 : obj 0.01221 ratio 0.2447519 
## Ssvd.als: 3 : obj 0.00414 ratio 0.00376972 
## Ssvd.als: 4 : obj 0.00409 ratio 2.135618e-05 
## Ssvd.als: 5 : obj 0.00409 ratio 1.215228e-07

svdx0c$d

## [1] 2.105814 1.928454 1.766865

We can compare this to the SVD of the full matrix version of this

x02=as.matrix(x0)
svd(scale(x02,TRUE,FALSE))$d

## [1] 2.10581400 1.92845422 1.76686517 0.48909346 0.07725295

One can actually use regularization here as well. For large problems, this can speed up convergence, and biases the convergence criterion toward the larger singular values. Note that if $Z=S_\lambda(X)$ has rank $r$, then the rank-$r$ SVD of $X$ will have the same left and right singular vectors as $Z$, and singular values $\lambda$ units higher.

Warm starts and regularization paths

Typically we don't have a clue about what values of $\lambda$ are reasonable. One useful function is lambda0(), which identifies the smallest value of $\lambda$ for which the rank of the solution is zero.

lam0=lambda0(xs)
lam0

## [1] 2.316587

fit0=softImpute(xs,lambda=lam0+.2)
fit0$d

## [1] 0

(If we had used lam0 itself, we would have to increase the number of iterations and decrease the threshold for softImpute to achieve an exact zero with ALS)

This value is actually the largest singular value for the version of xs with the missing values replaced by zeros. Lets check:

xs0=as(xs,"sparseMatrix")
fit0=svd.als(xs0)
fit0$d

## [1] 2.316587 1.978608

Now, armed with lam0 we could make a sequence of lambda values, decreasing toward zero.

lamseq=exp(seq(from=log(lam0),to=log(1),length=10))
lamseq

##  [1] 2.316587 2.110134 1.922079 1.750784 1.594754 1.452630 1.323172
##  [8] 1.205251 1.097839 1.000000

Now the idea is to fit a sequence of models, using these values of lambda, and warms starts. For large matrices, we also want to be clever with the rank, because we could not fit models with very large rank. Here is an example of what we could do.

fits=as.list(lamseq)
ranks=as.integer(lamseq)
rank.max=2
warm=NULL
for( i in seq(along=lamseq)){
  fiti=softImpute(xs,lambda=lamseq[i],rank=rank.max,warm=warm)
  ranks[i]=sum(round(fiti$d,4)>0)
  rank.max=min(ranks[i]+2,4)
  warm=fiti
  fits[[i]]=fiti
  cat(i,"lambda=",lamseq[i],"rank.max",rank.max,"rank",ranks[i],"\n")
  }

## Warning in Ssimpute.als(x, J, thresh, lambda, maxit, trace.it, warm.start,
## : Convergence not achieved by 100 iterations

## 1 lambda= 2.316587 rank.max 3 rank 1 
## 2 lambda= 2.110134 rank.max 3 rank 1 
## 3 lambda= 1.922079 rank.max 4 rank 2 
## 4 lambda= 1.750784 rank.max 4 rank 3 
## 5 lambda= 1.594754 rank.max 4 rank 3 
## 6 lambda= 1.45263 rank.max 4 rank 3 
## 7 lambda= 1.323172 rank.max 4 rank 3 
## 8 lambda= 1.205251 rank.max 4 rank 3 
## 9 lambda= 1.097839 rank.max 4 rank 3 
## 10 lambda= 1 rank.max 4 rank 3

Notes:

The warning message is for the first fit; it needs more iterations to get convergence to the NULL solution, and in fact we see that it didn't quite get there, since the rank is 1 (fits[[1]]$d)
We added 2 to the rank achieved for the previous lambda. That is fine for this small problem, but for large problems (Netflix size), you don't want to be too cavalier with large ranks. Ideally, the rank you use should be sufficiently large to achieve a solution with some soft-thresholded singular values set to zero. This would guarantee that you have solved the regularized problem.