To see a demonstration of the capabilities of liquidSVM from an R viewpoint, please look at the demo.
Disclaimer: liquidSVM and the R-bindings are in general quite stable and well tested by several people. However, use in production is at your own risk.
If you run into problems please check first the documentation for more details, or report the bug to the maintainer.
There are several options to install the package.
The most convenient way is to use the standard install to get it from CRAN:
install.packages("liquidSVM")
You can also use our repository:
install.packages("liquidSVM", repos="http://www.isa.uni-stuttgart.de/software/R")
Remark that in R a package can be installed either as source or binary
Source (default on Linux Systems): : Allows for optimization to the system and liquidSVM can benefit a lot from these optimizations. The drawback is that this needs a C++ compiler. This is usually okay on Linux-systems, but on Windows one has to install Rtools, and on MacOS X Xcode in MacApp Store (Xcode also for older MacOS versions.)
Binary (default on Windows and MacOS X): : compiled versions are provided, so you do not need compilers. However, these are optimized for generic processors (e.g. they do not use AVX), and hence you might do much better on your machine if you compile it yourself.
You can change the default behaviour of install.packages(...)
under Windows/MacOS by using the parameter type="source"
.
The binaries in our repository are only compiled using R version 3.*. If you use another version, they might not work and you have to try source installation (
type="source"
).Note: on MacOS X there can be an issue with binary package installation. If you get the error
tar: Failed to set default locale
then consulthttps://cran.r-project.org/bin/macosx/RMacOSX-FAQ.html#Internationalization-of-the-R_002eapp
Download the source or binary package from http://www.isa.uni-stuttgart.de/software/. On the command line use:
R CMD INSTALL path-to-package/liquidSVM_1.0.1.tar.gz
# Windows
Rcmd INSTALL path-to-package/liquidSVM_1.0.1.zip
# MacOS X using Termninal
R CMD INSTALL path-to-package/liquidSVM_1.0.1.tgz
or in a running R session:
install.packages("path-to-package/liquidSVM_1.0.1.tar.gz",repos=NULL)
# Windows binary
install.packages("path-to-package/liquidSVM_1.0.1.zip",repos=NULL)
# MacOS X binary
install.packages("path-to-package/liquidSVM_1.0.1.tgz",repos=NULL)
You can use also the means of any R-IDE. E.g. in RStudio go to the menu
Tools > Install packages...
Then set install from
to package archive file (.tar.gz or .tgz)
and choose your package
and install the package.
liquidSVM can be configured for different uses of available hardware. We provide the following configurations:
native
: compiles for the current system, e.g. uses AVX or even AVX2 if available. This uses g++/clang++ -march=native -O3
.
generic
: compiles for a wide range of currently deployed CPUs (uses SSE). This uses g++/clang++ -mtune=generic -O3
.
Our binary packages are compiled with this configuration.
default
: compiles using the default values provided by R.
debug
: compiles with debugging enabled.
empty
: gives no default compile arguments.
Additional compiler flags can be provided as well. On the command line, here are some examples:
R CMD INSTALL --configure-args=native path-to-package/liquidSVM_1.0.1.tar.gz
R CMD INSTALL --configure-args=generic path-to-package/liquidSVM_1.0.1.tar.gz
R CMD INSTALL --configure-args="empty -march=core2 -O3" path-to-package/liquidSVM_1.0.1.tar.gz
or in a running R session:
install.packages("liquidSVM",configure.args="native")
install.packages("liquidSVM",configure.args="generic")
install.packages("liquidSVM",configure.args="empty -march=core2 -O3")
Under MacOS you have to add the paramter type="source"
in order to trigger compilation.
Hint: to see whether liquidSVM got compiled with SSE and/or AVX use:
compilationInfo() #> [1] "Compiled without vectorization"
On Windows unfortunately neither --configure-args
nor configure.args
have any effect.
We enable compilation configuration by reading the environment variable LIQUIDSVM_CONFIGURE_ARGS
and using it in the same way as the configure args on the other platforms (see above).
So on the Windows command line use
set LIQUIDSVM_CONFIGURE_ARGS=native
R CMD INSTALL path-to-package/liquidSVM_1.0.1.tar.gz
set LIQUIDSVM_CONFIGURE_ARGS=empty -march=core2 -O3
R CMD INSTALL path-to-package/liquidSVM_1.0.1.tar.gz
Remark that no quotation has to be used. It is not tested whether paths with spaces will work in this setting.
If you wish to install from within R you can specify the environment variable as well:
Sys.setenv(LIQUIDSVM_CONFIGURE_ARGS="native")
install.packages("liquidSVM")
Sys.setenv(LIQUIDSVM_CONFIGURE_ARGS="empty -march=core2 -O3")
install.packages("liquidSVM")
If you have Rtools installed then
you should definitely try to use native
, because on Windows we use generic
as the default configuration even for source installs.
clang++ -march=native
does not activate AVX even if it is available.
Hence if you know it is available, use configure.args="native -mavx"
or even configure.args="native -mavx2"
.set LIQUIDSVM_CONFIGURE_ARGS=native
compiled but crashed on
execution: the compiler thought that FusedMultiplyAdd was available but it was not.
The solution was to set LIQUIDSVM_CONFIGURE_ARGS=native -mno-fma
For GCC it can help to use g++ -Q --help=target -march=native ...
to figure out which options
trigger what optimizations. For both GCC and clang you can also print the compilation headers by
g++ -march=native ... -dM -E - < /dev/null | egrep "SSE|AVX"
.
liquidSVM also is able to calculate the kernel on a GPU if it is compiled with CUDA-support. Since there is a big overhead in moving the kernel matrix from the GPU memory, this is most useful for problems with many feature-dimensions (see demo)
To activate CUDA support you have to specify its location (usually /usr/local/cuda
)
as a parameter to the configure arguments:
R CMD INSTALL --configure-args="native /my/path/to/cuda" path-to-package/liquidSVM_1.0.1.tar.gz
or again in R
install.packages('liquidSVM',configure.args="native /my/path/to/cuda")
Note that due to lack of testing machines this is known to work only on some Linux machines. The above instructions will probably not work on Windows!
If you have compiled with CUDA-support, you can activate it for a computation by using svm(..., GPUs=1)
:
The uses of svm(...)
, lsSVM(...)
, mcSVM(...)
, etc. can be configured
using the following parameters.
display
: This parameter determines the amount of output of
you see at the screen: The larger its value is,
the more you see. This can help as a progress indication.
scale
: If set to a true value then for every feature in the training data
a scaling is calculated so that its values lie in the interval \([0,1]\).
The training then is performed using these scaled values
and any testing data is scaled transparently as well.
Because SVMs are not scale-invariant any data should be scaled
for two main reasons: First that all features have the same weight,
and second to assure that the default gamma parameters that liquidSVM
provide remain meaningful.
If you do not have scaled the data previously this is an easy option.
threads
: This parameter determines the number of cores
used for computing the kernel matrices, the
validation error, and the test error.
* `threads=0` (default) means that all physical cores of your CPU run one thread.
* `threads=-1` means that all but one physical cores of your CPU run one thread.
partition_choice
: This parameter determines the way the input space
is partitioned. This allows larger data sets for which
the kernel matrix does not fit into memory.
* `partition_choice=0` (default) disables partitioning.
* `partition_choice=6` gives usually highest speed.
* `partition_choice=5` gives usually the best test error.
grid_choice
: This parameter determines the size of the hyper-
parameter grid used during the training phase.
Larger values correspond to larger grids. By
default, a 10x10 grid is used. Exact descriptions are given in the next section.
adaptivity_control
: This parameter determines, whether an adaptive
grid search heuristic is employed. Larger values
lead to more aggressive strategies. The default
adaptivity_control = 0
disables the heuristic.
random_seed
: This parameter determines the seed for the random
generator. random_seed
= -1 uses the internal
timer create the seed. All other values lead to
repeatable behavior of the svm.
folds
: How many folds should be used.
Parameters for regression (least-squares, quantile, and expectile)
clipping
: This parameter determines whether the decision
functions should be clipped at the specified
value. The value clipping
= -1.0 leads to
an adaptive clipping value, whereas clipping
= 0
disables clipping.
Parameter for multiclass classification determine the multiclass strategy:
mc-type=0
: AvA with hinge loss.
mc-type=1
: OvA with least squares loss.
mc-type=2
: OvA with hinge loss.
mc-type=3
: AvA with least squares loss.
Parameters for Neyman-Pearson Learning
class
: The class, the constraint
is enforced on.
constraint
: The constraint on the false alarm rate. The script
actually considers a couple of values around the
value of constraint
to give the user an informed
choice.
For Support Vector Machines two hyperparameters need to be determined:
gamma
the bandwith of the kernel lambda
has to be chosen such that neither over- nor underfitting happen.
lambda values are the classical regularization parameter in front of the norm term.liquidSVM has a built-in a cross-validation scheme to calculate validation errors for many values of these hyperparameters and then to choose the best pair. Since there are two parameters this means we consider a two-dimensional grid.
For both parameters either specific values can be given or a geometrically spaced grid can be specified.
gamma_steps
, min_gamma
, max_gamma
: specifies in the interval between min_gamma
and max_gamma
there should be gamma_steps
many values
gammas
: e.g. gammas=c(0.1,1,10,100)
will do these four gamma values
lambda_steps
, min_lambda
, max_lambda
: specifies in the interval between min_lambda
and max_lambda
there should be lambda_steps
many values
lambdas
: e.g. lambdas=c(0.1,1,10,100)
will do these four lambda values
c_values
: the classical term in front of the empirical error term,
e.g. c_values=c(0.1,1,10,100)
will do these four cost values (basically inverse of lambdas
)
Note the min and max values are scaled according the the number of samples, the dimensionality of the data sets, the number of folds used, and the estimated diameter of the data set.
Using grid_choice
allows for some general choices of these parameters
grid_choice |
0 | 1 | 2 |
---|---|---|---|
gamma_steps |
10 | 15 | 20 |
lambda_steps |
10 | 15 | 20 |
min_gamma |
0.2 | 0.1 | 0.05 |
max_gamma |
5.0 | 10.0 | 20.0 |
min_lambda |
0.001 | 0.0001 | 0.00001 |
max_lambda |
0.01 | 0.01 | 0.01 |
Using negative values of grid_choice
we create a grid with listed gamma and lambda values:
grid_choice |
-1 |
---|---|
gammas |
c(10.0, 5.0, 2.0, 1.0, 0.5, 0.25, 0.1, 0.05) |
lambdas |
c(1.0, 0.1, 0.01, 0.001, 0.0001, 0.00001, 0.000001, 0.0000001) |
grid_choice |
-2 |
---|---|
gammas |
c(10.0, 5.0, 2.0, 1.0, 0.5, 0.25, 0.1, 0.05) |
c_values |
c(0.01, 0.1, 1, 10, 100, 1000, 10000) |
An adaptive grid search can be activated. The higher the values
of MAX_LAMBDA_INCREASES
and MAX_NUMBER_OF_WORSE_GAMMAS
are set
the more conservative the search strategy is. The values can be
freely modified.
ADAPTIVITY_CONTROL |
1 | 2 |
---|---|---|
MAX_LAMBDA_INCREASES |
4 | 3 |
MAX_NUMBER_OF_WORSE_GAMMAS |
4 | 3 |
A major issue with SVMs is that for larger sample sizes the kernel matrix does not fit into the memory any more. Classically this gives an upper limit for the class of problems that traditional SVMs can handle without significant runtime increase. Furthermore also the time complexity is at least \(O(n^2)\).
liquidSVM implements two major concepts to circumvent these issues. One is random chunks which is known well in the literature. However we prefer the new alternative of splitting the space into spatial cells and use local SVMs on every cell.
If you specify useCells=TRUE
then the sample space \(X\) gets partitioned into
a number of cells.
The training is done first for cell 1 then for cell 2 and so on.
Now, to predict the label for a value \(x\in X\) liquidSVM first finds out
to which cell this \(x\) belongs and then uses the SVM of that cell to predict
a label for it.
If you run into memory issues turn cells on:
useCells=TRUE
This is quite performant, since the complexity in both time and memore are both \(O(\mbox{CELLSIZE} \times n)\) and this holds both for training as well as testing! It also can be shown that the quality of the solution is comparable, at least for moderate dimensions.
The cells can be configured using the partition_choice
:
1) This gives a partition into random chunks of size 2000
`VORONOI=c(1, 2000)`
2) This gives a partition into 10 random chunks
`VORONOI=c(2, 10)`
3) This gives a Voronoi partition into cells with radius not larger than 1.0. For its creation a subsample containing at most 50.000 samples is used.
`VORONOI=c(3, 1.0, 50000)`
4) This gives a Voronoi partition into cells with at most 2000 samples (approximately). For its creation a subsample containing at most 50.000 samples is used. A shrinking heuristic is used to reduce the number of cells.
`VORONOI=c(4, 2000, 1, 50000)`
5) This gives a overlapping regions with at most 2000 samples (approximately). For its creation a subsample containing at most 50.000 samples is used. A stopping heuristic is used to stop the creation of regions if 0.5 * 2000 samples have not been assigned to a region, yet.
`VORONOI=c(5, 2000, 0.5, 50000, 1)`
6) This splits the working sets into Voronoi like with PARTITION_TYPE=4
.
Unlike that case, the centers for the Voronoi partition are
found by a recursive tree approach, which in many cases may be
faster.
`VORONOI=c(6, 2000, 1, 50000, 2.0, 20, 4,)`
The first parameter values correspond to NO_PARTITION
, RANDOM_CHUNK_BY_SIZE
, RANDOM_CHUNK_BY_NUMBER
, VORONOI_BY_RADIUS
, VORONOI_BY_SIZE
, OVERLAP_BY_SIZE
qt, ex: Here the number of considered tau-quantiles/expectiles as well as the considered tau-values are defined. You can freely change these values but notice that the list of tau-values is space-separated!
npl, roc: Here, you define, which weighted classification problems will be considered. The choice is usually a bit tricky. Good luck …
NPL:
WEIGHT_STEPS=10
MIN_WEIGHT=0.001
MAX_WEIGHT=0.5
GEO_WEIGHTS=1
ROC:
WEIGHT_STEPS=9
MAX_WEIGHT=0.9
MIN_WEIGHT=0.1
GEO_WEIGHTS=0
By specifying groupIds
when initializing an SVM samples obtain group ids.
This by default also sets FOLDS_KIND
to GROUPED
.
If the latter is the case then samples with the same group id will be put
into the same fold at cross validation.
This is important if e.g. there are several patients with several measurements each.
The following parameters should only employed by experienced users and are self-explanatory for these:
KERNEL
: specifies the kernel to use, at the moment either GAUSS_RBF
or POISSON
RETRAIN_METHOD
: After training on grids and folds there are only solutions on folds.
In order to construct a global solution one can either retrain on the whole
training data (SELECT_ON_ENTIRE_TRAIN_SET
) or
the (partial) solutions from the training are
kept and combined using voting (SELECT_ON_EACH_FOLD
default)
store_solutions_internally
: If this is true (default in all applicable cases) then the solutions of the train phase
are stored and can be just reused in the select phase.
If you slowly run out of memory during the train phase maybe disable this.
However then in the select phase the best models have to be trained again.
For completeness here are some values that usually get set by the learning scenario
SVM_TYPE
: KERNEL_RULE
, SVM_LS_2D
, SVM_HINGE_2D
, SVM_QUANTILE
, SVM_EXPECTILE_2D
, SVM_TEMPLATE
LOSS_TYPE
: CLASSIFICATION_LOSS
, MULTI_CLASS_LOSS
, LEAST_SQUARES_LOSS
, WEIGHTED_LEAST_SQUARES_LOSS
, PINBALL_LOSS
, TEMPLATE_LOSS
VOTE_SCENARIO
: VOTE_CLASSIFICATION
, VOTE_REGRESSION
, VOTE_NPL
KERNEL_MEMORY_MODEL
: LINE_BY_LINE
, BLOCK
, CACHE
, EMPTY
FOLDS_KIND
: BLOCKS
, ALTERNATING
, RANDOM
, STRATIFIED
, GROUPED
, RANDOM_SUBSET
WS_TYPE
: FULL_SET
, MULTI_CLASS_ALL_VS_ALL
, MULTI_CLASS_ONE_VS_ALL
, BOOT_STRAP
Ctrl-C / Interrupt is tricky. It works most of the time, but it can fail. If you get weird results or errors save your models and restart the R session.
CUDA has not been tested neither on Windows nor on macOS.
32-bit has been seen to work but is not supported.
liquidSVM does its own threading - hence do not parallelize on top of that, unless
you know what you are doing. Hence just give the parameter threads=n
or let
the default use all of your physical cores.
If you really want to do it yourself you have to serialze the solutions. Furthermore you have to be carefule to assign disjoint cores else they will fight for the same core:
library(parallel)
## how big should the cluster be
workers <- 2
cl <- makeCluster(workers)
## how many threads should each worker use
threads <- 2
sml <- liquidData('reg-1d')
clusterExport(cl, c("sml","threads","workers"))
obj <- parLapply(cl, 1:workers, function(i) {
library(liquidSVM)
## to make it interesting use disjoint parts of sml$train
data <- sml$train[ seq(i,nrow(sml$train),workers) , ]
## the second argument to threads sets the offset of cores
model <- lsSVM(Y~., data, threads=c(threads,threads*(i-1)) )
## finally return the serialized solution
serialize.liquidSVM(model)
})
for(i in 1:workers){
## get the solution in the master session
model <- unserialize.liquidSVM(obj[[i]])
print(errors(test(model,sml$test)))
}
#> val_error
#> 0.00542
#> val_error
#> 0.00583