In general, dann will struggle as unrelated variables are intermingled with related variables. To deal with this, sub_dann projects the data onto a unique subspace and then calls dann. The number of features in the subspace is controlled by the numDim argument. sub_dann is able to mitigate the use of noise variables. See section 3 of Discriminate Adaptive Nearest Neighbor Classification for details. Section 4 compares dann and sub_dann to a number of other approaches.
In the below example there are 2 related variables and 5 that are unrelated. Lets see how dann, sub_dann, and dann with only the correct features perform.
library(dann)
library(mlbench)
library(magrittr)
library(dplyr, warn.conflicts = FALSE)
library(ggplot2)
######################
# Circle data with unrelated variables
######################
set.seed(1)
train <- mlbench.circle(500, 2) %>%
tibble::as_tibble()
colnames(train)[1:3] <- c("X1", "X2", "Y")
# Add 5 unrelated variables
train <- train %>%
mutate(
U1 = runif(500, -1, 1),
U2 = runif(500, -1, 1),
U3 = runif(500, -1, 1),
U4 = runif(500, -1, 1),
U5 = runif(500, -1, 1)
)
xTrain <- train %>%
select(X1, X2, U1, U2, U3, U4, U5) %>%
as.matrix()
yTrain <- train %>%
pull(Y) %>%
as.numeric() %>%
as.vector()
test <- mlbench.circle(500, 2) %>%
tibble::as_tibble()
colnames(test)[1:3] <- c("X1", "X2", "Y")
# Add 5 unrelated variables
test <- test %>%
mutate(
U1 = runif(500, -1, 1),
U2 = runif(500, -1, 1),
U3 = runif(500, -1, 1),
U4 = runif(500, -1, 1),
U5 = runif(500, -1, 1)
)
xTest <- test %>%
select(X1, X2, U1, U2, U3, U4, U5) %>%
as.matrix()
yTest <- test %>%
pull(Y) %>%
as.numeric() %>%
as.vector()
dannPreds <- dann(xTrain = xTrain, yTrain = yTrain, xTest = xTest,
k = 3, neighborhood_size = 50, epsilon = 1, probability = FALSE)
mean(dannPreds == yTest) # Not a good model
## [1] 0.668
As expected, dann was not performant. Moving on to sub_dann, the dimension of the subspace should be chosen based on the number of large eigenvalues. The graph suggests 2 is good (the correct answer).
graph_eigenvalues(xTrain = xTrain, yTrain = yTrain,
neighborhood_size = 50, weighted = FALSE, sphere = "mcd")
subDannPreds <- sub_dann(xTrain = xTrain, yTrain = yTrain, xTest = xTest,
k = 3, neighborhood_size = 50, epsilon = 1,
probability = FALSE,
weighted = FALSE, sphere = "mcd", numDim = 2)
mean(subDannPreds == yTest) # sub_dan does much better when unrelated variables are present.
## [1] 0.882
sub_dann did much better than dann. Lets see how dann does if only related variables are used.
variableSelectionDann <- dann(xTrain = xTrain[, 1:2], yTrain = yTrain, xTest = xTest[, 1:2],
k = 3, neighborhood_size = 50, epsilon = 1, probability = FALSE)
mean(variableSelectionDann == yTest) # Best model found when only true predictors are used.
## [1] 0.944
Overall, dann with the correct variables did better than sub_dann. In simulations one can simply pick the correct variables. In real applications the correct variables are usually unknown. sub_dann was able to estimate the correct number of features and get reasonably close to dann that only used related variables without having to know which variables are truly predictive.