Example: Circle Data With Random Variables

In the below example there are 2 related variables and 5 that are unrelated. Lets see how dann, sub_dann, and dann with only the correct features perform.

 library(dann)
 library(mlbench)
 library(magrittr)
 library(dplyr, warn.conflicts = FALSE)
 library(ggplot2)

 ######################
 # Circle data with unrelated variables
 ######################
 set.seed(1)
 train <- mlbench.circle(500, 2) %>%
   tibble::as_tibble()
 colnames(train)[1:3] <- c("X1", "X2", "Y")

 # Add 5 unrelated variables
 train <- train %>%
   mutate(
     U1 = runif(500, -1, 1),
     U2 = runif(500, -1, 1),
     U3 = runif(500, -1, 1),
     U4 = runif(500, -1, 1),
     U5 = runif(500, -1, 1)
   )

 xTrain <- train %>%
   select(X1, X2, U1, U2, U3, U4, U5) %>%
   as.matrix()

 yTrain <- train %>%
   pull(Y) %>%
   as.numeric() %>%
   as.vector()

 test <- mlbench.circle(500, 2) %>%
   tibble::as_tibble()
 colnames(test)[1:3] <- c("X1", "X2", "Y")

 # Add 5 unrelated variables
 test <- test %>%
   mutate(
     U1 = runif(500, -1, 1),
     U2 = runif(500, -1, 1),
     U3 = runif(500, -1, 1),
     U4 = runif(500, -1, 1),
     U5 = runif(500, -1, 1)
   )

 xTest <- test %>%
   select(X1, X2, U1, U2, U3, U4, U5) %>%
   as.matrix()

 yTest <- test %>%
   pull(Y) %>%
   as.numeric() %>%
   as.vector()
 
 dannPreds <- dann(xTrain = xTrain, yTrain = yTrain, xTest = xTest, 
                   k = 3, neighborhood_size = 50, epsilon = 1, probability = FALSE)
 mean(dannPreds == yTest) # Not a good model

## [1] 0.668

As expected, dann was not performant. Moving on to sub_dann, the dimension of the subspace should be chosen based on the number of large eigenvalues. The graph suggests 2 is good (the correct answer).

 graph_eigenvalues(xTrain = xTrain, yTrain = yTrain, 
                   neighborhood_size = 50, weighted = FALSE, sphere = "mcd")

 subDannPreds <- sub_dann(xTrain = xTrain, yTrain = yTrain, xTest = xTest, 
                          k = 3, neighborhood_size = 50, epsilon = 1, 
                          probability = FALSE, 
                          weighted = FALSE, sphere = "mcd", numDim = 2)
 mean(subDannPreds == yTest) # sub_dan does much better when unrelated variables are present.

## [1] 0.882

sub_dann did much better than dann. Lets see how dann does if only related variables are used.

 variableSelectionDann <- dann(xTrain = xTrain[, 1:2], yTrain = yTrain, xTest = xTest[, 1:2],
                               k = 3, neighborhood_size = 50, epsilon = 1, probability = FALSE)
 
 mean(variableSelectionDann == yTest) # Best model found when only true predictors are used.

## [1] 0.944

Overall, dann with the correct variables did better than sub_dann. In simulations one can simply pick the correct variables. In real applications the correct variables are usually unknown. sub_dann was able to estimate the correct number of features and get reasonably close to dann that only used related variables without having to know which variables are truly predictive.

 rm(train, test)
 rm(xTrain, yTrain)
 rm(xTest, yTest)
 rm(dannPreds, subDannPreds, variableSelectionDann)

SUB_DANN

Introduction

Arguments

Example: Circle Data With Random Variables