In V2.7 release of DataRobot API, the following model insights have been added:
Insights provided by Lift Chart and ROC Curves are helpful in checking the performance of machine learning models. Word clouds are helpful for understanding useful words and phrases generated after applying different NLP techniques to unstructured data. We will explore each one of these in detail.
To access the DataRobot modeling engine, it is necessary to establish an authenticated connection, which can be done in one of two ways. In both cases, the necessary information is an endpoint, the URL address of the specific DataRobot server being used and a token, a previously validated access token.
token is unique for each DataRobot modeling engine account and can be accessed using the DataRobot webapp in the account profile section.
endpoint depends on DataRobot modeling engine installation (cloud-based vs. on-premise) you are using. Contact your DataRobot admin for information on which endpoint to use if you do not know. The endpoint for DataRobot cloud accounts is https://app.datarobot.com/api/v2
.
The first access method uses a YAML configuration file with these two elements - labeled token and endpoint - located at $HOME/.config/datarobot/drconfig.yaml
. If this file exists when the datarobot package is loaded, a connection to the DataRobot modeling engine is automatically established during library(datarobot)
. It is also possible to establish a connection using this YAML file via the ConnectToDataRobot()
function, by specifying the configPath
parameter.
The second method of establishing a connection to the DataRobot modeling engine is to call the function ConnectToDataRobot with the endpoint and token parameters.
We will be using the Lending Club dataset, a sample dataset related to credit scoring open-sourced by LendingClub (https://www.lendingclub.com/). We can create a project with this dataset like this:
lendingClubURL <- "https://s3.amazonaws.com/datarobot_public_datasets/10K_Lending_Club_Loans.csv"
project <- StartProject(dataSource = lendingClubURL,
projectName = "AdvancedModelInsightsVignette",
mode = "auto",
target = "is_bad",
workerCount = "max",
wait = TRUE)
Once the modeling process has completed, the ListModels
function returns an S3 object of class listOfModels
that characterizes all of the models in a specified DataRobot project. It is important to use WaitforAutopilot
before calling ListModels
, as the function will return only a partial list (and a warning) if the autopilot is not yet complete.
results <- as.data.frame(ListModels(project))
saveRDS(results, "resultsModelInsights.rds")
library(knitr)
kable(head(results), longtable = TRUE, booktabs = TRUE, row.names = TRUE)
modelType | expandedModel | modelId | blueprintId | featurelistName | featurelistId | samplePct | validationMetric | |
---|---|---|---|---|---|---|---|---|
1 | Gradient Boosted Trees Classifier with Early Stopping | Gradient Boosted Trees Classifier with Early Stopping::Ordinal encoding of categorical variables::Converter for Text Mining::Auto-Tuned Word N-Gram Text Modeler using token occurrences::Missing Values Imputed | 5efa1dcfe157256402b66684 | 76406c9c52dc3f6a3a0ba8442fa17601 | Informative Features | 5efa1bd3f0f49455b0ccd765 | 64 | 0.36472 |
2 | eXtreme Gradient Boosted Trees Classifier with Early Stopping | eXtreme Gradient Boosted Trees Classifier with Early Stopping::Ordinal encoding of categorical variables::Converter for Text Mining::Auto-Tuned Word N-Gram Text Modeler using token occurrences::Missing Values Imputed | 5efa1dd0e157256402b66694 | 5964b39390e51b69a82d9a8dab7b2675 | Informative Features | 5efa1bd3f0f49455b0ccd765 | 64 | 0.36562 |
3 | ENET Blender | ENET Blender::Elastic-Net Classifier (L2 / Binomial Deviance) | 5efa23c020433938c72a0153 | 81092c05cb849904f6b737b767799660 | Multiple featurelists | Multiple featurelist ids | 64 | 0.36564 |
4 | AVG Blender | AVG Blender::Average Blender | 5efa23be20433938c72a014f | c294bee1a436f6f034fd680aa752b9d5 | Multiple featurelists | Multiple featurelist ids | 64 | 0.36566 |
5 | ENET Blender | ENET Blender::Elastic-Net Classifier (L2 / Binomial Deviance) | 5efa23c020433938c72a0155 | 83d1a0ca93741bd8ef06bfc47c75ac33 | Multiple featurelists | Multiple featurelist ids | 64 | 0.36567 |
6 | Advanced AVG Blender | Advanced AVG Blender::Average Blender | 5efa23c020433938c72a0151 | c40db7cd1b9d3ee12d17c0369639cb3a | Multiple featurelists | Multiple featurelist ids | 64 | 0.36639 |
Lift chart data can be retrieved for a specific data partition (validation, cross-validation, or holdout) or for all the data partitions using GetLiftChart
and ListLiftCharts
. To retrieve the data for holdout partition, it needs to be unlocked first.
Let’s retrieve the validation partition data for top model using GetLiftChart
. The GetLiftChart
function returns data for validation partition by default. We can retrieve data for specific data partition by passing value to source parameter in GetLiftChart
.
project <- GetProject("5eed0d790ef80408ae212f09")
allModels <- ListModels(project)
saveRDS(allModels, "modelsModelInsights.rds")
modelFrame <- as.data.frame(allModels)
metric <- modelFrame$validationMetric
if (project$metric %in% c('AUC', 'Gini Norm')) {
bestIndex <- which.max(metric)
} else {
bestIndex <- which.min(metric)
}
bestModel <- allModels[[bestIndex]]
bestModel$modelType
[1] “Gradient Boosted Greedy Trees Classifier with Early Stopping”
This selects a Gradient Boosted Greedy Trees Classifier with Early Stopping model.
The lift chart data we retrieve from the server includes the mean of the model prediction and the mean of the actual target values, sorted by the prediction values in ascending order and split into up to 60 bins.
actual predicted binWeight
1 0.00000000 0.01877918 27 2 0.03703704 0.02476968 27 3 0.00000000 0.02867826 26 4 0.00000000 0.03207965 27 5 0.07407407 0.03540244 27 6 0.03846154 0.03865136 26
ValidationLiftChart <- GetLiftChart(bestModel, source = "validation")
dr_dark_blue <- "#08233F"
dr_blue <- "#1F77B4"
dr_orange <- "#FF7F0E"
# Function to plot lift chart
library(data.table)
LiftChartPlot <- function(ValidationLiftChart, bins = 10) {
if (60 %% bins == 0) {
ValidationLiftChart$bins <- rep(seq(bins), each = 60 / bins)
ValidationLiftChart <- data.table(ValidationLiftChart)
ValidationLiftChart[, actual := mean(actual), by = bins]
ValidationLiftChart[, predicted := mean(predicted), by = bins]
unique(ValidationLiftChart[, -"binWeight"])
} else {
"Please provide bins less than 60 and divisor of 60"
}
}
LiftChartData <- LiftChartPlot(ValidationLiftChart)
saveRDS(LiftChartData, "LiftChartDataVal.rds")
par(bg = dr_dark_blue)
plot(LiftChartData$Actual, col = dr_orange, pch = 20, type = "b",
main = "Lift Chart", xlab = "Bins", ylab = "Value")
lines(LiftChartData$Predicted, col = dr_blue, pch = 20, type = "b")
All the available lift chart data can be retrieved using ListLiftCharts
. Here is an example retrieving data for all the available partitions, followed by plotting the cross validation partition:
AllLiftChart <- ListLiftCharts(bestModel)
LiftChartData <- LiftChartPlot(AllLiftChart[["crossValidation"]])
saveRDS(LiftChartData, "LiftChartDataCV.rds")
par(bg = dr_dark_blue)
plot(LiftChartData$Actual, col = dr_orange, pch = 20, type = "b",
main = "Lift Chart", xlab = "Bins", ylab = "Value")
lines(LiftChartData$Predicted, col = dr_blue, pch = 20, type = "b")
We can also plot the lift chart using ggplot2
:
library(ggplot2)
lc$actual <- lc$actual / lc$binWeight
lc$predicted <- lc$predicted / lc$binWeight
lc <- lc[order(lc$predicted), ]
lc$binWeight <- NULL
lc <- data.frame(value = c(lc$actual, lc$predicted),
variable = c(rep("Actual", length(lc$actual)),
rep("Predicted", length(lc$predicted))),
id = rep(seq_along(lc$actual), 2))
ggplot(lc) + geom_line(aes(x = id, y = value, color = variable))
The receiver operating characteristic curve, or ROC curve, is a graphical plot that illustrates the performance of a binary classifier system as its discrimination threshold is varied. The curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings.
ROC curve data can be generated for a specific data partition (validation, cross validation, or holdout) or for all the data partition using GetRocCurve
and ListRocCurves
.
To retrieve ROC curve information use GetRocCurve
:
You can then plot the results:
dr_dark_blue <- "#08233F"
dr_roc_green <- "#03c75f"
ValidationRocCurve <- GetRocCurve(bestModel)
ValidationRocPoints <- ValidationRocCurve[["rocPoints"]]
saveRDS(ValidationRocPoints, "ValidationRocPoints.rds")
par(bg = dr_dark_blue, xaxs = "i", yaxs = "i")
plot(ValidationRocPoints$falsePositiveRate, ValidationRocPoints$truePositiveRate,
main = "ROC Curve",
xlab = "False Positive Rate (Fallout)", ylab = "True Positive Rate (Sensitivity)",
col = dr_roc_green,
ylim = c(0,1), xlim = c(0,1),
pch = 20, type = "b")
All the available ROC curve data can be retrieved using ListRocCurves
. Here again is an example to retrieve data for all the available partitions, followed by plotting the cross validation partition:
AllRocCurve <- ListRocCurves(bestModel)
CrossValidationRocPoints <- AllRocCurve[['crossValidation']][['rocPoints']]
saveRDS(CrossValidationRocPoints, 'CrossValidationRocPoints.rds')
par(bg = dr_dark_blue, xaxs = "i", yaxs = "i")
plot(CrossValidationRocPoints$falsePositiveRate, CrossValidationRocPoints$truePositiveRate,
main = "ROC Curve",
xlab = "False Positive Rate (Fallout)", ylab = "True Positive Rate (Sensitivity)",
col = dr_roc_green,
ylim = c(0, 1), xlim = c(0, 1),
pch = 20, type = "b")
You can also plot the ROC curve using ggplot2
:
You can get the recommended threshold value with maximal F1 score. That is the same threshold that is preselected in DataRobot when you open the “ROC curve” tab.
You can also estimate metrics for different threshold values. This will produce the same results as updating the threshold on the DataRobot “ROC curve” tab.
The word cloud is a type of insight available for some text-processing models for datasets containing text columns. You can get information about how the appearance of each ngram (word or sequence of words) in the text field affects the predicted target value.
This example will show you how to obtain word cloud data and visualize it, similar to how DataRobot visualizes the word cloud in the “Model Insights” tab interface.
The visualization example here uses the modelwordcloud
package.
Now let’s find our word cloud:
# Find word-based models by looking for "word" modelType
wordModels <- allModels[grep("Word", lapply(allModels, `[[`, "modelType"))]
wordModel <- wordModels[[1]]
# Get word cloud
wordCloud <- GetWordCloud(project, wordModel$modelId)
saveRDS(wordCloud, "wordCloudModelInsights.rds")
Now we plot it!
# Remove stop words
wordCloud <- wordCloud[!wordCloud$isStopword, ]
# Specify colors similar to what DataRobot produces for
# a wordcloud in Insights
colors <- readRDS("colors.rds")
# Make word cloud
suppressWarnings(
wordcloud(words = wordCloud$ngram,
freq = wordCloud$frequency,
coefficients = wordCloud$coefficient,
colors = colors,
scale = c(3, 0.3))
)