MetaClean
is a package for building classifiers to identify low quality integrations in untargeted metabolomics data. It uses a combination of 12 peak quality metrics and 9 potential machine learning algorithms to build predictive models using user provided chromatographic data and associated labels. Once a predictive model has been built, it can be used to assign predictive labels and class probabilities to untargeted metabolomics datasets. The package is designed for use with the preprocessing package XCMS and can be easily integrated into existing untargeted metabolomics pipelines.
MetaClean
has two main use cases: (1) Training a Classifier Using User Provdied Data and (2) Using Existing Models to Make Predictions. This tutorial will walk the user through the steps for each.
!!IMPORTANT!! While any version of XCMS's peak-picking, retention time correction, and grouping functions may be utilized, this package requires the user to provide two objects produced by the getEIC() and fillPeaks() functions. These functions require objects of the “xcmsSet” class which was replaced by the “XCMSnExp” class. If using the newest functions provided by XCMS, please convert the “XCMSnExp” object to an “xcmsSet” object using as(XCMSnExp_object, “xcmsSet”).
This section explains how to build a classifier using data provided by the user. It is recommended that users create classifiers specific to the mode, matrix, and instrumentation of the dataset for those methodologies that are frequently utilized by the lab. For example, if a lab often runs plasma samples in Reverse Phase Negative Mode using the same column and the same instrument, a classifier can be built once using a dataset gerneated with this method and then saved and used again and again to make predictions for every additional run that uses this methodology.
Before the classifier can be trained the user must invest some time in visually assesing and labeling at least two datasets: a development dataset to be used for training and at least one test dataset to be used for evaluating the performance of the classifiers and selecting the best model. It is recommended that the user perform the following steps to prepare the development and test datasets:
The following sections provide detailed explanations of the steps required in training a classifier using user provided data:
NOTE: The * denotes sections that require additional example data not included in the MetaClean package. Users can either provide their own fill and xcmsEIC objects or download and install the data package MetaCleanData
using the following code:
## UNCOMMENT THIS SECTION IF YOU WISH TO USE THE MetaCleanData DATA PACKAGE
# # install devtools if not already installed
# install.packages("devtools")
# install the data package MetaCleanData from github
# devtools::install_github("KelseyChetnik/MetaCleanData")
# load MetaCleanData library
# library(MetaCleanData)
MetaCleanData
provides fill and xcmsEIC objects for example development and test sets.
To train a new classifier using PeakQualityIntegration
, the user must provide these three data files for each dataset:
NOTE: The same group object must be used to produce both the xcmsEIC and fill objects.
The following are examples of the required xcmsEIC and fill objects:
## UNCOMMENT THIS CODE IF YOU HAVE INSTALLED MetaCleanData
# # load the example xcms
# data("eicLabels_development")
# data("eicLabels_test")
# data("fill_development")
# data("fill_test")
# data("xs_development")
# data("xs_test")
The following is are examples
These examples will be utilized throughout the remainder of the tutorial.
The function getEvalObj
is called to extract the relevant data from the three objects provided by ther user and store them in an object of class evalObj
. This function takes the following arguments:
## UNCOMMENT THIS CODE IF YOU HAVE INSTALLED MetaCleanData
# call getEvalObj on development data
# eicEval_development <- getEvalObj(xs = xs_development, fill = fill_development)
# call getEvalObj on test data
# eicEval_test <- getEvalObj(xs = xs_test, fill = fill_test)
The evalObj
has three slots:
The function getPeakQualityMetrics
uses the evalObj
objects to calculate each of the 12 peak quality metrics. These metrics are: Apex Max-Boundary Ratio, Elution Shift, FWHM2Base, Jaggedness, Modality, Retention-Time Consistency, Symmetry, Gaussian Similarity, Peak Signfiicance Level, Sharpness, Triangle Peak Area Similarity Ratio (TPASR), and Zig-Zag Index. See (our paper) for a description of each metric.
This function takes the following arguments:
## UNCOMMENT THIS CODE IF YOU HAVE INSTALLED MetaCleanData
# # calculate peak quality metrics for development dataset
# # For 500 peaks and 89 samples, takes ~2.3 mins
# pqMetrics_development <- getPeakQualityMetrics(eicEvalData = eicEval_development, eicLabels_df = eicLabels_development)
#
# # calculate peak quality metrics for test dataset
# # For 500 peaks and 100 samples, takes ~2.6 mins
# pqMetrics_test <- getPeakQualityMetrics(eicEvalData = eicEval_test, eicLabels_df = eicLabels_test)
The getPeakQualityMetrics
function returns an Mx14 matrix where M is equal to the number of peaks. There are 14 columns in total, including one column for each of the twelve metrics, one column for EIC numbers, and one column for the class label.
This matrix serves as the input for the training the classifiers and making predictions.
MetaClean
provides 9 classification algorithms (implemented with the R package caret
) for building a predictive model. These are: Decision Tree, Naive Bayes, Logistic Regression, RandomForest, SVM_Linear, SVM_Radial, AdaBoost, Neural Netowrk, and Model-Averaged Neural Networks. The trainClassifiers
function is a wrapper function that uses cross-validation to train a user-selected subset of the nine available algorithms. It takes the following arguments:
## IF YOU HAVE INSTALLED MetaCleanData YOU CAN COMMENT OUT THIS CODE AND PROCEED WITH THE PEAK QUALITY METRIC TABLES GENERATED IN THE PREVIOUS SECTIONS
data("pqMetrics_development")
data("pqMetrics_test")
trainClassifiers
returns a list of lists. The outer list has one entry for every model trained. The inner list has two entries: the trained model and the name of the model trained.
Once the potential models have been trained, the next step is to evaluate the performance of each to determine which is the best performing and should be selected as the classifier. To do this, we first generate the seven available evaluation measures: PosClass.FScore, PosClass.Precision, PosClass.Recall, NegClass.FScore, NegClass.Precision, NegClass.Recall, and Accuracy. We use the getEvaluationMeasures
function to do this. This function takes the following arguments:
# calculate all seven evaluation measures for each model and each round of cross-validation
#evalMeasuresDF <- getEvaluationMeasures(models=models, k=5, repNum=10)
getEvaluationMeasures
returns a dataframe with the following columns: Model, RepNum, PosClass.FScore, PosClass.Recall, PosClass.Precision, NegClass.FScore, NegClass.Recall, NegClass.Precision, and Accuracy. The rows of the dataframe will correspond to the results of a particular model and a particular round of cross-validation.
The evaluation measures data frame can be used to assess the performance of each of the algorithms. The most convenient way to compare the performance of each is with visualizations. MetaClean
provides a simple wrapper function to easily generate bar plot visualizations of the evaluation measures for each model. This is makeBarPlots
which plots bar plots comparing each model across each of the evaluation measures. This function takes the following argument:
NOTE: When “All” is selected for emNames, the bar plots are returned in the same order as the names listed in the description.
# generate bar plots for every
#barPlots <- makeBarPlots(evalMeasuresDF, emNames="All")
#plot(barPlots[[1]]) # PASS.FScore
#plot(barPlots[[4]]) # FAIL.FScore
#plot(barPlots[[7]]) # Accuracy
These plots can help the user select the best perfomring classifier for the data. Of course, the user can also employ their own statistical tests on the evalMeasuresDF itself to determine which class
Once the best perfomring model has been selected, the user can train the algorithm using all of the available training data and the optimized hyperparameters for the algorithm determined by training, to create the final classifier.
# best performing model for example development set, rand.seed = 453, k = 5, repNum = 10 is AdaBoost
#trainData <- pqMetrics_development[,-c(1)]
#trControl <- trainControl(method = "none", savePredictions = 'final', classProbs=TRUE)
#seed <- 453
##set.seed(seed)
#best_model <- train(Class~., trainData,
# method="adaboost",
# trControl=trControl,
# tuneGrid=data.frame(nIter=150, method="Adaboost.M1") # list hyperparameters optimized for the algorithm
# )
The user can make predictions on new data using the final classifier using the predict() function from caret
as demonstrated below:
# return prediction probabilities for test dataset
#test_predictions_prob <- predict(best_model, pqMetrics_test[,-c(1)], type="prob")
# return class predictions for test dataset
#test_predictions_class <- predict(best_model, pqMetrics_test[,-c(1)])
#test_predictions <- cbind("Probabilities"=test_predictions_prob, "Class"=test_predictions_class, "EICNO"=pqMetrics_test$EICNo)
#test_evalMeasures <- calculateEvaluationMeasures(pred=test_predictions_class, pqMetrics_test$Class)
This classifier can then be saved to a directory specified by the user so it can used as many times as desired.
# uncomment the lines below and add path where you want to save trained model
# model_path <- ""
# model_file <- paste0(model_path, "MyModel.rds")
# saveRDS(best_model, file=model_file)
The user can load any previously trained model (including the trained model from our publication available with package) and use it to make predictions on new data.
To load a prediction model, simply provide the path and use the base function readRDS().
As an example, users who have downloaded the data package MetaCleanData can use the pre-trained model included with that package:
## UNCOMMENT THIS CODE IF YOU HAVE INSTALLED MetaCleanData
# # load model from MetaCleanData
# data(myModel)
The user can then make predictions using this model using the predict() function from caret
as seen below:
## UNCOMMENT THIS CODE IF YOU HAVE INSTALLED MetaCleanData
# myModel_predictions_prob <- predict(myModel, pqMetrics_test[,-c(1)], type="prob")
# myModel_predictions_class <- predict(myModel, pqMetrics_test[,-c(1)])
#
# myModel_predictions <- cbind("Probabilities"=myModel_predictions_prob, "Class"=myModel_predictions_class, "EICNO"=pqMetrics_test$EICNo)
#