By default ampir
predicts the probability of a protein to be an antimicrobial peptide (AMP) or not based on a trained SVM model with as input known AMP sequences corresponding to a wide diversity of organisms. However, within the predict_amps
function there is a model
argument that allows users to pass their own trained model object. Using a different trained model might be useful when users wish to e.g. use a taxonomic specific model to predict AMPs in a restricted group of taxa.
This vignette will go through a mock example of how you can train your own model using the caret
package. For more information on how to use caret
and the functions used within this example, please see the extensive documentation made by the author, Max Kuhn.
First, a positive and negative dataset have to be obtained. In this example, we want to predict AMPs in bats and decide to train a model using protein sequences found in bats. The positive dataset are AMPs and the negative dataset are random sequences. Both datasets were obtained from UniProt:
For the positive dataset:
bat_pos <- read_faa(system.file("extdata/bat_positive.fasta.gz", package = "ampir"))
bat_pos$Label <- "Positive"
bat_pos <- remove_nonstandard_aa(bat_pos)
For the negative dataset:
bat_neg <- read_faa(system.file("extdata/bat_negative.fasta.gz", package = "ampir"))
bat_neg$Label <- "Negative"
bat_neg <- remove_nonstandard_aa(bat_neg)
bat_neg <- bat_neg[!bat_neg$seq_aa %in% bat_pos$seq_aa,]
bat_neg <- bat_neg[sample(nrow(bat_neg),78),]
Combine the positive and negative dataset
Calculate features on the combined positive and negative dataset and add the label column
bats_features <- calculate_features(bats)
bats_features$Label <- as.factor(bats$Label)
rownames(bats_features) <- NULL
Split feature set data and create train and test set with caret
trainIndex <-createDataPartition(y=bats_features$Label, p=.7, list = FALSE)
bats_featuresTrain <-bats_features[trainIndex,]
bats_featuresTest <-bats_features[-trainIndex,]
Resample method using repeated cross validation and adding in a probability calculation with caret
Train model using a support vector machine with radial kernel with caret
. Note: Other classification models are supported too. For example, to use a random forest model in caret
, method
could be changed from “svmRadial” to “ranger”.
my_bat_svm_model <- train(Label~.,
data = bats_featuresTrain[,-1], # excluding seq_name column
method="svmRadial",
trControl = trctrl_prob,
preProcess = c("center", "scale"))
Test model to get an indication of how well the model performs on test data with caret
my_bat_pred <- predict(my_bat_svm_model, bats_featuresTest)
cm <- confusionMatrix(my_bat_pred, bats_featuresTest$Label, positive = "Positive")
Subset from cm$byClass
Balanced Accuracy | Precision | Recall | F1 |
---|---|---|---|
0.98 | 1 | 0.96 | 0.98 |
Convert the bat feature test data to the original FASTA type format containing just the sequence name and sequence as this is the required input data for ampir
bat_test_set <- bats[bats$seq_name %in% bats_featuresTest$seq_name,][,-3]
rownames(bat_test_set) <- NULL
Use the trained bat model in ampir
’s predict_amps
function on the bat test set
my_bat_AMPs
sample
seq_name | seq_aa | prob_AMP | |
---|---|---|---|
1 | G1P1T7… | MKALLTLGLL… | 0.999 |
2 | G1NZQ6… | MKLFLILIIL… | 0.986 |
3 | U3KZW3… | MKSLLILGLL… | 0.998 |
44 | P14392… | VHLSGEEKAA… | 0.010 |
45 | Q7I5M1… | MALIYTNTLL… | 0.014 |
46 | B2KI57… | MEPFSSKSLA… | 0.050 |