immunarch
comes with its own data format, including tab-delimited columns that can be specified as follows:
“Clones” - count or number of barcodes (events, UMIs) or reads;
“Proportion” - proportion of barcodes (events, UMIs) or reads;
“CDR3.nt” - CDR3 nucleotide sequence;
“CDR3.aa” - CDR3 amino acid sequence;
“V.name” - names of aligned Variable gene segments;
“D.name” - names of aligned Diversity gene segments or NA;
“J.name” - names of aligned Joining gene segments;
“V.end” - last positions of aligned V gene segments (1-based);
“D.start” - positions of D’5 end of aligned D gene segments (1-based);
“D.end” - positions of D’3 end of aligned D gene segments (1-based);
“J.start” - first positions of aligned J gene segments (1-based);
“VJ.ins” - number of inserted nucleotides (N-nucleotides) at V-J junction (-1 for receptors with VDJ recombination);
“VD.ins” - number of inserted nucleotides (N-nucleotides) at V-D junction (-1 for receptors with VJ recombination);
“DJ.ins” - number of inserted nucleotides (N-nucleotides) at D-J junction (-1 for receptors with VJ recombination);
“Sequence” - full nucleotide sequence.
The package provides several IO functions:
repLoad
- to load the repertoires, having compatible format.
repSave
- to save changes and to write the repertoire data to a file in a specific format (immunarch
, VDJtools).
repLoad
detects the input file format automatically. immunarch
currently support the following immune repertoire data formats:
"immunarch"
- current software tool, in case you forgot it :)
"immunoseq"
- http://www.adaptivebiotech.com/immunoseq
"mitcr"
- https://github.com/milaboratory/mitcr
"mixcr"
- https://github.com/milaboratory/mixcr
"migec"
- http://migec.readthedocs.io/en/latest/
"migmap"
- https://github.com/mikessh/migmap
"tcr"
- https://imminfo.github.io/tcr/
"vdjtools"
- https://vdjtools-doc.readthedocs.io/en/master/
"imgt"
- http://www.imgt.org/HighV-QUEST/
"airr"
- http://docs.airr-community.org/en/latest/datarep/overview.html
"10x"
- https://support.10xgenomics.com/single-cell-vdj/software/pipelines/latest/output/annotation
"archer"
- ArcherDX clonotype tables. https://archerdx.com/immunology/
These parsers will be available soon.
"imseq"
- http://www.imtools.org/
"rtcr"
- https://github.com/uubram/RTCR
"vidjil"
- http://www.vidjil.org/
Please contact us if there are more file formats you want to be supported.
For parsing IgBLAST results process the data with MigMap first.
You can load the data either from a single file, a list of repertoire file paths or from a folder with repertoire files. A single file can be loaded as follows:
# To load the data from a single file (note that you don't need to specify the data format):
file_path = paste0(system.file(package="immunarch"), "/extdata/io/Sample1.tsv.gz")
immdata <- repLoad(file_path)
In other cases you may want to provide a metadata file and locate it in the folder. It is necessary to name it exactly “metadata.txt”.
# For instance you have a following structure in your folder:
# >_ ls
# immunoseq1.txt
# immunoseq2.txt
# immunoseq3.txt
# metadata.txt
With the metadata repLoad
will create a list in the environment with 2 elements, namely data
and meta
. All the data will be accessible simply from immdata$data
.
Otherwise repLoad
will create a dummy metadata file with only sample names.
# To load the whole folder with every file in it type:
file_path = paste0(system.file(package="immunarch"), "/extdata/io/")
immdata <- repLoad(file_path)
print(names(immdata))
# In order to do that your folder must contain metadata file named
# exactly "metadata.txt".
# In R, when you load your data:
# > immdata <- repLoad("path/to/your/folder/")
# > names(immdata)
# [1] "data" "meta"
# Suppose you do not have "metadata.txt":
# > immdata <- repLoad("path/to/your/folder/")
# > names(immdata)
# [1] "data" "meta"
Dummy metadata data frame look like this:
as_tibble(data.frame(Sample = c("immunoseq1", "immunoseq2", "immunoseq3"), stringsAsFactors = F))
## # A tibble: 3 x 1
## Sample
## <chr>
## 1 immunoseq1
## 2 immunoseq2
## 3 immunoseq3
The metadata file “metadata.txt” has to be tab delimited file with first column named “Sample” and any number of additional columns with arbitrary names. The first column should contain base names of files without extensions in your folder.
Sample | Sex | Age | Status |
---|---|---|---|
immunoseq_1 | M | 1 | C |
immunoseq_2 | M | 2 | C |
immunoseq_3 | F | 3 | A |
In order to import data from the external databases you have to create a connection to this database and then load the data. Make sure that the table format in your database matches the immunarch
’s format.
To illustrate the use of external database, here is an example demonstrating data loading to the local MonetDB database:
# Your list of repertoires in immunarch's format
DATA
# Metadata data frame
META
# Create a temporary directory
dbdir = tempdir()
# Create a DBI connection to MonetDB in the temporary directory.
con = DBI::dbConnect(MonetDBLite::MonetDBLite(), embedded = dbdir)
# Write each repertoire to MonetDB. Each table has corresponding name from the DATA
for (i in 1:length(DATA)) {
DBI::dbWriteTable(con, names(DATA)[i], DATA[[i]], overwrite=TRUE)
}
# Create a source in the temporary directory with MonetDB
ms = MonetDBLite::src_monetdblite(dbdir = dbdir)
res_db = list()
# Load the data from MonetDB to dplyr tables
for (i in 1:length(DATA)) {
res_db[[names(DATA)[i]]] = dplyr::tbl(ms, names(DATA)[i])
}
# Your data is ready to use
list(data = res_db, meta = META)
immunarch
is compatible with following sources:
R data frames (for common applications)
R data tables (for speed, although requires lots of RAM)
MonetDB-like databases that support both DBI and dplyr (best choice, although you have to be a bit familiar with dplyr)
Apache Spark (if you are expierienced and comfortable with it)
You can find the introduction to dplyr
here: https://CRAN.R-project.org/package=dplyr/vignettes/dplyr.html
The function returns the most abundant clonotypes for the given repertoire:
top(immdata$data[[1]])
## # A tibble: 10 x 15
## Clones Proportion CDR3.nt CDR3.aa V.name D.name J.name V.end D.start D.end
## <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <int> <int> <int>
## 1 173 0.0204 TGCGCC… CASSQE… TRBV4… TRBD1 TRBJ2… 16 18 26
## 2 163 0.0192 TGCGCC… CASSYR… TRBV4… TRBD1 TRBJ2… 11 13 18
## 3 66 0.00776 TGTGCC… CATSTN… TRBV15 TRBD1 TRBJ2… 11 16 22
## 4 54 0.00635 TGTGCC… CATSIG… TRBV15 TRBD2 TRBJ2… 11 19 25
## 5 48 0.00565 TGTGCC… CASSPW… TRBV27 TRBD1 TRBJ1… 11 16 23
## 6 48 0.00565 TGCGCC… CASQGD… TRBV4… TRBD1 TRBJ1… 8 13 19
## 7 40 0.00471 TGCGCC… CASSQD… TRBV4… TRBD1 TRBJ2… 16 21 26
## 8 31 0.00365 TGTGCC… CASSEE… TRBV2 TRBD1 TRBJ1… 15 17 20
## 9 30 0.00353 TGCGCC… CASSQP… TRBV4… TRBD1 TRBJ2… 14 23 28
## 10 28 0.00329 TGTGCC… CASSWV… TRBV6… TRBD1 TRBJ2… 12 20 25
## # … with 5 more variables: J.start <int>, VJ.ins <dbl>, VD.ins <dbl>,
## # DJ.ins <dbl>, Sequence <lgl>
Conveniently, functions are vectorised over the list of data frames; and coding(immdata$data)
in the example below returns a list of data frames with coding sequences:
coding(immdata$data[[1]])
The next one operates in a similar fashion:
noncoding(immdata$data[[1]])
Now, the computation of the number of filtered sequences is straightforward:
nrow(inframes(immdata$data[[1]]))
And for the out-of-frame clonotypes:
nrow(outofframes(immdata$data[[1]]))
It is simple to subset data frame according to labels in the specified index. In the example the resulting data frame contains only records with ‘TRBV10-1’ V gene:
filter(immdata$data[[1]], V.name == 'TRBV10-1')
## # A tibble: 24 x 15
## Clones Proportion CDR3.nt CDR3.aa V.name D.name J.name V.end D.start D.end
## <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <int> <int> <int>
## 1 2 0.000235 TGCGCC… CASSES… TRBV1… TRBD2 TRBJ2… 16 20 25
## 2 2 0.000235 TGCGCC… CASSDG… TRBV1… TRBD1 TRBJ2… 13 15 22
## 3 1 0.000118 TGCGCC… CASSGD… TRBV1… TRBD2 TRBJ2… 8 10 15
## 4 1 0.000118 TGCGCC… CATLRS… TRBV1… TRBD1 TRBJ2… 6 7 9
## 5 1 0.000118 TGCGCC… CASSES… TRBV1… TRBD2 TRBJ2… 16 20 22
## 6 1 0.000118 TGCGCC… CASSES… TRBV1… TRBD2 TRBJ2… 16 17 21
## 7 1 0.000118 TGCGCC… CASRAS… TRBV1… TRBD2 TRBJ2… 10 13 21
## 8 1 0.000118 TGCGCC… CASRRD… TRBV1… TRBD1 TRBJ2… 8 13 19
## 9 1 0.000118 TGCGCC… CASSEV… TRBV1… TRBD1 TRBJ2… 14 19 24
## 10 1 0.000118 TGCGCC… CASSEG… TRBV1… TRBD2 TRBJ2… 13 19 27
## # … with 14 more rows, and 5 more variables: J.start <int>, VJ.ins <dbl>,
## # VD.ins <dbl>, DJ.ins <dbl>, Sequence <lgl>
ds = repSample(immdata$data, "downsample", 100)
sapply(ds, nrow)
## A2-i129 A2-i131 A2-i133 A2-i132 A4-i191 A4-i192 MS1 MS2 MS3 MS4
## 97 98 96 100 84 94 89 98 85 99
## MS5 MS6
## 81 100
ds = repSample(immdata$data, "sample", .n = 10)
sapply(ds, nrow)
## A2-i129 A2-i131 A2-i133 A2-i132 A4-i191 A4-i192 MS1 MS2 MS3 MS4
## 10 10 10 10 10 10 10 10 10 10
## MS5 MS6
## 10 10