This Vignette shows an example usecase for generating a linkage map for polyploid data in R.
First, we need to install the package pergola
. Second, we set a seed, to ensure, that this vignette is reproducable.
library(pergola)
set.seed(31415)
Next, we load the example dataset from the package.
data("simTetra")
simTetra
is a simulated tetraploid backcross dataset, consisting of seven chromosomes. A small subset reveals the structure of the dataset. Four columns represent a sample (e.g. P1, P2, F1…) and each row represents a marker. The values are 0 and 1, which stands for A and B allele in that case.
simTetra[1:5, 1:12]
## F2_001_1 F2_001_2 F2_001_3 F2_001_4 F2_002_1 F2_002_2 F2_002_3
## A001 0 1 0 0 1 0 0
## A002 0 1 0 0 1 0 0
## A003 0 1 0 0 1 0 0
## A004 0 1 0 0 0 0 0
## A005 0 1 0 0 0 0 0
## F2_002_4 F2_003_1 F2_003_2 F2_003_3 F2_003_4
## A001 1 1 1 0 1
## A002 1 1 0 0 1
## A003 1 1 0 0 1
## A004 1 1 0 0 1
## A005 1 1 0 0 1
As the data is simulated, the markers are grouped according to their chromosomes and ordered according to linkage. Our goal is to predict these. Thus, we want to randomize the order of the markers. The dataset contains the order of the alleles within the samples (haplotypes). This should also be randomized, because it provides information about the exact number of recombinations, which is approximated later.
simTetraGen <- shuffleInput(simTetra, ploidy = 4)
simTetraGen[1:5, 1:12]
## F2_001_1 F2_001_2 F2_001_3 F2_001_4 F2_002_1 F2_002_2 F2_002_3
## G010 1 0 1 1 1 1 1
## C007 1 0 0 0 0 0 0
## B003 1 0 1 0 1 1 0
## D009 1 0 0 0 0 0 1
## A008 1 0 0 0 0 1 1
## F2_002_4 F2_003_1 F2_003_2 F2_003_3 F2_003_4
## G010 0 0 1 0 0
## C007 1 1 1 1 1
## B003 0 0 0 0 1
## D009 0 1 1 0 0
## A008 0 1 0 1 0
Now that the data is randomized, we can treat it like real data. In the next step we collapse four columns into one. That way we have one column per sample, without loosing information.
simTetraGen <- bases2genotypes(simTetraGen, ploidy = 4)
simTetraGen[1:5, 1:6]
## [,1] [,2] [,3] [,4] [,5] [,6]
## G010 3 3 1 1 2 1
## C007 1 1 4 3 2 2
## B003 2 2 1 1 1 1
## D009 1 1 2 2 2 1
## A008 1 2 2 2 2 1
Finally, the data is ready to be analyzed.
Now, we want to compare all markers with each other. This allows us to group similar markers together and define linkage groups. The first step is the calculation of pairwise recombination frequencies. Unlike other tools we do not use the expectation–maximization (EM) algorithm or maximum-likelihood (ML) methods. Instead, we approximate the recombination events and always expect the minimum number.
rf <- calcRec(simTetraGen, ploidy = 4)
rf
is the matrix containing the recombination frequencies for all pairs of markers.
image(rf, xaxt = 'n', yaxt = 'n')
axis(side = 1, at = seq(0, 1, length.out = nrow(rf)),
labels = rownames(rf), las = 2, cex.axis = 0.8)
axis(side = 2, at = seq(0, 1, length.out = nrow(rf)),
labels = rownames(rf), las = 2, cex.axis = 0.8)
The colors represent the different frequencies (yellow = high, red = low). We do not see any structure in the data set, except the diagonal. It is red, because each there is no recombination between a marker and itself.
Next, we want to find out which marker belongs into which linkage group. Usually the number of linkage groups is equal to the number of chromosomes. In the beginning of this vignette, we mentioned that our dataset consist of seven chromosomes. Now it is time to see, if that can be observed from the data. First, we plot a dendogram, the default type of plot in the funtion splitChr
.
plotRf(rf)
The plot shows seven easily distinguishable clusters, as expected. The other implemented visualization is an image of the recombination frequency values:
plotRf(rf, plottype = "image")
Here we see seven rectangles and assume seven linkage groups. Hence, we split the data with splitChr
.
split <- splitChr(rf, nchr = 7)
table(split$split)
##
## 1 2 3 4 5 6 7
## 16 17 22 20 15 19 22
head(split)
## names split dup
## G010 G010 1 0
## C007 C007 2 0
## B003 B003 3 0
## D009 D009 4 0
## A008 A008 5 0
## C009 C009 2 0
table()
shows us, how many markers are on each chromosome.
In case that single markers end up in own linkage groups, one might want to filter them out with the parameter filter = TRUE
. If many markers are highly similar, we removing those duplicates is adviced. They have no added value for the linkage map and end up at the same position.
Now, that we grouped our markers, we want to find the correct order of markers, within the linkage groups. Filtered markers (chromosome 0) would be ignored.
split <- sortLeafs(rf, split)
head(split)
split$order
is the global order of markers. We can visualize it with image()
.
image(rf[split$order, split$order], xaxt = 'n', yaxt = 'n')
axis(side = 1, at = seq(0, 1, length.out = nrow(rf)),
labels = rownames(rf)[split$order], las = 2, cex.axis = 0.8)
axis(side = 2, at = seq(0, 1, length.out = nrow(rf)),
labels = rownames(rf)[split$order], las = 2, cex.axis = 0.8)
We see, that the order within the chromosomes has improved, because the red values moved closer to the diagonal.
Here we show an example, where two orders are equally good when only one neighbor is included.
set.seed(3)
ambRF <- cbind(c(0, 2, 4, 6, 8, 12),
c(2, 0, 4, 4, 7, 10),
c(4, 4, 0, 2, 4, 7),
c(6, 4, 2, 0, 4, 5),
c(8, 7, 4, 4, 0, 3),
c(12, 10, 7, 5, 3, 0)) / 100
ambsplit <- data.frame(names = LETTERS[1:6],
split = rep(1, 6),
order = 1:6)
amb1 <- sortLeafs(ambRF, ambsplit)
amb1$order
## [1] 1 2 3 4 5 6
amb2 <- c(1,2,4,3,5,6)
amb3 <- c(2,1,3,4,5,6)
amb4 <- 6:1 #reverse of amb1
calcSarf(ambRF, amb1$order, n = 1)
## [1] 0.15
calcSarf(ambRF, amb2, n = 1)
## [1] 0.15
calcSarf(ambRF, amb3, n = 1)
## [1] 0.15
calcSarf(ambRF, amb4, n = 1)
## [1] 0.15
calcSarf(ambRF, amb1$order, n = 2)
## [1] 0.32
calcSarf(ambRF, amb2, n = 2)
## [1] 0.36
calcSarf(ambRF, amb3, n = 2)
## [1] 0.34
calcSarf(ambRF, amb4, n = 2)
## [1] 0.32
Finally, we create the map:
maps <- pullMap(rf, split)
We get a list, with one object per chromosome. We visualize it with plotChr()
.
plotChr(maps[[1]], cex = 0.6)
We can also compare two chromosome maps:
maps2 <- pullMap(rf, split, fun = "kosambi")
plotChr(maps[[1]], maps2[[1]], cex = 0.6)
The map, based on the Kosambi function is smaller, than the default Haldane function.
To compare two maps on global level we use the package dendextend
.
library(dendextend)
library(gclus)
We create a new map from the existing map maps2
, because otherwise the two dendograms would be equal.
maps3 <- maps2
maps3[1] <- maps2[2]
maps3[2] <- maps2[1]
maps3[3] <- maps2[7]
maps3[[4]] <- rev(max(maps3[[4]]) - maps3[[4]])
maps3[6] <- maps2[3]
maps3[7] <- maps2[6]
maps3[[4]]
## D001 D002 D003 D004 D005 D006
## 0.000000 9.641562 19.941057 28.076617 37.563857 45.440883
## D007 D008 D009 D010 D011 D012
## 53.445072 63.411050 69.693911 78.887030 87.143073 95.414429
## D013 D014 D015 D016 D017 D018
## 103.454019 111.493610 114.777010 120.637028 127.458702 137.955048
## D019 D020
## 147.387404 156.503482
maps2[[4]]
## D020 D019 D018 D017 D016 D015
## 0.000000 9.116078 18.548434 29.044780 35.866454 41.726472
## D014 D013 D012 D011 D010 D009
## 45.009872 53.049463 61.089053 69.360409 77.616452 86.809571
## D008 D007 D006 D005 D004 D003
## 93.092432 103.058410 111.062599 118.939625 128.426865 136.562425
## D002 D001
## 146.861920 156.503482
We create dendogram objects from our previously created maps.
dend1 <- map2dend(maps)
plot(dend1, cex = 0.6)
One is plotted, to show what it looks like.
dend2 <- map2dend(maps3)
maketangle
creates a tanglegram.
maketangle(dend1, dend2, cutheight = 500, k = 7)
We can easily observe the rearrangements that we did before. In a real data example we would not want these. We use switchChrs
and swapChrs
to reverse them. One map serves as model while the other one is changed.
maps <- switchChrs(map = maps, comp = maps3)
maps <- swapChrs(map = maps, comp = maps3)
dend3 <- map2dend(maps)
dend4 <- map2dend(maps3)
maketangle(dend3, dend4, cutheight = 500, k = 7)
We see, that both maps are now perfectly aligned. However, the interchromosomal order is not changed.
I thank Jeroen Lodewijk for testing the package and providing valuable feedback about its usablity and this vignette.
This work was supported by the European Union’s Seventh Framework Programme for research, technological development and demonstration under grant agreement No. 289974. INTERCROSSING
sessionInfo()
## R version 3.2.2 (2015-08-14)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 15.10
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=de_DE.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=de_DE.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] dendextendRcpp_0.6.1 Rcpp_0.12.2 gclus_1.3.1
## [4] cluster_2.0.3 dendextend_1.1.2 pergola_1.0
##
## loaded via a namespace (and not attached):
## [1] whisker_0.3-2 knitr_1.12.3 magrittr_1.5
## [4] roxygen2_5.0.1 devtools_1.9.1 MASS_7.3-43
## [7] colorspace_1.2-6 foreach_1.4.3 stringr_1.0.0
## [10] caTools_1.17.1 tools_3.2.2 grid_3.2.2
## [13] seriation_1.1-3 KernSmooth_2.23-15 registry_0.3
## [16] htmltools_0.3 iterators_1.0.8 gtools_3.5.0
## [19] yaml_2.1.13 digest_0.6.9 formatR_1.2.1
## [22] bitops_1.0-6 codetools_0.2-14 evaluate_0.8
## [25] memoise_0.2.1 rmarkdown_0.9.5 TSP_1.1-3
## [28] gdata_2.17.0 stringi_1.0-1 gplots_2.17.0