Genotype and sequence summaries

Eric Archer

2020-02-23

There are several by-locus summary functions available for gtypes objects. Given some sample microsatellite data:

data(msats.g)
msats <- stratify(msats.g, "broad")
msats <- msats[, getLociNames(msats)[1:4], ]

One can calculate the following summaries:

The number of alleles at each locus:

numAlleles(msats)
##   locus num.alleles
## 1  D11t          12
## 2  EV37          22
## 3  EV94          15
## 4 Ttr11           9

The number of samples with missing data at each locus:

numMissing(msats)
##   locus num.missing
## 1  D11t           1
## 2  EV37           7
## 3  EV94           1
## 4 Ttr11           1

which can also be expressed as a proportion of samples with missing data:

numMissing(msats, prop = TRUE)
##   locus num.missing
## 1  D11t       0.004
## 2  EV37       0.028
## 3  EV94       0.004
## 4 Ttr11       0.004

The allelic richness, or the average number of alleles per sample:

allelicRichness(msats)
##   locus allelic.richness
## 1  D11t            0.096
## 2  EV37            0.185
## 3  EV94            0.120
## 4 Ttr11            0.072

The observed and expected heterozygosity:

# observed
heterozygosity(msats, type = "observed")
##   locus obsvd.het
## 1  D11t      0.70
## 2  EV37      0.66
## 3  EV94      0.77
## 4 Ttr11      0.70
# expected
heterozygosity(msats, type = "expected")
##   locus exptd.het
## 1  D11t      0.75
## 2  EV37      0.83
## 3  EV94      0.83
## 4 Ttr11      0.80

The proportion of alleles that are unique (present in only one sample):

propUniqueAlleles(msats)
##   locus prop.unique.alleles
## 1  D11t               0.032
## 2  EV37               0.025
## 3  EV94               0.016
## 4 Ttr11               0.024

The value of theta based on heterozygosity:

theta(msats)
## Registered S3 method overwritten by 'pegas':
##   method      from
##   print.amova ade4
##   locus theta
## 1  D11t  0.56
## 2  EV37  0.62
## 3  EV94  0.62
## 4 Ttr11  0.59

These measures are all calculated in the summarizeLoci function and returned as a matrix. This function also allows you to calculate the measures for each stratum separately, which returns a list for each stratum:

summarizeLoci(msats)
##   locus num.genotyped num.missing prop.genotyped num.alleles allelic.richness
## 1  D11t           125           1           0.99          12            0.096
## 2  EV37           119           7           0.94          22            0.185
## 3  EV94           125           1           0.99          15            0.120
## 4 Ttr11           125           1           0.99           9            0.072
##   prop.unique.alleles exptd.het obsvd.het
## 1               0.032      0.75      0.70
## 2               0.025      0.83      0.66
## 3               0.016      0.83      0.77
## 4               0.024      0.80      0.70
summarizeLoci(msats, by.strata = TRUE)
##   locus  stratum num.genotyped.x num.missing prop.genotyped num.alleles
## 1  D11t Offshore              58           0           1.00          12
## 2  EV37 Offshore              56           2           0.97          22
## 3  EV94 Offshore              57           1           0.98          15
## 4 Ttr11 Offshore              57           1           0.98           9
## 5  D11t  Coastal              67           1           0.99           3
## 6  EV37  Coastal              63           5           0.93           7
## 7  EV94  Coastal              68           0           1.00           5
## 8 Ttr11  Coastal              68           0           1.00           4
##   allelic.richness num.unique num.genotyped.y exptd.het obsvd.het
## 1            0.207          3              58      0.86      0.91
## 2            0.393          3              56      0.94      0.76
## 3            0.263          2              57      0.86      0.81
## 4            0.158          3              57      0.82      0.78
## 5            0.045          1              67      0.49      0.51
## 6            0.111          3              63      0.61      0.57
## 7            0.074          0              68      0.77      0.74
## 8            0.059          0              68      0.66      0.63

One can also obtain the allelic frequencies for each locus overall and by-strata by:

alleleFreqs(msats)
## $D11t
## 
## 117 119 121 127 129 131 133 135 137 139 141 143 
##   1   1   4   1   3  16  75  96  20  20   7   6 
## 
## $EV37
## 
## 190 200 202 204 206 208 210 212 214 216 218 220 222 224 226 228 230 232 234 236 
##   3   4   5   2   7   8   3  13  86  39   8   6  20  11   3   8   5   2   1   2 
## 240 254 
##   1   1 
## 
## $EV94
## 
## 229 239 243 245 247 249 251 253 255 259 261 263 265 269 271 
##   1   2  15  18   3  83  41   7   6  27  27   7   8   3   2 
## 
## $Ttr11
## 
## 193 197 207 209 211 213 215 217 219 
##   1  10  53  17  35  80  46   7   1
alleleFreqs(msats, by.strata = TRUE)
## $D11t
##      
##       Coastal Offshore
##   117       0        1
##   119       0        1
##   121       0        4
##   127       0        1
##   129       0        3
##   131       0       16
##   133      48       27
##   135      83       13
##   137       3       17
##   139       0       20
##   141       0        7
##   143       0        6
## 
## $EV37
##      
##       Coastal Offshore
##   190       0        3
##   200       0        4
##   202       0        5
##   204       0        2
##   206       0        7
##   208       0        8
##   210       0        3
##   212       1       12
##   214      71       15
##   216      33        6
##   218       2        6
##   220       0        6
##   222      11        9
##   224       7        4
##   226       1        2
##   228       0        8
##   230       0        5
##   232       0        2
##   234       0        1
##   236       0        2
##   240       0        1
##   254       0        1
## 
## $EV94
##      
##       Coastal Offshore
##   229       0        1
##   239       0        2
##   243       0       15
##   245      12        6
##   247       0        3
##   249      47       36
##   251      30       11
##   253       0        7
##   255       0        6
##   259      25        2
##   261      22        5
##   263       0        7
##   265       0        8
##   269       0        3
##   271       0        2
## 
## $Ttr11
##      
##       Coastal Offshore
##   193       0        1
##   197       0       10
##   207      42       11
##   209       0       17
##   211       0       35
##   213      59       21
##   215      33       13
##   217       2        5
##   219       0        1

The dupGenotypes function identifies samples that have the same or nearly the same genotypes. The number (or percent) of loci that must be shared in order for it to be considered a duplicate can be set by the num.shared argument. The return data.frame provides which loci the two samples show mismatches at so they can be reviewed.

# Find samples that share alleles at 2/3rds of the loci
dupGenotypes(msats, num.shared = 0.66)
##    ids.1 ids.2 strata.1 strata.2 mismatch.loci num.loci.genotyped
## 1  78045 78058  Coastal  Coastal          <NA>                  4
## 2  25509 41822  Coastal  Coastal          <NA>                  4
## 3  42193 78035  Coastal  Coastal          <NA>                  3
## 4  41579 45237  Coastal  Coastal          <NA>                  3
## 5  40916 78038  Coastal  Coastal          <NA>                  3
## 6  78063 78069  Coastal  Coastal          EV94                  4
## 7  78053 78061  Coastal  Coastal          EV94                  4
## 8  78051 78065  Coastal  Coastal          EV94                  4
## 9  78049 78057  Coastal  Coastal          EV94                  4
## 10 78048 78054  Coastal  Coastal          D11t                  4
## 11 78044 78067  Coastal  Coastal          D11t                  4
## 12 78044 78063  Coastal  Coastal          EV37                  4
## 13 78043 78046  Coastal  Coastal          EV94                  4
## 14 78041 78066  Coastal  Coastal          EV37                  4
## 15 78038 78051  Coastal  Coastal          D11t                  4
## 16 78038 78046  Coastal  Coastal          EV94                  4
## 17 78038 78043  Coastal  Coastal          EV94                  4
## 18 78036 78067  Coastal  Coastal          EV37                  4
## 19 78035 78053  Coastal  Coastal          D11t                  4
## 20 78034 78043  Coastal  Coastal          EV37                  4
## 21 78034 78040  Coastal  Coastal          EV94                  4
## 22 45236 78065  Coastal  Coastal         Ttr11                  4
## 23 45231 78041  Coastal  Coastal          EV94                  4
## 24 45230 78040  Coastal  Coastal         Ttr11                  4
## 25 45230 78035  Coastal  Coastal          EV94                  4
## 26 44721 78059  Coastal  Coastal          EV37                  4
## 27 44720 78058  Coastal  Coastal          EV94                  4
## 28 44720 78045  Coastal  Coastal          EV94                  4
## 29 44719 78044  Coastal  Coastal         Ttr11                  4
## 30 44719 45233  Coastal  Coastal          EV94                  4
## 31 44718 78037  Coastal  Coastal          EV94                  4
## 32 41822 78065  Coastal  Coastal          EV94                  4
## 33 41822 78051  Coastal  Coastal          EV94                  4
## 34 41821 78060  Coastal  Coastal          EV94                  4
## 35 41820 78035  Coastal  Coastal         Ttr11                  4
## 36 41820 45229  Coastal  Coastal          EV94                  4
## 37 41819 78040  Coastal  Coastal         Ttr11                  4
## 38 41819 45230  Coastal  Coastal         Ttr11                  4
## 39 41578 45233  Coastal  Coastal          EV94                  4
## 40 41578 44719  Coastal  Coastal          EV94                  4
## 41 41540 78040  Coastal  Coastal          D11t                  4
## 42 41538 45231  Coastal  Coastal          D11t                  4
## 43 40915 78047  Coastal  Coastal         Ttr11                  4
## 44 25509 78065  Coastal  Coastal          EV94                  4
## 45 25509 78051  Coastal  Coastal          EV94                  4
## 46 25503 78053  Coastal  Coastal         Ttr11                  4
## 47 25503 41539  Coastal  Coastal          EV94                  4
## 48 23945 78065  Coastal  Coastal          EV37                  4
## 49 23945 78050  Coastal  Coastal         Ttr11                  4
## 50 51981 78069  Coastal  Coastal          EV37                  3
## 51 45237 78069  Coastal  Coastal         Ttr11                  3
## 52 45237 78068  Coastal  Coastal          EV94                  3
## 53 45237 78048  Coastal  Coastal          EV94                  3
## 54 45237 78038  Coastal  Coastal         Ttr11                  3
## 55 45237 78033  Coastal  Coastal          EV94                  3
## 56 42193 78058  Coastal  Coastal          EV94                  3
## 57 42193 78053  Coastal  Coastal          D11t                  3
## 58 42193 78045  Coastal  Coastal          EV94                  3
## 59 42193 51982  Coastal  Coastal         Ttr11                  3
## 60 42193 45233  Coastal  Coastal          EV94                  3
## 61 42193 45230  Coastal  Coastal          EV94                  3
## 62 42193 44720  Coastal  Coastal          EV94                  3
## 63 42193 44719  Coastal  Coastal          EV94                  3
## 64 42192 78066  Coastal  Coastal          EV94                  3
## 65 42192 78051  Coastal  Coastal          D11t                  3
## 66 42192 78041  Coastal  Coastal          EV94                  3
## 67 42192 78038  Coastal  Coastal          D11t                  3
## 68 42192 45231  Coastal  Coastal          EV94                  3
## 69 41820 42193  Coastal  Coastal         Ttr11                  3
## 70 41819 45237  Coastal  Coastal          EV94                  3
## 71 41579 78069  Coastal  Coastal         Ttr11                  3
## 72 41579 78068  Coastal  Coastal          EV94                  3
## 73 41579 78048  Coastal  Coastal          EV94                  3
## 74 41579 78038  Coastal  Coastal         Ttr11                  3
## 75 41579 78033  Coastal  Coastal          EV94                  3
## 76 41579 41819  Coastal  Coastal          EV94                  3
## 77 41578 45237  Coastal  Coastal         Ttr11                  3
## 78 41578 42193  Coastal  Coastal          EV94                  3
## 79 41578 41579  Coastal  Coastal         Ttr11                  3
## 80 40916 78069  Coastal  Coastal         Ttr11                  3
## 81 40916 78051  Coastal  Coastal          D11t                  3
## 82 40916 78046  Coastal  Coastal          EV94                  3
## 83 40916 78043  Coastal  Coastal          EV94                  3
## 84 40916 78040  Coastal  Coastal          EV94                  3
## 85 40916 78034  Coastal  Coastal          EV94                  3
## 86 40916 45237  Coastal  Coastal         Ttr11                  3
## 87 40916 42192  Coastal  Coastal          D11t                  3
## 88 40916 41579  Coastal  Coastal         Ttr11                  3
## 89 40916 41578  Coastal  Coastal         Ttr11                  3
##    num.loci.shared prop.loci.shared
## 1                4             1.00
## 2                4             1.00
## 3                3             1.00
## 4                3             1.00
## 5                3             1.00
## 6                3             0.75
## 7                3             0.75
## 8                3             0.75
## 9                3             0.75
## 10               3             0.75
## 11               3             0.75
## 12               3             0.75
## 13               3             0.75
## 14               3             0.75
## 15               3             0.75
## 16               3             0.75
## 17               3             0.75
## 18               3             0.75
## 19               3             0.75
## 20               3             0.75
## 21               3             0.75
## 22               3             0.75
## 23               3             0.75
## 24               3             0.75
## 25               3             0.75
## 26               3             0.75
## 27               3             0.75
## 28               3             0.75
## 29               3             0.75
## 30               3             0.75
## 31               3             0.75
## 32               3             0.75
## 33               3             0.75
## 34               3             0.75
## 35               3             0.75
## 36               3             0.75
## 37               3             0.75
## 38               3             0.75
## 39               3             0.75
## 40               3             0.75
## 41               3             0.75
## 42               3             0.75
## 43               3             0.75
## 44               3             0.75
## 45               3             0.75
## 46               3             0.75
## 47               3             0.75
## 48               3             0.75
## 49               3             0.75
## 50               2             0.67
## 51               2             0.67
## 52               2             0.67
## 53               2             0.67
## 54               2             0.67
## 55               2             0.67
## 56               2             0.67
## 57               2             0.67
## 58               2             0.67
## 59               2             0.67
## 60               2             0.67
## 61               2             0.67
## 62               2             0.67
## 63               2             0.67
## 64               2             0.67
## 65               2             0.67
## 66               2             0.67
## 67               2             0.67
## 68               2             0.67
## 69               2             0.67
## 70               2             0.67
## 71               2             0.67
## 72               2             0.67
## 73               2             0.67
## 74               2             0.67
## 75               2             0.67
## 76               2             0.67
## 77               2             0.67
## 78               2             0.67
## 79               2             0.67
## 80               2             0.67
## 81               2             0.67
## 82               2             0.67
## 83               2             0.67
## 84               2             0.67
## 85               2             0.67
## 86               2             0.67
## 87               2             0.67
## 88               2             0.67
## 89               2             0.67

The start and end positions and number of N’s and indels can be generated with the summarizeSeqs function:

library(ape)
data(dolph.seqs)
seq.smry <- summarizeSeqs(as.DNAbin(dolph.seqs))
head(seq.smry)
##      start end length num.ns num.indels
## 4495     1 402    402      0          2
## 4496     1 402    402      0          2
## 4498     1 402    402      0          1
## 5814     1 402    402      0          2
## 5815     1 402    402      0          2
## 5816     1 402    402      0          2

Base frequencies can be generated with baseFreqs:

bf <- baseFreqs(as.DNAbin(dolph.seqs))

# nucleotide frequencies by site
bf$site.freq[, 1:15]
##     1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
## a   0 126 126 126 126 126   5   0   0   0   0 126   0   0   0
## c   0   0   0   0   0   0   0   0 126   0   0   0   0   0   0
## g 126   0   0   0   0   0   0 126   0   0   0   0   0   0 126
## t   0   0   0   0   0   0   0   0   0 126 126   0 126 126   0
## u   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
## r   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
## y   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
## m   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
## k   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
## w   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
## s   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
## b   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
## d   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
## h   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
## v   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
## n   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
## x   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
## -   0   0   0   0   0   0 121   0   0   0   0   0   0   0   0
## .   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
# overall nucleotide frequencies
bf$base.freqs
## 
##     a     c     g     t     u     r     y     m     k     w     s     b     d 
## 15179 11561  6501 17166     0     0     0     0     0     0     0     0     0 
##     h     v     n     x     -     . 
##     0     0     0     0   245     0

Sequences can be scanned for low-frequency substitutions with lowFreqSubs:

lowFreqSubs(as.DNAbin(dolph.seqs), min.freq = 2)
##      id site base freq       motif
## 1 23792  274    t    1 cctattgatcc
## 2 23794  287    g    1 cctccgttata
## 3 26304  274    a    1 cctataaatcc
## 4 26304  394    t    1 taccttgtggg
## 5 74962   57    a    1 taaaaataatt
## 6 74962  104    g    1 catacgcatgt
## 7 74962  392    t    1 catgctccgtg
## 8 74962  393    c    1 atgctccgtgg

Unusual sequences can be identified by plotting likelihoods based on pairwise distances:

data(dolph.haps)
sequenceLikelihoods(as.DNAbin(dolph.haps))

##        id mean.dist neg.log.lik delta.log.lik
## 1  Hap.32      13.7         110          26.3
## 2  Hap.22      12.7         104          20.7
## 3  Hap.06      13.1         102          19.2
## 4  Hap.02       6.3          99          16.0
## 5  Hap.15       7.4          98          14.8
## 6  Hap.29      11.8          97          13.4
## 7  Hap.10       7.8          94          11.1
## 8  Hap.30       8.9          93           9.4
## 9  Hap.23       9.7          93           9.2
## 10 Hap.03       7.1          92           8.9
## 11 Hap.04       8.1          92           8.8
## 12 Hap.33       7.9          92           8.8
## 13 Hap.31       7.2          91           8.1
## 14 Hap.14       8.2          91           7.6
## 15 Hap.09       7.3          91           7.5
## 16 Hap.12      11.6          91           7.4
## 17 Hap.18      11.7          90           7.2
## 18 Hap.19      11.3          90           7.0
## 19 Hap.07       7.2          90           6.8
## 20 Hap.21       8.5          88           4.3
## 21 Hap.13       8.6          88           4.3
## 22 Hap.20       8.1          87           3.3
## 23 Hap.26       8.7          87           3.3
## 24 Hap.27       7.2          86           3.2
## 25 Hap.16       8.3          86           2.9
## 26 Hap.05       7.9          86           2.9
## 27 Hap.24       8.8          86           2.7
## 28 Hap.17       7.2          86           2.2
## 29 Hap.25      11.2          85           1.8
## 30 Hap.01       8.0          85           1.7
## 31 Hap.08       8.9          85           1.4
## 32 Hap.28      11.0          84           1.2
## 33 Hap.11       7.7          83           0.0

All of the above functions can be conducted at once with the qaqc function. Only those functions appropriate to the data type contained (haploid or diploid) will be run. Files are written for each output that are labelled either by the @description slot of the gtypes object or the optional label argument of the function.