This document provides brief examples of how jackalope
can be used to generate sequencing data that can inform some common sampling decisions for HTS studies.
Here, I’ll show how to generate a reference genome based on an existing assembly or via simulated DNA sequences.
To show how an existing assembly is read in jackalope
, the code below processes the assembly for Drosophila melanogaster (version 6.27) downloaded from https://flybase.org. It reads the compressed FASTA file, filters out scaffolds by using a size threshold, removes the Y chromosome (to avoid having both an X and Y in the same haplotype), and merges the left and right arms of chromosomes 2 and 3. It lastly sets the names of chromosomes 2 and 3 to be "2"
and "3"
, respectively.
ref <- read_fasta("dmel-6.27.fasta.gz", cut_names = TRUE)
ref$filter_chroms(1e6, method = "size")
ref$rm_chroms("Y")
ref$merge_chroms(c("2L", "2R"))
ref$merge_chroms(c("3L", "3R"))
names <- ref$chrom_names()
names[grepl("^2", names)] <- "2"
names[grepl("^3", names)] <- "3"
ref$set_names(names)
For the rest of the document, I will use a simulated genome for simplicity (and so that I don’t have to store the D. melanogaster genome inside this package). Here is how I simulated a genome of size \(\sim 1\) kb split among 4 chromosomes, with the same names as the chromosomes in the D. melanogaster genome:
This resulted in the following ref_genome
object:
#> < Set of 4 chromosomes >
#> # Total size: 961 bp
#> name chromosome length
#> 2 TCTCACGGAAAACGGGAAGTTTTTGGCC...CAAGGCTTCGACGTAGAGCCGGTGGATT 232
#> 3 GCTCCGGCAGCATCGGCTTCCGTGATCA...GGACTGCGGGGGTTCTAACGGGCGTCGG 245
#> 4 GAATGCTCGCATGCGAACACTGAACGTT...GAATGCCAGTAAGCTCGAGCGTATAAGG 247
#> X CCCCATTTCAGCAACGAAGTCGATCGCA...GTTCAATCACGCGAAAAAAAACACGTCA 237
For the examples below, we’ll pretend this is our D. melanogaster genome.
To generate variant haplotypes based on our reference genome, we need molecular evolution information (unless passing a Variant Call Format, or VCF, file). For molecular-evolution information, I used the JC69 model for simplicity. Mutation rates were chosen from the evo_rates
object present inside jackalope
, which stores Table 1 from Sung et al. (2016). The substitution and indel rates were the values from evo_rates
specific to D. melanogaster. The mu
parameter to sub_JC69
function was set to NULL
because the default behavior of all substitution-model functions in jackalope
is to scale rate matrices such that branch lengths are in units of substitutions per site. In this case, I don’t want to scale rate matrices because our branch lengths will be in generations. Zhang and Gerstein (2003) found a 1 to 2.90 ratio of insertions to deletions, and described relative rates of indels of various sizes using a power-law relationship. To approximate the distributions they described, relative rates were derived from a Lavalette distribution with \(L = 60\) and \(a = 1.60\) for insertions and \(a = 1.51\) for deletions. The \(\theta\) parameter is the population-scaled mutation rate, which we can get directly from evo_rates
for D. melanogaster. Lastly, \(N_0\) is the effective population size for D. melanogaster, which is used in a few instances below because scrm
outputs branch lengths in units of \(4 N_0\) generations.
sub_rate <- evo_rates$subs[evo_rates$species == "Drosophila melanogaster"]
indel_rate <- evo_rates$indels[evo_rates$species == "Drosophila melanogaster"]
# Because both are in units of 10^-10 events per site per generation:
sub_rate <- sub_rate * 1e-10
indel_rate <- indel_rate * 1e-10
sub <- sub_JC69(lambda = sub_rate, mu = NULL)
ins <- indels(rate = indel_rate, max_length = 60,a = 1.60)
del <- indels(rate = indel_rate, max_length = 60, a = 1.51)
theta <- evo_rates[evo_rates$species == "Drosophila melanogaster","theta_s"]
# Originally in units of 1e6 individuals
N0 <- evo_rates[evo_rates$species == "Drosophila melanogaster", "Ne"] * 1e6
When generating any haplotypes below, these objects will be used inside create_haplotypes
to specify molecular evolution information.
The examples here produce FASTQ files from the known reference assembly that could test strategies for how to assemble a similar genome using HTS data.
The first strategy is to use only PacBio sequencing. The PacBio Sequel system produces up to 500,000 reads per Single Molecule, Real-Time (SMRT) cell, so you could run the following for two cells (with the file pacbio_R1.fq
as output):
An alternative, hybrid strategy uses 1 SMRT cell of PacBio sequencing and 1 lane (\(\sim 500\) million reads) of \(2 \times 100\)bp Illumina sequencing on the HiSeq 2500 system (the default Illumina system in jackalope
):
pacbio(ref, out_prefix = "pacbio", n_reads = 500e3)
illumina(ref, out_prefix = "illumina", n_reads = 500e6, paired = TRUE,
read_length = 100)
The last strategy combines 1 lane of \(2 \times 100\)bp Illumina HiSeq 2500 sequencing with 1 flow cell of \(2 \times 250\)bp mate-pair sequencing on an Illumina MiSeq v3. The mate-pair sequencing uses longer fragments (defaults are mean of 400 and standard deviation of 100) to better cover highly repetitive regions.
illumina(ref, out_prefix = "ill_pe", n_reads = 500e6, paired = TRUE,
read_length = 100)
illumina(ref, out_prefix = "ill_mp", seq_sys = "MSv3",
read_length = 250, n_reads = 50e6, matepair = TRUE,
frag_mean = 3000, frag_sd = 500)
These data could then be used to compare genome assembly performance between the strategies above, or between programs within a given strategy. Extensions of these tests include adjusting sequencing depth, sequencing platform, and error rates.
For diploid species, scientists won’t be sampling a haploid reference, and heterozygosity can be a real problem for the assembly process. So below I’ll show you how to simulate a diploid individual and create reads based on that. I’ll just show the hybrid assembly strategy, but the others simply differ in the sequencing step as shown above.
First, we’ll simulate two haplotypes based on the \(\theta\) (population-scaled mutation rate) and the other molecular evolution information for D. melanogaster we got from the evo_rates
object earlier. I’m also renaming the haplotypes.
haps <- create_haplotypes(ref, haps_theta(theta = theta, n_haps = 2),
sub, ins, del)
haps$set_names(c("A", "B"))
This results in the following haplotypes
object:
#> << haplotypes object >>
#> # Haplotypes: 2
#> # Mutations: 21
#>
#> << Reference genome info: >>
#> < Set of 4 chromosomes >
#> # Total size: 961 bp
#> name chromosome length
#> 2 TCTCACGGAAAACGGGAAGTTTTTGGCC...CAAGGCTTCGACGTAGAGCCGGTGGATT 232
#> 3 GCTCCGGCAGCATCGGCTTCCGTGATCA...GGACTGCGGGGGTTCTAACGGGCGTCGG 245
#> 4 GAATGCTCGCATGCGAACACTGAACGTT...GAATGCCAGTAAGCTCGAGCGTATAAGG 247
#> X CCCCATTTCAGCAACGAAGTCGATCGCA...GTTCAATCACGCGAAAAAAAACACGTCA 237
We generate 1 SMRT cell of PacBio sequencing and 1 lane of \(2 \times 100\)bp Illumina sequencing on the HiSeq 2500 system as before, except using the new haps
object instead of ref
:
pacbio(haps, out_prefix = "pacbio", n_reads = 500e3)
illumina(haps, out_prefix = "illumina", n_reads = 500e6, paired = TRUE,
read_length = 100)
If we want to save the information for these haplotypes, we can output FASTA files (one per haplotype) or a VCF file:
This would result in the following files being generated: haps__A.fa
, haps__B.fa
, and haps.vcf
. The sample_matrix
argument to write_vcf
states that the first two (i.e., all) haplotypes are from the first (and only) sample. These data could test different assembly strategies, similar to the above section, except that, for diploid organisms, it includes heterozygosity. Increasing the \(\theta\) parameter will result in more heterozygosity, which is an obvious extension of these tests.
Here, I will demonstrate how to generate population-genomic data of a type that might be used to estimate the divergence between two populations. I first use the scrm
package to conduct coalescent simulations that will generate segregating sites for 10 variant haplotypes from the reference genome. Five of the haplotypes are from one population, five from another. The symmetrical migration rate is 100 individuals per generation. I used a recombination rate of 1 (the expected number of recombinations on the locus per \(4 N_0\) generations). I specify the recombination rate and chromosome size for scrm
using the command -r 1 C
for chromosome size C
. I used the population-scaled mutation rate for D. melanogaster, scaled such that the mutation rate (\(\mu\)) is in units of mutations per \(4 N_0\) generations, where \(N_0\) is the population size. For \(N_0\), I’m using the effective population size for D. melanogaster, per evo_rates
. The other code in the chunk below is to convert the output to look like that from the scrm
R package, which can be easily processed by jackalope
.
library(scrm)
# Function to run scrm for one chromosome and format output
one_chrom <- function(.size) {
ssites <- scrm(sprintf("10 1 -t %.4f -r 1 %i -I 2 5 5 100", theta * 4 * N0, .size))
return(ssites$seg_sites[[1]])
}
ssites <- list(seg_sites = lapply(ref$sizes(), one_chrom))
Using the previously created objects for molecular evolution information and the haps_ssites
function, I create haplotypes from the reference genome:
This results in the following set of haplotypes:
#> << haplotypes object >>
#> # Haplotypes: 10
#> # Mutations: 2,553
#>
#> << Reference genome info: >>
#> < Set of 4 chromosomes >
#> # Total size: 961 bp
#> name chromosome length
#> 2 TCTCACGGAAAACGGGAAGTTTTTGGCC...CAAGGCTTCGACGTAGAGCCGGTGGATT 232
#> 3 GCTCCGGCAGCATCGGCTTCCGTGATCA...GGACTGCGGGGGTTCTAACGGGCGTCGG 245
#> 4 GAATGCTCGCATGCGAACACTGAACGTT...GAATGCCAGTAAGCTCGAGCGTATAAGG 247
#> X CCCCATTTCAGCAACGAAGTCGATCGCA...GTTCAATCACGCGAAAAAAAACACGTCA 237
For a file of true divergences from the reference genome, the write_vcf
function writes the haplotypes
object to a VCF file:
Lastly, I simulate 1 lane of \(2 \times 100\)bp Illumina HiSeq 2500 sequencing. In this case, individuals within a population are pooled, and the population sequences are derived from are identified by barcodes.
illumina(haps, out_prefix = "haps_illumina", n_reads = 500e6, paired = TRUE,
read_length = 100, barcodes = c(rep("AACCGCGG", 5),
rep("GGTTATAA", 5)))
The below example instead has each individual haplotype’s reads output to separate FASTQ files:
illumina(haps, out_prefix = "haps_illumina", n_reads = 500e6, paired = TRUE,
read_length = 100, sep_files = TRUE)
The \(F_{ST}\) calculated from the resulting VCF file could then be compared to output from various programs to inform which works best in a particular case. For uncertain population parameters (e.g., migration rates), output from multiple calls to scrm
varying the parameter of interest could be input to the jackalope
pipeline above to identify the conditions under which one program might have an advantage over another.
This section shows how jackalope
can generate haplotypes from a phylogeny, then simulate sequencing data from those haplotypes to test phylogeny reconstruction methods. First, I simulated a random species tree of 10 species, then scaled it to have a maximum tree depth of \(4 N_0\) generations:
Function haps_phylo
organizes and checks the tree
object, and including it with the mutation-type information allowed me to create haplotypes based on this phylogeny:
This results in the following haplotypes
object:
#> << haplotypes object >>
#> # Haplotypes: 10
#> # Mutations: 547
#>
#> << Reference genome info: >>
#> < Set of 4 chromosomes >
#> # Total size: 961 bp
#> name chromosome length
#> 2 TCTCACGGAAAACGGGAAGTTTTTGGCC...CAAGGCTTCGACGTAGAGCCGGTGGATT 232
#> 3 GCTCCGGCAGCATCGGCTTCCGTGATCA...GGACTGCGGGGGTTCTAACGGGCGTCGG 245
#> 4 GAATGCTCGCATGCGAACACTGAACGTT...GAATGCCAGTAAGCTCGAGCGTATAAGG 247
#> X CCCCATTTCAGCAACGAAGTCGATCGCA...GTTCAATCACGCGAAAAAAAACACGTCA 237
Now I can generate data for 1 flow cell of \(2 \times 250\)bp sequencing on an Illumina MiSeq v3, where haplotype_barcodes
is a character string that specifies the barcodes for each haplotype. I also wrote the true phylogenetic tree to a NEWICK file.
haplotype_barcodes <- c("CTAGCTTG", "TCGATCCA", "ATACCAAG", "GCGTTGGA",
"CTTCACGG", "TCCTGTAA", "CCTCGGTA", "TTCTAACG",
"CGCTCGTG", "TATCTACA")
illumina(haps, out_prefix = "phylo_tree", seq_sys = "MSv3",
paired = TRUE, read_length = 250, n_reads = 50e6,
barcodes = haplotype_barcodes)
ape::write.tree(tree, "true.tree")
The true phylogenetic tree would then be compared to the final tree output from the program(s) the user chooses to test.
Similar to the section above, the ultimate goal here is to test phylogeny reconstruction methods. The difference in this section is that instead of using a single, straightforward phylogeny, I use multiple gene trees per chromosome. In the species used in these simulations, species 1 diverged from 2 and 3 at \(t = 1.0\), where \(t\) indicates time into the past and is in units of \(4 N_0\) generations. Species 2 and 3 diverged at \(t = 0.5\). I assume a recombination rate of \(1 / (4 N_0)\) recombination events per chromosome per generation. There are 4 diploid individuals sampled per species. I used the following code to call scrm
to simulate the gene trees and create the object gtrees
. And like the section using segregating sites, we have to wrap lapply
inside list(trees = ...)
to replicate the data structure from scrm
for jackalope
to use later.
# Run scrm for one chromosome size:
one_chrom <- function(.size) {
sims <- scrm(
paste("24 1",
# Output gene trees:
"-T",
# Recombination:
"-r 1", .size,
# 3 species with no ongoing migration:
"-I 3 8 8 8 0",
# Species 2 derived from 1 at time 1.0:
"-ej 1.0 2 1",
# Species 3 derived from 2 at time 0.5:
"-ej 0.5 3 2"
))
trees <- sims$trees[[1]]
# scrm outputs branch lengths in units of 4*N0 generations, but we want just
# generations:
adjust_tree <- function(.p) {
# Read to phylo object and adjust branch lengths:
.tr <- read.tree(text = .p)
.tr$edge.length <- .tr$edge.length * 4 * N0
# "prefix" from `.p` showing how large the region this gene tree refers to is
prefix <- paste0(strsplit(.p, "\\]")[[1]][1], "]")
# Put back together into NEWICK text
return(paste0(prefix, write.tree(.tr)))
}
trees <- sapply(trees, adjust_tree)
names(trees) <- NULL
return(trees)
}
# For all chromosomes:
gtrees <- list(trees = lapply(ref$sizes(), one_chrom))
We can write the true gene trees using write_gtrees
to an ms
-style file:
The create_haplotypes
function uses these gene trees to create variant haplotypes. As for the other haplotype-creation methods, function haps_gtrees
checks and organizes information from the gtrees
object.
This results in the following haplotypes
object:
#> << haplotypes object >>
#> # Haplotypes: 24
#> # Mutations: 2,782
#>
#> << Reference genome info: >>
#> < Set of 4 chromosomes >
#> # Total size: 961 bp
#> name chromosome length
#> 2 TCTCACGGAAAACGGGAAGTTTTTGGCC...CAAGGCTTCGACGTAGAGCCGGTGGATT 232
#> 3 GCTCCGGCAGCATCGGCTTCCGTGATCA...GGACTGCGGGGGTTCTAACGGGCGTCGG 245
#> 4 GAATGCTCGCATGCGAACACTGAACGTT...GAATGCCAGTAAGCTCGAGCGTATAAGG 247
#> X CCCCATTTCAGCAACGAAGTCGATCGCA...GTTCAATCACGCGAAAAAAAACACGTCA 237
To store mutation information by diploid sample, the write_vcf
function writes the haplotypes
object to a VCF file. It assigns every other haplotype to a new diploid sample using a matrix for the sample_matrix
argument:
write_vcf(haps, out_prefix = "hap_gtrees",
sample_matrix = matrix(1:haps$n_haps(), ncol = 2, byrow = TRUE))
Next I generated data for 1 flow cell of \(2 \times 250\)bp sequencing on an Illumina MiSeq v3, with barcodes in the object haplotype_barcodes
.
# 2 of each barcode bc it's diploid
haplotype_barcodes <- rep(c("TCGCCTTA", "CTAGTACG", "TTCTGCCT", "GCTCAGGA", "AGGAGTCC",
"CATGCCTA", "GTAGAGAG", "CCTCTCTG", "AGCGTAGC", "CAGCCTCG",
"TGCCTCTT", "TCCTCTAC"), each = 2)
illumina(haps, out_prefix = "phylo_gtrees", seq_sys = "MSv3",
read_length = 250, n_reads = 50e6, paired = TRUE,
barcodes = haplotype_barcodes)
Topologies of the gene trees would then be compared to final phylogenies output from software the user is interested in testing. Varying recombination rates or adding gene flow after separation of species would be natural extensions of these simulations.