Intro to jackalope

This document provides brief examples of how jackalope can be used to generate sequencing data that can inform some common sampling decisions for HTS studies.

Generating reference genome

Here, I’ll show how to generate a reference genome based on an existing assembly or via simulated DNA sequences.

To show how an existing assembly is read in jackalope, the code below processes the assembly for Drosophila melanogaster (version 6.27) downloaded from https://flybase.org. It reads the compressed FASTA file, filters out scaffolds by using a size threshold, removes the Y chromosome (to avoid having both an X and Y in the same haplotype), and merges the left and right arms of chromosomes 2 and 3. It lastly sets the names of chromosomes 2 and 3 to be "2" and "3", respectively.

ref <- read_fasta("dmel-6.27.fasta.gz", cut_names = TRUE)
ref$filter_chroms(1e6, method = "size")
ref$rm_chroms("Y")
ref$merge_chroms(c("2L", "2R"))
ref$merge_chroms(c("3L", "3R"))
names <- ref$chrom_names()
names[grepl("^2", names)] <- "2"
names[grepl("^3", names)] <- "3"
ref$set_names(names)

For the rest of the document, I will use a simulated genome for simplicity (and so that I don’t have to store the D. melanogaster genome inside this package). Here is how I simulated a genome of size \(\sim 1\) kb split among 4 chromosomes, with the same names as the chromosomes in the D. melanogaster genome:

ref <- create_genome(n_chroms = 4,
                     len_mean = 1e3 / 4,
                     len_sd = 10)
ref$set_names(c(2:4, "X"))

This resulted in the following ref_genome object:

#> < Set of 4 chromosomes >
#> # Total size: 961 bp
#>   name                             chromosome                             length
#> 2          TCTCACGGAAAACGGGAAGTTTTTGGCC...CAAGGCTTCGACGTAGAGCCGGTGGATT       232
#> 3          GCTCCGGCAGCATCGGCTTCCGTGATCA...GGACTGCGGGGGTTCTAACGGGCGTCGG       245
#> 4          GAATGCTCGCATGCGAACACTGAACGTT...GAATGCCAGTAAGCTCGAGCGTATAAGG       247
#> X          CCCCATTTCAGCAACGAAGTCGATCGCA...GTTCAATCACGCGAAAAAAAACACGTCA       237

For the examples below, we’ll pretend this is our D. melanogaster genome.

Molecular evolution information

To generate variant haplotypes based on our reference genome, we need molecular evolution information (unless passing a Variant Call Format, or VCF, file). For molecular-evolution information, I used the JC69 model for simplicity. Mutation rates were chosen from the evo_rates object present inside jackalope, which stores Table 1 from Sung et al. (2016). The substitution and indel rates were the values from evo_rates specific to D. melanogaster. The mu parameter to sub_JC69 function was set to NULL because the default behavior of all substitution-model functions in jackalope is to scale rate matrices such that branch lengths are in units of substitutions per site. In this case, I don’t want to scale rate matrices because our branch lengths will be in generations. Zhang and Gerstein (2003) found a 1 to 2.90 ratio of insertions to deletions, and described relative rates of indels of various sizes using a power-law relationship. To approximate the distributions they described, relative rates were derived from a Lavalette distribution with \(L = 60\) and \(a = 1.60\) for insertions and \(a = 1.51\) for deletions. The \(\theta\) parameter is the population-scaled mutation rate, which we can get directly from evo_rates for D. melanogaster. Lastly, \(N_0\) is the effective population size for D. melanogaster, which is used in a few instances below because scrm outputs branch lengths in units of \(4 N_0\) generations.

sub_rate <- evo_rates$subs[evo_rates$species == "Drosophila melanogaster"]
indel_rate <- evo_rates$indels[evo_rates$species == "Drosophila melanogaster"]
# Because both are in units of 10^-10 events per site per generation:
sub_rate <- sub_rate * 1e-10
indel_rate <- indel_rate * 1e-10

sub <- sub_JC69(lambda = sub_rate, mu = NULL)
ins <- indels(rate = indel_rate, max_length = 60,a = 1.60)
del <- indels(rate = indel_rate, max_length = 60, a = 1.51)

theta <- evo_rates[evo_rates$species == "Drosophila melanogaster","theta_s"]

# Originally in units of 1e6 individuals
N0 <- evo_rates[evo_rates$species == "Drosophila melanogaster", "Ne"] * 1e6

When generating any haplotypes below, these objects will be used inside create_haplotypes to specify molecular evolution information.

Assembling a genome

Based on a reference

The examples here produce FASTQ files from the known reference assembly that could test strategies for how to assemble a similar genome using HTS data.

The first strategy is to use only PacBio sequencing. The PacBio Sequel system produces up to 500,000 reads per Single Molecule, Real-Time (SMRT) cell, so you could run the following for two cells (with the file pacbio_R1.fq as output):

pacbio(ref, out_prefix = "pacbio", n_reads = 2 * 500e3)

An alternative, hybrid strategy uses 1 SMRT cell of PacBio sequencing and 1 lane (\(\sim 500\) million reads) of \(2 \times 100\)bp Illumina sequencing on the HiSeq 2500 system (the default Illumina system in jackalope):

pacbio(ref, out_prefix = "pacbio", n_reads = 500e3)
illumina(ref, out_prefix = "illumina", n_reads = 500e6, paired = TRUE,
         read_length = 100)

The last strategy combines 1 lane of \(2 \times 100\)bp Illumina HiSeq 2500 sequencing with 1 flow cell of \(2 \times 250\)bp mate-pair sequencing on an Illumina MiSeq v3. The mate-pair sequencing uses longer fragments (defaults are mean of 400 and standard deviation of 100) to better cover highly repetitive regions.

illumina(ref, out_prefix = "ill_pe", n_reads = 500e6, paired = TRUE,
         read_length = 100)
illumina(ref, out_prefix = "ill_mp", seq_sys = "MSv3",
         read_length = 250, n_reads = 50e6, matepair = TRUE, 
         frag_mean = 3000, frag_sd = 500)

These data could then be used to compare genome assembly performance between the strategies above, or between programs within a given strategy. Extensions of these tests include adjusting sequencing depth, sequencing platform, and error rates.

Based on a diploid individual

For diploid species, scientists won’t be sampling a haploid reference, and heterozygosity can be a real problem for the assembly process. So below I’ll show you how to simulate a diploid individual and create reads based on that. I’ll just show the hybrid assembly strategy, but the others simply differ in the sequencing step as shown above.

First, we’ll simulate two haplotypes based on the \(\theta\) (population-scaled mutation rate) and the other molecular evolution information for D. melanogaster we got from the evo_rates object earlier. I’m also renaming the haplotypes.

haps <- create_haplotypes(ref, haps_theta(theta = theta, n_haps = 2), 
                          sub, ins, del)
haps$set_names(c("A", "B"))

This results in the following haplotypes object:

#>                               << haplotypes object >>
#> # Haplotypes: 2
#> # Mutations: 21
#> 
#>                           << Reference genome info: >>
#> < Set of 4 chromosomes >
#> # Total size: 961 bp
#>   name                             chromosome                             length
#> 2          TCTCACGGAAAACGGGAAGTTTTTGGCC...CAAGGCTTCGACGTAGAGCCGGTGGATT       232
#> 3          GCTCCGGCAGCATCGGCTTCCGTGATCA...GGACTGCGGGGGTTCTAACGGGCGTCGG       245
#> 4          GAATGCTCGCATGCGAACACTGAACGTT...GAATGCCAGTAAGCTCGAGCGTATAAGG       247
#> X          CCCCATTTCAGCAACGAAGTCGATCGCA...GTTCAATCACGCGAAAAAAAACACGTCA       237

We generate 1 SMRT cell of PacBio sequencing and 1 lane of \(2 \times 100\)bp Illumina sequencing on the HiSeq 2500 system as before, except using the new haps object instead of ref:

pacbio(haps, out_prefix = "pacbio", n_reads = 500e3)
illumina(haps, out_prefix = "illumina", n_reads = 500e6, paired = TRUE,
         read_length = 100)

If we want to save the information for these haplotypes, we can output FASTA files (one per haplotype) or a VCF file:

write_fasta(haps, "haps")
write_vcf(haps, "haps", sample_matrix = cbind(1, 2))

This would result in the following files being generated: haps__A.fa, haps__B.fa, and haps.vcf. The sample_matrix argument to write_vcf states that the first two (i.e., all) haplotypes are from the first (and only) sample. These data could test different assembly strategies, similar to the above section, except that, for diploid organisms, it includes heterozygosity. Increasing the \(\theta\) parameter will result in more heterozygosity, which is an obvious extension of these tests.

Estimating divergence between populations

Here, I will demonstrate how to generate population-genomic data of a type that might be used to estimate the divergence between two populations. I first use the scrm package to conduct coalescent simulations that will generate segregating sites for 10 variant haplotypes from the reference genome. Five of the haplotypes are from one population, five from another. The symmetrical migration rate is 100 individuals per generation. I used a recombination rate of 1 (the expected number of recombinations on the locus per \(4 N_0\) generations). I specify the recombination rate and chromosome size for scrm using the command -r 1 C for chromosome size C. I used the population-scaled mutation rate for D. melanogaster, scaled such that the mutation rate (\(\mu\)) is in units of mutations per \(4 N_0\) generations, where \(N_0\) is the population size. For \(N_0\), I’m using the effective population size for D. melanogaster, per evo_rates. The other code in the chunk below is to convert the output to look like that from the scrm R package, which can be easily processed by jackalope.

library(scrm)
# Function to run scrm for one chromosome and format output
one_chrom <- function(.size) {
    ssites <- scrm(sprintf("10 1 -t %.4f -r 1 %i -I 2 5 5 100", theta * 4 * N0, .size))
    return(ssites$seg_sites[[1]])
}
ssites <- list(seg_sites = lapply(ref$sizes(), one_chrom))

Using the previously created objects for molecular evolution information and the haps_ssites function, I create haplotypes from the reference genome:

haps <- create_haplotypes(ref, haps_ssites(ssites), sub, ins, del)

This results in the following set of haplotypes:

#>                               << haplotypes object >>
#> # Haplotypes: 10
#> # Mutations: 2,553
#> 
#>                           << Reference genome info: >>
#> < Set of 4 chromosomes >
#> # Total size: 961 bp
#>   name                             chromosome                             length
#> 2          TCTCACGGAAAACGGGAAGTTTTTGGCC...CAAGGCTTCGACGTAGAGCCGGTGGATT       232
#> 3          GCTCCGGCAGCATCGGCTTCCGTGATCA...GGACTGCGGGGGTTCTAACGGGCGTCGG       245
#> 4          GAATGCTCGCATGCGAACACTGAACGTT...GAATGCCAGTAAGCTCGAGCGTATAAGG       247
#> X          CCCCATTTCAGCAACGAAGTCGATCGCA...GTTCAATCACGCGAAAAAAAACACGTCA       237

For a file of true divergences from the reference genome, the write_vcf function writes the haplotypes object to a VCF file:

write_vcf(haps, "haplotypes")

Lastly, I simulate 1 lane of \(2 \times 100\)bp Illumina HiSeq 2500 sequencing. In this case, individuals within a population are pooled, and the population sequences are derived from are identified by barcodes.

illumina(haps, out_prefix = "haps_illumina", n_reads = 500e6, paired = TRUE,
         read_length = 100, barcodes = c(rep("AACCGCGG", 5), 
                                         rep("GGTTATAA", 5)))

The below example instead has each individual haplotype’s reads output to separate FASTQ files:

illumina(haps, out_prefix = "haps_illumina", n_reads = 500e6, paired = TRUE,
         read_length = 100, sep_files = TRUE)

The \(F_{ST}\) calculated from the resulting VCF file could then be compared to output from various programs to inform which works best in a particular case. For uncertain population parameters (e.g., migration rates), output from multiple calls to scrm varying the parameter of interest could be input to the jackalope pipeline above to identify the conditions under which one program might have an advantage over another.

Constructing a phylogeny

From one phylogenetic tree

This section shows how jackalope can generate haplotypes from a phylogeny, then simulate sequencing data from those haplotypes to test phylogeny reconstruction methods. First, I simulated a random species tree of 10 species, then scaled it to have a maximum tree depth of \(4 N_0\) generations:

tree <- rcoal(10)
tree$edge.length <- 4 * N0 * tree$edge.length / max(node.depth.edgelength(tree))

Function haps_phylo organizes and checks the tree object, and including it with the mutation-type information allowed me to create haplotypes based on this phylogeny:

haps <- create_haplotypes(ref, haps_phylo(tree), sub, ins, del)

This results in the following haplotypes object:

#>                               << haplotypes object >>
#> # Haplotypes: 10
#> # Mutations: 547
#> 
#>                           << Reference genome info: >>
#> < Set of 4 chromosomes >
#> # Total size: 961 bp
#>   name                             chromosome                             length
#> 2          TCTCACGGAAAACGGGAAGTTTTTGGCC...CAAGGCTTCGACGTAGAGCCGGTGGATT       232
#> 3          GCTCCGGCAGCATCGGCTTCCGTGATCA...GGACTGCGGGGGTTCTAACGGGCGTCGG       245
#> 4          GAATGCTCGCATGCGAACACTGAACGTT...GAATGCCAGTAAGCTCGAGCGTATAAGG       247
#> X          CCCCATTTCAGCAACGAAGTCGATCGCA...GTTCAATCACGCGAAAAAAAACACGTCA       237

Now I can generate data for 1 flow cell of \(2 \times 250\)bp sequencing on an Illumina MiSeq v3, where haplotype_barcodes is a character string that specifies the barcodes for each haplotype. I also wrote the true phylogenetic tree to a NEWICK file.

haplotype_barcodes <- c("CTAGCTTG", "TCGATCCA", "ATACCAAG", "GCGTTGGA",
                        "CTTCACGG", "TCCTGTAA", "CCTCGGTA", "TTCTAACG", 
                        "CGCTCGTG", "TATCTACA")
illumina(haps, out_prefix = "phylo_tree", seq_sys = "MSv3",
         paired = TRUE, read_length = 250, n_reads = 50e6,
         barcodes = haplotype_barcodes)
ape::write.tree(tree, "true.tree")

The true phylogenetic tree would then be compared to the final tree output from the program(s) the user chooses to test.

From gene trees

Similar to the section above, the ultimate goal here is to test phylogeny reconstruction methods. The difference in this section is that instead of using a single, straightforward phylogeny, I use multiple gene trees per chromosome. In the species used in these simulations, species 1 diverged from 2 and 3 at \(t = 1.0\), where \(t\) indicates time into the past and is in units of \(4 N_0\) generations. Species 2 and 3 diverged at \(t = 0.5\). I assume a recombination rate of \(1 / (4 N_0)\) recombination events per chromosome per generation. There are 4 diploid individuals sampled per species. I used the following code to call scrm to simulate the gene trees and create the object gtrees. And like the section using segregating sites, we have to wrap lapply inside list(trees = ...) to replicate the data structure from scrm for jackalope to use later.

# Run scrm for one chromosome size:
one_chrom <- function(.size) {
    sims <- scrm(
        paste("24 1",
              # Output gene trees:
              "-T",
              # Recombination:
              "-r 1", .size,
              # 3 species with no ongoing migration:
              "-I 3 8 8 8 0",
              # Species 2 derived from 1 at time 1.0:
              "-ej 1.0 2 1",  
              # Species 3 derived from 2 at time 0.5:
              "-ej 0.5 3 2"
        ))
    trees <- sims$trees[[1]]
    # scrm outputs branch lengths in units of 4*N0 generations, but we want just
    # generations:
    adjust_tree <- function(.p) {
        # Read to phylo object and adjust branch lengths:
        .tr <- read.tree(text = .p)
        .tr$edge.length <- .tr$edge.length * 4 * N0
        # "prefix" from `.p` showing how large the region this gene tree refers to is
        prefix <- paste0(strsplit(.p, "\\]")[[1]][1], "]")
        # Put back together into NEWICK text
        return(paste0(prefix, write.tree(.tr)))
    }
    trees <- sapply(trees, adjust_tree)
    names(trees) <- NULL
    return(trees)
}
# For all chromosomes:
gtrees <- list(trees = lapply(ref$sizes(), one_chrom))

We can write the true gene trees using write_gtrees to an ms-style file:

write_gtrees(haps_gtrees(gtrees), "gtrees")

The create_haplotypes function uses these gene trees to create variant haplotypes. As for the other haplotype-creation methods, function haps_gtrees checks and organizes information from the gtrees object.

haps <- create_haplotypes(ref, haps_gtrees(gtrees),
                          sub, ins, del)

This results in the following haplotypes object:

#>                               << haplotypes object >>
#> # Haplotypes: 24
#> # Mutations: 2,782
#> 
#>                           << Reference genome info: >>
#> < Set of 4 chromosomes >
#> # Total size: 961 bp
#>   name                             chromosome                             length
#> 2          TCTCACGGAAAACGGGAAGTTTTTGGCC...CAAGGCTTCGACGTAGAGCCGGTGGATT       232
#> 3          GCTCCGGCAGCATCGGCTTCCGTGATCA...GGACTGCGGGGGTTCTAACGGGCGTCGG       245
#> 4          GAATGCTCGCATGCGAACACTGAACGTT...GAATGCCAGTAAGCTCGAGCGTATAAGG       247
#> X          CCCCATTTCAGCAACGAAGTCGATCGCA...GTTCAATCACGCGAAAAAAAACACGTCA       237

To store mutation information by diploid sample, the write_vcf function writes the haplotypes object to a VCF file. It assigns every other haplotype to a new diploid sample using a matrix for the sample_matrix argument:

write_vcf(haps, out_prefix = "hap_gtrees",
          sample_matrix = matrix(1:haps$n_haps(), ncol = 2, byrow = TRUE))

Next I generated data for 1 flow cell of \(2 \times 250\)bp sequencing on an Illumina MiSeq v3, with barcodes in the object haplotype_barcodes.

# 2 of each barcode bc it's diploid
haplotype_barcodes <- rep(c("TCGCCTTA", "CTAGTACG", "TTCTGCCT", "GCTCAGGA", "AGGAGTCC", 
                            "CATGCCTA", "GTAGAGAG", "CCTCTCTG", "AGCGTAGC", "CAGCCTCG", 
                            "TGCCTCTT", "TCCTCTAC"), each = 2)
illumina(haps, out_prefix = "phylo_gtrees", seq_sys = "MSv3",
         read_length = 250, n_reads = 50e6, paired = TRUE, 
         barcodes = haplotype_barcodes)

Topologies of the gene trees would then be compared to final phylogenies output from software the user is interested in testing. Varying recombination rates or adding gene flow after separation of species would be natural extensions of these simulations.