Subsample table in pepr

Michal Stolarczyk & Nathan Sheffield

2020-06-03

Learn sample subannotations in pepr

This vignette will show you how and why to use the subsample table functionality of the pepr package.

Problem/Goal

This series of examples below demonstrates how and why to use sample subannoatation functionality in multiple cases to provide multiple input files of the same type for a single sample.

Solutions

Example 1: basic sample subannotation table

This example demonstrates how the sample subannotation functionality is used. In this example, 2 samples have multiple input files that need merging (frog_1 and frog_2), while 1 sample (frog_3) does not. Therefore, frog_3 specifies its file in the sample_table.csv file, while the others leave that field blank and instead specify several files in the subsample_table.csv file.

This example is made up of these components:

  • Project config file:
   pep_version: 2.0.0
   sample_table: sample_table.csv
   subsample_table: subsample_table.csv
   output_dir: $HOME/example_results
  • Sample table:
    sample_name protocol file
    frog_1 anySampleType multi
    frog_2 anySampleType multi
    frog_3 anySampleType multi
  • Subsample table:
    sample_name subsample_name file
    frog_1 sub_a data/frog1a_data.txt
    frog_1 sub_b data/frog1b_data.txt
    frog_1 sub_c data/frog1c_data.txt
    frog_2 sub_a data/frog2a_data.txt
    frog_2 sub_b data/frog2b_data.txt

Let’s create the Project object and see if multiple files are present

And inspect the whole table in p1@samples slot

sample_name protocol file subsample_name
frog_1 anySampleType c(“data/frog1a_data.txt”, “data/frog1b_data.txt”, “data/frog1c_data.txt”) c(“sub_a”, “sub_b”, “sub_c”)
frog_2 anySampleType c(“data/frog2a_data.txt”, “data/frog2b_data.txt”) c(“sub_a”, “sub_b”)
frog_3 anySampleType multi NULL

You can also access a single subsample if you call the getSubsample method with appropriate sample_name - subsample_name attribute combination. Note, that this is only possible if the subsample_name column is defined in the sub_annotation.csv file.

Example 2: subannotations and derived attributes

This example uses a subsample_table.csv file and a derived attributes to point to files. This is a rather complex example. Notice we must include the file_id column in the sample_table.csv file, and leave it blank; this is then populated by just some of the samples (frog_1 and frog_2) in the subsample_table.csv, but is left empty for the samples that are not merged.

This example is made up of these components:

  • Project config file:
  Warning in readLines(file): incomplete final line found on '/private/var/
  folders/3f/0wj7rs2144l9zsgxd3jn5nxc0000gn/T/RtmpF0yVmb/Rinstbd5d643c109c/pepr/
  extdata/example_peps-master/example_subtable2/project_config.yaml'
   pep_version: 2.0.0
   sample_table: sample_table.csv
   subsample_table: subsample_table.csv
   output_dir: $HOME/hello_looper_results
   pipeline_interfaces: ../pipeline/pipeline_interface.yaml
   sample_modifiers:
      derive:
          attributes: file
          sources:
              local_files: ../data/{identifier}{file_id}_data.txt
              local_files_unmerged: ../data/{identifier}_data.txt
  • Sample annotation table:
#> Warning in read.table(sampleAnnotation, sep = ",", header = T): incomplete
#> final line found by readTableHeader on '/private/var/folders/3f/
#> 0wj7rs2144l9zsgxd3jn5nxc0000gn/T/RtmpF0yVmb/Rinstbd5d643c109c/pepr/extdata/
#> example_peps-master/example_subtable2/sample_table.csv'
sample_name protocol identifier file
frog_1 anySampleType frog1 local_files
frog_2 anySampleType frog2 local_files
frog_3 anySampleType frog3 local_files_unmerged
frog_4 anySampleType frog4 local_files_unmerged
  • Sample subannotation table:
    sample_name file_id subsample_name
    frog_1 a a
    frog_1 b b
    frog_1 c c
    frog_2 a a
    frog_2 b b
    Let’s load the project config, create the Project object and see if multiple files are present

And inspect the whole table in p2@samples slot

sample_name protocol identifier file file_id subsample_name
frog_1 anySampleType frog1 ../data/frog1a_data.txt c(“a”, “b”, “c”) c(“a”, “b”, “c”)
frog_2 anySampleType frog2 ../data/frog2a_data.txt c(“a”, “b”) c(“a”, “b”)
frog_3 anySampleType frog3 ../data/frog3_data.txt NULL NULL
frog_4 anySampleType frog4 ../data/frog4_data.txt NULL NULL

Example 3: subannotations and expansion characters

This example gives the exact same results as Example 2, but in this case, uses a wildcard for frog_2 instead of including it in the subsample_table.csv file. Since we can’t use a wildcard and a subannotation for the same sample, this necessitates specifying a second data source class (local_files_unmerged) that uses an asterisk (*). The outcome is the same.

This example is made up of these components:

  • Project config file:
   pep_version: 2.0.0
   sample_table: sample_table.csv
   subsample_table: subsample_table.csv
   output_dir: $HOME/hello_looper_results
   pipeline_interfaces: ../pipeline/pipeline_interface.yaml
   sample_modifiers:
      derive:
          attributes: file
          sources:
              local_files: ../data/{identifier}{file_id}_data.txt
              local_files_unmerged: ../data/{identifier}*_data.txt
  • Sample annotation table:
    sample_name protocol identifier file file_id
    frog_1 anySampleType frog1 local_files NA
    frog_2 anySampleType frog2 local_files_unmerged NA
    frog_3 anySampleType frog3 local_files_unmerged NA
    frog_4 anySampleType frog4 local_files_unmerged NA
  • Sample subtable table:
    sample_name file_id
    frog_1 a
    frog_1 b
    frog_1 c
    Let’s load the project config, create the Project object and see if multiple files are present

And inspect the whole table in p3@samples slot

sample_name protocol identifier file file_id
frog_1 anySampleType frog1 ../data/frog1a_data.txt c(“a”, “b”, “c”)
frog_2 anySampleType frog2 ../data/frog2*_data.txt
frog_3 anySampleType frog3 ../data/frog3*_data.txt
frog_4 anySampleType frog4 ../data/frog4*_data.txt

Example 4: subannotations and multiple (separate-class) inputs

Merging is for same class inputs (like, multiple files for read1). Different-class inputs (like read1 vs read2) are handled by different attributes (or columns). This example shows you how to handle paired-end data, while also merging within each.

This example is made up of these components:

  • Project config file:
   pep_version: 2.0.0
   sample_table: sample_table.csv
   subsample_table: subsample_table.csv
   output_dir: $HOME/hello_looper_results
   pipeline_interfaces: ../pipeline/pipeline_interface.yaml
  • Sample annotation table:
    sample_name protocol
    frog_1 anySampleType
    frog_2 anySampleType
    frog_3 anySampleType
    frog_4 anySampleType
  • Sample subannotation table:
    sample_name read1 read2
    frog_1 frog1a_data.txt frog1a_data2.txt
    frog_1 frog1b_data.txt frog1b_data2.txt
    frog_1 frog1c_data.txt frog1b_data2.txt
    Let’s load the project config, create the Project object and see if multiple files are present

And inspect the whole table in p4@samples slot

sample_name protocol read1 read2
frog_1 anySampleType c(“frog1a_data.txt”, “frog1b_data.txt”, “frog1c_data.txt”) c(“frog1a_data2.txt”, “frog1b_data2.txt”, “frog1b_data2.txt”)
frog_2 anySampleType NULL NULL
frog_3 anySampleType NULL NULL
frog_4 anySampleType NULL NULL