pepr
This vignette will show you how and why to use the subsample table functionality of the pepr
package.
basic information about the PEP concept visit the project website.
broader theoretical description in the subsample table documentation section.
This series of examples below demonstrates how and why to use sample subannoatation functionality in multiple cases to provide multiple input files of the same type for a single sample.
This example demonstrates how the sample subannotation functionality is used. In this example, 2 samples have multiple input files that need merging (frog_1
and frog_2
), while 1 sample (frog_3
) does not. Therefore, frog_3
specifies its file in the sample_table.csv
file, while the others leave that field blank and instead specify several files in the subsample_table.csv
file.
This example is made up of these components:
pep_version: 2.0.0
sample_table: sample_table.csv
subsample_table: subsample_table.csv
output_dir: $HOME/example_results
sample_name | protocol | file |
---|---|---|
frog_1 | anySampleType | multi |
frog_2 | anySampleType | multi |
frog_3 | anySampleType | multi |
sample_name | subsample_name | file |
---|---|---|
frog_1 | sub_a | data/frog1a_data.txt |
frog_1 | sub_b | data/frog1b_data.txt |
frog_1 | sub_c | data/frog1c_data.txt |
frog_2 | sub_a | data/frog2a_data.txt |
frog_2 | sub_b | data/frog2b_data.txt |
Let’s create the Project object and see if multiple files are present
projectConfig1 = system.file(
"extdata",
paste0("example_peps-", branch),
"example_subtable1",
"project_config.yaml",
package = "pepr"
)
p1 = Project(projectConfig1)
#> Loading config file: /private/var/folders/3f/0wj7rs2144l9zsgxd3jn5nxc0000gn/T/RtmpF0yVmb/Rinstbd5d643c109c/pepr/extdata/example_peps-master/example_subtable1/project_config.yaml
# Check the files
p1Samples = sampleTable(p1)
p1Samples$file
#> [[1]]
#> [1] "data/frog1a_data.txt" "data/frog1b_data.txt" "data/frog1c_data.txt"
#>
#> [[2]]
#> [1] "data/frog2a_data.txt" "data/frog2b_data.txt"
#>
#> [[3]]
#> [1] "multi"
# Check the subsample names
p1Samples$subsample_name
#> [[1]]
#> [1] "sub_a" "sub_b" "sub_c"
#>
#> [[2]]
#> [1] "sub_a" "sub_b"
#>
#> [[3]]
#> NULL
And inspect the whole table in p1@samples
slot
sample_name | protocol | file | subsample_name |
---|---|---|---|
frog_1 | anySampleType | c(“data/frog1a_data.txt”, “data/frog1b_data.txt”, “data/frog1c_data.txt”) | c(“sub_a”, “sub_b”, “sub_c”) |
frog_2 | anySampleType | c(“data/frog2a_data.txt”, “data/frog2b_data.txt”) | c(“sub_a”, “sub_b”) |
frog_3 | anySampleType | multi | NULL |
You can also access a single subsample if you call the getSubsample
method with appropriate sample_name
- subsample_name
attribute combination. Note, that this is only possible if the subsample_name
column is defined in the sub_annotation.csv
file.
This example uses a subsample_table.csv
file and a derived attributes to point to files. This is a rather complex example. Notice we must include the file_id
column in the sample_table.csv
file, and leave it blank; this is then populated by just some of the samples (frog_1
and frog_2
) in the subsample_table.csv
, but is left empty for the samples that are not merged.
This example is made up of these components:
Warning in readLines(file): incomplete final line found on '/private/var/
folders/3f/0wj7rs2144l9zsgxd3jn5nxc0000gn/T/RtmpF0yVmb/Rinstbd5d643c109c/pepr/
extdata/example_peps-master/example_subtable2/project_config.yaml'
pep_version: 2.0.0
sample_table: sample_table.csv
subsample_table: subsample_table.csv
output_dir: $HOME/hello_looper_results
pipeline_interfaces: ../pipeline/pipeline_interface.yaml
sample_modifiers:
derive:
attributes: file
sources:
local_files: ../data/{identifier}{file_id}_data.txt
local_files_unmerged: ../data/{identifier}_data.txt
#> Warning in read.table(sampleAnnotation, sep = ",", header = T): incomplete
#> final line found by readTableHeader on '/private/var/folders/3f/
#> 0wj7rs2144l9zsgxd3jn5nxc0000gn/T/RtmpF0yVmb/Rinstbd5d643c109c/pepr/extdata/
#> example_peps-master/example_subtable2/sample_table.csv'
sample_name | protocol | identifier | file |
---|---|---|---|
frog_1 | anySampleType | frog1 | local_files |
frog_2 | anySampleType | frog2 | local_files |
frog_3 | anySampleType | frog3 | local_files_unmerged |
frog_4 | anySampleType | frog4 | local_files_unmerged |
sample_name | file_id | subsample_name |
---|---|---|
frog_1 | a | a |
frog_1 | b | b |
frog_1 | c | c |
frog_2 | a | a |
frog_2 | b | b |
projectConfig2 = system.file(
"extdata",
paste0("example_peps-", branch),
"example_subtable2",
"project_config.yaml",
package = "pepr"
)
p2 = Project(projectConfig2)
#> Loading config file: /private/var/folders/3f/0wj7rs2144l9zsgxd3jn5nxc0000gn/T/RtmpF0yVmb/Rinstbd5d643c109c/pepr/extdata/example_peps-master/example_subtable2/project_config.yaml
#> Warning in readLines(con): incomplete final line found on '/private/var/folders/
#> 3f/0wj7rs2144l9zsgxd3jn5nxc0000gn/T/RtmpF0yVmb/Rinstbd5d643c109c/pepr/extdata/
#> example_peps-master/example_subtable2/project_config.yaml'
#> Warning in `[<-.data.frame`(x, i, j, value): replacement element 1 has 3 rows to
#> replace 1 rows
#> Warning in `[<-.data.frame`(x, i, j, value): replacement element 1 has 2 rows to
#> replace 1 rows
# Check the files
p2Samples = sampleTable(p2)
p2Samples$file
#> [[1]]
#> [1] "../data/frog1a_data.txt"
#>
#> [[2]]
#> [1] "../data/frog2a_data.txt"
#>
#> [[3]]
#> [1] "../data/frog3_data.txt"
#>
#> [[4]]
#> [1] "../data/frog4_data.txt"
And inspect the whole table in p2@samples
slot
sample_name | protocol | identifier | file | file_id | subsample_name |
---|---|---|---|---|---|
frog_1 | anySampleType | frog1 | ../data/frog1a_data.txt | c(“a”, “b”, “c”) | c(“a”, “b”, “c”) |
frog_2 | anySampleType | frog2 | ../data/frog2a_data.txt | c(“a”, “b”) | c(“a”, “b”) |
frog_3 | anySampleType | frog3 | ../data/frog3_data.txt | NULL | NULL |
frog_4 | anySampleType | frog4 | ../data/frog4_data.txt | NULL | NULL |
This example gives the exact same results as Example 2, but in this case, uses a wildcard for frog_2
instead of including it in the subsample_table.csv
file. Since we can’t use a wildcard and a subannotation for the same sample, this necessitates specifying a second data source class (local_files_unmerged
) that uses an asterisk (*
). The outcome is the same.
This example is made up of these components:
pep_version: 2.0.0
sample_table: sample_table.csv
subsample_table: subsample_table.csv
output_dir: $HOME/hello_looper_results
pipeline_interfaces: ../pipeline/pipeline_interface.yaml
sample_modifiers:
derive:
attributes: file
sources:
local_files: ../data/{identifier}{file_id}_data.txt
local_files_unmerged: ../data/{identifier}*_data.txt
sample_name | protocol | identifier | file | file_id |
---|---|---|---|---|
frog_1 | anySampleType | frog1 | local_files | NA |
frog_2 | anySampleType | frog2 | local_files_unmerged | NA |
frog_3 | anySampleType | frog3 | local_files_unmerged | NA |
frog_4 | anySampleType | frog4 | local_files_unmerged | NA |
sample_name | file_id |
---|---|
frog_1 | a |
frog_1 | b |
frog_1 | c |
projectConfig3 = system.file(
"extdata",
paste0("example_peps-", branch),
"example_subtable3",
"project_config.yaml",
package = "pepr"
)
p3 = Project(projectConfig3)
#> Loading config file: /private/var/folders/3f/0wj7rs2144l9zsgxd3jn5nxc0000gn/T/RtmpF0yVmb/Rinstbd5d643c109c/pepr/extdata/example_peps-master/example_subtable3/project_config.yaml
#> Warning in `[<-.data.frame`(x, i, j, value): replacement element 1 has 3 rows to
#> replace 1 rows
# Check the files
p3Samples = sampleTable(p3)
p3Samples$file
#> [[1]]
#> [1] "../data/frog1a_data.txt"
#>
#> [[2]]
#> [1] "../data/frog2*_data.txt"
#>
#> [[3]]
#> [1] "../data/frog3*_data.txt"
#>
#> [[4]]
#> [1] "../data/frog4*_data.txt"
And inspect the whole table in p3@samples
slot
sample_name | protocol | identifier | file | file_id |
---|---|---|---|---|
frog_1 | anySampleType | frog1 | ../data/frog1a_data.txt | c(“a”, “b”, “c”) |
frog_2 | anySampleType | frog2 | ../data/frog2*_data.txt | |
frog_3 | anySampleType | frog3 | ../data/frog3*_data.txt | |
frog_4 | anySampleType | frog4 | ../data/frog4*_data.txt |
Merging is for same class inputs (like, multiple files for read1). Different-class inputs (like read1 vs read2) are handled by different attributes (or columns). This example shows you how to handle paired-end data, while also merging within each.
This example is made up of these components:
pep_version: 2.0.0
sample_table: sample_table.csv
subsample_table: subsample_table.csv
output_dir: $HOME/hello_looper_results
pipeline_interfaces: ../pipeline/pipeline_interface.yaml
sample_name | protocol |
---|---|
frog_1 | anySampleType |
frog_2 | anySampleType |
frog_3 | anySampleType |
frog_4 | anySampleType |
sample_name | read1 | read2 |
---|---|---|
frog_1 | frog1a_data.txt | frog1a_data2.txt |
frog_1 | frog1b_data.txt | frog1b_data2.txt |
frog_1 | frog1c_data.txt | frog1b_data2.txt |
projectConfig4 = system.file(
"extdata",
paste0("example_peps-", branch),
"example_subtable4",
"project_config.yaml",
package = "pepr"
)
p4 = Project(projectConfig4)
#> Loading config file: /private/var/folders/3f/0wj7rs2144l9zsgxd3jn5nxc0000gn/T/RtmpF0yVmb/Rinstbd5d643c109c/pepr/extdata/example_peps-master/example_subtable4/project_config.yaml
# Check the read1 and read2 columns
p4Samples = sampleTable(p4)
p4Samples$read1
#> [[1]]
#> [1] "frog1a_data.txt" "frog1b_data.txt" "frog1c_data.txt"
#>
#> [[2]]
#> NULL
#>
#> [[3]]
#> NULL
#>
#> [[4]]
#> NULL
p4Samples$read2
#> [[1]]
#> [1] "frog1a_data2.txt" "frog1b_data2.txt" "frog1b_data2.txt"
#>
#> [[2]]
#> NULL
#>
#> [[3]]
#> NULL
#>
#> [[4]]
#> NULL
And inspect the whole table in p4@samples
slot
sample_name | protocol | read1 | read2 |
---|---|---|---|
frog_1 | anySampleType | c(“frog1a_data.txt”, “frog1b_data.txt”, “frog1c_data.txt”) | c(“frog1a_data2.txt”, “frog1b_data2.txt”, “frog1b_data2.txt”) |
frog_2 | anySampleType | NULL | NULL |
frog_3 | anySampleType | NULL | NULL |
frog_4 | anySampleType | NULL | NULL |