Sample modifiers in pepr: derive

Michal Stolarczyk

2020-06-03

Learn derived attributes in pepr

This vignette will show you how and why to use the derived attributes functionality of the pepr package.

Problem/Goal

The example below demonstrates how to use the derived attributes to flexibly define the samples attributes the file_path column of the sample_table.csv file to match the file names in your project. Please consider the example below for reference:

sample_name protocol organism time file_path
pig_0h RRBS pig 0 data/lab/project/pig_0h.fastq
pig_1h RRBS pig 1 data/lab/project/pig_1h.fastq
frog_0h RRBS frog 0 data/lab/project/frog_0h.fastq
frog_1h RRBS frog 1 data/lab/project/frog_1h.fastq

Solution

As the name suggests the attributes in the specified attributes (here: file_path) can be derived from other ones. The way how this process is carried out is indicated explicitly in the project_config.yaml file (presented below). The name of the column is determined in the sample_modifiers.derive.attributes key-value pair, whereas the pattern for the attributes construction - in the sample_modifiers.derive.sources one. Note that the second level key (here: source) has to exactly match the attributes in the file_path column of the modified sample_annotation.csv (presented below).

   pep_version: 2.0.0
   sample_table: sample_table.csv
   output_dir: $HOME/hello_looper_results
   sample_modifiers:
      derive:
          attributes: file_path
          sources:
              source1: $HOME/data/lab/project/{organism}_{time}h.fastq
              source2: 
  /path/from/collaborator/weirdNamingScheme_{external_id}.fastq

Let’s introduce a few modifications to the original sample_annotation.csv file to map the appropriate data sources from the project_config.yaml with attributes in the derived column - [file_path]:

sample_name protocol organism time file_path
pig_0h RRBS pig 0 source1
pig_1h RRBS pig 1 source1
frog_0h RRBS frog 0 source1
frog_1h RRBS frog 1 source1

Code

Load pepr and read in the project metadata by specifying the path to the project_config.yaml:

And inspect it:

As you can see, the resulting samples are annotated the same way as if they were read from the original, unwieldy, annotations file.

What is more, the p object consists of all the information from the project config file (project_config.yaml). Run the following line to explore it: