P3BSseq / Wiki / Home

Home

Authors:

P3BSseq: Parallel processing pipeline software for automatic analysis
of bisulfite sequencing data

Phuc-Loi Luu1, Daniela Gerovska2, Mikel Arrospide-Elgarresta2, Sugoi Retegi-Carrión2,
Hans R. Schöler3,4 and Marcos J. Araúzo-Bravo1,2,5

Computational Biology and Bioinformatics Group,
Max Planck Institute for Molecular Biomedicine, 48149 Münster, Germany;
Group of Computational Biology and Systems Biomedicine,
Biodonostia Health Research Institute, 20014 San Sebastián, Spain
Department of Cell and Developmental Biology,
Max Planck Institute for Molecular Biomedicine, 48149 Münster, Germany
Medical Faculty, University of Münster, 48149 Münster, Germany
IKERBASQUE, Basque Foundation for Science, 48011 Bilbao, Spain

mararabra@yahoo.co.uk

===============================================================================
EXPERIMENTAL DESIGN
===============================================================================
There are two main executable parts in the P3BSseq software:
One to download and index the reference genome (genome_indexing),
and another to perform the methylome mapping.
Depending on the combination of type of methylome (WGBS or RRBS)
and the type of reads ending (single-end or paired-end) there are four
different mappers (single_end_WGBS, paired_end_WGBS, single_end_RRBS,
paired_end_RRBS).

In a typical application, the user has to download and install the software,
perform a preprocessing stage of automatic downloading and indexing of the
reference genome using genome_indexing and process the data using one of the
four ad hoc developed mappers
(single_end_WGBS, paired_end_WGBS, single_end_RRBS, paired_end_RRBS).

The run of P3BSseq can be summed up as follows:
Decide where to place the folders for the P3BSseq software,
the reference genome, the input data and the results,
and create the corresponding folders.
Download and index the reference genome of a species before processing
methylomics data from such species. Once the reference genome of a species
has been processed, it is not necessary do it again.
Set the corresponding parameter file before running any executable files.

For each application the user only needs to fill in with a text editor certain
parameters in two text files:
A parameter file to define the reference genome (genome_indexing.txt).
A parameter file with default optimized processing parameters associated with
each type of methylation data
(single_end_WGBS.txt, paired_end_WGBS.txt, single_end_RRBS.txt or paired_end_RRBS.txt).

All parameter files are under the PARAMETERS folder of the distribution.
Providing an interface through parameter files facilitates the tracking of the changes
of all the parameters in a self documented way.
In a project involving the precessing of several datasets, the user can write a
script that first creates the parameter files under the
PARAMETERS folder, and then run the corresponding application of P3BSseq.

In a preprocessing stage, the reference genome and its annotation
(including CpG islands)
are automatically downloaded from the UCSC Genome Browser
(genome.ucsc.edu) and indexed.
Thereafter, all processing stages until the final report generation are fully automated.

===============================================================================
LIMITATIONS
===============================================================================
The software has been developed for Linux/Unix based operating system with at
least 6.5GB of RAM and ~450GB of hard disk to process a sample with 500 million reads of length 85bp.
P3BSseq is not designed to work for data in the color space and the input data has to be in fastq format.
If the user has the data in another format, she/he has to convert it to fastq prior use of P3BSseq.
All data files associated to a project have to be of the same reference genome,
and the same combination of methylome type (WGBS or RRBS),
reads ending type (single-end or paired-end) and directional library (directional or non-directional).

===============================================================================
MATERIALS
===============================================================================

Equipment

A computer running a Linux/Unix based operating system with at
least 6.5GB of RAM memory and ~450GB of available hard disk space is necessary.
The software itself requires 80MB.
Additionally, each mammalian reference genome needs ~13GBs.
A typical mammalian methylome dataset will need ~28GB and its processed output ~400GB.
After processing the user can delete some results folders designed for
advanced applications and keep ~200GB.

Equipment setup

To setup the system to process DNA methylation data with P3BSseq,
the user has to create the folder path for the P3BSseq software and install the software.
Additionally, to run P3BSseq the user must create other
three folders with convenient names and locations for the reference genome,
the input data and the results:

REFERENCE_GENOME: hosts the different reference genomes,
their annotation and indexing.
Once a reference genome has been downloaded and indexed,
it is not necessary to process it anymore. Any time the user requires a new reference genome
(for a different species of for a new version of a genome of an already used species)
a new subfolder will be automatically created in the REFERENCE_GENOME folder to host the new reference genome.

INPUT_FILES: hosts the raw reads in fastq format of each methylomics sample to be processed.

OUTPUTS: hosts the output subfolders and files associated with each processed methylomics sample.

After installing P3BSseq, the main binary files of P3BSseq are in the folder P3BSseq.
That folder there has also a subfolder named PARAMETERS, that hosts the five text parameter files,
one to configure the indexing of the genome (genome_indexing.txt),
and other four to configure the four mappers
(single_end_WGBS.txt, paired_end_WGBS.txt, single_end_RRBS.txt, and paired_end_RRBS.txt)
that the user has to edit to adapt P3BSseq to her/his necessities.
Thus, a running installation of P3BSseq has a five folder structure.
The names of the five main folders can be changed by the user.
To allow the executable files of P3BSseq to find them,
the names of the five main folders have to be provided in the parameter
files of the PARAMETERS folder.

The Java Runtime Environment (JRE) version at least 1.6 has to be installed,
since it is required by the FastQC core package.

Finally, the matplotlibrc file has to be installed in the /etc folder.
The matplotlibrc file is the configuration file of the python 2D plotting library.
It is required by P3BSseq to draw the figures in the final report.

===============================================================================
PROCEDURE
===============================================================================

Software installation

P3BSseq is designed for Linux/Unix operating systems,
and here we explain how to install it on an Ubuntu system.

Download P3BSseq from
https://sourceforge.net/projects/p3bsseq/P3BSseq.zip
Copy the P3BSseq_v*.zip file in a convenient for the user folder.
From here on, we will consider that the user has chosen to install the software in a path path_to_P3BSseq/
Unzip the P3BSseq.zip file
After unzip, two subfolders will be created in the folder path_to_P3BSseq/P3BSseq,
PARAMETERS and P3BSseq.

The PARAMETERS folder contains the optimized parameters as default values in the text files:
genome_indexing.txt,
parameter file to provide the species and the version of the reference genome,
and the local place in which such genome will be downloaded and indexed.
single_end_WGBS.txt, default parameters for processing single-end reads of WGBS data.
paired_end_WGBS.txt, default parameters for processing paired-end reads of WGBS data.
single_end_RRBS.txt, default parameters for processing single-end reads of RRBS data.
paired_end_RRBS.txt, default parameters for processing paired-end reads of RRBS data.

The user interacts with the P3BSseq software by editing these five parameter
files with any text editor such as gedit, vim or vi.

P3BSseq contains, beside the internal core tools,
the main binary executable files through which the user interacts with the software:
genome_indexing: To download and index the reference genome.
single_end_WGBS: To map WGBS with single-end reads.
single_end_RRBS: To map RRBS with single-end reads.
paired_end_WGBS: To map WGBS with paired-end reads.
paired_end_RRBS: To map for RRBS with paired-end reads.

Give executable rights to the executable files:
Open a terminal and type:

chmod +x path_to_P3BSseq/P3BSseq/*

NOTE
5. Ensure that the matplotlibrc file is in the /etc folder.
If the matplotlibrc is not already installed in the /etc folder, the user has two options:
A) To install the whole matplotlib library (see instructions in http://matplotlib.org)
B) Or copy the matplotlib file provided in our software package into the /etc folder.

To copy file matplotlibrc to the /etc folder,
the user should have need administration rights.
Thus, open a terminal and type:

sudo cp path_to_P3BSseq/P3BSseq/matplotlibrc /etc

Because of the use of administration rights (sudo command in case of Ubuntu operating system),
the system will ask for the user password before performing this task.

NOTE
6. Ensure that the Java Runtime Environment (JRE) version at least 1.6 is installed.
JRE is available for a number of different platforms.
(i) Test whether Java is correctly installed as follows.
Open a terminal, type in the command line:

java -version

If the terminal displays a message like this:

java version "1.6.0_17"
Java(TM) SE Runtime Environment (build 1.6.0_17-b04-248-10M3025)
Java HotSpot(TM) Client VM (build 14.3-b01-101, mixed mode)

the Java 1.6 is correctly installed. Otherwise, an error message appears
that means that Java is not instaled or incorrectly installed.

(ii) To install Java
sudo apt-get update
java -version
sudo apt-get install default-jre

Input data file installation

Make a folder to store the raw reads
The user can choose any path name to store the raw reads. For example:
/home/DNAmethylation/INPUT_FILES/

Open a terminal and type

mkdir /home/DNAmethylation/INPUT_FILES/

Download the raw reads files
In case that the user is not processing her/his own data,
she/he can download the data (in fastq format) from a database,
and save it in the INPUT_FILES folder, for example:
/home/DNAmethylation/INPUT_FILES

NOTE
The names of the raw fastq read files have to follow a naming structure:
For single-end files: The file name has the structure basename.fastq.extension.
The basename part can be any string without white space,
dash (_) or dot(.) characters.
The .fastq string has to appear after the basename.
The .extension part can be .gz or .zip for compressed files,
or an empty string for non compressed ones.
An example of such naming format is: SRR094461.fastq.gz

For paired-end files: The paired-end datasets come in paired files.
The file names have the structure
basename_1.fastq.extension and basename_2.fastq.extension.
The paired files have to have the same basename.
The naming rules for the basename and for the extension part are as in the single-end files.
An example of such naming format is the pair SRR097408_1.fastq.gz and SRR097408_2.fastq.gz

Often the methylation data of a project are split into several fastq files.
P3BSseq can process all these files together.
The only requirement for that is to put all fastq files of
the same sample in the same INPUT_FILES folder.
As an example of analysis of a dataset split in several fastq files we chose one of the
datasets we have already processed in Luu et al.(2013).
P3BSseq does not require to uncompress the data files,
the mappers perform the uncompression automatically of .gz and .zip fastq compressed files.

NOTE
P3BSseq tries to process all the fastq data files that exist in the
INPUT_FILES folder associated to a project.
However, since each project can be processed only with one of the four mappers
(single_end_WGBS, paired_end_WGBS, single_end_RRBS and paired_end_RRBS)
all data files in the INPUT_FILES folder have to be of the same reference genome,
and the same combination of methylome type (WGBS or RRBS),
reads ending type (single-end or paired-end) and directional library
(directional or non-directional).

To process data of different nature,
the user should split the data in several projects, each with uniform data type.
It is possible to perform such task by scripting.
First store all the data in another folder,
for example named DATA, and write a script that successively selects uniform
data types and put them in the corresponding INPUT_FILES folder to process them
in the same batch.

Reference genome installation

Create a folder to store the reference genome.
It is possible to use any name, for example:
/home/DNAmethylation/REFERENCE_GENOME

Thus, open a terminal and type

mkdir /home/DNAmethylation/REFERENCE_GENOME

The space required for such folder is ~10GB for the mouse and ~13GB for the human case.
It is not necessary to download the reference genome.
It is only necessary to provide a folder and the path to that folder,
and the program will download automatically the reference genome into that folder.

Set the parameters of the genome downloading and indexing parameter file:

(i) Open with a text editor the file:
path_to_BSseq3P/PARAMETERS/genome_indexing.txt

(ii) Set the variable genome_name to the ID of the genome.
This ID is formed by an abbreviation of the species name,
followed by the genome version number.
For example,
ERR192345 is mouse data, thus, choose genome_name mm9 or mm10,
SRR201781 is human data, thus, choose genome_name hg18, hg19 or hg38.

(iii) Set the variable path_genome to the path in which the user wishes to
store permanently the reference genome.

NOTE
The path written in the variable path_genome will point to the place in which
the reference genome will be downloaded and indexed.
Any mistake writing that path here will affect the placement of the reference genome.

(iv) Safe the genome_indexing.txt file before closing the editor.

See Table S1 for the definition of the indexing parameters and
their default values.
3. Download and index the reference genome
(i) Open a terminal.

(ii) Run the indexer by typing in the command line of the terminal:

path_to_P3BSseq/P3BSseq/genome_indexing

NOTE
The download will take ~10 minutes. The indexing will take ~12 hours
(depending on the hardware) for a human genome.
It is only necessary to perform the indexing one time per reference genome.
The genome indexing program will create a structure of subfolders and files in the
path_to_REFERENCE_GENOME folder associated to the chosen genome.

For example if the user decide to download the hg19 reference genome in the folder
/home/DNAmethylation/REFERENCE_GENOME

genome_indexing will create automatically the folder
/home/DNAmethylation/REFERENCE_GENOME/hg19

with the following subfolders:
annotation, to keep the genome annotation.
CpGislands, to keep the information about the CpG islands and CpA coordinates
within 200bp promoter CpG islands.
Fasta, to keep the fasta files of all the chromosomes sequences of the
reference genome.
Indexing, to keep the indexing information of the reference genome.

NOTE
To save the user time performing the indexing, we have uploaded at

https://sourceforge.net/projects/p3bsseq/files/IndexedGenomes/

the indexed genomes of several genome versions:
Human (hg18, hg19, hg38), mouse (mm9, mm10) and rat (rn4).

To use these indexed genomes, the user should dowloaded the files from the respective
remotes directories at

https://sourceforge.net/projects/p3bsseq/files/IndexedGenomes/

and save them in the local folder

/home/DNAmethylation/REFERENCE_GENOME/

then she/he has to join the chunks.

For example, for version 19 of the indexed human genome:

Open a terminal, go to the local directory

/home/DNAmethylation/REFERENCE_GENOME/

and type

cut hg19* > hg19.zip

Unzip hg19.zip

unzip hg19.zip

This will create a directory hg19 with subfolders

annotation
CpGislands
fasta
indexing
path

of the indexed genome.

Run the mapper

Choose the type of mapper.
Depending on the combination of methylome type (WGBS or RRBS)
and reads ending type (single-end or paired-end),
there are four different mappers
(named single_end_WGBS, paired_end_WGBS, single_end_RRBS, paired_end_RRBS).

NOTE
Since for each different dataset the user might wish to utilize different processing parameters,
including different folders to save the results,
it is advisable that prior to any editing of the parameter files,
the user backups a copy of the default parameter files.
The simpler way is before any change in that files
to copy the PARAMETERS folder in another folder,
for example named PARAMETERS-DEFAULT.
Thus, the user can always recover the software default parameters.

Configure the mapper parameter file.
(i) Before running the mapper, open with a text editor such as
(gedit, vim or vi) the mapper parameter file
(single_end_WGBS.txt, paired_end_WGBS.txt, single_end_RRBS.txt, or paired_end_RRBS.txt)
corresponding to the chosen mapper.

The mapper parameter files are in the PARAMETERS folder of the P3BSseq installation.
For example in:
path_to_P3BSseq/PARAMETERS/process_alignment_WGBS.txt

Edit the needed information.
The minimal required information is the following
The sample_name variable is the name of the methylomics sample to be processed
(it can be any string denoting a file name).
This name will be used to create the BED file name with the methylation calls results.

The genome_name variable has to store exactly the same reference ID already selected
in the genome_name variable of the genome_indexing.txt file.
For example mm10.

NOTE
The value of the genome_name variable has to match exactly the species abbreviation
and the genome version of the reference ID already selected in the
genome_name variable of the genome_indexing.txt file.

The path_genome variable is the path to the folder that has the indexed genome.
It has to be exactly the same path as to the reference genome already selected
in the path_genome variable of the genome_indexing.txt file.

NOTE
The path written in the path_genome variable has to match exactly the path to
the reference genome
already selected in the path_genome variable of the genome_indexing.txt file.

The path_raw variable is the path to the folder that has the fastq format
raw read files (path to INPUT_FILES).

NOTE
The path written in path_raw variable has to match exactly the path in which
the fastq files are placed.

The path_out is the path to the folder that will store all the output results.

NOTE
P3BSseq erase the content of the path_out folder prior any calculation.
If the user wishes to run several experiments with different parameters for
the same dataset,
she/he has to choose a different name for the path_out folder for each experiment
in order not to lose for each running the results.

The rest of the variables in the configuration file can be kept in the
optimized default values.
Advance users can change such default values
(see Tables S2 and S3 for the definition of all mapper parameters,
and their default values).
The comments in the parameter file contain the hint or a link for
further parameter explanations.

(iv) Save the mapper parameter file before closing the editor.

Run the mapper
(i) Open a terminal.
(ii) Type the path to the binary file of the chosen mapper and press Return.
The possible mappers are:
For Genome Wide Bisulfite Sequencing methylomes
For single end

path_to_P3BSseq/P3BSseq/single_end_WGBS

For paired end
path_to_P3BSseq/P3BSseq/paired_end_WGBS

For Reduced Representation Bisulfite Sequencing (RRBS) methylomes

For single end
path_to_P3BSseq/P3BSseq/single_end_RRBS
For paired end
path_to_P3BSseq/P3BSseq/paired_end_RRBS

===============================================================================
EXAMPLES
===============================================================================

Example of using P3BSseq for a methylome data fragmented into three files

The example assumes that the user has already installed P3BSseq in a folder:
/home/DNAmethylation/P3BSseq
and wishes to store the reference genome in a folder:
/home/DNAmethylation/REFERENCE_GENOME
and wishes to process human iPS cell derived from fibroblast ffipsc1911 downloaded from the
ftp://ftp.sra.ebi.ac.uk server and to save the reads data in a folder:
/home/DNAmethylation/INPUT_FILES/ffipsc1911
and the results in a folder
/home/DNAmethylation/OUTPUTS/ffipsc1911
These data are WGBS directional single-end reads,
fragmented into three fastq files.

Download and index the reference genome
Set the parameters
Open the
/home/DNAmethylation/P3BSseq/PARAMETERS/genome_indexing.txt
parameter file with any text editor
(gedit, vim or vi) to set the parameters.
Write in genome_name the genome ID. Since we wish to align to human genome,
then set it to hg19:
genome_name = hg19
Write in path_genome the path of the output folder from the indexing of the
reference genome,
path_genome = /home/DNAmethylation/REFERENCE_GENOME.
This location should have a free space of at least 13GB for hg19.
Save the file and close it.
Run the executable genome of the genome indexing file.
Open a terminal and type:
/home/DNAmethylation/P3BSseq/genome_indexing

The download will take ~10 minutes and the indexing will take
(depending on the hardware) ~12 hours.
It is only necessary to perform the indexing one time per reference genome.
The genome downloading and indexing program will generate a structure of
subfolders and files in the
/home/DNAmethylation/REFERENCE_GENOME/hg19 folder associated with the selected genome.

Set the input data
P3BSseq will run and generate a report with one sample methylome at a time.
Input data preparation:
All the raw reads, in three fastq files of the human iPS cells derived from
fibroblast (ffipsc1911)
directional single-end read WGBS with ~406 million reads can be downloaded via the links:
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR094/SRR094461/SRR094461.fastq.gz
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR094/SRR094462/SRR094462.fastq.gz
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR094/SRR094463/SRR094463.fastq.gz

Save these files in the folder
/home/DNAmethylation/INPUT_FILES/ffipsc1911
For this example at least 28GB are required to keep the fastq files.
It is not necessary to unzip the fastq files.

Run the mapper
Set the mapper parameters
Since the data is human WGBS directional single-end,
open the /home/DNAmethylation/P3BSseq/PARAMETERS/single_end_WGBS.txt
parameter file with any text editor (gedit, vim or vi) to set the parameters.
Write in sample_name the string recalling the sample.
We have already decided to use the name ffipsc1911
Write in genome_name the human genome ID.
It has to be the same as the one used for indexing the reference genome,
as in the parameter file genome_indexing.txt
genome_name = hg19
Write in path_genome the path of the output folder from indexing the reference genome
path_genome = /home/DNAmethylation/REFERENCE_GENOME
Write in path_raw the path to folder that stored the fastq format raw read files,
as we set it in the input data preparation step:
path_raw = /home/DNAmethylation/INPUTS/ffipsc1911
Write in path_out the path to the folder that will store all the output results
path_out = /home/DNAmethylation/OUTPUTS/ffipsc1911
Write in number_of_strands = 2
since this data corresponds with a directional library.
Save the file and close it.

Run the mapper
Open a terminal and type:
home/DNAmethylation/P3BSseq/single_end_WGBS
After running, it will create in the OUTPUT folder
/home/DNAmethylation/OUTPUTS/ffipsc1911
a structure of subfolders and files with the results will be created.

Example results

After running one of the mappers,
it will create in the OUTPUTS folder a structure of subfolders and files with the results
with the following information:

01_Raw_Sequence_Quality:
Contains the quality check results of the raw sequences.

02_Quality_and_Adapter_Trimmed_Sequences:
Contains the files with the results
of quality and adapter trimming of the raw sequences.
These files are short sequences in fastq format.

03_Input_Sequences:
Contains the trimmed sequences that are converted to BRAT-BW format before alignment.

04_Input_Sequence_Quality:
Contains the quality check results of the trimmed sequences before alignment

05_Unmapped_Sequences:
Contains the unmapped sequences after the separation from the mapped sequences.

06_Unmapped_Sequence_Quality:
Contains the quality check results of the unmapped sequences.

07_Quality_and_Kmer_Trimmed_Sequences:
Contains the quality and kmer trimming
files with the results from trimmed unmapped sequences.
These files store short sequences in fastq format.

08_Mapped_Sequences:
Contains the results of alignment including mapped sequences
and statistics before and after deduplication.
These files store short sequences in BRAT-BW format.

10_Methylation_Call: Contains 3 types of files:
The first type of files have all methylation ratios of CG on each strand of
each chromosome. The files are named chr*.fa.txt_strand.txt_CG, where * is
the chromosome, and strand is forw (for the forward strand) and rev (for the
reverse strand). These files have 8 columns:
1. Chromosome ID.
2. Locus in the chromosome of the C of the CG pair.
3. Strand (+ for forward, - for reverse).
4. CG
5. Total sum of read counts of Cytosines (Cs), thus means, methylated Cs,
mapped to this locus.
6. Total sum of read counts of Cytosines and Thymines (C+T)
mapped to this locus.
7. Methylation ratio: C/(C+T) of the C locus.
8. Total sum of read counts of Thymines (Ts), thus means,
unmethylated Cs, mapped to this locus.

The second file type has only one representative: sample_name_CG.txt
is the merge of the CG
methylation calls (forward and reverse) of all the chromosomes. It has the format:
1. Chromosome ID.
2. Locus in the chromosome of the C of the CG pair.
3. Locus in the chromosome of the G of the CG pair.
4. CG
5. Total sum of read counts of Cytosines (Cs), thus means, methylated Cs,
mapped to this locus.
6. Total sum of read counts of Cytosines and Thymines (C+T) mapped to this locus.
7. Methylation ratio: C/(C+T) of the C locus.
8. Total sum of read counts of Thymines (Ts), thus means, unmethylated Cs,
mapped to this locus.

The third file type is represented by 2 files: sample_name_strand.txt are where strand is forw
(for the forward strand) and rev (for the reverse strand) are the files are the
methylation calling of CG, CHG and CHH of each strand with the following format:
1. Chromosome ID.
2. Locus in the chromosome of the C of the CG, CHG or CHH pattern.
3. Locus in the chromosome of last base of the CG, CHG or CHH pattern.
4. CG, CHG or CHH pattern: Total sum of read counts of Cytosines and Thymines (C+T)
mapped to this locus.
5. Methylation ratio: C/(C+T) of the C locus.
6. Strand (+ for forward, - for reverse).

Note
Bisulfite conversion deaminate unmethylated Cytosines to produce Uracil in DNA.
Methylated Cytosines are protected from the conversion to Uracil.
Uracils are displayed as Thymines. with the methylation calling results in BED format.
This file stores the methylation ratio of each of the three categories of methylation ratios:
CG, CHH and CHG for all the chromosomes merged together.
The BED format provides a flexible way to define genomic features.

11_BAM_File_and_Mapping_Visualization:
Contains the SAM and BAM files converted from results of alignment.

12_Bisulfite_Sequencing_Report:
Contains intermediate steps results needed to create the final HTML report file.

All the results are summed up in the report html file,
in compliance with the Standards and Guidelines for
Whole Genome Shotgun Bisulfite Sequencing issued by the
NIH Roadmap Epigenomics Mapping Consortium.
The report is located at:
path_to_output/12_Bisulfite_Sequencing_Report/BSseq_report.html

The report contains information on the sequencing depth;
read information and parameters before and after trimming and alignment;
and the percentages of genomic,
Cytosine and CpG coverage distributed in four sections about

I. Raw sequences
Provides among other results information about the coverage X.

II. Processing raw sequences.
Provides among other results the parameters used to trim the data,
the statistics and the links to the fastq results files of the trimming step.

III. Alignment.
Provides among other results
the parameters used to map the data and the statistics of there the alignment results.

IV. Processing alignment output.
Provides among other results,
the statistics of the final alignment after deduplication,
the links to the BED files with the methylation ratios,
the bisulfate conversion rates, and the sequencing depths after mapping.

Some of the subfolders results are created to keep intermediate step results,
that can be helpful for further quality check or for more advance research.
But for a normal application they can be removed without any effect on
the methylome and on the final report
(since the final report has no links to such subfolders).
Thus, in case the user wishes to save hard disk space,
she/he can delete the following subfolders:
02_Quality_and_Adapter_Trimmed_Sequences
03_Input_Sequences
05_Unmapped_Sequences
07_Quality_and_Kmer_Trimmed_Sequences
08_Mapped_Sequences

Deleting these subfolders will free approximately half of the result space.

===============================================================================
REFERENCES
===============================================================================
Luu PL, Schöler HR, Araúzo-Bravo MJ.
Disclosing the crosstalk among DNA methylation, transcription factors,
and histone marks in human pluripotent cells through discovery of DNA methylation motifs.
Genome Res. 2013 Dec;23(12):2013-29. doi: 10.1101/gr.155960.113. Epub 2013 Oct 22.
PMID: 24149073

===============================================================================
APPENDIX
===============================================================================

Table S1. Parameter definition of the genome_indexing.txt parameter file to
configure the download and indexing of the reference genome.
These parameters are used by the genome_indexing executable file.

No Parameters Explanation

1 genome_name = mm9 Id of the reference genome such as hg18,
hg19, hg38, mm9, mm10; hg = human, mm = mouse
2 path_genome = /home/ref_genome Reference genome that will be downloaded to path_genome.
The indexing outputs are also stored in path_genome folder.
The required space for the path_genome
folder is approximately 13GB for hg38, 11.5GB for mm10
3 url_chromosomes = url Url from which the software will download the fasta files of the sequences of all the chromosomes.
Optional variable.
If it is not defined, the software will use as default the url of the UCSC:
ftp://hgdownload.cse.ucsc.edu/goldenPath/genome_name/chromosomes/
4 url_annotation = url Url from which the software will download the annotation information of the genome.
Optional variable.
If it is not defined, the software will use as default the url of the UCSC:
ftp://hgdownload.cse.ucsc.edu/goldenPath/genome_name/database/
5 url_cpg_island = url Url from which the software will download the CpG island information.
Optional variable.
If it is not defined, the software will use as default the url of the UCSC:
ftp://hgdownload.cse.ucsc.edu/goldenPath/genome_name/database/cpgIslandExt.txt.gz

Table S2. Parameter definition of the single-end read parameter files
single_end_WGBS.txt and single_end_RRBS.txt to configure
the processing of WGBS and RRBS with the single_end_WGBS and
single_end_RRBS mappers, respectively.

No Parameters Explanation

1 sample_name = ffipsc69 Name of methylomics sample to be processed.
It is any string that recalls the sample.
This name will be used to create the
subfolder structure that will host the results
2 genome_name = hg19 ID of the reference genome,
such as hg18, hg19, hg38, mm9, mm10, etc. (hg = human, mm = mouse).
The ID is the same as genome_name in the genome_indexing.txt
3 path_genome = /home/ref_genome Path to the indexed genome;
it is the same as path_genome in the genome_indexing.txt
4 path_raw = /home/data/ffipsc69 Path to the folder with the raw read files in fastq format
5 path_out = /home/data/results Path to folder that outputs all the results
6 bcq0 = 28 Base call quality threshold,
any bases with base call quality < bcq0 will be removed.
Recommended values = 20, 28, 30;
(see http://en.wikipedia.org/wiki/Phred_quality_score)
7 bcq_code = 33 ASCII quality score encoding used in the raw fastq file.
Recommended values = 33 or 64
(see http://en.wikipedia.org/wiki/FASTQ_format)
8 GC_threshold = 40 GC content threshold, best value = 40, 30
(see http://www.epigenesys.eu/images/stories/protocols/pdf/20120720103700_p57.pdf)
9 C_threshold = 20 C composition threshold,.
Recommended values = 20, 15
(see http://www.epigenesys.eu/images/stories/protocols/pdf/20120720103700_p57.pdf)
10 adapter = AGATCGGAAGAGC Adapter sequence contaminated in read that needs to be removed.
The default value is the first 13 bp of the Illumina adapter,
AGATCGGAAGAGC
11 error_rate = 0.1 Maximum allowed error rate
(number of errors divided by the length of the matching region).
Default value = 0.1
12 read_adapter_overlap = 1 Number of nucleotides that overlap between a read and the adapter to be trimmed.
Recommended values = 1, 2
13 length = 24 Length of a read kept for alignment if its length is at
least 24 bp after being trimmed by quality and adapter
14 Ns = 1 Number of internal undefined nucleotides of a read is allowed.
Recommended values = 0, 1, 2
15 number_of_strands = 2 Number of strands needed to align.
2 is for directional library and
4 is for non-directional library
16 nmismatch = 1 Allowed number of mismatches between a
read and the reference genome sequence.
Recommended values = 1, 2
17 bcq1 = 30 Base call quality for realign reads.
It makes sense when bcq1 ≥ bcq0
18 kmer_percent = 0.01 The percentage of kmer list used for kmer trimming.
Recommended values = 0.01, 0.02
19 make_report = 1 If 1, a BSseq report is generated, and if 0 not
20 nX = 5 Minimum coverage of each Cytosine on each strand
required to take this Cytosine into account.
The recommended values are from 2 to 10
21 maximum_number_process = 20 Maximum number of processes the user wishes to allocate for running P3BSseq.

Table S3. Parameter definition of the paired-end read parameter files
paired_end_WGBS.txt and paired_end_RRBS.txt to configure the processing
of WGBS and RRBS with the paired_end_WGBS and paired_end_RRBS mappers, respectively.

No Parameters Explanation

1 sample_name = ffipsc69 Name of methylomics sample to be processed.
It is any string that recalls the sample.
This name will be used to create the
subfolder structure that will host the
results
2 genome_name = hg19 ID of the reference genome,
such as hg18, hg19, hg38, mm9, mm10, etc.
(hg = human, mm = mouse).
The ID is the same as genome_name in
the genome_indexing.txt
3 path_genome = /home/ref_genome Path to the indexed genome;
it is same as path_genome in the
genome_indexing.txt
4 path_raw = /home/data/ffipsc69 Path to the folder containing the
raw read files in fastq format
5 path_out = /home/data/results Path to the folder that outputs all
the results
6 bcq0 = 28 The base call quality threshold.
Any bases with base call
quality < bcq0 will be removed.
Recommended values = 20, 28, 30;
(see http://en.wikipedia.org/wiki/Phred_quality_score)
7 bcq_code = 33 ASCII quality score encoding used in the raw fastq file.
Recommended values = 33 or 64
(see http://en.wikipedia.org/wiki/FASTQ_format)
8 GC_threshold = 40 GC content threshold.
Recommended values = 40, 30
(see http://www.epigenesys.eu/images/stories/protocols/pdf/20120720103700_p57.pdf)
9 C_threshold = 20 C composition threshold.
Recommended values = 20, 15
(see http://www.epigenesys.eu/images/stories/protocols/pdf/20120720103700_p57.pdf)
10 adapter = AGATCGGAAGAGC Adapter sequence contaminated in read that needs to be removed.
The default value is the first 13 bp of the Illumina adapter,
AGATCGGAAGAGC
11 adapter2 = '' Adapter sequences for second mate of the paired-end read.
Default value: '' (empty string)
12 error_rate = 0.1 Maximum allowed error rate
(number of errors divided by the length of the matching region).
Default value = 0.1
13 read_adapter_overlap = 1 Number of nucleotides that overlap between a read and the adapter to be trimmed.
Recommended values = 1, 2
14 length = 24 Length of a read kept for alignment
if its length is at least 24 bp after
trimming by quality and adapter
15 length1 = 35 Length of a mate 1 kept for alignment
if its length is at least 35 bp after
trimming by quality and adapter when
mate 2 is removed
16 length2 = 35 Length of a mate 2 kept for alignment
if its length is at least 35 bp after
trimming by quality and adapter when
mate 1 is removed
17 Ns = 1 Allowed number of internal undefined
nucleotides of a read.
Recommended values = 0, 1, 2
18 number_of_strands = 2 Number of strands needed to align.
2 is for directional library and
4 is for non-directional library
19 nmismatch = 1 Allowed number of mismatches between
a read and the reference genome sequence.
Recommended values = 1, 2
20 minimum_insertion_size = 0 Minimum insertion size between a paired read.
Recommended value = 0
21 maximum_insertion_size = 1000 Maximum insertion size between a paired read.
Recommended value = 1000
22 version = '' If not realign of unmapped reads is wished,
then version = '' (empty string),
otherwise, specify any string
23 bcq1 = 30 Base call quality for realign reads.
It makes sense when bcq1 ≥ bcq0
24 kmer_percent = 0.01 Percentage of kmer list used for kmer trimming.
Recommended values = 0.01, 0.02
25 make_report = 1 If 1, a BSseq report is generated,
and if 0 not
26 nX = 5 Minimum coverage of each Cytosine on each
strand required to take this Cytosine into account.
The recommended values are between 2 to 10
27 maximum_number_process = 20 Maximum number of processes the user wishes to allocate for running P3BSseq.

P3BSseq Wiki

Parallel processing pipeline for analysis of bisulfite sequencing data

Home

Equipment

Equipment setup

Software installation

Input data file installation

Reference genome installation

Run the mapper

Example of using P3BSseq for a methylome data fragmented into three files

Example results