rseqflow Wiki

RNA-Seq analysis pipeline for QC, Expression, DifferentialExpression

Brought to you by: jherstein

PipelineDescription

Authors: Anonymous

Attachments

RseqFlow-Extension-ExpressionLevel-2013-07-17.png (28294 bytes)

RseqFlow-Extension-QC.png (62296 bytes)

RseqFlow-Extension-framework.png (52768 bytes)

RseqFlow Pipeline Description

RseqFlow is an RNA-Seq analysis pipeline which offers an express implementation of analysis steps for RNA sequencing datasets. It can perform pre and post mapping quality control (QC) for sequencing data, calculate expression levels for uniquely mapped reads, identify differentially expressed genes, and convert file formats for ease of visualization. A detailed description of the pipeline is given below.

Frame Work
Four Run Branches
Software Environment and Run modes
Aligning RNA-Seq Datasets
Description of Each Branch
Input Files and Formats
References

Frame Work

The framework is shown as follows:

Four Run Branches

The main framework has four run branches which can be run individually or in workflow mode.

Branch 1: Quality Control and SNP calling based on the merging of alignments to the transcriptome and genome.
Branch 2: Expression level quantification for Gene/Exon/Splice Junctions based on alignment to the transcriptome.
Branch 3: Some file format conversions for easy storage, backup and visualization.
Branch 4: Differentially expressed gene identification based on the output of the Expression level quantification from Branch 2.

Software Environment and Run modes

The following software must be pre-installed:
* Python 2.7 or higher
* R 2.11 or higher
* GCC
* ImageMagick (optional) Required if you want a combined pdf QC report

RseqFlow is implemented with two run mode options: Pegasus workflow management run mode and Simple Unix Shell run mode.

Aligning RNA-Seq Datasets

Alignment tools: Bowtie2¹
Alignment targets: Genome and/or Transcriptome
QC, SNP calling, and Alignment File Format Conversion modules are implemented based on the merging of alignments to the transcriptome and genome.
Expression Quantification and Differentially Expressed Gene analysis modules are implemented based on only the alignment to the transcriptome.

Description of Each Branch

Quality Control

This module outputs the pre-alignment metrics for fastq files and post-alignment QC analysis for sam files. It also performs SNPs calling. The detailed QC processing is as follows:

Some of the following figures and text are taken from RSeQC².

1. Pre-alignment->Read Quality: Outputs a boxplot and heatmap based on the Phred Quality Score. This analysis is available only when the RNA-Seq input file is in FASTQ format. The heatmap uses different colors to represent nucleotide density ("blue"=low density, "orange"=median denstiy, "red"=high density). The following are some example output files:

Output.read_qual.boxplot.pdf:

Output.read_qual.heatmap.pdf:

2. Pre-alignment->Nucleotide distribution: Outputs nucleotide composition for each position of the read. This module is used to check the nucleotide composition bias. The output files are as follows:

Output.ACTG_Percentage.txt:

PositionInRead      #Reads          A%     C%     G%     T%    N%

     1              17,734,524    31.08% 19.95% 21.10% 27.87% 0.00%
    ...

Output.ACTG_Percentage_plot.pdf:

3. Pre-alignment->GC content detection: Outputs two types of GC analysis, one counts GC content across the read length; the other is a reads number histogram of GC content. The output files include two image files and their corresponding numerical value .txt files as follows:

Output.GC_Percentage.txt: (Numerical values for Output.GC_Percentage_plot.pdf)

PositionInRead             #Reads            GC content

      1                   80,999,006            70.63%
     ...

Output.GC_Percentage_plot.pdf:

Output.GC_hist.xls: (Numerical values for Output.GC_hist_plot.pdf)

  GC content      read_count

     0.00%           3,896
     1.00%             768
      ...

Output.GC_hist_plot.pdf:

4. Post-alignment->Statistics of Reads Alignments: Outputs the number of sequenced reads, mapped reads, multi-mapped reads and unique-mapped reads. The output file is as follows:

Output.mapping_report.txt:

Number of Total Reads: 41,102,255
Number of Mapped Reads: 30,314,711   73.75%
Number of Multi-mapped reads: 3,265,968   7.94%
Number of Unique-mapped reads: 27,048,743   65.81%

5. Post-alignment->Statistics of Strand specificity: Outputs how reads were stranded for RNA-seq data. It is based on annotation of the transcriptome by comparing read mapping information to the underlying gene model. The output file is as follows:

Output.strand_stat.txt:

This is SingleEnd Data

Fraction of reads explained by "++": .2314
Fraction of reads explained by "--": .2637
Fraction of reads explained by "+-": .2667
Fraction of reads explained by "-+": .2382

Note:

++:read mapped to '+' strand indicates parental gene on '+' strand
--:read mapped to '-' strand indicates parental gene on '-' strand
+-:read mapped to '+' strand indicates parental gene on '-' strand
-+:read mapped to '-' strand indicates parental gene on '+' strand

6. Post-alignment->Gene Body Coverage Distribution: Outputs the average read distribution over the gene body. This module scales all transcripts to 100bp length. For ① all annotated genes and ② the genes with single annotated transcript, the average number of reads for each scaled point is calculated. The output files include two images of coverage profile along the gene body for ① and ② genes and their corresponding numerical value .txt files.

Output_whole_genes.geneBodyCoverage.txt: (Numerical values for Output_whole_genes.geneBody_coverage.pdf)

Total reads: 16970177
Fragment number: 16994719

percentile              count

     0                  16712
     1                  17294
    ...

Output_whole_genes.geneBody_coverage.pdf:

Output_single_transcriptGene.geneBodyCoverage.txt: (Numerical values for Output_single_transcriptGene.geneBody_coverage.pdf)

Total reads: 16970177
Fragment number: 16994719

percentile              count

     0                   5518
     1                   5685
    ...

Output_single_transcriptGene.geneBody_coverage.pdf:

7. Post-alignment->Reads distribution along genome: Outputs how mapped reads were distributed over the genome, such as exon, intron, and intergenic regions. When genome features overlap (e.g. a region could be annotated as both exon and intron by two different transcripts), they are prioritized in the following order: Exons then Introns then Intergenic regions. For example, if a read was mapped to both an exon and intron, it will be assigned to CDS exons. The output file is as follows:

Output.read_distribution.txt:

Percentage of Assigned Tags 0.91

Region      Percentage_of_Tags

Exons           74.27 %
Introns         18.93 %
TSS_up_1kb       0.52 %
TSS_up_5kb       3.62 %
TSS_up_10kb      5.07 %
TES_down_1kb     0.37 %
TES_down_5kb     1.11 %
TES_down_10kb    1.73 %

Note:
If reads spliced once, it will be counted as 2 tags

TSS=Transcription Start Site, TES=Transcription End Site, down=downstream, up=upstream

8. Post-alignment-optional->Alignments to ribosomal RNA and Mitochondrial genome: Outputs the mapping rate of reads to ribosomal RNA and Mitochondrial genome. The output file is as follows:

Mitochondrial genome mapping report
Output.chrM.simple_mapping_report.txt:

Number of Reads: 41,021,821
Number of Mapped Reads: 10,331,743
Mapping Rate: 25.19%

ribosomal RNA mapping report
Output.rRNA.simple_mapping_report.txt:

Number of Reads: 41,021,821
Number of Mapped Reads: 7,870,589
Mapping Rate: 19.19%

9. Combined QC Report

If you have ImageMagick installed, RseqFlow will create a combined pdf report containing all of the individual QC reports. The combined file is named Output.combined_QC_report.pdf

SNPs Calling

This module uses the uniquely mapped alignments based on merged alignments to the genome and transcriptome. This module is implemented within the QC branch.
SNP calling is analyzed by Samtools³. It outputs .bcf files which can be viewed using the included bcftools package.

Expression Level Quantification

Multi-mapped reads are removed.
All analysis is based on the alignments from RNA-Seq to the reference transcriptome.
Output Information:

Output_whole_GeneExpressionLevel_unique.txt:

GeneID   chromosome   Strand  reads_number    RPKM

IFT27      chr22        -        24        8.48459367622
CRKL       chr22        +       166       17.47879409
SBF1       chr22        -       268       23.1426123831
...

Output_whole_ExonExpressionLevel_unique.txt:

GeneID:Exon Start:Exon End   Chromosome Strand reads_number   RPKM

APOBEC3B:39388435:39388789      chr22     +       6.78      10.7118340257
GGA1:38019333:38019474          chr22     +       6.34      25.0203422498
NCAPH2:50960169:50960277        chr22     +       7.63      39.2671604098
...

Output_whole_JunctionExpressionLevel_unique.txt: RPM (Reads Per Million Mapped Reads)

GeneID:Junc Start:Junc End Chromosome Strand reads_number    RPM
DRG1:31795711:31796605       chr22      +       1         14.9479065457
MYH9:36678831:36680138       chr22      -       6         89.6874392741
RPL3:39709315:39709638       chr22      -      11        164.426972003
...

Differentially Expressed Gene Identification

Two calculations for conditions with and without replicates:
- For conditions with replicates: We compute p-values for differentially expressed genes using DESeq⁴.
- For conditions without replicates: p-values for exons are computed with DESeq and then combined into a single value using Fisher probability test⁵.
All the analysis is based on the outputs from Expression Level Quantification.
Output Information
- DE_all_With(out)Replicate_+Condition1Sample-Condition2Sample+_Table.txt: This file outputs the statistical information for all genes.
- DE_Significant_With(out)Replicate_+Condition1Sample-Condition2Sample+_Table.txt: This file outputs the significant differentially expressed genes.

Output will be in the following format:

Id           Base_mean         Base_meanA       Base_meanB          Fold_change       log2_fold_change        pval              padj 
"UBE2Q1"  1127.52604407842  1140.74641615722  1066.381823213     0.934810583763415  -0.0972540267309428  0.525741421393871  0.947892997381397
"RNF14"   1899.48755163846  1901.42965526746  1890.50532235435   0.994254674169595  -0.00831265547272295 0.942974278996524  1

Alignment File Format Conversion module

1. This module implements format conversion for backup and visualization:

sam to bam, mrf and wig/bed format
bam to wig/bed
mrf to wig/bed format

2. The conversion is implemented with Samtools³, RseqTools⁶, and Bedtools⁷ etc.

Input Files and Formats

Possible input files for QC_SNP.sh (Depends on the selected options):

Genome annotation GTF file
Transcriptome reference sequences
Genome reference seqeuences
RNA-Seq fastq or fastq.gz file
Alignment files in SAM format
Reference sequences of Mitochondria
Reference sequences of Ribosomal RNA

Possible input files for ExpressionEstimation.sh (Depends on the selected options):

Genome annotation GTF file
Transcriptome reference sequences
RNA-Seq fastq or fastq.gz file
Alignment files in SAM format

Possible input files for DE.sh:

Output files from ExpressionEstimation.sh:
- whole_GeneExpressionLevel_unique.txt
- whole_ExonExpressionLevel_unique.txt files.

Format specification of input files

There should be three separate files: one for Transcriptome reference sequences, one for Genome reference sequences and one for the Genome Annotation file. RseqFlow will automatically split the files during processing, if necessary. All eukaryotic species with files in the required formats can be analyzed in the RseqFlow pipeline.

Transcriptome Reference Sequences

The transcript names must begin with “>” and should meet the format requirements below. Extra information may follow the chromosome end location as long as there is a space separating the chromosome end from the extra info.

GenomeName_AnnotationSource_TranscriptsID=Chromosome:Start-End [extra info]

For example:

>hg19_wgEncodeGencodeManualV4_ENST00000480075=chr7:19757-35457 5'pad=0 3'pad=0 strand=- repeatMasking=none

Genome Reference Sequences

The chromosome name must begin with “>” and should meet the format requirements below. Extra information may follow the chromosome as long as there is a space separating the chromosome from the extra info.

For example:

>chr1 dna:chromosome

>chr21 dna:chromosome chromosome:GRCh37:21:1:48129895:1 REF

>chrM

Genome Annotation

The Genome Annotation GTF file must be in format GTF3.0.

References

1. Langmead B, Trapnell C, Pop M, Salzberg SL.(2009) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10.

2. Wang L, Wang S, Li W (2012) RSeQC: quality control of RNA-seq experiments Bioinformatics 28 (16): 2184-2185.

3. Li, H., Handsaker, etc. (2009) The Sequence alignment/map (SAM) format and SAMtools, Bioinformatics, 25, 2078-2079.

4. Simon Anders and Wolfgang Huber: Differential expression analysis for sequence count data Genome Biology (2010),11

5. Wan L and Sun:CEDER: Accurate detection of differentially expressed genes by combining significance of exons using RNA-Seq,(2012), IEEE/ACM Transactions on Computational Biology and Bioinformatics, 9(5): 1281-1292.

6. Lukas Habegger, Andrea Sboner, etc.(2010). RSEQtools: A modular framework to analyze RNA-Seq data using compact, anonymized data summaries. Bioinformatics.

7. Quinlan AR and Hall IM, 2010. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 26, 6, pp. 841–842.