RseqFlow is an RNA-Seq analysis pipeline which offers an express implementation of analysis steps for RNA sequencing datasets. It can perform pre and post mapping quality control (QC) for sequencing data, calculate expression levels for uniquely mapped reads, identify differentially expressed genes, and convert file formats for ease of visualization. A detailed description of the pipeline is given below.
The framework is shown as follows:
The main framework has four run branches which can be run individually or in workflow mode.
The following software must be pre-installed:
* Python 2.7 or higher
* R 2.11 or higher
* GCC
* ImageMagick (optional) Required if you want a combined pdf QC report
RseqFlow is implemented with two run mode options: Pegasus workflow management run mode and Simple Unix Shell run mode.
This module outputs the pre-alignment metrics for fastq files and post-alignment QC analysis for sam files. It also performs SNPs calling. The detailed QC processing is as follows:
Some of the following figures and text are taken from RSeQC2.
1. Pre-alignment->Read Quality: Outputs a boxplot and heatmap based on the Phred Quality Score. This analysis is available only when the RNA-Seq input file is in FASTQ format. The heatmap uses different colors to represent nucleotide density ("blue"=low density, "orange"=median denstiy, "red"=high density). The following are some example output files:
Output.read_qual.boxplot.pdf:
Output.read_qual.heatmap.pdf:
2. Pre-alignment->Nucleotide distribution: Outputs nucleotide composition for each position of the read. This module is used to check the nucleotide composition bias. The output files are as follows:
Output.ACTG_Percentage.txt:
PositionInRead #Reads A% C% G% T% N%
1 17,734,524 31.08% 19.95% 21.10% 27.87% 0.00%
...
Output.ACTG_Percentage_plot.pdf:
3. Pre-alignment->GC content detection: Outputs two types of GC analysis, one counts GC content across the read length; the other is a reads number histogram of GC content. The output files include two image files and their corresponding numerical value .txt files as follows:
Output.GC_Percentage.txt: (Numerical values for Output.GC_Percentage_plot.pdf)
PositionInRead #Reads GC content
1 80,999,006 70.63%
...
Output.GC_Percentage_plot.pdf:
Output.GC_hist.xls: (Numerical values for Output.GC_hist_plot.pdf)
GC content read_count
0.00% 3,896
1.00% 768
...
Output.GC_hist_plot.pdf:
4. Post-alignment->Statistics of Reads Alignments: Outputs the number of sequenced reads, mapped reads, multi-mapped reads and unique-mapped reads. The output file is as follows:
Output.mapping_report.txt:
Number of Total Reads: 41,102,255
Number of Mapped Reads: 30,314,711 73.75%
Number of Multi-mapped reads: 3,265,968 7.94%
Number of Unique-mapped reads: 27,048,743 65.81%
5. Post-alignment->Statistics of Strand specificity: Outputs how reads were stranded for RNA-seq data. It is based on annotation of the transcriptome by comparing read mapping information to the underlying gene model. The output file is as follows:
Output.strand_stat.txt:
This is SingleEnd Data
Fraction of reads explained by "++": .2314
Fraction of reads explained by "--": .2637
Fraction of reads explained by "+-": .2667
Fraction of reads explained by "-+": .2382
Note:
++:read mapped to '+' strand indicates parental gene on '+' strand
--:read mapped to '-' strand indicates parental gene on '-' strand
+-:read mapped to '+' strand indicates parental gene on '-' strand
-+:read mapped to '-' strand indicates parental gene on '+' strand
6. Post-alignment->Gene Body Coverage Distribution: Outputs the average read distribution over the gene body. This module scales all transcripts to 100bp length. For ① all annotated genes and ② the genes with single annotated transcript, the average number of reads for each scaled point is calculated. The output files include two images of coverage profile along the gene body for ① and ② genes and their corresponding numerical value .txt files.
Output_whole_genes.geneBodyCoverage.txt: (Numerical values for Output_whole_genes.geneBody_coverage.pdf)
Total reads: 16970177
Fragment number: 16994719
percentile count
0 16712
1 17294
...
Output_whole_genes.geneBody_coverage.pdf:
Output_single_transcriptGene.geneBodyCoverage.txt: (Numerical values for Output_single_transcriptGene.geneBody_coverage.pdf)
Total reads: 16970177
Fragment number: 16994719
percentile count
0 5518
1 5685
...
Output_single_transcriptGene.geneBody_coverage.pdf:
7. Post-alignment->Reads distribution along genome: Outputs how mapped reads were distributed over the genome, such as exon, intron, and intergenic regions. When genome features overlap (e.g. a region could be annotated as both exon and intron by two different transcripts), they are prioritized in the following order: Exons then Introns then Intergenic regions. For example, if a read was mapped to both an exon and intron, it will be assigned to CDS exons. The output file is as follows:
Output.read_distribution.txt:
Percentage of Assigned Tags 0.91
Region Percentage_of_Tags
Exons 74.27 %
Introns 18.93 %
TSS_up_1kb 0.52 %
TSS_up_5kb 3.62 %
TSS_up_10kb 5.07 %
TES_down_1kb 0.37 %
TES_down_5kb 1.11 %
TES_down_10kb 1.73 %
Note:
If reads spliced once, it will be counted as 2 tags
TSS=Transcription Start Site, TES=Transcription End Site, down=downstream, up=upstream
8. Post-alignment-optional->Alignments to ribosomal RNA and Mitochondrial genome: Outputs the mapping rate of reads to ribosomal RNA and Mitochondrial genome. The output file is as follows:
Mitochondrial genome mapping report
Output.chrM.simple_mapping_report.txt:
Number of Reads: 41,021,821
Number of Mapped Reads: 10,331,743
Mapping Rate: 25.19%
ribosomal RNA mapping report
Output.rRNA.simple_mapping_report.txt:
Number of Reads: 41,021,821
Number of Mapped Reads: 7,870,589
Mapping Rate: 19.19%
9. Combined QC Report
If you have ImageMagick installed, RseqFlow will create a combined pdf report containing all of the individual QC reports. The combined file is named Output.combined_QC_report.pdf
This module uses the uniquely mapped alignments based on merged alignments to the genome and transcriptome. This module is implemented within the QC branch.
SNP calling is analyzed by Samtools3. It outputs .bcf files which can be viewed using the included bcftools package.
Multi-mapped reads are removed.
All analysis is based on the alignments from RNA-Seq to the reference transcriptome.
Output Information:
Output_whole_GeneExpressionLevel_unique.txt:
GeneID chromosome Strand reads_number RPKM
IFT27 chr22 - 24 8.48459367622
CRKL chr22 + 166 17.47879409
SBF1 chr22 - 268 23.1426123831
...
Output_whole_ExonExpressionLevel_unique.txt:
GeneID:Exon Start:Exon End Chromosome Strand reads_number RPKM
APOBEC3B:39388435:39388789 chr22 + 6.78 10.7118340257
GGA1:38019333:38019474 chr22 + 6.34 25.0203422498
NCAPH2:50960169:50960277 chr22 + 7.63 39.2671604098
...
Output_whole_JunctionExpressionLevel_unique.txt: RPM (Reads Per Million Mapped Reads)
GeneID:Junc Start:Junc End Chromosome Strand reads_number RPM
DRG1:31795711:31796605 chr22 + 1 14.9479065457
MYH9:36678831:36680138 chr22 - 6 89.6874392741
RPL3:39709315:39709638 chr22 - 11 164.426972003
...
Two calculations for conditions with and without replicates:
All the analysis is based on the outputs from Expression Level Quantification.
Output Information
Output will be in the following format:
Id Base_mean Base_meanA Base_meanB Fold_change log2_fold_change pval padj
"UBE2Q1" 1127.52604407842 1140.74641615722 1066.381823213 0.934810583763415 -0.0972540267309428 0.525741421393871 0.947892997381397
"RNF14" 1899.48755163846 1901.42965526746 1890.50532235435 0.994254674169595 -0.00831265547272295 0.942974278996524 1
1. This module implements format conversion for backup and visualization:
2. The conversion is implemented with Samtools3, RseqTools6, and Bedtools7 etc.
Possible input files for QC_SNP.sh (Depends on the selected options):
Possible input files for ExpressionEstimation.sh (Depends on the selected options):
Possible input files for DE.sh:
Format specification of input files
There should be three separate files: one for Transcriptome reference sequences, one for Genome reference sequences and one for the Genome Annotation file. RseqFlow will automatically split the files during processing, if necessary. All eukaryotic species with files in the required formats can be analyzed in the RseqFlow pipeline.
The transcript names must begin with “>” and should meet the format requirements below. Extra information may follow the chromosome end location as long as there is a space separating the chromosome end from the extra info.
GenomeName_AnnotationSource_TranscriptsID=Chromosome:Start-End [extra info]
For example:
>hg19_wgEncodeGencodeManualV4_ENST00000480075=chr7:19757-35457 5'pad=0 3'pad=0 strand=- repeatMasking=none
The chromosome name must begin with “>” and should meet the format requirements below. Extra information may follow the chromosome as long as there is a space separating the chromosome from the extra info.
For example:
>chr1 dna:chromosome
>chr21 dna:chromosome chromosome:GRCh37:21:1:48129895:1 REF
>chrM
The Genome Annotation GTF file must be in format GTF3.0.
1. Langmead B, Trapnell C, Pop M, Salzberg SL.(2009) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10.
2. Wang L, Wang S, Li W (2012) RSeQC: quality control of RNA-seq experiments Bioinformatics 28 (16): 2184-2185.
3. Li, H., Handsaker, etc. (2009) The Sequence alignment/map (SAM) format and SAMtools, Bioinformatics, 25, 2078-2079.
4. Simon Anders and Wolfgang Huber: Differential expression analysis for sequence count data Genome Biology (2010),11
5. Wan L and Sun:CEDER: Accurate detection of differentially expressed genes by combining significance of exons using RNA-Seq,(2012), IEEE/ACM Transactions on Computational Biology and Bioinformatics, 9(5): 1281-1292.
6. Lukas Habegger, Andrea Sboner, etc.(2010). RSEQtools: A modular framework to analyze RNA-Seq data using compact, anonymized data summaries. Bioinformatics.
7. Quinlan AR and Hall IM, 2010. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 26, 6, pp. 841–842.