Recent changes to PipelineDescription

PipelineDescription modified by J.Herstein

J.Herstein — Sun, 21 Sep 2014 23:11:23 -0000

--- v10
+++ v11
@@ -20,7 +20,8 @@
 The following software must be pre-installed: 
   * Python 2.7 or higher
   * R 2.11 or higher
-  * GCC.
+  * GCC
+  * ImageMagick (optional) Required if you want a combined pdf QC report

 RseqFlow is implemented with two [run mode options](http://sourceforge.net/p/rseqflow/wiki/TwoRunningModes): Pegasus workflow management run mode and Simple Unix Shell run mode.

@@ -59,7 +60,7 @@

          1              17,734,524    31.08% 19.95% 21.10% 27.87% 0.00%
         ...
-<BR>
+
 **Output.ACTG_Percentage_plot.pdf:**

 ![](http://genomics.isi.edu/wp-content/uploads/2012/09/figure3.png)
@@ -74,7 +75,7 @@
           1                   80,999,006            70.63%
          ...

-<BR>
+
 **Output.GC_Percentage_plot.pdf:**

 ![](http://genomics.isi.edu/wp-content/uploads/2012/09/figure4.png)
@@ -87,7 +88,7 @@
          0.00%           3,896
          1.00%             768
           ...
-<BR>
+
 **Output.GC_hist_plot.pdf:**

 ![](http://genomics.isi.edu/wp-content/uploads/2012/09/figure5.png)
@@ -102,7 +103,7 @@
     Number of Multi-mapped reads: 3,265,968   7.94%
     Number of Unique-mapped reads: 27,048,743   65.81%

-<BR>
+
 5\. Post-alignment->Statistics of Strand specificity: Outputs how reads were stranded for RNA-seq data. It is based on annotation of the transcriptome by comparing read mapping information to the underlying gene model. The output file is as follows: 

 **Output.strand_stat.txt:**
@@ -122,7 +123,7 @@
     +-:read mapped to '+' strand indicates parental gene on '-' strand
     -+:read mapped to '-' strand indicates parental gene on '+' strand

-<BR>
+
 6\. Post-alignment->Gene Body Coverage Distribution: Outputs the average read distribution over the gene body. This module scales all transcripts to 100bp length. For ① all annotated genes and ② the genes with single annotated transcript, the average number of reads for each scaled point is calculated. The output files include two images of coverage profile along the gene body for ① and ② genes and their corresponding numerical value .txt files. 

 **Output_whole_genes.geneBodyCoverage.txt:** (Numerical values for Output_whole_genes.geneBody_coverage.pdf) 
@@ -137,7 +138,7 @@
          1                  17294
         ...

-<BR>
+
 **Output_whole_genes.geneBody_coverage.pdf:**

 ![](http://genomics.isi.edu/wp-content/uploads/2012/09/figure6.png)
@@ -154,7 +155,7 @@
          1                   5685
         ...                    

-<BR>
+
 **Output_single_transcriptGene.geneBody_coverage.pdf:**

 ![](http://genomics.isi.edu/wp-content/uploads/2012/09/figure7.png)
@@ -181,7 +182,7 @@

     TSS=Transcription Start Site, TES=Transcription End Site, down=downstream, up=upstream

-<BR>
+
 8\. Post-alignment-optional->Alignments to ribosomal RNA and Mitochondrial genome: Outputs the mapping rate of reads to ribosomal RNA and Mitochondrial genome. The output file is as follows: 

 **Mitochondrial genome mapping report**
@@ -191,7 +192,7 @@
     Number of Reads: 41,021,821
     Number of Mapped Reads: 10,331,743
     Mapping Rate: 25.19%
-<BR>
+
 **ribosomal RNA mapping report**
 **Output.rRNA.simple_mapping_report.txt:**

@@ -199,14 +200,18 @@
     Number of Reads: 41,021,821
     Number of Mapped Reads: 7,870,589
     Mapping Rate: 19.19%
-<BR>
+
+
+9\.  Combined QC Report
+
+If you have ImageMagick installed, RseqFlow will create a combined pdf report containing all of the individual QC reports. The combined file is named Output.combined_QC_report.pdf
+

 ## SNPs Calling
 * This module uses the uniquely mapped alignments based on merged alignments to the genome and transcriptome. This module is implemented within the QC branch. 

 * SNP calling is analyzed by [Samtools](http://samtools.sourceforge.net/)³. It outputs .bcf files which can be viewed using the included bcftools package. 

-<BR>

 ## Expression Level Quantification
 [[img src=RseqFlow-Extension-ExpressionLevel-2013-07-17.png height=75% width=75%]]
@@ -227,7 +232,7 @@
     SBF1       chr22        -       268       23.1426123831
     ... 

-<BR>
+
 **Output_whole_ExonExpressionLevel_unique.txt:**

@@ -238,7 +243,7 @@
     NCAPH2:50960169:50960277        chr22     +       7.63      39.2671604098
     ... 

-<BR>
+
 **Output_whole_JunctionExpressionLevel_unique.txt:** RPM (Reads Per Million Mapped Reads) 

@@ -248,7 +253,7 @@
     RPL3:39709315:39709638       chr22      -      11        164.426972003
     ...

-<BR>
+
 ## Differentially Expressed Gene Identification
 ![](http://genomics.isi.edu/wp-content/uploads/2012/09/RseqFlow-Extension-DE.png)

@@ -272,7 +277,7 @@
     "UBE2Q1"  1127.52604407842  1140.74641615722  1066.381823213     0.934810583763415  -0.0972540267309428  0.525741421393871  0.947892997381397
     "RNF14"   1899.48755163846  1901.42965526746  1890.50532235435   0.994254674169595  -0.00831265547272295 0.942974278996524  1

-<BR>
+

 ## Alignment File Format Conversion module
 1\. This module implements format conversion for backup and visualization: 
@@ -283,7 +288,7 @@

 2\. The conversion is implemented with [Samtools](http://samtools.sourceforge.net/)³, [RseqTools](http://archive.gersteinlab.org/proj/rnaseq/rseqtools/)⁶, and [Bedtools](http://code.google.com/p/bedtools/)⁷ etc. 

-<BR>
+
 # Input Files and Formats
 Possible input files for QC_SNP.sh (Depends on the selected options):

@@ -322,7 +327,7 @@
 For example:

 >hg19_wgEncodeGencodeManualV4_ENST00000480075=chr7:19757-35457 5'pad=0 3'pad=0 strand=- repeatMasking=none
-<BR>
+

   * **Genome Reference Sequences**

@@ -335,12 +340,12 @@
 >chr21 dna:chromosome chromosome:GRCh37:21:1:48129895:1 REF

 >chrM
-<BR>
+

   * **Genome Annotation**

 The Genome Annotation GTF file must be in format GTF3.0.
-<BR>
+

 # References

PipelineDescription modified by J.Herstein

J.Herstein — Sat, 08 Feb 2014 00:57:50 -0000

--- v9
+++ v10
@@ -1,5 +1,8 @@
 RseqFlow Pipeline Description
+RseqFlow is an RNA-Seq analysis pipeline which offers an express implementation of analysis steps for RNA sequencing datasets. It can perform pre and post mapping quality control (QC) for sequencing data, calculate expression levels for uniquely mapped reads, identify differentially expressed genes, and convert file formats for ease of visualization. A detailed description of the pipeline is given below.
+
 [TOC]
+
 # Frame Work
 The framework is shown as follows:

PipelineDescription modified by J.Herstein

J.Herstein — Sat, 08 Feb 2014 00:49:32 -0000

--- v8
+++ v9
@@ -345,11 +345,11 @@

 2\. Wang L, Wang S, Li W (2012) RSeQC: quality control of RNA-seq experiments Bioinformatics 28 (16): 2184-2185. 

-3\. Simon Anders and Wolfgang Huber: Differential expression analysis for sequence count data Genome Biology (2010),11 
-
-4\. Wan L and Sun:CEDER: Accurate detection of differentially expressed genes by combining significance of exons using RNA-Seq,(2012), IEEE/ACM Transactions on Computational Biology and Bioinformatics, 9(5): 1281-1292. 
-
-5\. Li, H., Handsaker, etc. (2009) The Sequence alignment/map (SAM) format and SAMtools, Bioinformatics, 25, 2078-2079. 
+3\. Li, H., Handsaker, etc. (2009) The Sequence alignment/map (SAM) format and SAMtools, Bioinformatics, 25, 2078-2079. 
+
+4\. Simon Anders and Wolfgang Huber: Differential expression analysis for sequence count data Genome Biology (2010),11 
+
+5\. Wan L and Sun:CEDER: Accurate detection of differentially expressed genes by combining significance of exons using RNA-Seq,(2012), IEEE/ACM Transactions on Computational Biology and Bioinformatics, 9(5): 1281-1292. 

 6\. Lukas Habegger, Andrea Sboner, etc.(2010). RSEQtools: A modular framework to analyze RNA-Seq data using compact, anonymized data summaries. Bioinformatics.

PipelineDescription modified by J.Herstein

J.Herstein — Sat, 08 Feb 2014 00:48:36 -0000

--- v7
+++ v8
@@ -201,7 +201,7 @@
 ## SNPs Calling
 * This module uses the uniquely mapped alignments based on merged alignments to the genome and transcriptome. This module is implemented within the QC branch.

-* SNP calling is analyzed by [Samtools](http://samtools.sourceforge.net/)⁵. It outputs .bcf files which can be viewed using the included bcftools package. 
+* SNP calling is analyzed by [Samtools](http://samtools.sourceforge.net/)³. It outputs .bcf files which can be viewed using the included bcftools package. 

 


@@ -251,8 +251,8 @@

 * Two calculations for conditions with and without replicates: 

-    * For conditions with replicates: We compute p-values for differentially expressed genes using [DESeq](http://bioconductor.org/packages/release/bioc/html/DESeq.html)³.
-    * For conditions without replicates: p-values for exons are computed with DESeq and then combined into a single value using Fisher probability test⁴.
+    * For conditions with replicates: We compute p-values for differentially expressed genes using [DESeq](http://bioconductor.org/packages/release/bioc/html/DESeq.html)⁴.
+    * For conditions without replicates: p-values for exons are computed with DESeq and then combined into a single value using Fisher probability test⁵.

 * All the analysis is based on the outputs from _Expression Level Quantification_.
@@ -278,7 +278,7 @@
   * bam to wig/bed 
   * mrf to wig/bed format 

-2\. The conversion is implemented with [Samtools](http://samtools.sourceforge.net/)⁵, [RseqTools](http://archive.gersteinlab.org/proj/rnaseq/rseqtools/)⁶, and [Bedtools](http://code.google.com/p/bedtools/)⁷ etc. 
+2\. The conversion is implemented with [Samtools](http://samtools.sourceforge.net/)³, [RseqTools](http://archive.gersteinlab.org/proj/rnaseq/rseqtools/)⁶, and [Bedtools](http://code.google.com/p/bedtools/)⁷ etc. 

 

 # Input Files and Formats

PipelineDescription modified by J.Herstein

J.Herstein — Sat, 08 Feb 2014 00:39:35 -0000

--- v6
+++ v7
@@ -31,7 +31,7 @@
 # Description of Each Branch

 ## Quality Control
-This module outputs the pre-alignment metrics for fastq files and post-alignment QC analysis for sam files. The detailed processing is as follows: 
+This module outputs the pre-alignment metrics for fastq files and post-alignment QC analysis for sam files. It also performs SNPs calling. The detailed QC processing is as follows: 

 [[img src=RseqFlow-Extension-QC.png height=95% width=95%]]

@@ -198,6 +198,13 @@
     Mapping Rate: 19.19%
 


+## SNPs Calling
+* This module uses the uniquely mapped alignments based on merged alignments to the genome and transcriptome. This module is implemented within the QC branch. 
+
+* SNP calling is analyzed by [Samtools](http://samtools.sourceforge.net/)⁵. It outputs .bcf files which can be viewed using the included bcftools package. 
+
+

+
 ## Expression Level Quantification
 [[img src=RseqFlow-Extension-ExpressionLevel-2013-07-17.png height=75% width=75%]]

@@ -263,12 +270,7 @@
     "RNF14"   1899.48755163846  1901.42965526746  1890.50532235435   0.994254674169595  -0.00831265547272295 0.942974278996524  1

 

-## SNPs Calling
-* This module uses the uniquely mapped alignments based on merged alignments to the genome and transcriptome. This module is implemented within the QC branch. 
-
-* SNP calling is analyzed by [Samtools](http://samtools.sourceforge.net/)⁵. It outputs .bcf files which can be viewed using the included bcftools package. 
-
-

+
 ## Alignment File Format Conversion module
 1\. This module implements format conversion for backup and visualization: 

@@ -279,7 +281,7 @@
 2\. The conversion is implemented with [Samtools](http://samtools.sourceforge.net/)⁵, [RseqTools](http://archive.gersteinlab.org/proj/rnaseq/rseqtools/)⁶, and [Bedtools](http://code.google.com/p/bedtools/)⁷ etc. 

 

-## Input Files and Formats
+# Input Files and Formats
 Possible input files for QC_SNP.sh (Depends on the selected options):

   * Genome annotation GTF file

PipelineDescription modified by J.Herstein

J.Herstein — Fri, 07 Feb 2014 22:30:19 -0000

--- v5
+++ v6
@@ -63,7 +63,7 @@

 3\. Pre-alignment->GC content detection: Outputs two types of GC analysis, one counts GC content across the read length; the other is a reads number histogram of GC content. The output files include two image files and their corresponding numerical value .txt files as follows: 

-**Output.GC_Percentage.txt:** (Numberical values for Output.GC_Percentage_plot.pdf) 
+**Output.GC_Percentage.txt:** (Numerical values for Output.GC_Percentage_plot.pdf) 

     PositionInRead             #Reads            GC content

PipelineDescription modified by J.Herstein

J.Herstein — Fri, 31 Jan 2014 21:56:42 -0000

--- v4
+++ v5
@@ -244,14 +244,17 @@

 * Two calculations for conditions with and without replicates: 

-   * For conditions with replicates: We compute p-values for differentially expressed genes using [DESeq](http://bioconductor.org/packages/release/bioc/html/DESeq.html)⁴.
-   * For conditions without replicates: p-values for exons are computed with DESeq and then combined into a single value using Fisher probability test⁵.
-* All the analysis is based on the outputs from _Expression Level Quantification_ . 
+    * For conditions with replicates: We compute p-values for differentially expressed genes using [DESeq](http://bioconductor.org/packages/release/bioc/html/DESeq.html)³.
+    * For conditions without replicates: p-values for exons are computed with DESeq and then combined into a single value using Fisher probability test⁴.
+
+
+* All the analysis is based on the outputs from _Expression Level Quantification_.
+

 * Output Information 

-  * DE_all_With(out)Replicate_+Condition1Sample-Condition2Sample+_Table.txt: This file outputs the statistical information for all genes.
-  * DE_Significant_With(out)Replicate_+Condition1Sample-Condition2Sample+_Table.txt: This file outputs the significant differentially expressed genes.
+    * DE_all_With(out)Replicate_+Condition1Sample-Condition2Sample+_Table.txt: This file outputs the statistical information for all genes.
+    * DE_Significant_With(out)Replicate_+Condition1Sample-Condition2Sample+_Table.txt: This file outputs the significant differentially expressed genes.

 Output will be in the following format: 

@@ -263,7 +266,7 @@
 ## SNPs Calling
 * This module uses the uniquely mapped alignments based on merged alignments to the genome and transcriptome. This module is implemented within the QC branch. 

-* SNP calling is analyzed by [Samtools](http://samtools.sourceforge.net/)⁶. It outputs .bcf files which can be viewed using the included bcftools package. 
+* SNP calling is analyzed by [Samtools](http://samtools.sourceforge.net/)⁵. It outputs .bcf files which can be viewed using the included bcftools package. 

 

 ## Alignment File Format Conversion module
@@ -273,7 +276,7 @@
   * bam to wig/bed 
   * mrf to wig/bed format 

-2\. The conversion is implemented with [Samtools](http://samtools.sourceforge.net/)⁶, [RseqTools](http://archive.gersteinlab.org/proj/rnaseq/rseqtools/)⁷, and [Bedtools](http://code.google.com/p/bedtools/)⁸ etc. 
+2\. The conversion is implemented with [Samtools](http://samtools.sourceforge.net/)⁵, [RseqTools](http://archive.gersteinlab.org/proj/rnaseq/rseqtools/)⁶, and [Bedtools](http://code.google.com/p/bedtools/)⁷ etc. 

 

 ## Input Files and Formats
@@ -340,14 +343,12 @@

 2\. Wang L, Wang S, Li W (2012) RSeQC: quality control of RNA-seq experiments Bioinformatics 28 (16): 2184-2185. 

-3\. Ying Wang, Gaurang Mehta, Rajiv Mayani, Tade Souaiaia, Yangho Chen, Andrew Clark, Lin Wan, Oleg V. Evgrafov, James A. Knowles, Ewa Deelman and Ting Chen, RseqFlow: workflows for RNA-Seq data analysis. Bioinformatics, 2011, 27 (18): 2598–2600. 
-
-4\. Simon Anders and Wolfgang Huber: Differential expression analysis for sequence count data Genome Biology (2010),11 
-
-5\. Wan L and Sun:CEDER: Accurate detection of differentially expressed genes by combining significance of exons using RNA-Seq,(2012), IEEE/ACM Transactions on Computational Biology and Bioinformatics, 9(5): 1281-1292. 
-
-6\. Li, H., Handsaker, etc. (2009) The Sequence alignment/map (SAM) format and SAMtools, Bioinformatics, 25, 2078-2079. 
-
-7\. Lukas Habegger, Andrea Sboner, etc.(2010). RSEQtools: A modular framework to analyze RNA-Seq data using compact, anonymized data summaries. Bioinformatics. 
-
-8\. Quinlan AR and Hall IM, 2010. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 26, 6, pp. 841–842. 
+3\. Simon Anders and Wolfgang Huber: Differential expression analysis for sequence count data Genome Biology (2010),11 
+
+4\. Wan L and Sun:CEDER: Accurate detection of differentially expressed genes by combining significance of exons using RNA-Seq,(2012), IEEE/ACM Transactions on Computational Biology and Bioinformatics, 9(5): 1281-1292. 
+
+5\. Li, H., Handsaker, etc. (2009) The Sequence alignment/map (SAM) format and SAMtools, Bioinformatics, 25, 2078-2079. 
+
+6\. Lukas Habegger, Andrea Sboner, etc.(2010). RSEQtools: A modular framework to analyze RNA-Seq data using compact, anonymized data summaries. Bioinformatics. 
+
+7\. Quinlan AR and Hall IM, 2010. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 26, 6, pp. 841–842.

PipelineDescription modified by J.Herstein

J.Herstein — Fri, 31 Jan 2014 21:48:58 -0000

--- v3
+++ v4
@@ -1,7 +1,7 @@
-RseqFlow Pipeline Desription
+RseqFlow Pipeline Description
 [TOC]
 # Frame Work
-The whole framework is shown as follows: 
+The framework is shown as follows:

 [[img src=RseqFlow-Extension-framework.png height=100% width=100%]]

@@ -39,11 +39,11 @@

 1\. Pre-alignment->Read Quality: Outputs a boxplot and heatmap based on the Phred Quality Score. This analysis is available only when the RNA-Seq input file is in FASTQ format. The heatmap uses different colors to represent nucleotide density ("blue"=low density, "orange"=median denstiy, "red"=high density). The following are some example output files: 

-Output.read_qual.boxplot.pdf 
+**Output.read_qual.boxplot.pdf:**

 ![](http://genomics.isi.edu/wp-content/uploads/2012/09/figure1.png)

-Output.read_qual.heatmap.pdf 
+**Output.read_qual.heatmap.pdf:**

 ![](http://genomics.isi.edu/wp-content/uploads/2012/09/figure2.png)

@@ -52,11 +52,11 @@
 **Output.ACTG_Percentage.txt:**

-    PositionInRead      #Reads        A%     C%     G%     T%    N%
+    PositionInRead      #Reads          A%     C%     G%     T%    N%

          1              17,734,524    31.08% 19.95% 21.10% 27.87% 0.00%
         ...
-
+

 **Output.ACTG_Percentage_plot.pdf:**

 ![](http://genomics.isi.edu/wp-content/uploads/2012/09/figure3.png)
@@ -66,12 +66,12 @@
 **Output.GC_Percentage.txt:** (Numberical values for Output.GC_Percentage_plot.pdf) 

-    PositionInRead           #Reads             GC content
+    PositionInRead             #Reads            GC content

           1                   80,999,006            70.63%
          ...

-
+

 **Output.GC_Percentage_plot.pdf:**

 ![](http://genomics.isi.edu/wp-content/uploads/2012/09/figure4.png)
@@ -84,7 +84,7 @@
          0.00%           3,896
          1.00%             768
           ...
-
+

 **Output.GC_hist_plot.pdf:**

 ![](http://genomics.isi.edu/wp-content/uploads/2012/09/figure5.png)
@@ -99,7 +99,7 @@
     Number of Multi-mapped reads: 3,265,968   7.94%
     Number of Unique-mapped reads: 27,048,743   65.81%

-
+

 5\. Post-alignment->Statistics of Strand specificity: Outputs how reads were stranded for RNA-seq data. It is based on annotation of the transcriptome by comparing read mapping information to the underlying gene model. The output file is as follows: 

 **Output.strand_stat.txt:**
@@ -119,6 +119,7 @@
     +-:read mapped to '+' strand indicates parental gene on '-' strand
     -+:read mapped to '-' strand indicates parental gene on '+' strand

+

 6\. Post-alignment->Gene Body Coverage Distribution: Outputs the average read distribution over the gene body. This module scales all transcripts to 100bp length. For ① all annotated genes and ② the genes with single annotated transcript, the average number of reads for each scaled point is calculated. The output files include two images of coverage profile along the gene body for ① and ② genes and their corresponding numerical value .txt files. 

 **Output_whole_genes.geneBodyCoverage.txt:** (Numerical values for Output_whole_genes.geneBody_coverage.pdf) 
@@ -133,6 +134,7 @@
          1                  17294
         ...

+

 **Output_whole_genes.geneBody_coverage.pdf:**

 ![](http://genomics.isi.edu/wp-content/uploads/2012/09/figure6.png)
@@ -149,6 +151,7 @@
          1                   5685
         ...                    

+

 **Output_single_transcriptGene.geneBody_coverage.pdf:**

 ![](http://genomics.isi.edu/wp-content/uploads/2012/09/figure7.png)
@@ -175,19 +178,18 @@

     TSS=Transcription Start Site, TES=Transcription End Site, down=downstream, up=upstream

+

 8\. Post-alignment-optional->Alignments to ribosomal RNA and Mitochondrial genome: Outputs the mapping rate of reads to ribosomal RNA and Mitochondrial genome. The output file is as follows: 

-Mitochondrial genome mapping report 
-
+**Mitochondrial genome mapping report**
 **Output.chrM.simple_mapping_report.txt:**

     Number of Reads: 41,021,821
     Number of Mapped Reads: 10,331,743
     Mapping Rate: 25.19%
-
-ribosomal RNA mapping report 
-
+

+**ribosomal RNA mapping report**
 **Output.rRNA.simple_mapping_report.txt:**

@@ -197,13 +199,13 @@
 


 ## Expression Level Quantification
-![](http://rseqflow.googlecode.com/svn/wiki/images/RseqFlow-Extension-ExpressionLevel.png)
-
-1\. Multi-mapped reads are removed. 
-
-2\. All analysis is based on the alignments from RNA-Seq to the reference transcriptome. 
-
-3\. Output Information 
+[[img src=RseqFlow-Extension-ExpressionLevel-2013-07-17.png height=75% width=75%]]
+
+  * Multi-mapped reads are removed.
+
+  * All analysis is based on the alignments from RNA-Seq to the reference transcriptome. 
+
+  * Output Information:

 **Output_whole_GeneExpressionLevel_unique.txt:**

@@ -215,6 +217,7 @@
     SBF1       chr22        -       268       23.1426123831
     ... 

+

 **Output_whole_ExonExpressionLevel_unique.txt:**

@@ -225,7 +228,7 @@
     NCAPH2:50960169:50960277        chr22     +       7.63      39.2671604098
     ... 

-
+

 **Output_whole_JunctionExpressionLevel_unique.txt:** RPM (Reads Per Million Mapped Reads) 

@@ -239,18 +242,18 @@
 ## Differentially Expressed Gene Identification
 ![](http://genomics.isi.edu/wp-content/uploads/2012/09/RseqFlow-Extension-DE.png)

-1\. Two calculations for conditions with and without replicates: 
-
->   * For conditions with replicates: We compute p-values for differentially expressed genes using [DESeq](http://bioconductor.org/packages/release/bioc/html/DESeq.html)⁴.
-  * For conditions without replicates: p-values for exons are computed with DESeq and then combined into a single value using Fisher probability test⁵.
-2\. All the analysis is based on the outputs from _Expression Level Quantification_ . 
-
-3\. Output Information 
-
->   * DE_all_With(out)Replicate_+Condition1Sample-Condition2Sample+_Table.txt: This file outputs the statistical information for all genes.
+* Two calculations for conditions with and without replicates: 
+
+   * For conditions with replicates: We compute p-values for differentially expressed genes using [DESeq](http://bioconductor.org/packages/release/bioc/html/DESeq.html)⁴.
+   * For conditions without replicates: p-values for exons are computed with DESeq and then combined into a single value using Fisher probability test⁵.
+* All the analysis is based on the outputs from _Expression Level Quantification_ . 
+
+* Output Information 
+
+  * DE_all_With(out)Replicate_+Condition1Sample-Condition2Sample+_Table.txt: This file outputs the statistical information for all genes.
   * DE_Significant_With(out)Replicate_+Condition1Sample-Condition2Sample+_Table.txt: This file outputs the significant differentially expressed genes.

-> Output will be in the following format: 
+Output will be in the following format: 

     Id           Base_mean         Base_meanA       Base_meanB          Fold_change       log2_fold_change        pval              padj 
     "UBE2Q1"  1127.52604407842  1140.74641615722  1066.381823213     0.934810583763415  -0.0972540267309428  0.525741421393871  0.947892997381397
@@ -258,9 +261,9 @@

 

 ## SNPs Calling
-1\. This module uses the uniquely mapped alignments based on merged alignments to the genome and transcriptome. This module is implemented within the QC branch. 
-
-2\. SNP calling is analyzed by [Samtools](http://samtools.sourceforge.net/)⁶. It outputs .bcf files which can be viewed using the included bcftools package. 
+* This module uses the uniquely mapped alignments based on merged alignments to the genome and transcriptome. This module is implemented within the QC branch. 
+
+* SNP calling is analyzed by [Samtools](http://samtools.sourceforge.net/)⁶. It outputs .bcf files which can be viewed using the included bcftools package. 

 

 ## Alignment File Format Conversion module

PipelineDescription modified by J.Herstein

J.Herstein — Fri, 31 Jan 2014 21:02:09 -0000

--- v2
+++ v3
@@ -3,7 +3,7 @@
 # Frame Work
 The whole framework is shown as follows:

-![](http://rseqflow.googlecode.com/svn/wiki/images/RseqFlow-Extension-framework.png)
+[[img src=RseqFlow-Extension-framework.png height=100% width=100%]]

 # Four Run Branches
 The main framework has four run branches which can be run individually or in workflow mode. 
@@ -33,7 +33,7 @@
 ## Quality Control
 This module outputs the pre-alignment metrics for fastq files and post-alignment QC analysis for sam files. The detailed processing is as follows: 

-![](http://rseqflow.googlecode.com/svn/wiki/images/RseqFlow-Extension-QC.png)
+[[img src=RseqFlow-Extension-QC.png height=95% width=95%]]

 Some of the following figures and text are taken from [RSeQC](http://code.google.com/p/rseqc/)².

PipelineDescription modified by J.Herstein

J.Herstein — Wed, 29 Jan 2014 23:58:53 -0000

--- v1
+++ v2
@@ -1,24 +1,11 @@
-  * Frame Work
-  * Four Run Branches
-  * Software Environment and Run modes
-  * Description of Each Branch
-    * Aligning RNA-Seq Datasets
-    * Quality Control
-    * Expression Level Quantification
-    * Differentially Expressed Gene Identification
-    * SNPs Calling
-    * Alignment File Format Conversion module
-    * Input Files and Formats
-  * Reference
-
+RseqFlow Pipeline Desription
+[TOC]
 # Frame Work
-
 The whole framework is shown as follows:

 ![](http://rseqflow.googlecode.com/svn/wiki/images/RseqFlow-Extension-framework.png)

 # Four Run Branches
-
 The main framework has four run branches which can be run individually or in workflow mode. 

   * Branch 1: Quality Control and SNP calling based on the merging of alignments to the transcriptome and genome. 
@@ -27,26 +14,28 @@
   * Branch 4: Differentially expressed gene identification based on the output of the Expression level quantification from Branch 2. 

 # Software Environment and Run modes
-
-> * The following software must be pre-installed: Python 2.7 or higher, R 2.11 or higher, and GCC.
-> * `RseqFlow` is implemented with two [run mode options](http://code.google.com/p/rseqflow/wiki/TwoRunningModes): Pegasus workflow management run mode and Simple Unix Shell run mode.
+The following software must be pre-installed: 
+  * Python 2.7 or higher
+  * R 2.11 or higher
+  * GCC.
+
+RseqFlow is implemented with two [run mode options](http://sourceforge.net/p/rseqflow/wiki/TwoRunningModes): Pegasus workflow management run mode and Simple Unix Shell run mode.
+
+# Aligning RNA-Seq Datasets
+
+  * Alignment tools: [Bowtie2](http://bowtie-bio.sourceforge.net/bowtie2/index.shtml)¹
+  * Alignment targets: Genome and/or Transcriptome
+  * QC, SNP calling, and Alignment File Format Conversion modules are implemented based on the merging of alignments to the transcriptome and genome.
+  * Expression Quantification and Differentially Expressed Gene analysis modules are implemented based on only the alignment to the transcriptome.

 # Description of Each Branch

-## **Aligning RNA-Seq Datasets**
-
-> * Alignment tools: [Bowtie2](http://bowtie-bio.sourceforge.net/bowtie2/index.shtml)1
-> * Alignment targets: Genome and/or Transcriptome
-> * QC, SNP calling, and Alignment File Format Conversion modules are implemented based on the merging of alignments to the transcriptome and genome.
-> * Expression Quantification and Differentially Expressed Gene analysis modules are implemented based on only the alignment to the transcriptome.
-
-## **Quality Control**
-
+## Quality Control
 This module outputs the pre-alignment metrics for fastq files and post-alignment QC analysis for sam files. The detailed processing is as follows: 

 ![](http://rseqflow.googlecode.com/svn/wiki/images/RseqFlow-Extension-QC.png)

-Some of the following figures and text are taken from [RSeQC](http://code.google.com/p/rseqc/)2. 
+Some of the following figures and text are taken from [RSeQC](http://code.google.com/p/rseqc/)². 

 1\. Pre-alignment->Read Quality: Outputs a boxplot and heatmap based on the Phred Quality Score. This analysis is available only when the RNA-Seq input file is in FASTQ format. The heatmap uses different colors to represent nucleotide density ("blue"=low density, "orange"=median denstiy, "red"=high density). The following are some example output files: 

@@ -116,7 +105,7 @@
 **Output.strand_stat.txt:**

-    This is `SingleEnd` Data
+    This is SingleEnd Data

     Fraction of reads explained by "++": .2314
     Fraction of reads explained by "--": .2637
@@ -132,7 +121,7 @@

 6\. Post-alignment->Gene Body Coverage Distribution: Outputs the average read distribution over the gene body. This module scales all transcripts to 100bp length. For ① all annotated genes and ② the genes with single annotated transcript, the average number of reads for each scaled point is calculated. The output files include two images of coverage profile along the gene body for ① and ② genes and their corresponding numerical value .txt files. 

-**Output_whole_genes.geneBodyCoverage.txt:** (Numerical values for Output_whole_genes.geneBody_coverage.pdf`) 
+**Output_whole_genes.geneBodyCoverage.txt:** (Numerical values for Output_whole_genes.geneBody_coverage.pdf) 

     Total reads: 16970177
@@ -205,10 +194,9 @@
     Number of Reads: 41,021,821
     Number of Mapped Reads: 7,870,589
     Mapping Rate: 19.19%
-    
-
-## **Expression Level Quantification**
-
+

+
+## Expression Level Quantification
 ![](http://rseqflow.googlecode.com/svn/wiki/images/RseqFlow-Extension-ExpressionLevel.png)

 1\. Multi-mapped reads are removed. 
@@ -217,7 +205,7 @@

 3\. Output Information 

-**`Output_whole_GeneExpressionLevel_unique.txt`:**
+**Output_whole_GeneExpressionLevel_unique.txt:**

     GeneID   chromosome   Strand  reads_number    RPKM
@@ -227,7 +215,7 @@
     SBF1       chr22        -       268       23.1426123831
     ... 

-**`Output_whole_ExonExpressionLevel_unique.txt`:**
+**Output_whole_ExonExpressionLevel_unique.txt:**

     GeneID:Exon Start:Exon End   Chromosome Strand reads_number   RPKM
@@ -238,7 +226,7 @@
     ... 

-**`Output_whole_JunctionExpressionLevel_unique.txt`:** RPM (Reads Per Million Mapped Reads) 
+**Output_whole_JunctionExpressionLevel_unique.txt:** RPM (Reads Per Million Mapped Reads) 

     GeneID:Junc Start:Junc End Chromosome Strand reads_number    RPM
@@ -246,22 +234,21 @@
     MYH9:36678831:36680138       chr22      -       6         89.6874392741
     RPL3:39709315:39709638       chr22      -      11        164.426972003
     ...
-    
-
-## **Differentially Expressed Gene Identification**
-
+
+

+## Differentially Expressed Gene Identification
 ![](http://genomics.isi.edu/wp-content/uploads/2012/09/RseqFlow-Extension-DE.png)

 1\. Two calculations for conditions with and without replicates: 

->   * For conditions with replicates: We compute p-values for differentially expressed genes using [DESeq](http://bioconductor.org/packages/release/bioc/html/DESeq.html)4.
-  * For conditions without replicates: p-values for exons are computed with DESeq and then combined into a single value using Fisher probability test5.
+>   * For conditions with replicates: We compute p-values for differentially expressed genes using [DESeq](http://bioconductor.org/packages/release/bioc/html/DESeq.html)⁴.
+  * For conditions without replicates: p-values for exons are computed with DESeq and then combined into a single value using Fisher probability test⁵.
 2\. All the analysis is based on the outputs from _Expression Level Quantification_ . 

 3\. Output Information 

->   * `DE_all_With(out)Replicate_+Condition1Sample-Condition2Sample+_Table.txt`: This file outputs the statistical information for all genes.
-  * `DE_Significant_With(out)Replicate_+Condition1Sample-Condition2Sample+_Table.txt`: This file outputs the significant differentially expressed genes.
+>   * DE_all_With(out)Replicate_+Condition1Sample-Condition2Sample+_Table.txt: This file outputs the statistical information for all genes.
+  * DE_Significant_With(out)Replicate_+Condition1Sample-Condition2Sample+_Table.txt: This file outputs the significant differentially expressed genes.

 > Output will be in the following format: 

@@ -269,82 +256,88 @@
     "UBE2Q1"  1127.52604407842  1140.74641615722  1066.381823213     0.934810583763415  -0.0972540267309428  0.525741421393871  0.947892997381397
     "RNF14"   1899.48755163846  1901.42965526746  1890.50532235435   0.994254674169595  -0.00831265547272295 0.942974278996524  1

-## **SNPs Calling**
-
+

+## SNPs Calling
 1\. This module uses the uniquely mapped alignments based on merged alignments to the genome and transcriptome. This module is implemented within the QC branch. 

-2\. SNP calling is analyzed by [Samtools](http://samtools.sourceforge.net/)6. It outputs .bcf files which can be viewed using the included bcftools package. 
-
-## **Alignment File Format Conversion module**
-
+2\. SNP calling is analyzed by [Samtools](http://samtools.sourceforge.net/)⁶. It outputs .bcf files which can be viewed using the included bcftools package. 
+
+

+## Alignment File Format Conversion module
 1\. This module implements format conversion for backup and visualization: 

   * sam to bam, mrf and wig/bed format 
   * bam to wig/bed 
   * mrf to wig/bed format 

-2\. The conversion is implemented with [Samtools](http://samtools.sourceforge.net/)6, [RseqTools](http://archive.gersteinlab.org/proj/rnaseq/rseqtools/)7, and [Bedtools](http://code.google.com/p/bedtools/)8 etc. 
-
-## **Input Files and Formats**
-
-Some annotation and reference sequence files with required formats for several model species can be [downloaded](http://code.google.com/p/rseqflow/wiki/ReferenceAnnotationDownload) here. 
-
-**1\. Possible input Files for QC_SNP.sh(Depends on the options selected)**
+2\. The conversion is implemented with [Samtools](http://samtools.sourceforge.net/)⁶, [RseqTools](http://archive.gersteinlab.org/proj/rnaseq/rseqtools/)⁷, and [Bedtools](http://code.google.com/p/bedtools/)⁸ etc. 
+
+

+## Input Files and Formats
+Possible input files for QC_SNP.sh (Depends on the selected options):

   * Genome annotation GTF file 
-  * Transcriptome reference sequences fasta file 
-  * Genome reference sequences fasta file 
+  * Transcriptome reference sequences 
+  * Genome reference seqeuences 
   * RNA-Seq fastq or fastq.gz file 
+  * Alignment files in SAM format 
   * Reference sequences of Mitochondria 
   * Reference sequences of Ribosomal RNA 
-  * Alignment files in SAM format 
-
-**2\. Possile input Files for Expression Level Quantification (Depends on the options selected)**
+
+Possible input files for ExpressionEstimation.sh (Depends on the selected options):

   * Genome annotation GTF file 
-  * Transcriptome reference sequences fasta file 
+  * Transcriptome reference sequences 
   * RNA-Seq fastq or fastq.gz file 
   * Alignment files in SAM format 

-**3\. Possible input Files for DE Gene Identification (Depends on the options selected)**
-
-  * Output files from the Expression Level Quantification module (`ExpressionEstimation.sh`): 
-
-> `whole_GeneExpressionLevel_unique.txt` or `whole_ExonExpressionLevel_unique.txt`
-
-**4\. Format Specification of Input Files **
-
-  * **Genome Annotation GTF file**
-
-> Format from GTF 2.0 to GTF3.0 is required. 
-
-  * **Genome Annotation Reference Sequences (Reference transcriptome)**
-
-> The transcript names must begin with “>” and should meet the format requirements as shown below. Extra information may follow the chromosome end location as long as there is a space separating the chromosome end location from the extra info. 
-
-> `“>$GenomeName_$AnnotationSource_$TranscriptsID=$Chromosome:$Start-$End [extra info]”`
-
-> For example, 
-
-> “>hg19_wgEncodeGencodeManualV4_ENST00000480075=chr7:19757-35457 5'pad=0 3'pad=0 strand=- repeatMasking=none” 
+Possible input files for DE.sh:
+
+  * Output files from ExpressionEstimation.sh: 
+       * whole_GeneExpressionLevel_unique.txt
+       * whole_ExonExpressionLevel_unique.txt files. 
+
+Format specification of input files
+
+There should be three separate files: one for Transcriptome reference sequences, one for Genome reference sequences and one for the Genome Annotation file. RseqFlow will automatically split the files during processing, if necessary. All eukaryotic species with files in the required formats can be analyzed in the RseqFlow pipeline. 
+
+
+  * **Transcriptome Reference Sequences**
+
+The transcript names must begin with “>” and should meet the format requirements below. Extra information may follow the chromosome end location as long as there is a space separating the chromosome end from the extra info. 
+
+GenomeName_AnnotationSource_TranscriptsID=Chromosome:Start-End [extra info]
+
+For example:
+
+>hg19_wgEncodeGencodeManualV4_ENST00000480075=chr7:19757-35457 5'pad=0 3'pad=0 strand=- repeatMasking=none
+


   * **Genome Reference Sequences**

-> The chromosome name must begin with “>” and should meet the format requirements as shown below. Extra information may follow the chromosome as long as there is a space separating the chromosome from the extra info. 
-
-> “>$chromsome” 
-
-> For example, 
-
-> “>chr1” “>chrM” etc 
-
-# Reference
+The chromosome name must begin with “>” and should meet the format requirements below. Extra information may follow the chromosome as long as there is a space separating the chromosome from the extra info. 
+
+For example: 
+
+>chr1 dna:chromosome 
+
+>chr21 dna:chromosome chromosome:GRCh37:21:1:48129895:1 REF
+
+>chrM
+

+
+  * **Genome Annotation**
+
+The Genome Annotation GTF file must be in format GTF3.0.
+

+
+# References

 1\. Langmead B, Trapnell C, Pop M, Salzberg SL.(2009) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. _Genome Biol_ 10\. 

 2\. Wang L, Wang S, Li W (2012) RSeQC: quality control of RNA-seq experiments Bioinformatics 28 (16): 2184-2185. 

-3\. Ying Wang, Gaurang Mehta, Rajiv Mayani, Tade Souaiaia, Yangho Chen, Andrew Clark, Lin Wan, Oleg V. Evgrafov, James A. Knowles, Ewa Deelman and Ting Chen, `RseqFlow`: workflows for RNA-Seq data analysis. Bioinformatics, 2011, 27 (18): 2598–2600. 
+3\. Ying Wang, Gaurang Mehta, Rajiv Mayani, Tade Souaiaia, Yangho Chen, Andrew Clark, Lin Wan, Oleg V. Evgrafov, James A. Knowles, Ewa Deelman and Ting Chen, RseqFlow: workflows for RNA-Seq data analysis. Bioinformatics, 2011, 27 (18): 2598–2600. 

 4\. Simon Anders and Wolfgang Huber: Differential expression analysis for sequence count data Genome Biology (2010),11