Recent changes to UnixModeManual

UnixModeManual modified by J.Herstein

J.Herstein — Sun, 21 Sep 2014 23:16:07 -0000

--- v7
+++ v8
@@ -6,7 +6,8 @@
   * UNIX; 
   * Python 2.7 and higher; 
   * R 2.11 and higher; 
-  * GCC 
+  * GCC;
+  * ImageMagick (optional) Required if you want a single combined QC PDF report

 RseqFlow Installation and Configuration

@@ -15,40 +16,40 @@
 Step 2: Enter your specified directory: 

        bash-3.2$ cd /home/user/rseqflow
-<BR>
+

 Step 3: Extract the tar file: 

        bash-3.2$ tar -xvf RseqFlow_source.tar.gz 
-<BR>
+

 Step 4: Enter the directory: 

        bash-3.2$ cd /home/user/reseqflow/RseqFlow_source
-<BR>
+

 Step 5: Set up some tools: 

        bash-3.2$ ./make.sh 
-<BR>
+

 Step 6: Set up PATH so that the system knows where to find the executable files. It is recommended to run step 7 after this to make these changes permanent, otherwise you will need to run ./configure.sh each time you run RseqFlow from a new terminal window. 

        bash-3.2$ source ./configure.sh 
-<BR>
+

 Step 7: This step will make the changes to your PATH and PYTHONPATH variables from step 6 permanent. If you choose not to do this step, you will need to run configure.sh each time you run RseqFlow. To make the PATH and PYTHONPATH changes permanent, copy the commands in 'configure.sh' into your bash file either manually or with the following command. **Please make sure the command contains ">>", not “>", otherwise, you will overwrite your original bash file! **

        bash-3.2$ cat configure.sh >> /home/user/.bashrc
-<BR>
+

 Step 8: If your system has multiple python versions, make sure you use version 2.7 or higher. Run 'pythonCompilerSet.sh' to set the proper python header in each of the python scripts. The example below assumes the path of the python executable is '/home/user/python2.7'. 

        bash-3.2$ ./pythonCompilerSet.sh -p /home/user/python2.7/python
-<BR>
+

 Step 9: Now you can run the corresponding shell scripts for each branch. 
-<BR>
+

 # 2\. Branch usage 

@@ -159,28 +160,28 @@
     QC_SNP.sh -f Reads.fastq -o OutputPrefix –p --cleanup

 _This command will give pre-alignment quality analysis only, based on the RNA-Seq dataset (Reads.fq) and will delete temporary files._
-<BR>  
+

     QC_SNP.sh -f Reads.fastq.gz -g RefGene.fa -c RefTran.fa -a Anno.gtf -o OutputPrefix –p -q --ribo rRNA.fa

 _This command will align the Reads.fastq.gz to the genome (RefGene.fa), transcriptome (RefTran.fa) and ribosomal RNA (rRNA.fa) and implement the pre and post alignment quality control analyses. It will also output the ribosomal RNA alignment report._
-<BR>   
+  

     QC_SNP.sh -a Anno.gtf -o OutputPrefix -q --gSam AlignedToGenome.sam --tSam AlignedToTran.sam

 _This command will give post-alignment quality analysis based on the user specified alignment SAM files to the genome (AlignedToGenome.sam) and transcriptome(AlignedToTran.sam)._
-<BR>   
+   

     QC_SNP.sh -1 Read1.fastq -2 Read2.fastq -g RefGene.fa -c RefTran.fa –a Anno.gtf -o OutputPrefix -p –q –s

 _This command will align the paired end RNA-Seq datasets Read1.fastq and Read2.fastq to the genome (RefGene.fa) and transcriptome (RefTran.fa) and implement the pre and post alignment quality control analyses and SNP calling._
-<BR>    
+   

     QC_SNP.sh -1 Read1.fastq.gz -2 Read2.fastq.gz -g RefGene.fa -c RefTran.fa -a Anno.gtf -o OutputPrefix –s --mito chrM.fa

 _This command will align the paired end Read1.fastq.gz and Read2.fastq.gz to the genome (RefGene.fa), transcriptome (RefTran.fa) and Mitochondrial chromosome (chrM.fa) and implement SNP calling base on the merged genome and transcriptome alignments._

-<BR>
+
 ## Branch 2: Expression Level Estimation (ExpressionEstimation.sh)
 ExpressionEstimation.sh will implement expression level estimation for gene/exon/splice junctions based on the alignment to the transcriptome for single end or paired end RNA-Seq datsets. Beginning in version 2.1 and later, only unique alignments are used in expression estimation. Prior to version 2.1, the best alignment for each read was used which may or may not have been unique. 

@@ -250,12 +251,12 @@
     ExpressionEstimation.sh -f Reads.fastq -c RefTran.fa -a Anno.gtf -o OutputPrefix --cleanup

 _This command will align the single end Reads.fastq to the transcriptome reference sequence (RefTran.fa) and implement the expression level estimation using the distinct best alignment for each read. Temporary files will be deleted._
-<BR>   
+  

     ExpressionEstimation.sh -a Anno.gtf -o OutputPrefix --tSam AlignedToTran.sam

 _This command uses the supplied .sam file that has already been aligned to the transcriptome. RseqFlow will implement the expression level estimation from the distinct best alignment for each read._
-<BR>
+

 ## Branch 3: Differentially Expressed Gene Identification (DE.sh)
 The "de" command will identify the differentially expressed genes for two conditions (e.g. case/control) using the output files from ExpressionEstimation.sh. If both conditions have only one sample, the ExonExpressionLevel_unique files from ExpressionEstimation.sh are used as input. Otherwise, if either condition has more than one sample, the GeneExpressionLevel_unique files from ExpressionEstimation.sh are used as input. 
@@ -304,12 +305,12 @@
     DE.sh de --c1 disease --c2 normal --f1 disease_S1_whole_GeneExpressionLevel_unique.txt, disease_S2_whole_GeneExpressionLevel_unique.txt --f2 normal_S1_whole_GeneExpressionLevel_unique.txt -o HeartDisease

 _In this case, there are two samples with the disease condition, so the GeneExpressionLevel_unique files from ExpressionEstimation.sh are used as input._
-<BR>   
+  

     DE.sh de --c1 Brain --c2 Colon --f1 Brain_whole_ExonExpressionLevel_unique.txt --f2 Colon_whole_ExonExpressionLevel_unique.txt –o TissueCompare

 _In this case, both conditions have only a single sample each, so the ExonExpressionLevel files from ExpressionEstimation.sh are used as input._
-<BR>
+

 ## Branch 4: Alignment File Format Conversion (FileFormatConversion.sh)
 FileFormatConversion.sh will convert between alignment file formats for storage or visualization convenience. It can convert SAM to BAM, MRF and WIG, BAM to BED, and MRF to WIG. 
@@ -366,17 +367,17 @@
     FileFormatConversion.sh -i in.sam -o out -b -r ref.fa

 _This command will convert the input in.sam file to a bam file. If the sam file does not contain the appropriate "@SQ" header lines, the reference sequences file (ref.fa) is required._
-<BR>   
+   

     FileFormatConversion.sh -i in.sam -o out –m

 _This command will convert the input in.sam file to a mrf file._
-<BR> 
+

     FileFormatConversion.sh -i in.bam -o out –d

 _This command will convert the input in.bam file to a BED file for visualization._
-<BR>
+

 # 3\. Input files and formats
 ## Possible input files for QC_SNP.sh (Depends on the selected options):
@@ -388,19 +389,19 @@
   * Alignment files in SAM format 
   * Reference sequences of Mitochondria 
   * Reference sequences of Ribosomal RNA 
-<BR>
+
 ## Possible input files for ExpressionEstimation.sh (Depends on the selected options):

   * Genome annotation GTF file 
   * Transcriptome reference sequences 
   * RNA-Seq fastq or fastq.gz file 
   * Alignment files in SAM format 
-<BR>
+
 ## Possible input files for DE.sh:
   * Output files from ExpressionEstimation.sh: 
        * whole_GeneExpressionLevel_unique.txt
        * whole_ExonExpressionLevel_unique.txt files
-<BR>
+
 ## Format specification of input files

 There should be three separate files: one for Transcriptome reference sequences, one for Genome reference sequences and one for the Genome Annotation file. RseqFlow will automatically split the files during processing, if necessary. All eukaryotic species with files in the required formats can be analyzed in the RseqFlow pipeline. 
@@ -415,7 +416,7 @@
 For example:

 >hg19_wgEncodeGencodeManualV4_ENST00000480075=chr7:19757-35457 5'pad=0 3'pad=0 strand=- repeatMasking=none
-<BR>
+

   * **Genome Reference Sequences**

@@ -428,13 +429,13 @@
 >chr21 dna:chromosome chromosome:GRCh37:21:1:48129895:1 REF

 >chrM
-<BR>
+

   * **Genome Annotation**

 The Genome Annotation GTF file must be in format GTF3.0.
-<BR>
-<BR>
+
+

 # 4\. Sample datasets
 ## c. elegans
@@ -444,7 +445,7 @@
 [Reference genome sequences file](http://sourceforge.net/projects/rseqflow/files/genome_c.elegans.fa.gz/download)
 [Reference annotation gtf file](http://sourceforge.net/projects/rseqflow/files/c.elegans_refseq_anno.gtf/download)
 [Reference transcriptome sequences file](http://sourceforge.net/projects/rseqflow/files/c.elegans_refseq_seq.fa/download)
-<BR>
+

 Commands for running QC_SNP, ExpressionEstimation, and FileFormatConversion:

@@ -452,17 +453,17 @@

     mkdir c.elegans_Output

-<BR>
+
 i) Branch 1: QC_SNP.sh 

      QC_SNP.sh -f SRR514378.fastq -g genome_c.elegans.fa -c c.elegans_refseq_seq.fa -a c.elegans_refseq_anno.gtf -o c.elegans_Output/QC_c.elegans -p -q -s

-<BR>
+
 ii) Branch 2: ExpressionEstimation.sh

      ExpressionEstimation.sh -f SRR514378.fastq -c c.elegans_refseq_seq.fa -a c.elegans_refseq_anno.gtf -o c.elegans_Output/ExpressionEstimation_c.elegans

-<BR>
+
 iii) Branch 4: FileFormatConversion.sh

      FileFormatConversion.sh -i c.elegans_Output/QC_c.elegans_Bowtie2_genome.sam -r genome_c.elegans.fa -b -o c.elegans_Output/QC_c.elegans_Bowtie2_genome
@@ -473,7 +474,7 @@

      FileFormatConversion.sh -i c.elegans_Output/QC_c.elegans_Bowtie2_genome.bam -d -o c.elegans_Output/QC_c.elegans_Bowtie2_genome

-<BR>
+
 ## Human

 Download the following datasets:
@@ -482,14 +483,14 @@
 [Mitochondria sequence file](http://sourceforge.net/projects/rseqflow/files/Human_chrM.fa/download)
 [Reference annotation gtf file (gencodeV14)](http://sourceforge.net/projects/rseqflow/files/Human_gencodeV14_anno.gtf.gz/download)
 [Reference transcriptome sequences file (gencodeV14)](http://sourceforge.net/projects/rseqflow/files/Human_gencodeV14_transcripts.fa.gz/download)
-<BR>
+

 Commands for running QC_SNP, ExpressionEstimation, and FileFormatConversion:

 Before running the pipeline, create a new directory for output results: 

      mkdir Human_Output
-<BR>
+

 Unzip the reference files. You may wish to create a separate directory so that you can use these references for future runs. 

@@ -497,17 +498,17 @@
      gunzip Human_gencodeV14_transcripts.fa.gz
      gunzip Human_genome_GRCh37.fa.gz

-<BR>
+
  i) Branch 1: QC_SNP.sh

      QC_SNP.sh -f ERR030893_2M.fastq.gz -g Human_genome_GRCh37.fa -c Human_gencodeV14_transcripts.fa -a Human_gencodeV14_anno.gtf --mito Human_chrM -o Human_Output/QC_human -p -q -s

-<BR>
+
  ii) Branch 2: ExpressionEstimation.sh

      ExpressionEstimation.sh -f ERR030893_2M.fastq.gz -c Human_gencodeV14_transcripts.fa -a Human_gencodeV14_anno.gtf -o Human_Output/ExpressionEstimation_human

-<BR>
+
  iii) Branch 4: FileFormatConversion.sh

      FileFormatConversion.sh -i Human_Output/QC_human_Bowtie2_genome.sam -r  Human_genome_GRCh37.fa -b -o Human_Output/QC_human_Bowtie2_genome

UnixModeManual modified by J.Herstein

J.Herstein — Fri, 31 Jan 2014 20:22:29 -0000

--- v6
+++ v7
@@ -410,7 +410,7 @@

 The transcript names must begin with “>” and should meet the format requirements below. Extra information may follow the chromosome end location as long as there is a space separating the chromosome end from the extra info. 

-GenomeName_AnnotationSource_TranscriptsID=Chromosome:Start-End [extra info]
+GenomeName_AnnotationSource_TranscriptsID=Chromosome:Start-End Extra Info

 For example:

UnixModeManual modified by J.Herstein

J.Herstein — Thu, 30 Jan 2014 00:32:04 -0000

--- v5
+++ v6
@@ -1,31 +1,14 @@
-# Manual of RseqFlow Unix Run Mode
-
-  * Manual of RseqFlow Unix Run Mode
-    * 1\. Package installation and configuration
-

-    * 2\. Branch usage
-        * Branch 1: QC_SNP.sh
-        * Branch 2: ExpressionEstimation.sh
-        * Branch 3: DE.sh
-        * Branch 4: FileFormatConversion.sh
-    * 3\. Input files and formats
-        * Possible input files for QC_SNP.sh (Depends on the selected options)
-        * Possible input files for ExpressionEstimation.sh (Depends on the selected options)
-        * Possible input files for DE.sh
-        * Format specification of input files
-    * 4\. Sample datasets
-        * c. elegans
-        * Human
-
-## 1\. Package installation and configuration
-
+Manual of RseqFlow Unix Run Mode
+[TOC]
+# 1\. Package installation and configuration
 The following packages must be pre-installed:
-    1. UNIX; 
-    2. Python 2.7 and higher; 
-    3. R 2.11 and higher; 
-    4. GCC 
-
-### RseqFlow Installation and Configuration
+
+  * UNIX; 
+  * Python 2.7 and higher; 
+  * R 2.11 and higher; 
+  * GCC 
+
+RseqFlow Installation and Configuration

 Step 1: Download the [source code](http://sourceforge.net/projects/rseqflow/files/rseqflow2-v2.1.tar.gz/download) to your directory, e.g '/home/user/rseqflow'. 

@@ -66,16 +49,10 @@

 Step 9: Now you can run the corresponding shell scripts for each branch. 
 

-## 2\. Branch usage
-
-The pipeline consists of four branches: 
-
-  * Branch 1: Quality Control and SNP calling 
-  * Branch 2: Expression level estimation 
-  * Branch 3: Differentially expressed gene identification 
-  * Branch 4: Alignment file format conversion 
-
-### Branch 1: QC_SNP.sh
+
+# 2\. Branch usage 
+
+## Branch 1: Quality Control and SNP calling (QC_SNP.sh)

 QC_SNP.sh will implement quality control and/or SNP calling analysis for single end and paired end RNA-Seq datasets. 

@@ -174,7 +151,7 @@
 
 

-#### Examples:
+Examples:

 Examples for running QC_SNP.sh:

@@ -182,18 +159,18 @@
     QC_SNP.sh -f Reads.fastq -o OutputPrefix –p --cleanup

 _This command will give pre-alignment quality analysis only, based on the RNA-Seq dataset (Reads.fq) and will delete temporary files._
-
    
-
+
  
+  
     QC_SNP.sh -f Reads.fastq.gz -g RefGene.fa -c RefTran.fa -a Anno.gtf -o OutputPrefix –p -q --ribo rRNA.fa

 _This command will align the Reads.fastq.gz to the genome (RefGene.fa), transcriptome (RefTran.fa) and ribosomal RNA (rRNA.fa) and implement the pre and post alignment quality control analyses. It will also output the ribosomal RNA alignment report._
-
    
-
+
   
+ 
     QC_SNP.sh -a Anno.gtf -o OutputPrefix -q --gSam AlignedToGenome.sam --tSam AlignedToTran.sam

 _This command will give post-alignment quality analysis based on the user specified alignment SAM files to the genome (AlignedToGenome.sam) and transcriptome(AlignedToTran.sam)._
-
    
-
+
   
+ 
     QC_SNP.sh -1 Read1.fastq -2 Read2.fastq -g RefGene.fa -c RefTran.fa –a Anno.gtf -o OutputPrefix -p –q –s

 _This command will align the paired end RNA-Seq datasets Read1.fastq and Read2.fastq to the genome (RefGene.fa) and transcriptome (RefTran.fa) and implement the pre and post alignment quality control analyses and SNP calling._
@@ -204,8 +181,7 @@
 _This command will align the paired end Read1.fastq.gz and Read2.fastq.gz to the genome (RefGene.fa), transcriptome (RefTran.fa) and Mitochondrial chromosome (chrM.fa) and implement SNP calling base on the merged genome and transcriptome alignments._

 

-### Branch 2: ExpressionEstimation.sh
-
+## Branch 2: Expression Level Estimation (ExpressionEstimation.sh)
 ExpressionEstimation.sh will implement expression level estimation for gene/exon/splice junctions based on the alignment to the transcriptome for single end or paired end RNA-Seq datsets. Beginning in version 2.1 and later, only unique alignments are used in expression estimation. Prior to version 2.1, the best alignment for each read was used which may or may not have been unique. 

 If bowtie2 indexes already exist in the same directory as the transcriptome reference file, RseqFlow will use those existing indexes. If it cannot find precomputed indexes, it will create the indexes in the output directory. If you plan on using the same reference file for future runs, you may wish to move the files ending in .bt2 from the output directory into the directory of the transcriptome reference. This will bypass index creation for future runs and reduce run times. 
@@ -267,22 +243,21 @@
 

-#### Examples:
+Examples:

 Examples for running ExpressionEstimation.sh:

     ExpressionEstimation.sh -f Reads.fastq -c RefTran.fa -a Anno.gtf -o OutputPrefix --cleanup

 _This command will align the single end Reads.fastq to the transcriptome reference sequence (RefTran.fa) and implement the expression level estimation using the distinct best alignment for each read. Temporary files will be deleted._
-
    
-
+
   
+ 
     ExpressionEstimation.sh -a Anno.gtf -o OutputPrefix --tSam AlignedToTran.sam

 _This command uses the supplied .sam file that has already been aligned to the transcriptome. RseqFlow will implement the expression level estimation from the distinct best alignment for each read._
 


-### Branch 3: DE.sh de
-
+## Branch 3: Differentially Expressed Gene Identification (DE.sh)
 The "de" command will identify the differentially expressed genes for two conditions (e.g. case/control) using the output files from ExpressionEstimation.sh. If both conditions have only one sample, the ExonExpressionLevel_unique files from ExpressionEstimation.sh are used as input. Otherwise, if either condition has more than one sample, the GeneExpressionLevel_unique files from ExpressionEstimation.sh are used as input. 

 Options for DE.sh de:
@@ -305,12 +280,12 @@
 
   -1/--f1
    
-  Comma separated file name list for condition1. The files should be the Gene/Exon expression file(s) for condition1*
+  Comma separated file name list for condition1. The files should be the Gene/Exon expression file(s) for condition1^*
 
 
   -2/--f2
    
-  Comma separated file name list for condition2. The files should be the Gene/Exon expression file(s) for condition2*
+  Comma separated file name list for condition2. The files should be the Gene/Exon expression file(s) for condition2^*
 
 
   -o/--output-prefix
@@ -320,24 +295,23 @@
 

-\* If both conditions have only one sample each, use the ExonExpressionLevel_unique files from ExpressionEstimation.sh as input. Otherwise, if both conditions have more than one sample each, use the GeneExpressionLevel_unique files from ExpressionEstimation.sh as input. 
-
-#### Examples:
+^* If both conditions have only one sample each, use the ExonExpressionLevel_unique files from ExpressionEstimation.sh as input. Otherwise, if both conditions have more than one sample each, use the GeneExpressionLevel_unique files from ExpressionEstimation.sh as input. 
+
+Examples:

 Examples for running DE.sh de:

     DE.sh de --c1 disease --c2 normal --f1 disease_S1_whole_GeneExpressionLevel_unique.txt, disease_S2_whole_GeneExpressionLevel_unique.txt --f2 normal_S1_whole_GeneExpressionLevel_unique.txt -o HeartDisease

 _In this case, there are two samples with the disease condition, so the GeneExpressionLevel_unique files from ExpressionEstimation.sh are used as input._
-
  
-  
+
   
+
     DE.sh de --c1 Brain --c2 Colon --f1 Brain_whole_ExonExpressionLevel_unique.txt --f2 Colon_whole_ExonExpressionLevel_unique.txt –o TissueCompare

 _In this case, both conditions have only a single sample each, so the ExonExpressionLevel files from ExpressionEstimation.sh are used as input._
 


-### Branch 4: FileFormatConversion.sh
-
+## Branch 4: Alignment File Format Conversion (FileFormatConversion.sh)
 FileFormatConversion.sh will convert between alignment file formats for storage or visualization convenience. It can convert SAM to BAM, MRF and WIG, BAM to BED, and MRF to WIG. 

 Options:
@@ -385,28 +359,27 @@
 

-#### Examples:
+Examples:

 Examples for running FileFormatConversion.sh:

     FileFormatConversion.sh -i in.sam -o out -b -r ref.fa

 _This command will convert the input in.sam file to a bam file. If the sam file does not contain the appropriate "@SQ" header lines, the reference sequences file (ref.fa) is required._
-
    
-
+
   
+ 
     FileFormatConversion.sh -i in.sam -o out –m

 _This command will convert the input in.sam file to a mrf file._
-

-    
+
 
+   
     FileFormatConversion.sh -i in.bam -o out –d

 _This command will convert the input in.bam file to a BED file for visualization._
 


-## 3\. Input files and formats
-
-#### Possible input files for QC_SNP.sh (Depends on the selected options)
+# 3\. Input files and formats
+## Possible input files for QC_SNP.sh (Depends on the selected options):

   * Genome annotation GTF file 
   * Transcriptome reference sequences 
@@ -415,21 +388,20 @@
   * Alignment files in SAM format 
   * Reference sequences of Mitochondria 
   * Reference sequences of Ribosomal RNA 
-
-#### Possible input files for ExpressionEstimation.sh (Depends on the selected options)
+

+## Possible input files for ExpressionEstimation.sh (Depends on the selected options):

   * Genome annotation GTF file 
   * Transcriptome reference sequences 
   * RNA-Seq fastq or fastq.gz file 
   * Alignment files in SAM format 
-
-#### Possible input files for DE.sh
-
+

+## Possible input files for DE.sh:
   * Output files from ExpressionEstimation.sh: 
        * whole_GeneExpressionLevel_unique.txt
-       * whole_ExonExpressionLevel_unique.txt files. 
-
-#### Format specification of input files
+       * whole_ExonExpressionLevel_unique.txt files
+

+## Format specification of input files

 There should be three separate files: one for Transcriptome reference sequences, one for Genome reference sequences and one for the Genome Annotation file. RseqFlow will automatically split the files during processing, if necessary. All eukaryotic species with files in the required formats can be analyzed in the RseqFlow pipeline. 

@@ -462,20 +434,19 @@

 The Genome Annotation GTF file must be in format GTF3.0.
 

-
-## 4\. Sample datasets
-
-### 1. c. elegans
-
-##### Download the following datasets:
-
+

+
+# 4\. Sample datasets
+## c. elegans
+
+Download the following datasets:
 [RNA-Seq dataset](http://sourceforge.net/projects/rseqflow/files/SRR514378.fastq.gz/download)
 [Reference genome sequences file](http://sourceforge.net/projects/rseqflow/files/genome_c.elegans.fa.gz/download)
 [Reference annotation gtf file](http://sourceforge.net/projects/rseqflow/files/c.elegans_refseq_anno.gtf/download)
 [Reference transcriptome sequences file](http://sourceforge.net/projects/rseqflow/files/c.elegans_refseq_seq.fa/download)
 


-##### Commands for running QC_SNP, ExpressionEstimation, and FileFormatConversion:
+Commands for running QC_SNP, ExpressionEstimation, and FileFormatConversion:

 Before running the pipeline, create a new directory for output results: 

@@ -503,10 +474,9 @@
      FileFormatConversion.sh -i c.elegans_Output/QC_c.elegans_Bowtie2_genome.bam -d -o c.elegans_Output/QC_c.elegans_Bowtie2_genome

 

-### 2. Human
-
-##### Download the following datasets:
-
+## Human
+
+Download the following datasets:
 [RNA-Seq dataset](http://sourceforge.net/projects/rseqflow/files/ERR030893_2M.fastq.gz/download)
 [Reference genome sequences file](http://genomics.isi.edu/downloads/Human_genome_GRCh37_69.fa.gz)
 [Mitochondria sequence file](http://sourceforge.net/projects/rseqflow/files/Human_chrM.fa/download)
@@ -514,7 +484,7 @@
 [Reference transcriptome sequences file (gencodeV14)](http://sourceforge.net/projects/rseqflow/files/Human_gencodeV14_transcripts.fa.gz/download)
 


-##### Commands for running QC_SNP, ExpressionEstimation, and FileFormatConversion:
+Commands for running QC_SNP, ExpressionEstimation, and FileFormatConversion:

 Before running the pipeline, create a new directory for output results:

UnixModeManual modified by J.Herstein

J.Herstein — Wed, 29 Jan 2014 22:34:44 -0000

--- v4
+++ v5
@@ -322,19 +322,19 @@

 \* If both conditions have only one sample each, use the ExonExpressionLevel_unique files from ExpressionEstimation.sh as input. Otherwise, if both conditions have more than one sample each, use the GeneExpressionLevel_unique files from ExpressionEstimation.sh as input. 

-  * **Examples**
+#### Examples:

 Examples for running DE.sh de:

     DE.sh de --c1 disease --c2 normal --f1 disease_S1_whole_GeneExpressionLevel_unique.txt, disease_S2_whole_GeneExpressionLevel_unique.txt --f2 normal_S1_whole_GeneExpressionLevel_unique.txt -o HeartDisease

 _In this case, there are two samples with the disease condition, so the GeneExpressionLevel_unique files from ExpressionEstimation.sh are used as input._
-  
+
  

     DE.sh de --c1 Brain --c2 Colon --f1 Brain_whole_ExonExpressionLevel_unique.txt --f2 Colon_whole_ExonExpressionLevel_unique.txt –o TissueCompare

 _In this case, both conditions have only a single sample each, so the ExonExpressionLevel files from ExpressionEstimation.sh are used as input._
-
+


 ### Branch 4: FileFormatConversion.sh

@@ -385,23 +385,24 @@
 

-  * **Examples**
-    
+#### Examples:
+ 
 Examples for running FileFormatConversion.sh:

     FileFormatConversion.sh -i in.sam -o out -b -r ref.fa

 _This command will convert the input in.sam file to a bam file. If the sam file does not contain the appropriate "@SQ" header lines, the reference sequences file (ref.fa) is required._
-    
+
    

     FileFormatConversion.sh -i in.sam -o out –m

 _This command will convert the input in.sam file to a mrf file._
-
+


     FileFormatConversion.sh -i in.bam -o out –d

 _This command will convert the input in.bam file to a BED file for visualization._
+


 ## 3\. Input files and formats

@@ -424,7 +425,9 @@

 #### Possible input files for DE.sh

-  * Output files from ExpressionEstimation.sh: whole_GeneExpressionLevel_unique.txt or whole_ExonExpressionLevel_unique.txt files. 
+  * Output files from ExpressionEstimation.sh: 
+       * whole_GeneExpressionLevel_unique.txt
+       * whole_ExonExpressionLevel_unique.txt files. 

 #### Format specification of input files

@@ -440,7 +443,7 @@
 For example:

 >hg19_wgEncodeGencodeManualV4_ENST00000480075=chr7:19757-35457 5'pad=0 3'pad=0 strand=- repeatMasking=none
-
+


   * **Genome Reference Sequences**

@@ -453,39 +456,42 @@
 >chr21 dna:chromosome chromosome:GRCh37:21:1:48129895:1 REF

 >chrM
-
+


   * **Genome Annotation**

 The Genome Annotation GTF file must be in format GTF3.0.
-
+


 ## 4\. Sample datasets

 ### 1. c. elegans

-* **Download the following datasets:**
+##### Download the following datasets:

 [RNA-Seq dataset](http://sourceforge.net/projects/rseqflow/files/SRR514378.fastq.gz/download)
 [Reference genome sequences file](http://sourceforge.net/projects/rseqflow/files/genome_c.elegans.fa.gz/download)
 [Reference annotation gtf file](http://sourceforge.net/projects/rseqflow/files/c.elegans_refseq_anno.gtf/download)
 [Reference transcriptome sequences file](http://sourceforge.net/projects/rseqflow/files/c.elegans_refseq_seq.fa/download)
-
-

-* **Commands for running QC_SNP, ExpressionEstimation, and FileFormatConversion:** 
+

+
+##### Commands for running QC_SNP, ExpressionEstimation, and FileFormatConversion:

 Before running the pipeline, create a new directory for output results: 

     mkdir c.elegans_Output

+

 i) Branch 1: QC_SNP.sh 

      QC_SNP.sh -f SRR514378.fastq -g genome_c.elegans.fa -c c.elegans_refseq_seq.fa -a c.elegans_refseq_anno.gtf -o c.elegans_Output/QC_c.elegans -p -q -s

+

 ii) Branch 2: ExpressionEstimation.sh

      ExpressionEstimation.sh -f SRR514378.fastq -c c.elegans_refseq_seq.fa -a c.elegans_refseq_anno.gtf -o c.elegans_Output/ExpressionEstimation_c.elegans

+

 iii) Branch 4: FileFormatConversion.sh

      FileFormatConversion.sh -i c.elegans_Output/QC_c.elegans_Bowtie2_genome.sam -r genome_c.elegans.fa -b -o c.elegans_Output/QC_c.elegans_Bowtie2_genome
@@ -496,23 +502,24 @@

      FileFormatConversion.sh -i c.elegans_Output/QC_c.elegans_Bowtie2_genome.bam -d -o c.elegans_Output/QC_c.elegans_Bowtie2_genome

-
+

 ### 2. Human

-* **Download the following datasets:**
+##### Download the following datasets:

 [RNA-Seq dataset](http://sourceforge.net/projects/rseqflow/files/ERR030893_2M.fastq.gz/download)
 [Reference genome sequences file](http://genomics.isi.edu/downloads/Human_genome_GRCh37_69.fa.gz)
 [Mitochondria sequence file](http://sourceforge.net/projects/rseqflow/files/Human_chrM.fa/download)
 [Reference annotation gtf file (gencodeV14)](http://sourceforge.net/projects/rseqflow/files/Human_gencodeV14_anno.gtf.gz/download)
 [Reference transcriptome sequences file (gencodeV14)](http://sourceforge.net/projects/rseqflow/files/Human_gencodeV14_transcripts.fa.gz/download)
-
-

-* **Commands for running QC_SNP, ExpressionEstimation, and FileFormatConversion:** 
+

+
+##### Commands for running QC_SNP, ExpressionEstimation, and FileFormatConversion:

 Before running the pipeline, create a new directory for output results: 

      mkdir Human_Output
+


 Unzip the reference files. You may wish to create a separate directory so that you can use these references for future runs. 

@@ -520,14 +527,17 @@
      gunzip Human_gencodeV14_transcripts.fa.gz
      gunzip Human_genome_GRCh37.fa.gz

+

  i) Branch 1: QC_SNP.sh

      QC_SNP.sh -f ERR030893_2M.fastq.gz -g Human_genome_GRCh37.fa -c Human_gencodeV14_transcripts.fa -a Human_gencodeV14_anno.gtf --mito Human_chrM -o Human_Output/QC_human -p -q -s

+

  ii) Branch 2: ExpressionEstimation.sh

      ExpressionEstimation.sh -f ERR030893_2M.fastq.gz -c Human_gencodeV14_transcripts.fa -a Human_gencodeV14_anno.gtf -o Human_Output/ExpressionEstimation_human

+

  iii) Branch 4: FileFormatConversion.sh

      FileFormatConversion.sh -i Human_Output/QC_human_Bowtie2_genome.sam -r  Human_genome_GRCh37.fa -b -o Human_Output/QC_human_Bowtie2_genome

UnixModeManual modified by J.Herstein

J.Herstein — Wed, 29 Jan 2014 22:22:01 -0000

--- v3
+++ v4
@@ -2,6 +2,7 @@

   * Manual of RseqFlow Unix Run Mode
     * 1\. Package installation and configuration
+

     * 2\. Branch usage
         * Branch 1: QC_SNP.sh
         * Branch 2: ExpressionEstimation.sh
@@ -18,51 +19,53 @@

 ## 1\. Package installation and configuration

-  * **Pre-install run environment**
+The following packages must be pre-installed:
     1. UNIX; 
     2. Python 2.7 and higher; 
     3. R 2.11 and higher; 
     4. GCC 

+### RseqFlow Installation and Configuration
+
 Step 1: Download the [source code](http://sourceforge.net/projects/rseqflow/files/rseqflow2-v2.1.tar.gz/download) to your directory, e.g '/home/user/rseqflow'. 

 Step 2: Enter your specified directory: 

        bash-3.2$ cd /home/user/rseqflow
-
+


 Step 3: Extract the tar file: 

        bash-3.2$ tar -xvf RseqFlow_source.tar.gz 
-
+


 Step 4: Enter the directory: 

        bash-3.2$ cd /home/user/reseqflow/RseqFlow_source
-
+


 Step 5: Set up some tools: 

        bash-3.2$ ./make.sh 
-
+


 Step 6: Set up PATH so that the system knows where to find the executable files. It is recommended to run step 7 after this to make these changes permanent, otherwise you will need to run ./configure.sh each time you run RseqFlow from a new terminal window. 

        bash-3.2$ source ./configure.sh 
-
+


 Step 7: This step will make the changes to your PATH and PYTHONPATH variables from step 6 permanent. If you choose not to do this step, you will need to run configure.sh each time you run RseqFlow. To make the PATH and PYTHONPATH changes permanent, copy the commands in 'configure.sh' into your bash file either manually or with the following command. **Please make sure the command contains ">>", not “>", otherwise, you will overwrite your original bash file! **

        bash-3.2$ cat configure.sh >> /home/user/.bashrc
-
+


 Step 8: If your system has multiple python versions, make sure you use version 2.7 or higher. Run 'pythonCompilerSet.sh' to set the proper python header in each of the python scripts. The example below assumes the path of the python executable is '/home/user/python2.7'. 

        bash-3.2$ ./pythonCompilerSet.sh -p /home/user/python2.7/python
-
+


 Step 9: Now you can run the corresponding shell scripts for each branch. 
-
+

 ## 2\. Branch usage

 The pipeline consists of four branches: 
@@ -171,7 +174,7 @@
 
 

-  * **Examples**
+#### Examples:

 Examples for running QC_SNP.sh:

@@ -179,28 +182,28 @@
     QC_SNP.sh -f Reads.fastq -o OutputPrefix –p --cleanup

 _This command will give pre-alignment quality analysis only, based on the RNA-Seq dataset (Reads.fq) and will delete temporary files._
-    
+
    

     QC_SNP.sh -f Reads.fastq.gz -g RefGene.fa -c RefTran.fa -a Anno.gtf -o OutputPrefix –p -q --ribo rRNA.fa

 _This command will align the Reads.fastq.gz to the genome (RefGene.fa), transcriptome (RefTran.fa) and ribosomal RNA (rRNA.fa) and implement the pre and post alignment quality control analyses. It will also output the ribosomal RNA alignment report._
-    
+
    

     QC_SNP.sh -a Anno.gtf -o OutputPrefix -q --gSam AlignedToGenome.sam --tSam AlignedToTran.sam

 _This command will give post-alignment quality analysis based on the user specified alignment SAM files to the genome (AlignedToGenome.sam) and transcriptome(AlignedToTran.sam)._
-    
+
    

     QC_SNP.sh -1 Read1.fastq -2 Read2.fastq -g RefGene.fa -c RefTran.fa –a Anno.gtf -o OutputPrefix -p –q –s

 _This command will align the paired end RNA-Seq datasets Read1.fastq and Read2.fastq to the genome (RefGene.fa) and transcriptome (RefTran.fa) and implement the pre and post alignment quality control analyses and SNP calling._
-    
+
    

     QC_SNP.sh -1 Read1.fastq.gz -2 Read2.fastq.gz -g RefGene.fa -c RefTran.fa -a Anno.gtf -o OutputPrefix –s --mito chrM.fa

 _This command will align the paired end Read1.fastq.gz and Read2.fastq.gz to the genome (RefGene.fa), transcriptome (RefTran.fa) and Mitochondrial chromosome (chrM.fa) and implement SNP calling base on the merged genome and transcriptome alignments._

-
+

 ### Branch 2: ExpressionEstimation.sh

 ExpressionEstimation.sh will implement expression level estimation for gene/exon/splice junctions based on the alignment to the transcriptome for single end or paired end RNA-Seq datsets. Beginning in version 2.1 and later, only unique alignments are used in expression estimation. Prior to version 2.1, the best alignment for each read was used which may or may not have been unique. 
@@ -264,19 +267,19 @@
 

-  * **Examples**
+#### Examples:

 Examples for running ExpressionEstimation.sh:

     ExpressionEstimation.sh -f Reads.fastq -c RefTran.fa -a Anno.gtf -o OutputPrefix --cleanup

 _This command will align the single end Reads.fastq to the transcriptome reference sequence (RefTran.fa) and implement the expression level estimation using the distinct best alignment for each read. Temporary files will be deleted._
-    
+
    

     ExpressionEstimation.sh -a Anno.gtf -o OutputPrefix --tSam AlignedToTran.sam

 _This command uses the supplied .sam file that has already been aligned to the transcriptome. RseqFlow will implement the expression level estimation from the distinct best alignment for each read._
-
+


 ### Branch 3: DE.sh de

UnixModeManual modified by J.Herstein

J.Herstein — Wed, 29 Jan 2014 21:50:24 -0000

--- v2
+++ v3
@@ -1,4 +1,4 @@
-# Manual of `RseqFlow` Unix Run Mode
+# Manual of RseqFlow Unix Run Mode

   * Manual of RseqFlow Unix Run Mode
     * 1\. Package installation and configuration
@@ -8,13 +8,13 @@
         * Branch 3: DE.sh
         * Branch 4: FileFormatConversion.sh
     * 3\. Input files and formats
-        * (1) Possible input files for QC_SNP.sh (Depends on the selected options)
-        * (2) Possible input files for ExpressionEstimation.sh (Depends on the selected options)
-        * (3) Possible input files for DE.sh
-        * (4) Format specification of input files
+        * Possible input files for QC_SNP.sh (Depends on the selected options)
+        * Possible input files for ExpressionEstimation.sh (Depends on the selected options)
+        * Possible input files for DE.sh
+        * Format specification of input files
     * 4\. Sample datasets
-        * (1) c. elegans
-        * (2) Human
+        * c. elegans
+        * Human

 ## 1\. Package installation and configuration

@@ -46,12 +46,12 @@
        bash-3.2$ ./make.sh 

-Step 6: Set up PATH so that the system knows where to find the executable files. It is recommended to run step 7 after this to make these changes permanent, otherwise you will need to run ./configure.sh each time you run `RseqFlow` from a new terminal window. 
+Step 6: Set up PATH so that the system knows where to find the executable files. It is recommended to run step 7 after this to make these changes permanent, otherwise you will need to run ./configure.sh each time you run RseqFlow from a new terminal window. 

        bash-3.2$ source ./configure.sh 

-Step 7: This step will make the changes to your PATH and PYTHONPATH variables from step 6 permanent. If you choose not to do this step, you will need to run configure.sh each time you run `RseqFlow`. To make the PATH and PYTHONPATH changes permanent, copy the commands in 'configure.sh' into your bash file either manually or with the following command. **Please make sure the command contains ">>", not “>", otherwise, you will overwrite your original bash file! **
+Step 7: This step will make the changes to your PATH and PYTHONPATH variables from step 6 permanent. If you choose not to do this step, you will need to run configure.sh each time you run RseqFlow. To make the PATH and PYTHONPATH changes permanent, copy the commands in 'configure.sh' into your bash file either manually or with the following command. **Please make sure the command contains ">>", not “>", otherwise, you will overwrite your original bash file! **

        bash-3.2$ cat configure.sh >> /home/user/.bashrc

@@ -72,7 +72,7 @@
   * Branch 3: Differentially expressed gene identification 
   * Branch 4: Alignment file format conversion 

-### (1) Branch 1: QC_SNP.sh
+### Branch 1: QC_SNP.sh

 QC_SNP.sh will implement quality control and/or SNP calling analysis for single end and paired end RNA-Seq datasets. 

@@ -86,70 +86,96 @@

 If you already have RseqFlow alignment file(s) in SAM format for the genome and/or transcriptome, you may input the .sam files and the alignment step will be skipped. The post-alignment and/or SNP calling will be analyzed based on the merged result only. 

-  * **Options**
-
-Options 
-arguments 
-Meaning 
-
--f/--fastq 
-.fastq or .fastq.gz 
-RNA-Seq dataset with single end reads in FASTQ format (fastq or fastq.gz) 
-
--1/--read1 
-.fastq or .fastq.gz 
-The first reads file for paired end data in FASTQ format (fastq or fastq.gz) 
-
--2/--read2 
-.fastq or .fastq.gz 
-The second reads file for paired end data in FASTQ format (fastq or fastq.gz) 
-
--g/--genome 
-.fa 
-Genome reference sequences in FASTA format 
-
--c/-transcriptome 
-.fa 
-Transcriptome reference sequences in FASTA format 
-
--a/--annotation 
-.gtf 
-Reference annotation in GTF format 
-
--o/--output-prefix 
-prefix 
-Prefix of output files (default setting is “QC_SNP_output”) 
-
--p/--pre-QC 
-Generate pre-alignment QC reports based only on the RNA-Seq iput fastq file 
-
--q/--QC 
-Generate post-QC reports based on alignment 
-
--s/--SNP 
-Implement SNPs calling analysis 
-
-\--ribo 
-.fa 
-Reference sequences of ribosomal RNA in FASTA format 
-
-\--mito 
-.fa 
-Reference sequences of Mitochondrial chromosome in FASTA format 
-
-\--gSam 
-.sam 
-Alignment to genome in SAM format (This option is for cases where RNA-Seq dataset has already been aligned to genome) 
-
-\--tSam 
-.sam 
-Alignment to transcriptome in SAM format (This option is for cases where RNA-Seq dataset has already been aligned to transcriptome) 
-
-\--cleanup 
-Delete temporary files (default setting is to save all files) 
+
+Options:
+
+
+
+
+
+
+
+  
+  
+   
+
+
+  
+  
+   
+
+  
+  
+  
+   
+
+
+  
+  
+   
+
+
+  
+  
+  
+
+
+  
+  
+   
+
+
+  
+  
+  
+
+ 
+  
+  
+  
+
+ 
+  
+  
+  
+
+
+  
+  
+  
+
+
+  
+  
+  
+
+ 
+  
+  
+  
+
+
+  
+  
+  
+
+
+  
+  
+  
+
+
+  
+  
+   
+
+Option Argument Description
-f/--fastq .fastq or .fastq.gz RNA-Seq dataset with single end reads in FASTQ format (fastq or fastq.gz)
-1/--read1 .fastq or .fastq.gz The first reads file for paired end data in FASTQ format (fastq or fastq.gz)
-2/--read2 .fastq or .fastq.gz The second reads file for paired end data in FASTQ format (fastq or fastq.gz)
-g/--genome .fa Genome reference sequences in FASTA format
-c/-transcriptome .fa Transcriptome reference sequences in FASTA format
-a/--annotation .gtf Reference annotation in GTF format
-o/--output-prefix  Prefix of output files (default setting is “QC_SNP_output”)
-p/--pre-QC  Generate pre-alignment QC reports based only on the RNA-Seq iput fastq file
-q/--QC  Generate post-QC reports based on alignment
-s/--SNP  Implement SNPs calling analysis
--ribo .fa Reference sequences of ribosomal RNA in FASTA format
--mito .fa Reference sequences of Mitochondrial chromosome in FASTA format
--gSam .sam Alignment to genome in SAM format (This option is for cases where RNA-Seq dataset has already been aligned to genome)
--tSam .sam Alignment to transcriptome in SAM format (This option is for cases where RNA-Seq dataset has already been aligned to transcriptome)
--cleanup  Delete temporary files (default setting is to save all files)

   * **Examples**
-    
+
+Examples for running QC_SNP.sh:
+
+
     QC_SNP.sh -f Reads.fastq -o OutputPrefix –p --cleanup

 _This command will give pre-alignment quality analysis only, based on the RNA-Seq dataset (Reads.fq) and will delete temporary files._
@@ -175,7 +201,7 @@
 _This command will align the paired end Read1.fastq.gz and Read2.fastq.gz to the genome (RefGene.fa), transcriptome (RefTran.fa) and Mitochondrial chromosome (chrM.fa) and implement SNP calling base on the merged genome and transcriptome alignments._

-### (2) Branch 2: ExpressionEstimation.sh
+### Branch 2: ExpressionEstimation.sh

 ExpressionEstimation.sh will implement expression level estimation for gene/exon/splice junctions based on the alignment to the transcriptome for single end or paired end RNA-Seq datsets. Beginning in version 2.1 and later, only unique alignments are used in expression estimation. Prior to version 2.1, the best alignment for each read was used which may or may not have been unique. 

@@ -185,47 +211,63 @@

 If you already have an alignment file to the transcriptome, you may input the .sam file and the alignment step will be skipped. ExpressionEstimation.sh will implement expression level estimation. 

-Please note that ExpressionEstimation.sh is expecting a bowtie2 created samfile. If your pre-existing samfile was not created with bowtie2, it is strongly suggested that you run `ExpressionEstimation.sh` without a samfile to let RseqFlow create its own, otherwise there is no guarantee that `ExpressionEstimation` will use only unique alignments. 
-
-  * **Options**
-
-Options 
-arguments 
-Meaning 
-
--f/--fastq 
-.fastq or .fastq.gz 
-The reads file for single end data in FASTQ format (fastq or fastq.gz) 
-
--1/--read1 
-.fastq or .fastq.gz 
-The first reads file for paired end data in FASTQ format (fastq or fastq.gz) 
-
--2/--read2 
-.fastq or .fastq.gz 
-The second reads file for paired end data in FASTQ format (fastq or fastq.gz) 
-
--c/-transcriptome 
-.fa 
-Transcriptome reference sequences in FASTA format 
-
--a/--annotation 
-.gtf 
-Reference annotation in GTF format 
-
--o/--output-prefix 
-prefix 
-Prefix of output files (default is Expression_output) 
-
-\--tSam 
-.sam 
-Alignments to the transcriptome in SAM format 
-
-\--cleanup 
-Delete temporary files (default setting is to save all files) 
+Please note that ExpressionEstimation.sh is expecting a bowtie2 created samfile. If your pre-existing samfile was not created with bowtie2, it is strongly suggested that you run ExpressionEstimation.sh without a samfile to let RseqFlow create its own, otherwise there is no guarantee that ExpressionEstimation will use only unique alignments. 
+
+
+Options:
+
+
+
+
+
+
+
+  
+  
+   
+
+
+  
+  
+   
+
+  
+  
+  
+   
+
+
+  
+  
+  
+
+
+  
+  
+   
+
+
+  
+  
+  
+
+
+  
+  
+  
+
+
+  
+  
+   
+
+Option Argument Description
-f/--fastq .fastq or .fastq.gz The reads file for single end data in FASTQ format (fastq or fastq.gz)
-1/--read1 .fastq or .fastq.gz The first reads file for paired end data in FASTQ format (fastq or fastq.gz)
-2/--read2 .fastq or .fastq.gz The second reads file for paired end data in FASTQ format (fastq or fastq.gz)
-c/-transcriptome .fa Transcriptome reference sequences in FASTA format
-a/--annotation .gtf Reference annotation in GTF format
-o/--output-prefix  Prefix of output files (default setting is “Expression_output”)
--tSam .sam Alignment to transcriptome in SAM format (This option is for cases where RNA-Seq dataset has already been aligned to transcriptome)
--cleanup  Delete temporary files (default setting is to save all files)
+

   * **Examples**

+Examples for running ExpressionEstimation.sh:
+
     ExpressionEstimation.sh -f Reads.fastq -c RefTran.fa -a Anno.gtf -o OutputPrefix --cleanup

 _This command will align the single end Reads.fastq to the transcriptome reference sequence (RefTran.fa) and implement the expression level estimation using the distinct best alignment for each read. Temporary files will be deleted._
@@ -236,40 +278,51 @@
 _This command uses the supplied .sam file that has already been aligned to the transcriptome. RseqFlow will implement the expression level estimation from the distinct best alignment for each read._

-### (3) Branch 3: DE.sh
-
-The de command will identify the differentially expressed genes for two conditions (e.g. case/control) using the output files from ExpressionEstimation.sh. If both conditions have only one sample, the ExonExpressionLevel_unique files from ExpressionEstimation.sh are used as input. Otherwise, if either condition has more than one sample, the GeneExpressionLevel_unique files from ExpressionEstimation.sh are used as input. 
-
-  * **Options**
-
-1) DE.sh de [options]
-
-Options 
-arguments 
-Meaning 
-
-\--c1 
-Condition1_ID 
-ID for condition1 (e.g. Control) 
-
-\--c2 
-Condition2_ID 
-ID for condition2 (e.g. Case) 
-
--1/--f1 
-Comma separated file name list for condition1. The files should be the Gene/Exon expression file(s) for condition1`*`
-
--2/--f2 
-Comma separated file name list for condition1. The files should be the Gene/Exon expression file(s) for condition2`*`
-
--o/--output-prefix 
-prefix 
-Prefix of output files 
-
-* If both conditions have only one sample each, use the ExonExpressionLevel_unique files from ExpressionEstimation.sh as input. Otherwise, if both conditions have more than one sample each, use the GeneExpressionLevel_unique files from ExpressionEstimation.sh as input. 
+### Branch 3: DE.sh de
+
+The "de" command will identify the differentially expressed genes for two conditions (e.g. case/control) using the output files from ExpressionEstimation.sh. If both conditions have only one sample, the ExonExpressionLevel_unique files from ExpressionEstimation.sh are used as input. Otherwise, if either condition has more than one sample, the GeneExpressionLevel_unique files from ExpressionEstimation.sh are used as input. 
+
+Options for DE.sh de:
+
+
+
+
+
+
+
+  
+  
+   
+
+  
+  
+  
+   
+
+
+  
+  
+  
+
+
+  
+  
+  
+
+
+  
+  
+  
+
+Option Argument Description
--c1 Condition1_ID ID for condition1 (e.g. Control)
--c2 Condition2_ID ID for condition2 (e.g. Case)
-1/--f1  Comma separated file name list for condition1. The files should be the Gene/Exon expression file(s) for condition1*
-2/--f2  Comma separated file name list for condition2. The files should be the Gene/Exon expression file(s) for condition2*
-o/--output-prefix  Prefix of output files
+
+
+\* If both conditions have only one sample each, use the ExonExpressionLevel_unique files from ExpressionEstimation.sh as input. Otherwise, if both conditions have more than one sample each, use the GeneExpressionLevel_unique files from ExpressionEstimation.sh as input. 

   * **Examples**

+Examples for running DE.sh de:
+
     DE.sh de --c1 disease --c2 normal --f1 disease_S1_whole_GeneExpressionLevel_unique.txt, disease_S2_whole_GeneExpressionLevel_unique.txt --f2 normal_S1_whole_GeneExpressionLevel_unique.txt -o HeartDisease

 _In this case, there are two samples with the disease condition, so the GeneExpressionLevel_unique files from ExpressionEstimation.sh are used as input._
@@ -280,42 +333,59 @@
 _In this case, both conditions have only a single sample each, so the ExonExpressionLevel files from ExpressionEstimation.sh are used as input._

-### (4) Branch 4: FileFormatConversion.sh
+### Branch 4: FileFormatConversion.sh

 FileFormatConversion.sh will convert between alignment file formats for storage or visualization convenience. It can convert SAM to BAM, MRF and WIG, BAM to BED, and MRF to WIG. 

-  * **Options**
-
-Options 
-arguments 
-Meaning 
-
--i/--input 
-File to convert 
-Input file to convert, with the correct file suffix (.sam, .bam, or .mrf) 
-
--o/--output 
-prefix 
-Prefix of output files 
-
--r/--reference 
-Reference sequence file that was used to align the input file 
-If a file is being converted from SAM to BAM format, the .sam file should contain either the appropriate @SQ headers or you need to use the --reference option to supply the reference file that was used to create the .sam file. 
-
--b/--toBAM 
-Convert SAM to BAM 
-
--m/--toMRF 
-Convert SAM to MRF 
-
--d/--toBED 
-Convert BAM to BED 
-
--w/--toWIG 
-Convert SAM to WIG or MRF to WIG 
+Options:
+
+
+
+
+
+
+
+  
+  
+   
+
+
+  
+  
+  
+
+  
+  
+  
+   
+
+
+  
+  
+  
+
+
+  
+  
+  
+
+
+  
+  
+  
+
+
+  
+  
+  
+
+Option Argument Description
-i/--input .sam, .bam, .mrf Input file to convert, with the correct file suffix (.sam, .bam, or .mrf)
-o/--output-prefix  Prefix of output files
-r/--reference .fa Reference sequence file in FASTA format that was used to align the input file. If a file is being converted from SAM to BAM format, the .sam file should contain either the appropriate @SQ headers or you need to use the --reference option to supply the reference file that was used to create the .sam file.
-b/--toBAM  Convert SAM to BAM
-m/--toMRF  Convert SAM to MRF
-d/--toBED  Convert BAM to BED
-w/--toWIG  Convert SAM to WIG or MRF to WIG
+

   * **Examples**

+Examples for running FileFormatConversion.sh:
+
     FileFormatConversion.sh -i in.sam -o out -b -r ref.fa

 _This command will convert the input in.sam file to a bam file. If the sam file does not contain the appropriate "@SQ" header lines, the reference sequences file (ref.fa) is required._
@@ -332,7 +402,7 @@

 ## 3\. Input files and formats

-### (1) Possible input files for QC_SNP.sh (Depends on the selected options)
+#### Possible input files for QC_SNP.sh (Depends on the selected options)

   * Genome annotation GTF file 
   * Transcriptome reference sequences 
@@ -342,64 +412,64 @@
   * Reference sequences of Mitochondria 
   * Reference sequences of Ribosomal RNA 

-### (2) Possible input files for ExpressionEstimation.sh (Depends on the selected options)
+#### Possible input files for ExpressionEstimation.sh (Depends on the selected options)

   * Genome annotation GTF file 
   * Transcriptome reference sequences 
   * RNA-Seq fastq or fastq.gz file 
   * Alignment files in SAM format 

-### (3) Possible input files for DE.sh
-
-  * Output files from `ExpressionEstimation.sh`: whole_GeneExpressionLevel_unique.txt or whole_ExonExpressionLevel_unique.txt files. 
-
-### (4) Format specification of input files
-
-The following annotations and references should be in separate files. RseqFlow will automatically split the files during processing, if necessary. All eukaryotic species with files in the required formats can be analyzed in the RseqFlow pipeline. 
-
-  * **Genome Annotation GTF file**
-
-Format from GTF 2.0 to GTF3.0 is required; 
+#### Possible input files for DE.sh
+
+  * Output files from ExpressionEstimation.sh: whole_GeneExpressionLevel_unique.txt or whole_ExonExpressionLevel_unique.txt files. 
+
+#### Format specification of input files
+
+There should be three separate files: one for Transcriptome reference sequences, one for Genome reference sequences and one for the Genome Annotation file. RseqFlow will automatically split the files during processing, if necessary. All eukaryotic species with files in the required formats can be analyzed in the RseqFlow pipeline. 
+

   * **Transcriptome Reference Sequences**

 The transcript names must begin with “>” and should meet the format requirements below. Extra information may follow the chromosome end location as long as there is a space separating the chromosome end from the extra info. 

-“>$GenomeName_$AnnotationSource_$TranscriptsID=$Chromosome:$Start-$End [extra info]”
-
-For example, 
-
-“>hg19_wgEncodeGencodeManualV4_ENST00000480075=chr7:19757-35457 5'pad=0 3'pad=0 strand=- repeatMasking=none” 
+GenomeName_AnnotationSource_TranscriptsID=Chromosome:Start-End [extra info]
+
+For example:
+
+>hg19_wgEncodeGencodeManualV4_ENST00000480075=chr7:19757-35457 5'pad=0 3'pad=0 strand=- repeatMasking=none
+

   * **Genome Reference Sequences**

 The chromosome name must begin with “>” and should meet the format requirements below. Extra information may follow the chromosome as long as there is a space separating the chromosome from the extra info. 

-“>$chromsome” 
-
-For example, 
-
-“>chr1 dna:chromosome” 
-
-">chr21 dna:chromosome chromosome:GRCh37:21:1:48129895:1 REF"
-
-“>chrM” 
+For example: 
+
+>chr1 dna:chromosome 
+
+>chr21 dna:chromosome chromosome:GRCh37:21:1:48129895:1 REF
+
+>chrM
+
+
+  * **Genome Annotation**
+
+The Genome Annotation GTF file must be in format GTF3.0.
+

 ## 4\. Sample datasets

-### (1) c. elegans
-
-We offer the sample datasets for c. elegans: 
-
-[(1) RNA-Seq dataset](http://sourceforge.net/projects/rseqflow/files/SRR514378.fastq.gz/download)
-
-[(2) Reference genome sequences file](http://sourceforge.net/projects/rseqflow/files/genome_c.elegans.fa.gz/download)
-
-[(3) Reference annotation gtf file](http://sourceforge.net/projects/rseqflow/files/c.elegans_refseq_anno.gtf/download)
-
-[(4) Reference transcriptome sequences file](http://sourceforge.net/projects/rseqflow/files/c.elegans_refseq_seq.fa/download)
-
-(5) Commands for running three of the branches: 
+### 1. c. elegans
+
+* **Download the following datasets:**
+
+[RNA-Seq dataset](http://sourceforge.net/projects/rseqflow/files/SRR514378.fastq.gz/download)
+[Reference genome sequences file](http://sourceforge.net/projects/rseqflow/files/genome_c.elegans.fa.gz/download)
+[Reference annotation gtf file](http://sourceforge.net/projects/rseqflow/files/c.elegans_refseq_anno.gtf/download)
+[Reference transcriptome sequences file](http://sourceforge.net/projects/rseqflow/files/c.elegans_refseq_seq.fa/download)
+
+

+* **Commands for running QC_SNP, ExpressionEstimation, and FileFormatConversion:** 

 Before running the pipeline, create a new directory for output results: 

@@ -409,11 +479,11 @@

      QC_SNP.sh -f SRR514378.fastq -g genome_c.elegans.fa -c c.elegans_refseq_seq.fa -a c.elegans_refseq_anno.gtf -o c.elegans_Output/QC_c.elegans -p -q -s

-ii) Branch 2: `ExpressionEstimation.sh`
+ii) Branch 2: ExpressionEstimation.sh

      ExpressionEstimation.sh -f SRR514378.fastq -c c.elegans_refseq_seq.fa -a c.elegans_refseq_anno.gtf -o c.elegans_Output/ExpressionEstimation_c.elegans

-iii) Branch 4: `FileFormatConversion.sh`
+iii) Branch 4: FileFormatConversion.sh

      FileFormatConversion.sh -i c.elegans_Output/QC_c.elegans_Bowtie2_genome.sam -r genome_c.elegans.fa -b -o c.elegans_Output/QC_c.elegans_Bowtie2_genome

@@ -423,25 +493,25 @@

      FileFormatConversion.sh -i c.elegans_Output/QC_c.elegans_Bowtie2_genome.bam -d -o c.elegans_Output/QC_c.elegans_Bowtie2_genome

-### (2) Human
-
-[(1) RNA-Seq dataset](http://sourceforge.net/projects/rseqflow/files/ERR030893_2M.fastq.gz/download)
-
-[(2) Reference genome sequences file](http://genomics.isi.edu/downloads/Human_genome_GRCh37_69.fa.gz)
-
-[(3) Mitochondria sequence file](http://sourceforge.net/projects/rseqflow/files/Human_chrM.fa/download)
-
-[(4) Reference annotation gtf file (gencodeV14)](http://sourceforge.net/projects/rseqflow/files/Human_gencodeV14_anno.gtf.gz/download)
-
-[(5) Reference transcriptome sequences file (gencodeV14)](http://sourceforge.net/projects/rseqflow/files/Human_gencodeV14_transcripts.fa.gz/download)
-
-(6) Commands for running branches #1,#2, and #4: 
-
- Before running the pipeline, create a new directory for output results: 
+
+### 2. Human
+
+* **Download the following datasets:**
+
+[RNA-Seq dataset](http://sourceforge.net/projects/rseqflow/files/ERR030893_2M.fastq.gz/download)
+[Reference genome sequences file](http://genomics.isi.edu/downloads/Human_genome_GRCh37_69.fa.gz)
+[Mitochondria sequence file](http://sourceforge.net/projects/rseqflow/files/Human_chrM.fa/download)
+[Reference annotation gtf file (gencodeV14)](http://sourceforge.net/projects/rseqflow/files/Human_gencodeV14_anno.gtf.gz/download)
+[Reference transcriptome sequences file (gencodeV14)](http://sourceforge.net/projects/rseqflow/files/Human_gencodeV14_transcripts.fa.gz/download)
+
+

+* **Commands for running QC_SNP, ExpressionEstimation, and FileFormatConversion:** 
+
+Before running the pipeline, create a new directory for output results: 

      mkdir Human_Output

- Unzip the reference files. You may wish to create a separate directory so that you can use these references for future runs. 
+Unzip the reference files. You may wish to create a separate directory so that you can use these references for future runs. 

      gunzip Human_gencodeV14_anno.gtf.gz
      gunzip Human_gencodeV14_transcripts.fa.gz

Option	Argument	Description
-f/--fastq	.fastq or .fastq.gz	RNA-Seq dataset with single end reads in FASTQ format (fastq or fastq.gz)
-1/--read1	.fastq or .fastq.gz	The first reads file for paired end data in FASTQ format (fastq or fastq.gz)
-2/--read2	.fastq or .fastq.gz	The second reads file for paired end data in FASTQ format (fastq or fastq.gz)
-g/--genome	.fa	Genome reference sequences in FASTA format
-c/-transcriptome	.fa	Transcriptome reference sequences in FASTA format
-a/--annotation	.gtf	Reference annotation in GTF format
-o/--output-prefix		Prefix of output files (default setting is “QC_SNP_output”)
-p/--pre-QC		Generate pre-alignment QC reports based only on the RNA-Seq iput fastq file
-q/--QC		Generate post-QC reports based on alignment
-s/--SNP		Implement SNPs calling analysis
--ribo	.fa	Reference sequences of ribosomal RNA in FASTA format
--mito	.fa	Reference sequences of Mitochondrial chromosome in FASTA format
--gSam	.sam	Alignment to genome in SAM format (This option is for cases where RNA-Seq dataset has already been aligned to genome)
--tSam	.sam	Alignment to transcriptome in SAM format (This option is for cases where RNA-Seq dataset has already been aligned to transcriptome)
--cleanup		Delete temporary files (default setting is to save all files)

Option	Argument	Description
-f/--fastq	.fastq or .fastq.gz	The reads file for single end data in FASTQ format (fastq or fastq.gz)
-1/--read1	.fastq or .fastq.gz	The first reads file for paired end data in FASTQ format (fastq or fastq.gz)
-2/--read2	.fastq or .fastq.gz	The second reads file for paired end data in FASTQ format (fastq or fastq.gz)
-c/-transcriptome	.fa	Transcriptome reference sequences in FASTA format
-a/--annotation	.gtf	Reference annotation in GTF format
-o/--output-prefix		Prefix of output files (default setting is “Expression_output”)
--tSam	.sam	Alignment to transcriptome in SAM format (This option is for cases where RNA-Seq dataset has already been aligned to transcriptome)
--cleanup		Delete temporary files (default setting is to save all files)

Option	Argument	Description
--c1	Condition1_ID	ID for condition1 (e.g. Control)
--c2	Condition2_ID	ID for condition2 (e.g. Case)
-1/--f1		Comma separated file name list for condition1. The files should be the Gene/Exon expression file(s) for condition1*
-2/--f2		Comma separated file name list for condition2. The files should be the Gene/Exon expression file(s) for condition2*
-o/--output-prefix		Prefix of output files

Option	Argument	Description
-i/--input	.sam, .bam, .mrf	Input file to convert, with the correct file suffix (.sam, .bam, or .mrf)
-o/--output-prefix		Prefix of output files
-r/--reference	.fa	Reference sequence file in FASTA format that was used to align the input file. If a file is being converted from SAM to BAM format, the .sam file should contain either the appropriate @SQ headers or you need to use the --reference option to supply the reference file that was used to create the .sam file.
-b/--toBAM		Convert SAM to BAM
-m/--toMRF		Convert SAM to MRF
-d/--toBED		Convert BAM to BED
-w/--toWIG		Convert SAM to WIG or MRF to WIG

UnixModeManual modified by J.Herstein

J.Herstein — Wed, 29 Jan 2014 20:12:21 -0000

--- v1
+++ v2
@@ -2,19 +2,19 @@

   * Manual of RseqFlow Unix Run Mode
     * 1\. Package installation and configuration
-    * 2\. Usage of each branch
-      * (1) Branch 1: QC_SNP.sh
-      * (2) Branch 2: ExpressionEstimation.sh
-      * (3) Branch 3: DE.sh
-      * (4) Branch 4: FileFormatConversion.sh
+    * 2\. Branch usage
+        * Branch 1: QC_SNP.sh
+        * Branch 2: ExpressionEstimation.sh
+        * Branch 3: DE.sh
+        * Branch 4: FileFormatConversion.sh
     * 3\. Input files and formats
-      * (1) Possible input files for QC_SNP.sh (Depends on the selected options)
-      * (2) Possible input files for ExpressionEstimation.sh (Depends on the selected options)
-      * (3) Possible input files for DE.sh
-      * (4) Format specification of input files
+        * (1) Possible input files for QC_SNP.sh (Depends on the selected options)
+        * (2) Possible input files for ExpressionEstimation.sh (Depends on the selected options)
+        * (3) Possible input files for DE.sh
+        * (4) Format specification of input files
     * 4\. Sample datasets
-      * (1) c. elegans
-      * (2) Human
+        * (1) c. elegans
+        * (2) Human

 ## 1\. Package installation and configuration

@@ -24,39 +24,46 @@
     3. R 2.11 and higher; 
     4. GCC 

-> Step 1: Download the [source code](http://code.google.com/p/rseqflow/downloads/detail?name=RseqFlow_source.tar.gz&can=2&q=) to your directory, e.g '/home/user/rseqflow'. 
-
-> Step 2: Enter your specified directory: 
->     
->       bash-3.2$ cd /home/user/rseqflow
-
-> Step 3: Extract the tar file: 
->     
->       bash-3.2$ tar -xvf RseqFlow_source.tar.gz 
-
-> Step 4: Enter the directory: 
->     
->       bash-3.2$ cd /home/user/reseqflow/RseqFlow_source
-
-> Step 5: Set up some tools: 
->     
->       bash-3.2$ ./make.sh 
-
-> Step 6: Set up PATH so that the system knows where to find the executable files. It is recommended to run step 7 after this to make these changes permanent, otherwise you will need to run ./configure.sh each time you run `RseqFlow` from a new terminal window. 
->     
->       bash-3.2$ source ./configure.sh 
-
-> Step 7: This step will make the changes to your PATH and PYTHONPATH variables from step 6 permanent. If you choose not to do this step, you will need to run configure.sh each time you run `RseqFlow`. To make the PATH and PYTHONPATH changes permanent, copy the commands in 'configure.sh' into your bash file either manually or with the following command. **Please make sure the command contains ">>", not “>", otherwise, you will overwrite your original bash file! **
->     
->       bash-3.2$ cat configure.sh >> /home/user/.bashrc
-
-> Step 8: If your system has multiple python versions, make sure you use version 2.7 or higher. Run 'pythonCompilerSet.sh' to set the proper python header in each of the python scripts. The example below assumes the path of the python executable is '/home/user/python2.7'. 
->     
->       bash-3.2$ ./pythonCompilerSet.sh -p /home/user/python2.7/python
-
-> Step 9: Now you can run the corresponding shell scripts for each branch. 
-
-## 2\. Usage of each branch
+Step 1: Download the [source code](http://sourceforge.net/projects/rseqflow/files/rseqflow2-v2.1.tar.gz/download) to your directory, e.g '/home/user/rseqflow'. 
+
+Step 2: Enter your specified directory: 
+     
+       bash-3.2$ cd /home/user/rseqflow
+
+
+Step 3: Extract the tar file: 
+     
+       bash-3.2$ tar -xvf RseqFlow_source.tar.gz 
+
+
+Step 4: Enter the directory: 
+     
+       bash-3.2$ cd /home/user/reseqflow/RseqFlow_source
+
+
+Step 5: Set up some tools: 
+     
+       bash-3.2$ ./make.sh 
+
+
+Step 6: Set up PATH so that the system knows where to find the executable files. It is recommended to run step 7 after this to make these changes permanent, otherwise you will need to run ./configure.sh each time you run `RseqFlow` from a new terminal window. 
+     
+       bash-3.2$ source ./configure.sh 
+
+
+Step 7: This step will make the changes to your PATH and PYTHONPATH variables from step 6 permanent. If you choose not to do this step, you will need to run configure.sh each time you run `RseqFlow`. To make the PATH and PYTHONPATH changes permanent, copy the commands in 'configure.sh' into your bash file either manually or with the following command. **Please make sure the command contains ">>", not “>", otherwise, you will overwrite your original bash file! **
+     
+       bash-3.2$ cat configure.sh >> /home/user/.bashrc
+
+
+Step 8: If your system has multiple python versions, make sure you use version 2.7 or higher. Run 'pythonCompilerSet.sh' to set the proper python header in each of the python scripts. The example below assumes the path of the python executable is '/home/user/python2.7'. 
+     
+       bash-3.2$ ./pythonCompilerSet.sh -p /home/user/python2.7/python
+
+
+Step 9: Now you can run the corresponding shell scripts for each branch. 
+
+## 2\. Branch usage

 The pipeline consists of four branches: 

@@ -65,19 +72,19 @@
   * Branch 3: Differentially expressed gene identification 
   * Branch 4: Alignment file format conversion 

-### (1) Branch 1: `QC_SNP.sh`
-
-> QC_SNP.sh will implement quality control and/or SNP calling analysis for single end and paired end RNA-Seq datasets. 
-
-> For pre-alignment quality control analysis only, a fastq(.gz) file is required as input, either a single fastq for single end data or Read1.fastq and Read2.fastq for paired end data. 
-
-> For post-alignment quality control analysis, the fastq files(s), genome reference sequences, transcriptome reference sequences, and annotation file must be input. QC_SNP.sh will align the RNA-Seq dataset to the genome and transcriptome, then merge the alignment results for quality control and SNP calling. 
-
-> You may input ribosomal RNA reference sequences and/or mitochondrial reference sequences if desired. 
-
-> If bowtie2 indexes already exist in the same directory as the genome and/or transcriptome reference files, `RseqFlow` will use those existing indexes. If it cannot find precomputed indexes, it will create the indexes in the output directory. If you plan on using the same reference files for future runs, you may wish to move the files ending in .bt2 from the output directory into the directory of the corresponding genome/transcriptome reference. This will bypass index creation for future runs and reduce run times. 
-
-> If you already have alignment file(s) in SAM format for the genome and/or transcriptome, you may input the .sam files and the alignment step will be skipped. The post-alignment and/or SNP calling will be analyzed based on the merged result only. 
+### (1) Branch 1: QC_SNP.sh
+
+QC_SNP.sh will implement quality control and/or SNP calling analysis for single end and paired end RNA-Seq datasets. 
+
+For pre-alignment quality control analysis only, a fastq(.gz) file is required as input, either a single fastq for single end data or Read1.fastq and Read2.fastq for paired end data. 
+
+For post-alignment quality control analysis, the fastq files(s), genome reference sequences, transcriptome reference sequences, and annotation file must be input. QC_SNP.sh will align the RNA-Seq dataset to the genome and transcriptome, then merge the alignment results for quality control and SNP calling. 
+
+You may input ribosomal RNA reference sequences and/or mitochondrial reference sequences if desired. 
+
+If bowtie2 indexes already exist in the same directory as the genome and/or transcriptome reference files, RseqFlow will use those existing indexes. If it cannot find precomputed indexes, it will create the indexes in the output directory. If you plan on using the same reference files for future runs, you may wish to move the files ending in .bt2 from the output directory into the directory of the corresponding genome/transcriptome reference. This will bypass index creation for future runs and reduce run times. 
+
+If you already have RseqFlow alignment file(s) in SAM format for the genome and/or transcriptome, you may input the .sam files and the alignment step will be skipped. The post-alignment and/or SNP calling will be analyzed based on the merged result only. 

   * **Options**

@@ -145,35 +152,40 @@

     QC_SNP.sh -f Reads.fastq -o OutputPrefix –p --cleanup

-> _This command will give pre-alignment quality analysis only, based on the RNA-Seq dataset (Reads.fq) and will delete temporary files._
-    
+_This command will give pre-alignment quality analysis only, based on the RNA-Seq dataset (Reads.fq) and will delete temporary files._
+    
+
     QC_SNP.sh -f Reads.fastq.gz -g RefGene.fa -c RefTran.fa -a Anno.gtf -o OutputPrefix –p -q --ribo rRNA.fa

-> _This command will align the Reads.fastq.gz to the genome (`RefGene.fa`), transcriptome (`RefTran.fa`) and ribosomal RNA (rRNA.fa) and implement the pre and post alignment quality control analyses. It will also output the ribosomal RNA alignment report._
-    
+_This command will align the Reads.fastq.gz to the genome (RefGene.fa), transcriptome (RefTran.fa) and ribosomal RNA (rRNA.fa) and implement the pre and post alignment quality control analyses. It will also output the ribosomal RNA alignment report._
+    
+
     QC_SNP.sh -a Anno.gtf -o OutputPrefix -q --gSam AlignedToGenome.sam --tSam AlignedToTran.sam

-> _This command will give post-alignment quality analysis based on the user specified alignment SAM files to the genome (`AlignedToGenome.sam`) and transcriptome(`AlignedToTran.sam`)._
-    
+_This command will give post-alignment quality analysis based on the user specified alignment SAM files to the genome (AlignedToGenome.sam) and transcriptome(AlignedToTran.sam)._
+    
+
     QC_SNP.sh -1 Read1.fastq -2 Read2.fastq -g RefGene.fa -c RefTran.fa –a Anno.gtf -o OutputPrefix -p –q –s

-> _This command will align the paired end RNA-Seq datasets Read1.fastq and Read2.fastq to the genome (`RefGene.fa`) and transcriptome (`RefTran.fa`) and implement the pre and post alignment quality control analyses and SNP calling._
-    
+_This command will align the paired end RNA-Seq datasets Read1.fastq and Read2.fastq to the genome (RefGene.fa) and transcriptome (RefTran.fa) and implement the pre and post alignment quality control analyses and SNP calling._
+    
+
     QC_SNP.sh -1 Read1.fastq.gz -2 Read2.fastq.gz -g RefGene.fa -c RefTran.fa -a Anno.gtf -o OutputPrefix –s --mito chrM.fa

-> _This command will align the paired end Read1.fastq.gz and Read2.fastq.gz to the genome (`RefGene.fa`), transcriptome (`RefTran.fa`) and Mitochondrial chromosome (chrM.fa) and implement SNP calling base on the merged genome and transcriptome alignments._
-
-### (2) Branch 2: `ExpressionEstimation.sh`
-
-> `ExpressionEstimation.sh` will implement expression level estimation for gene/exon/splice junctions based on the alignment to the transcriptome for single end or paired end RNA-Seq datsets. Beginning in version 2.1 and later, only unique alignments are used in expression estimation. Prior to version 2.1, the best alignment for each read was used which may or may not have been unique. 
-
-> If bowtie2 indexes already exist in the same directory as the transcriptome reference file, `RseqFlow` will use those existing indexes. If it cannot find precomputed indexes, it will create the indexes in the output directory. If you plan on using the same reference file for future runs, you may wish to move the files ending in .bt2 from the output directory into the directory of the transcriptome reference. This will bypass index creation for future runs and reduce run times. 
-
-> If a fastq file and the reference sequences of the transcriptome are supplied as input, `ExpressionEstimation.sh` will align the RNA-Seq dataset to the transcriptome and then implement expression level estimation. 
-
-> If you already have an alignment file to the transcriptome, you may input the .sam file and the alignment step will be skipped. `ExpressionEstimation.sh` will implement expression level estimation. 
-
-> Please note that `ExpressionEstimation.sh` is expecting a bowtie2 created samfile. If your pre-existing samfile was not created with bowtie2, it is strongly suggested that you run `ExpressionEstimation.sh` without a samfile to let `RseqFlow` create its own, otherwise there is no guarantee that `ExpressionEstimation` will use only unique alignments. 
+_This command will align the paired end Read1.fastq.gz and Read2.fastq.gz to the genome (RefGene.fa), transcriptome (RefTran.fa) and Mitochondrial chromosome (chrM.fa) and implement SNP calling base on the merged genome and transcriptome alignments._
+
+
+### (2) Branch 2: ExpressionEstimation.sh
+
+ExpressionEstimation.sh will implement expression level estimation for gene/exon/splice junctions based on the alignment to the transcriptome for single end or paired end RNA-Seq datsets. Beginning in version 2.1 and later, only unique alignments are used in expression estimation. Prior to version 2.1, the best alignment for each read was used which may or may not have been unique. 
+
+If bowtie2 indexes already exist in the same directory as the transcriptome reference file, RseqFlow will use those existing indexes. If it cannot find precomputed indexes, it will create the indexes in the output directory. If you plan on using the same reference file for future runs, you may wish to move the files ending in .bt2 from the output directory into the directory of the transcriptome reference. This will bypass index creation for future runs and reduce run times. 
+
+If a fastq file and the reference sequences of the transcriptome are supplied as input, ExpressionEstimation.sh will align the RNA-Seq dataset to the transcriptome and then implement expression level estimation. 
+
+If you already have an alignment file to the transcriptome, you may input the .sam file and the alignment step will be skipped. ExpressionEstimation.sh will implement expression level estimation. 
+
+Please note that ExpressionEstimation.sh is expecting a bowtie2 created samfile. If your pre-existing samfile was not created with bowtie2, it is strongly suggested that you run `ExpressionEstimation.sh` without a samfile to let RseqFlow create its own, otherwise there is no guarantee that `ExpressionEstimation` will use only unique alignments. 

   * **Options**

@@ -216,19 +228,21 @@

     ExpressionEstimation.sh -f Reads.fastq -c RefTran.fa -a Anno.gtf -o OutputPrefix --cleanup

-> _This command will align the single end Reads.fastq to the transcriptome reference sequence (`RefTran.fa`) and implement the expression level estimation using the distinct best alignment for each read. Temporary files will be deleted._
-    
+_This command will align the single end Reads.fastq to the transcriptome reference sequence (RefTran.fa) and implement the expression level estimation using the distinct best alignment for each read. Temporary files will be deleted._
+    
+
     ExpressionEstimation.sh -a Anno.gtf -o OutputPrefix --tSam AlignedToTran.sam

-> _This command uses the supplied .sam file that has already been aligned to the transcriptome. `RseqFlow` will implement the expression level estimation from the distinct best alignment for each read._
-
-### (3) Branch 3: `DE.sh`
-
-> The de command will identify the differentially expressed genes for two conditions (e.g. case/control) using the output files from `ExpressionEstimation.sh`. If both conditions have only one sample, the `ExonExpressionLevel_unique` files from `ExpressionEstimation.sh` are used as input. Otherwise, if either condition has more than one sample, the `GeneExpressionLevel_unique` files from `ExpressionEstimation.sh` are used as input. 
+_This command uses the supplied .sam file that has already been aligned to the transcriptome. RseqFlow will implement the expression level estimation from the distinct best alignment for each read._
+
+
+### (3) Branch 3: DE.sh
+
+The de command will identify the differentially expressed genes for two conditions (e.g. case/control) using the output files from ExpressionEstimation.sh. If both conditions have only one sample, the ExonExpressionLevel_unique files from ExpressionEstimation.sh are used as input. Otherwise, if either condition has more than one sample, the GeneExpressionLevel_unique files from ExpressionEstimation.sh are used as input. 

   * **Options**

-1) DE.sh de `[options]`
+1) DE.sh de [options]

 Options 
 arguments 
@@ -252,21 +266,23 @@
 prefix 
 Prefix of output files 

-`*` If both conditions have only one sample each, use the `ExonExpressionLevel_unique` files from `ExpressionEstimation.sh` as input. Otherwise, if both conditions have more than one sample each, use the `GeneExpressionLevel_unique` files from `ExpressionEstimation.sh` as input. 
+* If both conditions have only one sample each, use the ExonExpressionLevel_unique files from ExpressionEstimation.sh as input. Otherwise, if both conditions have more than one sample each, use the GeneExpressionLevel_unique files from ExpressionEstimation.sh as input. 

   * **Examples**

     DE.sh de --c1 disease --c2 normal --f1 disease_S1_whole_GeneExpressionLevel_unique.txt, disease_S2_whole_GeneExpressionLevel_unique.txt --f2 normal_S1_whole_GeneExpressionLevel_unique.txt -o HeartDisease

-> _In this case, there are two samples with the disease condition, so the `GeneExpressionLevel_unique` files from `ExpressionEstimation.sh` are used as input._
-    
+_In this case, there are two samples with the disease condition, so the GeneExpressionLevel_unique files from ExpressionEstimation.sh are used as input._
+  
+  
     DE.sh de --c1 Brain --c2 Colon --f1 Brain_whole_ExonExpressionLevel_unique.txt --f2 Colon_whole_ExonExpressionLevel_unique.txt –o TissueCompare

-> _In this case, both conditions have only a single sample each, so the `ExonExpressionLevel` files from `ExpressionEstimation.sh` are used as input._
-
-### (4) Branch 4: `FileFormatConversion.sh`
-
-> `FileFormatConversion_version_1.sh` will convert between alignment file formats for storage or visualization convenience. It can convert SAM to BAM, MRF and WIG, BAM to BED, and MRF to WIG. 
+_In this case, both conditions have only a single sample each, so the ExonExpressionLevel files from ExpressionEstimation.sh are used as input._
+
+
+### (4) Branch 4: FileFormatConversion.sh
+
+FileFormatConversion.sh will convert between alignment file formats for storage or visualization convenience. It can convert SAM to BAM, MRF and WIG, BAM to BED, and MRF to WIG. 

   * **Options**

@@ -302,15 +318,17 @@

     FileFormatConversion.sh -i in.sam -o out -b -r ref.fa

-> _This command will convert the input in.sam file to a bam file. If the sam file does not contain the appropriate "@SQ" header lines, the reference sequences file (ref.fa) is required._
-    
+_This command will convert the input in.sam file to a bam file. If the sam file does not contain the appropriate "@SQ" header lines, the reference sequences file (ref.fa) is required._
+    
+
     FileFormatConversion.sh -i in.sam -o out –m

-> _This command will convert the input in.sam file to a mrf file._
+_This command will convert the input in.sam file to a mrf file._
+

     FileFormatConversion.sh -i in.bam -o out –d

-> _This command will convert the input in.bam file to a BED file for visualization._
+_This command will convert the input in.bam file to a BED file for visualization._

 ## 3\. Input files and formats

@@ -324,7 +342,7 @@
   * Reference sequences of Mitochondria 
   * Reference sequences of Ribosomal RNA 

-### (2) Possible input files for `ExpressionEstimation.sh` (Depends on the selected options)
+### (2) Possible input files for ExpressionEstimation.sh (Depends on the selected options)

   * Genome annotation GTF file 
   * Transcriptome reference sequences 
@@ -333,39 +351,39 @@

 ### (3) Possible input files for DE.sh

-  * Output files from `ExpressionEstimation.sh`: `whole_GeneExpressionLevel_unique.txt` or `whole_ExonExpressionLevel_unique.txt` files. 
+  * Output files from `ExpressionEstimation.sh`: whole_GeneExpressionLevel_unique.txt or whole_ExonExpressionLevel_unique.txt files. 

 ### (4) Format specification of input files

-> The following annotations and references should be in separate files. `RseqFlow` will automatically split the files during processing, if necessary. All eukaryotic species with files in the required formats can be analyzed in the `RseqFlow` pipeline. 
+The following annotations and references should be in separate files. RseqFlow will automatically split the files during processing, if necessary. All eukaryotic species with files in the required formats can be analyzed in the RseqFlow pipeline. 

   * **Genome Annotation GTF file**

-> Format from GTF 2.0 to GTF3.0 is required; 
+Format from GTF 2.0 to GTF3.0 is required; 

   * **Transcriptome Reference Sequences**

-> The transcript names must begin with “>” and should meet the format requirements below. Extra information may follow the chromosome end location as long as there is a space separating the chromosome end from the extra info. 
-
-> `“>$GenomeName_$AnnotationSource_$TranscriptsID=$Chromosome:$Start-$End [extra info]”`
-
-> For example, 
-
-> “>hg19_wgEncodeGencodeManualV4_ENST00000480075=chr7:19757-35457 5'pad=0 3'pad=0 strand=- repeatMasking=none” 
+The transcript names must begin with “>” and should meet the format requirements below. Extra information may follow the chromosome end location as long as there is a space separating the chromosome end from the extra info. 
+
+“>$GenomeName_$AnnotationSource_$TranscriptsID=$Chromosome:$Start-$End [extra info]”
+
+For example, 
+
+“>hg19_wgEncodeGencodeManualV4_ENST00000480075=chr7:19757-35457 5'pad=0 3'pad=0 strand=- repeatMasking=none” 

   * **Genome Reference Sequences**

-> The chromosome name must begin with “>” and should meet the format requirements below. Extra information may follow the chromosome as long as there is a space separating the chromosome from the extra info. 
-
-> “>$chromsome” 
-
-> For example, 
-
-> “>chr1 dna:chromosome” 
-
-> ">chr21 dna:chromosome chromosome:GRCh37:21:1:48129895:1 REF"
-
-> “>chrM” 
+The chromosome name must begin with “>” and should meet the format requirements below. Extra information may follow the chromosome as long as there is a space separating the chromosome from the extra info. 
+
+“>$chromsome” 
+
+For example, 
+
+“>chr1 dna:chromosome” 
+
+">chr21 dna:chromosome chromosome:GRCh37:21:1:48129895:1 REF"
+
+“>chrM” 

 ## 4\. Sample datasets

@@ -373,76 +391,76 @@

 We offer the sample datasets for c. elegans: 

-[(1) RNA-Seq dataset](http://code.google.com/p/rseqflow/downloads/detail?name=SRR514378.fastq.gz&can=2&q=)
-
-[(2) Reference genome sequences file](https://code.google.com/p/rseqflow/downloads/detail?name=genome_c.elegans.fa.gz&can=2&q=)
-
-[(3) Reference annotation gtf file](https://code.google.com/p/rseqflow/downloads/detail?name=c.elegans_refseq_anno.gtf&can=2&q=)
-
-[(4) Reference transcriptome sequences file](https://code.google.com/p/rseqflow/downloads/detail?name=c.elegans_refseq_seq.fa&can=2&q=)
+[(1) RNA-Seq dataset](http://sourceforge.net/projects/rseqflow/files/SRR514378.fastq.gz/download)
+
+[(2) Reference genome sequences file](http://sourceforge.net/projects/rseqflow/files/genome_c.elegans.fa.gz/download)
+
+[(3) Reference annotation gtf file](http://sourceforge.net/projects/rseqflow/files/c.elegans_refseq_anno.gtf/download)
+
+[(4) Reference transcriptome sequences file](http://sourceforge.net/projects/rseqflow/files/c.elegans_refseq_seq.fa/download)

 (5) Commands for running three of the branches: 

-> Before running the pipeline, create a new directory for output results: 
->     
->     mkdir c.elegans_Output
-
-> i) Branch 1: `QC_SNP.sh`
->     
->     QC_SNP.sh -f SRR514378.fastq -g genome_c.elegans.fa -c c.elegans_refseq_seq.fa -a c.elegans_refseq_anno.gtf -o c.elegans_Output/QC_c.elegans -p -q -s
-
-> ii) Branch 2: `ExpressionEstimation.sh`
->     
->     ExpressionEstimation.sh -f SRR514378.fastq -c c.elegans_refseq_seq.fa -a c.elegans_refseq_anno.gtf -o c.elegans_Output/ExpressionEstimation_c.elegans
-
-> iii) Branch 4: `FileFormatConversion.sh`
->     
->     FileFormatConversion.sh -i c.elegans_Output/QC_c.elegans_Bowtie2_genome.sam -r genome_c.elegans.fa -b -o c.elegans_Output/QC_c.elegans_Bowtie2_genome
->     
->     FileFormatConversion.sh -i c.elegans_Output/QC_c.elegans_Bowtie2_transcriptome.sam -m -o c.elegans_Output/QC_c.elegans_Bowtie2_transcriptome
->     
->     FileFormatConversion.sh -i c.elegans_Output/QC_c.elegans_Bowtie2_genome.sam -w -o c.elegans_Output/QC_c.elegans_Bowtie2_genome
->     
->     FileFormatConversion.sh -i c.elegans_Output/QC_c.elegans_Bowtie2_genome.bam -d -o c.elegans_Output/QC_c.elegans_Bowtie2_genome
+Before running the pipeline, create a new directory for output results: 
+    
+    mkdir c.elegans_Output
+
+i) Branch 1: QC_SNP.sh 
+    
+     QC_SNP.sh -f SRR514378.fastq -g genome_c.elegans.fa -c c.elegans_refseq_seq.fa -a c.elegans_refseq_anno.gtf -o c.elegans_Output/QC_c.elegans -p -q -s
+
+ii) Branch 2: `ExpressionEstimation.sh`
+    
+     ExpressionEstimation.sh -f SRR514378.fastq -c c.elegans_refseq_seq.fa -a c.elegans_refseq_anno.gtf -o c.elegans_Output/ExpressionEstimation_c.elegans
+
+iii) Branch 4: `FileFormatConversion.sh`
+     
+     FileFormatConversion.sh -i c.elegans_Output/QC_c.elegans_Bowtie2_genome.sam -r genome_c.elegans.fa -b -o c.elegans_Output/QC_c.elegans_Bowtie2_genome
+    
+     FileFormatConversion.sh -i c.elegans_Output/QC_c.elegans_Bowtie2_transcriptome.sam -m -o c.elegans_Output/QC_c.elegans_Bowtie2_transcriptome
+     
+     FileFormatConversion.sh -i c.elegans_Output/QC_c.elegans_Bowtie2_genome.sam -w -o c.elegans_Output/QC_c.elegans_Bowtie2_genome
+     
+     FileFormatConversion.sh -i c.elegans_Output/QC_c.elegans_Bowtie2_genome.bam -d -o c.elegans_Output/QC_c.elegans_Bowtie2_genome

 ### (2) Human

-[(1) RNA-Seq dataset](http://code.google.com/p/rseqflow/downloads/detail?name=ERR030893_2M.fastq.gz&can=2&q=)
+[(1) RNA-Seq dataset](http://sourceforge.net/projects/rseqflow/files/ERR030893_2M.fastq.gz/download)

 [(2) Reference genome sequences file](http://genomics.isi.edu/downloads/Human_genome_GRCh37_69.fa.gz)

-[(3) Mitochondria sequence file](http://code.google.com/p/rseqflow/downloads/detail?name=Human_chrM.fa&can=2&q=)
-
-[(4) Reference annotation gtf file (gencodeV14)](http://code.google.com/p/rseqflow/downloads/detail?name=Human_gencodeV14_anno.gtf.gz&can=2&q=)
-
-[(5) Reference transcriptome sequences file (gencodeV14)](http://code.google.com/p/rseqflow/downloads/detail?name=Human_gencodeV14_transcripts.fa.gz&can=2&q=)
-
-(6) Commands for running three of the branches: 
-
-> Before running the pipeline, create a new directory for output results: 
->     
->     mkdir Human_Output
-> 
-> Unzip the reference files. You may wish to create a separate directory so that you can use these references for future runs. 
->     
->     gunzip Human_gencodeV14_anno.gtf.gz
->     gunzip Human_gencodeV14_transcripts.fa.gz
->     gunzip Human_genome_GRCh37.fa.gz
-
-> i) Branch 1: `QC_SNP.sh`
->     
->     QC_SNP.sh -f ERR030893_2M.fastq.gz -g Human_genome_GRCh37.fa -c Human_gencodeV14_transcripts.fa -a Human_gencodeV14_anno.gtf --mito Human_chrM -o Human_Output/QC_human -p -q -s
-
-> ii) Branch 2: `ExpressionEstimation.sh`
->     
->     ExpressionEstimation.sh -f ERR030893_2M.fastq.gz -c Human_gencodeV14_transcripts.fa -a Human_gencodeV14_anno.gtf -o Human_Output/ExpressionEstimation_human
-
-> iii) Branch 4: `FileFormatConversion.sh`
->     
->     FileFormatConversion.sh -i Human_Output/QC_human_Bowtie2_genome.sam -r  Human_genome_GRCh37.fa -b -o Human_Output/QC_human_Bowtie2_genome
->     
->     FileFormatConversion.sh -i Human_Output/QC_human_Bowtie2_transcriptome.sam -m -o Human_Output/QC_human_Bowtie2_transcriptome
->     
->     FileFormatConversion.sh -i Human_Output/QC_human_Bowtie2_genome.sam -w -o Human_Output/QC_human_Bowtie2_genome
->     
->     FileFormatConversion.sh -i Human_Output/QC_human_Bowtie2_genome.bam -d -o Human_Output/QC_human_Bowtie2_genome
+[(3) Mitochondria sequence file](http://sourceforge.net/projects/rseqflow/files/Human_chrM.fa/download)
+
+[(4) Reference annotation gtf file (gencodeV14)](http://sourceforge.net/projects/rseqflow/files/Human_gencodeV14_anno.gtf.gz/download)
+
+[(5) Reference transcriptome sequences file (gencodeV14)](http://sourceforge.net/projects/rseqflow/files/Human_gencodeV14_transcripts.fa.gz/download)
+
+(6) Commands for running branches #1,#2, and #4: 
+
+ Before running the pipeline, create a new directory for output results: 
+     
+     mkdir Human_Output
+ 
+ Unzip the reference files. You may wish to create a separate directory so that you can use these references for future runs. 
+     
+     gunzip Human_gencodeV14_anno.gtf.gz
+     gunzip Human_gencodeV14_transcripts.fa.gz
+     gunzip Human_genome_GRCh37.fa.gz
+
+ i) Branch 1: QC_SNP.sh
+     
+     QC_SNP.sh -f ERR030893_2M.fastq.gz -g Human_genome_GRCh37.fa -c Human_gencodeV14_transcripts.fa -a Human_gencodeV14_anno.gtf --mito Human_chrM -o Human_Output/QC_human -p -q -s
+
+ ii) Branch 2: ExpressionEstimation.sh
+     
+     ExpressionEstimation.sh -f ERR030893_2M.fastq.gz -c Human_gencodeV14_transcripts.fa -a Human_gencodeV14_anno.gtf -o Human_Output/ExpressionEstimation_human
+
+ iii) Branch 4: FileFormatConversion.sh
+     
+     FileFormatConversion.sh -i Human_Output/QC_human_Bowtie2_genome.sam -r  Human_genome_GRCh37.fa -b -o Human_Output/QC_human_Bowtie2_genome
+     
+     FileFormatConversion.sh -i Human_Output/QC_human_Bowtie2_transcriptome.sam -m -o Human_Output/QC_human_Bowtie2_transcriptome
+     
+     FileFormatConversion.sh -i Human_Output/QC_human_Bowtie2_genome.sam -w -o Human_Output/QC_human_Bowtie2_genome
+     
+     FileFormatConversion.sh -i Human_Output/QC_human_Bowtie2_genome.bam -d -o Human_Output/QC_human_Bowtie2_genome

UnixModeManual modified by Anonymous

Anonymous — Wed, 29 Jan 2014 18:26:42 -0000

Manual of `RseqFlow` Unix Run Mode

Manual of RseqFlow Unix Run Mode
- 1. Package installation and configuration
- 2. Usage of each branch
- (1) Branch 1: QC_SNP.sh
- (2) Branch 2: ExpressionEstimation.sh
- (3) Branch 3: DE.sh
- (4) Branch 4: FileFormatConversion.sh
- 3. Input files and formats
- (1) Possible input files for QC_SNP.sh (Depends on the selected options)
- (2) Possible input files for ExpressionEstimation.sh (Depends on the selected options)
- (3) Possible input files for DE.sh
- (4) Format specification of input files
- 4. Sample datasets
- (1) c. elegans
- (2) Human

1. Package installation and configuration

Pre-install run environment
1. UNIX;
2. Python 2.7 and higher;
3. R 2.11 and higher;
4. GCC

Step 1: Download the source code to your directory, e.g '/home/user/rseqflow'.

Step 2: Enter your specified directory:
  bash-3.2$ cd /home/user/rseqflow
Step 3: Extract the tar file:
  bash-3.2$ tar -xvf RseqFlow_source.tar.gz
Step 4: Enter the directory:
  bash-3.2$ cd /home/user/reseqflow/RseqFlow_source
Step 5: Set up some tools:
  bash-3.2$ ./make.sh
Step 6: Set up PATH so that the system knows where to find the executable files. It is recommended to run step 7 after this to make these changes permanent, otherwise you will need to run ./configure.sh each time you run RseqFlow from a new terminal window.
  bash-3.2$ source ./configure.sh
Step 7: This step will make the changes to your PATH and PYTHONPATH variables from step 6 permanent. If you choose not to do this step, you will need to run configure.sh each time you run RseqFlow. To make the PATH and PYTHONPATH changes permanent, copy the commands in 'configure.sh' into your bash file either manually or with the following command. Please make sure the command contains ">>", not “>", otherwise, you will overwrite your original bash file!
  bash-3.2$ cat configure.sh >> /home/user/.bashrc
Step 8: If your system has multiple python versions, make sure you use version 2.7 or higher. Run 'pythonCompilerSet.sh' to set the proper python header in each of the python scripts. The example below assumes the path of the python executable is '/home/user/python2.7'.
  bash-3.2$ ./pythonCompilerSet.sh -p /home/user/python2.7/python
Step 9: Now you can run the corresponding shell scripts for each branch.

2. Usage of each branch

The pipeline consists of four branches:

Branch 1: Quality Control and SNP calling
Branch 2: Expression level estimation
Branch 3: Differentially expressed gene identification
Branch 4: Alignment file format conversion

(1) Branch 1: `QC_SNP.sh`

QC_SNP.sh will implement quality control and/or SNP calling analysis for single end and paired end RNA-Seq datasets.

For pre-alignment quality control analysis only, a fastq(.gz) file is required as input, either a single fastq for single end data or Read1.fastq and Read2.fastq for paired end data.

For post-alignment quality control analysis, the fastq files(s), genome reference sequences, transcriptome reference sequences, and annotation file must be input. QC_SNP.sh will align the RNA-Seq dataset to the genome and transcriptome, then merge the alignment results for quality control and SNP calling.

You may input ribosomal RNA reference sequences and/or mitochondrial reference sequences if desired.

If bowtie2 indexes already exist in the same directory as the genome and/or transcriptome reference files, RseqFlow will use those existing indexes. If it cannot find precomputed indexes, it will create the indexes in the output directory. If you plan on using the same reference files for future runs, you may wish to move the files ending in .bt2 from the output directory into the directory of the corresponding genome/transcriptome reference. This will bypass index creation for future runs and reduce run times.

If you already have alignment file(s) in SAM format for the genome and/or transcriptome, you may input the .sam files and the alignment step will be skipped. The post-alignment and/or SNP calling will be analyzed based on the merged result only.

Options

Options
arguments
Meaning

-f/--fastq
.fastq or .fastq.gz
RNA-Seq dataset with single end reads in FASTQ format (fastq or fastq.gz)

-1/--read1
.fastq or .fastq.gz
The first reads file for paired end data in FASTQ format (fastq or fastq.gz)

-2/--read2
.fastq or .fastq.gz
The second reads file for paired end data in FASTQ format (fastq or fastq.gz)

-g/--genome
.fa
Genome reference sequences in FASTA format

-c/-transcriptome
.fa
Transcriptome reference sequences in FASTA format

-a/--annotation
.gtf
Reference annotation in GTF format

-o/--output-prefix
prefix
Prefix of output files (default setting is “QC_SNP_output”)

-p/--pre-QC
Generate pre-alignment QC reports based only on the RNA-Seq iput fastq file

-q/--QC
Generate post-QC reports based on alignment

-s/--SNP
Implement SNPs calling analysis

--ribo
.fa
Reference sequences of ribosomal RNA in FASTA format

--mito
.fa
Reference sequences of Mitochondrial chromosome in FASTA format

--gSam
.sam
Alignment to genome in SAM format (This option is for cases where RNA-Seq dataset has already been aligned to genome)

--tSam
.sam
Alignment to transcriptome in SAM format (This option is for cases where RNA-Seq dataset has already been aligned to transcriptome)

--cleanup
Delete temporary files (default setting is to save all files)

Examples

QC_SNP.sh -f Reads.fastq -o OutputPrefix –p --cleanup

This command will give pre-alignment quality analysis only, based on the RNA-Seq dataset (Reads.fq) and will delete temporary files.

QC_SNP.sh -f Reads.fastq.gz -g RefGene.fa -c RefTran.fa -a Anno.gtf -o OutputPrefix –p -q --ribo rRNA.fa

This command will align the Reads.fastq.gz to the genome (RefGene.fa), transcriptome (RefTran.fa) and ribosomal RNA (rRNA.fa) and implement the pre and post alignment quality control analyses. It will also output the ribosomal RNA alignment report.

QC_SNP.sh -a Anno.gtf -o OutputPrefix -q --gSam AlignedToGenome.sam --tSam AlignedToTran.sam

This command will give post-alignment quality analysis based on the user specified alignment SAM files to the genome (AlignedToGenome.sam) and transcriptome(AlignedToTran.sam).

QC_SNP.sh -1 Read1.fastq -2 Read2.fastq -g RefGene.fa -c RefTran.fa –a Anno.gtf -o OutputPrefix -p –q –s

This command will align the paired end RNA-Seq datasets Read1.fastq and Read2.fastq to the genome (RefGene.fa) and transcriptome (RefTran.fa) and implement the pre and post alignment quality control analyses and SNP calling.

QC_SNP.sh -1 Read1.fastq.gz -2 Read2.fastq.gz -g RefGene.fa -c RefTran.fa -a Anno.gtf -o OutputPrefix –s --mito chrM.fa

This command will align the paired end Read1.fastq.gz and Read2.fastq.gz to the genome (RefGene.fa), transcriptome (RefTran.fa) and Mitochondrial chromosome (chrM.fa) and implement SNP calling base on the merged genome and transcriptome alignments.

(2) Branch 2: `ExpressionEstimation.sh`

ExpressionEstimation.sh will implement expression level estimation for gene/exon/splice junctions based on the alignment to the transcriptome for single end or paired end RNA-Seq datsets. Beginning in version 2.1 and later, only unique alignments are used in expression estimation. Prior to version 2.1, the best alignment for each read was used which may or may not have been unique.

If bowtie2 indexes already exist in the same directory as the transcriptome reference file, RseqFlow will use those existing indexes. If it cannot find precomputed indexes, it will create the indexes in the output directory. If you plan on using the same reference file for future runs, you may wish to move the files ending in .bt2 from the output directory into the directory of the transcriptome reference. This will bypass index creation for future runs and reduce run times.

If a fastq file and the reference sequences of the transcriptome are supplied as input, ExpressionEstimation.sh will align the RNA-Seq dataset to the transcriptome and then implement expression level estimation.

If you already have an alignment file to the transcriptome, you may input the .sam file and the alignment step will be skipped. ExpressionEstimation.sh will implement expression level estimation.

Please note that ExpressionEstimation.sh is expecting a bowtie2 created samfile. If your pre-existing samfile was not created with bowtie2, it is strongly suggested that you run ExpressionEstimation.sh without a samfile to let RseqFlow create its own, otherwise there is no guarantee that ExpressionEstimation will use only unique alignments.

Options

Options
arguments
Meaning

-f/--fastq
.fastq or .fastq.gz
The reads file for single end data in FASTQ format (fastq or fastq.gz)

-1/--read1
.fastq or .fastq.gz
The first reads file for paired end data in FASTQ format (fastq or fastq.gz)

-2/--read2
.fastq or .fastq.gz
The second reads file for paired end data in FASTQ format (fastq or fastq.gz)

-c/-transcriptome
.fa
Transcriptome reference sequences in FASTA format

-a/--annotation
.gtf
Reference annotation in GTF format

-o/--output-prefix
prefix
Prefix of output files (default is Expression_output)

--tSam
.sam
Alignments to the transcriptome in SAM format

--cleanup
Delete temporary files (default setting is to save all files)

Examples

ExpressionEstimation.sh -f Reads.fastq -c RefTran.fa -a Anno.gtf -o OutputPrefix --cleanup

This command will align the single end Reads.fastq to the transcriptome reference sequence (RefTran.fa) and implement the expression level estimation using the distinct best alignment for each read. Temporary files will be deleted.

ExpressionEstimation.sh -a Anno.gtf -o OutputPrefix --tSam AlignedToTran.sam

This command uses the supplied .sam file that has already been aligned to the transcriptome. RseqFlow will implement the expression level estimation from the distinct best alignment for each read.

(3) Branch 3: `DE.sh`

The de command will identify the differentially expressed genes for two conditions (e.g. case/control) using the output files from ExpressionEstimation.sh. If both conditions have only one sample, the ExonExpressionLevel_unique files from ExpressionEstimation.sh are used as input. Otherwise, if either condition has more than one sample, the GeneExpressionLevel_unique files from ExpressionEstimation.sh are used as input.

Options

1) DE.sh de [options]

Options
arguments
Meaning

--c1
Condition1_ID
ID for condition1 (e.g. Control)

--c2
Condition2_ID
ID for condition2 (e.g. Case)

-1/--f1
Comma separated file name list for condition1. The files should be the Gene/Exon expression file(s) for condition1*

-2/--f2
Comma separated file name list for condition1. The files should be the Gene/Exon expression file(s) for condition2*

-o/--output-prefix
prefix
Prefix of output files

* If both conditions have only one sample each, use the ExonExpressionLevel_unique files from ExpressionEstimation.sh as input. Otherwise, if both conditions have more than one sample each, use the GeneExpressionLevel_unique files from ExpressionEstimation.sh as input.

Examples

DE.sh de --c1 disease --c2 normal --f1 disease_S1_whole_GeneExpressionLevel_unique.txt, disease_S2_whole_GeneExpressionLevel_unique.txt --f2 normal_S1_whole_GeneExpressionLevel_unique.txt -o HeartDisease

In this case, there are two samples with the disease condition, so the GeneExpressionLevel_unique files from ExpressionEstimation.sh are used as input.

DE.sh de --c1 Brain --c2 Colon --f1 Brain_whole_ExonExpressionLevel_unique.txt --f2 Colon_whole_ExonExpressionLevel_unique.txt –o TissueCompare

In this case, both conditions have only a single sample each, so the ExonExpressionLevel files from ExpressionEstimation.sh are used as input.

(4) Branch 4: `FileFormatConversion.sh`

FileFormatConversion_version_1.sh will convert between alignment file formats for storage or visualization convenience. It can convert SAM to BAM, MRF and WIG, BAM to BED, and MRF to WIG.

Options

Options
arguments
Meaning

-i/--input
File to convert
Input file to convert, with the correct file suffix (.sam, .bam, or .mrf)

-o/--output
prefix
Prefix of output files

-r/--reference
Reference sequence file that was used to align the input file
If a file is being converted from SAM to BAM format, the .sam file should contain either the appropriate @SQ headers or you need to use the --reference option to supply the reference file that was used to create the .sam file.

-b/--toBAM
Convert SAM to BAM

-m/--toMRF
Convert SAM to MRF

-d/--toBED
Convert BAM to BED

-w/--toWIG
Convert SAM to WIG or MRF to WIG

Examples

FileFormatConversion.sh -i in.sam -o out -b -r ref.fa

This command will convert the input in.sam file to a bam file. If the sam file does not contain the appropriate "@SQ" header lines, the reference sequences file (ref.fa) is required.

FileFormatConversion.sh -i in.sam -o out –m

This command will convert the input in.sam file to a mrf file.

FileFormatConversion.sh -i in.bam -o out –d

This command will convert the input in.bam file to a BED file for visualization.

3. Input files and formats

(1) Possible input files for QC_SNP.sh (Depends on the selected options)

Genome annotation GTF file
Transcriptome reference sequences
Genome reference seqeuences
RNA-Seq fastq or fastq.gz file
Alignment files in SAM format
Reference sequences of Mitochondria
Reference sequences of Ribosomal RNA

(2) Possible input files for `ExpressionEstimation.sh` (Depends on the selected options)

Genome annotation GTF file
Transcriptome reference sequences
RNA-Seq fastq or fastq.gz file
Alignment files in SAM format

(3) Possible input files for DE.sh

Output files from ExpressionEstimation.sh: whole_GeneExpressionLevel_unique.txt or whole_ExonExpressionLevel_unique.txt files.

(4) Format specification of input files

The following annotations and references should be in separate files. RseqFlow will automatically split the files during processing, if necessary. All eukaryotic species with files in the required formats can be analyzed in the RseqFlow pipeline.

Genome Annotation GTF file

Format from GTF 2.0 to GTF3.0 is required;

Transcriptome Reference Sequences

The transcript names must begin with “>” and should meet the format requirements below. Extra information may follow the chromosome end location as long as there is a space separating the chromosome end from the extra info.

“>$GenomeName_$AnnotationSource_$TranscriptsID=$Chromosome:$Start-$End [extra info]”

For example,

“>hg19_wgEncodeGencodeManualV4_ENST00000480075=chr7:19757-35457 5'pad=0 3'pad=0 strand=- repeatMasking=none”

Genome Reference Sequences

The chromosome name must begin with “>” and should meet the format requirements below. Extra information may follow the chromosome as long as there is a space separating the chromosome from the extra info.

“>$chromsome”

For example,

“>chr1 dna:chromosome”

">chr21 dna:chromosome chromosome:GRCh37:21:1:48129895:1 REF"

“>chrM”

4. Sample datasets

(1) c. elegans

We offer the sample datasets for c. elegans:

(1) RNA-Seq dataset

(2) Reference genome sequences file

(3) Reference annotation gtf file

(4) Reference transcriptome sequences file

(5) Commands for running three of the branches:

Before running the pipeline, create a new directory for output results:

mkdir c.elegans_Output

i) Branch 1: QC_SNP.sh

QC_SNP.sh -f SRR514378.fastq -g genome_c.elegans.fa -c c.elegans_refseq_seq.fa -a c.elegans_refseq_anno.gtf -o c.elegans_Output/QC_c.elegans -p -q -s

ii) Branch 2: ExpressionEstimation.sh

ExpressionEstimation.sh -f SRR514378.fastq -c c.elegans_refseq_seq.fa -a c.elegans_refseq_anno.gtf -o c.elegans_Output/ExpressionEstimation_c.elegans

iii) Branch 4: FileFormatConversion.sh

FileFormatConversion.sh -i c.elegans_Output/QC_c.elegans_Bowtie2_genome.sam -r genome_c.elegans.fa -b -o c.elegans_Output/QC_c.elegans_Bowtie2_genome

FileFormatConversion.sh -i c.elegans_Output/QC_c.elegans_Bowtie2_transcriptome.sam -m -o c.elegans_Output/QC_c.elegans_Bowtie2_transcriptome

FileFormatConversion.sh -i c.elegans_Output/QC_c.elegans_Bowtie2_genome.sam -w -o c.elegans_Output/QC_c.elegans_Bowtie2_genome

FileFormatConversion.sh -i c.elegans_Output/QC_c.elegans_Bowtie2_genome.bam -d -o c.elegans_Output/QC_c.elegans_Bowtie2_genome

(2) Human

(1) RNA-Seq dataset

(2) Reference genome sequences file

(3) Mitochondria sequence file

(4) Reference annotation gtf file (gencodeV14)

(5) Reference transcriptome sequences file (gencodeV14)

(6) Commands for running three of the branches:

Before running the pipeline, create a new directory for output results:

mkdir Human_Output

Unzip the reference files. You may wish to create a separate directory so that you can use these references for future runs.

gunzip Human_gencodeV14_anno.gtf.gz
gunzip Human_gencodeV14_transcripts.fa.gz
gunzip Human_genome_GRCh37.fa.gz

i) Branch 1: QC_SNP.sh

QC_SNP.sh -f ERR030893_2M.fastq.gz -g Human_genome_GRCh37.fa -c Human_gencodeV14_transcripts.fa -a Human_gencodeV14_anno.gtf --mito Human_chrM -o Human_Output/QC_human -p -q -s

ii) Branch 2: ExpressionEstimation.sh

ExpressionEstimation.sh -f ERR030893_2M.fastq.gz -c Human_gencodeV14_transcripts.fa -a Human_gencodeV14_anno.gtf -o Human_Output/ExpressionEstimation_human

iii) Branch 4: FileFormatConversion.sh

FileFormatConversion.sh -i Human_Output/QC_human_Bowtie2_genome.sam -r  Human_genome_GRCh37.fa -b -o Human_Output/QC_human_Bowtie2_genome

FileFormatConversion.sh -i Human_Output/QC_human_Bowtie2_transcriptome.sam -m -o Human_Output/QC_human_Bowtie2_transcriptome

FileFormatConversion.sh -i Human_Output/QC_human_Bowtie2_genome.sam -w -o Human_Output/QC_human_Bowtie2_genome

FileFormatConversion.sh -i Human_Output/QC_human_Bowtie2_genome.bam -d -o Human_Output/QC_human_Bowtie2_genome