The following packages must be pre-installed:
Step 1: Download the source code to your directory, e.g '/home/user/rseqflow'.
Step 2: Enter your specified directory:
bash-3.2$ cd /home/user/rseqflow
Step 3: Extract the tar file:
bash-3.2$ tar -xvf RseqFlow_source.tar.gz
Step 4: Enter the directory:
bash-3.2$ cd /home/user/reseqflow/RseqFlow_source
Step 5: Set up some tools:
bash-3.2$ ./make.sh
Step 6: Set up PATH so that the system knows where to find the executable files. It is recommended to run step 7 after this to make these changes permanent, otherwise you will need to run ./configure.sh each time you run RseqFlow from a new terminal window.
bash-3.2$ source ./configure.sh
Step 7: This step will make the changes to your PATH and PYTHONPATH variables from step 6 permanent. If you choose not to do this step, you will need to run configure.sh each time you run RseqFlow. To make the PATH and PYTHONPATH changes permanent, copy the commands in 'configure.sh' into your bash file either manually or with the following command. Please make sure the command contains ">>", not “>", otherwise, you will overwrite your original bash file!
bash-3.2$ cat configure.sh >> /home/user/.bashrc
Step 8: If your system has multiple python versions, make sure you use version 2.7 or higher. Run 'pythonCompilerSet.sh' to set the proper python header in each of the python scripts. The example below assumes the path of the python executable is '/home/user/python2.7'.
bash-3.2$ ./pythonCompilerSet.sh -p /home/user/python2.7/python
Step 9: Now you can run the corresponding shell scripts for each branch.
QC_SNP.sh will implement quality control and/or SNP calling analysis for single end and paired end RNA-Seq datasets.
For pre-alignment quality control analysis only, a fastq(.gz) file is required as input, either a single fastq for single end data or Read1.fastq and Read2.fastq for paired end data.
For post-alignment quality control analysis, the fastq files(s), genome reference sequences, transcriptome reference sequences, and annotation file must be input. QC_SNP.sh will align the RNA-Seq dataset to the genome and transcriptome, then merge the alignment results for quality control and SNP calling.
You may input ribosomal RNA reference sequences and/or mitochondrial reference sequences if desired.
If bowtie2 indexes already exist in the same directory as the genome and/or transcriptome reference files, RseqFlow will use those existing indexes. If it cannot find precomputed indexes, it will create the indexes in the output directory. If you plan on using the same reference files for future runs, you may wish to move the files ending in .bt2 from the output directory into the directory of the corresponding genome/transcriptome reference. This will bypass index creation for future runs and reduce run times.
If you already have RseqFlow alignment file(s) in SAM format for the genome and/or transcriptome, you may input the .sam files and the alignment step will be skipped. The post-alignment and/or SNP calling will be analyzed based on the merged result only.
Option | Argument | Description |
---|---|---|
-f/--fastq | .fastq or .fastq.gz | RNA-Seq dataset with single end reads in FASTQ format (fastq or fastq.gz) |
-1/--read1 | .fastq or .fastq.gz | The first reads file for paired end data in FASTQ format (fastq or fastq.gz) |
-2/--read2 | .fastq or .fastq.gz | The second reads file for paired end data in FASTQ format (fastq or fastq.gz) |
-g/--genome | .fa | Genome reference sequences in FASTA format |
-c/-transcriptome | .fa | Transcriptome reference sequences in FASTA format |
-a/--annotation | .gtf | Reference annotation in GTF format |
-o/--output-prefix | Prefix of output files (default setting is “QC_SNP_output”) | |
-p/--pre-QC | Generate pre-alignment QC reports based only on the RNA-Seq iput fastq file | |
-q/--QC | Generate post-QC reports based on alignment | |
-s/--SNP | Implement SNPs calling analysis | |
--ribo | .fa | Reference sequences of ribosomal RNA in FASTA format |
--mito | .fa | Reference sequences of Mitochondrial chromosome in FASTA format |
--gSam | .sam | Alignment to genome in SAM format (This option is for cases where RNA-Seq dataset has already been aligned to genome) |
--tSam | .sam | Alignment to transcriptome in SAM format (This option is for cases where RNA-Seq dataset has already been aligned to transcriptome) |
--cleanup | Delete temporary files (default setting is to save all files) |
Examples for running QC_SNP.sh:
QC_SNP.sh -f Reads.fastq -o OutputPrefix –p --cleanup
This command will give pre-alignment quality analysis only, based on the RNA-Seq dataset (Reads.fq) and will delete temporary files.
QC_SNP.sh -f Reads.fastq.gz -g RefGene.fa -c RefTran.fa -a Anno.gtf -o OutputPrefix –p -q --ribo rRNA.fa
This command will align the Reads.fastq.gz to the genome (RefGene.fa), transcriptome (RefTran.fa) and ribosomal RNA (rRNA.fa) and implement the pre and post alignment quality control analyses. It will also output the ribosomal RNA alignment report.
QC_SNP.sh -a Anno.gtf -o OutputPrefix -q --gSam AlignedToGenome.sam --tSam AlignedToTran.sam
This command will give post-alignment quality analysis based on the user specified alignment SAM files to the genome (AlignedToGenome.sam) and transcriptome(AlignedToTran.sam).
QC_SNP.sh -1 Read1.fastq -2 Read2.fastq -g RefGene.fa -c RefTran.fa –a Anno.gtf -o OutputPrefix -p –q –s
This command will align the paired end RNA-Seq datasets Read1.fastq and Read2.fastq to the genome (RefGene.fa) and transcriptome (RefTran.fa) and implement the pre and post alignment quality control analyses and SNP calling.
QC_SNP.sh -1 Read1.fastq.gz -2 Read2.fastq.gz -g RefGene.fa -c RefTran.fa -a Anno.gtf -o OutputPrefix –s --mito chrM.fa
This command will align the paired end Read1.fastq.gz and Read2.fastq.gz to the genome (RefGene.fa), transcriptome (RefTran.fa) and Mitochondrial chromosome (chrM.fa) and implement SNP calling base on the merged genome and transcriptome alignments.
ExpressionEstimation.sh will implement expression level estimation for gene/exon/splice junctions based on the alignment to the transcriptome for single end or paired end RNA-Seq datsets. Beginning in version 2.1 and later, only unique alignments are used in expression estimation. Prior to version 2.1, the best alignment for each read was used which may or may not have been unique.
If bowtie2 indexes already exist in the same directory as the transcriptome reference file, RseqFlow will use those existing indexes. If it cannot find precomputed indexes, it will create the indexes in the output directory. If you plan on using the same reference file for future runs, you may wish to move the files ending in .bt2 from the output directory into the directory of the transcriptome reference. This will bypass index creation for future runs and reduce run times.
If a fastq file and the reference sequences of the transcriptome are supplied as input, ExpressionEstimation.sh will align the RNA-Seq dataset to the transcriptome and then implement expression level estimation.
If you already have an alignment file to the transcriptome, you may input the .sam file and the alignment step will be skipped. ExpressionEstimation.sh will implement expression level estimation.
Please note that ExpressionEstimation.sh is expecting a bowtie2 created samfile. If your pre-existing samfile was not created with bowtie2, it is strongly suggested that you run ExpressionEstimation.sh without a samfile to let RseqFlow create its own, otherwise there is no guarantee that ExpressionEstimation will use only unique alignments.
Option | Argument | Description |
---|---|---|
-f/--fastq | .fastq or .fastq.gz | The reads file for single end data in FASTQ format (fastq or fastq.gz) |
-1/--read1 | .fastq or .fastq.gz | The first reads file for paired end data in FASTQ format (fastq or fastq.gz) |
-2/--read2 | .fastq or .fastq.gz | The second reads file for paired end data in FASTQ format (fastq or fastq.gz) |
-c/-transcriptome | .fa | Transcriptome reference sequences in FASTA format |
-a/--annotation | .gtf | Reference annotation in GTF format |
-o/--output-prefix | Prefix of output files (default setting is “Expression_output”) | |
--tSam | .sam | Alignment to transcriptome in SAM format (This option is for cases where RNA-Seq dataset has already been aligned to transcriptome) |
--cleanup | Delete temporary files (default setting is to save all files) |
Examples for running ExpressionEstimation.sh:
ExpressionEstimation.sh -f Reads.fastq -c RefTran.fa -a Anno.gtf -o OutputPrefix --cleanup
This command will align the single end Reads.fastq to the transcriptome reference sequence (RefTran.fa) and implement the expression level estimation using the distinct best alignment for each read. Temporary files will be deleted.
ExpressionEstimation.sh -a Anno.gtf -o OutputPrefix --tSam AlignedToTran.sam
This command uses the supplied .sam file that has already been aligned to the transcriptome. RseqFlow will implement the expression level estimation from the distinct best alignment for each read.
The "de" command will identify the differentially expressed genes for two conditions (e.g. case/control) using the output files from ExpressionEstimation.sh. If both conditions have only one sample, the ExonExpressionLevel_unique files from ExpressionEstimation.sh are used as input. Otherwise, if either condition has more than one sample, the GeneExpressionLevel_unique files from ExpressionEstimation.sh are used as input.
Option | Argument | Description |
---|---|---|
--c1 | Condition1_ID | ID for condition1 (e.g. Control) |
--c2 | Condition2_ID | ID for condition2 (e.g. Case) |
-1/--f1 | Comma separated file name list for condition1. The files should be the Gene/Exon expression file(s) for condition1* | |
-2/--f2 | Comma separated file name list for condition2. The files should be the Gene/Exon expression file(s) for condition2* | |
-o/--output-prefix | Prefix of output files |
* If both conditions have only one sample each, use the ExonExpressionLevel_unique files from ExpressionEstimation.sh as input. Otherwise, if both conditions have more than one sample each, use the GeneExpressionLevel_unique files from ExpressionEstimation.sh as input.
Examples for running DE.sh de:
DE.sh de --c1 disease --c2 normal --f1 disease_S1_whole_GeneExpressionLevel_unique.txt, disease_S2_whole_GeneExpressionLevel_unique.txt --f2 normal_S1_whole_GeneExpressionLevel_unique.txt -o HeartDisease
In this case, there are two samples with the disease condition, so the GeneExpressionLevel_unique files from ExpressionEstimation.sh are used as input.
DE.sh de --c1 Brain --c2 Colon --f1 Brain_whole_ExonExpressionLevel_unique.txt --f2 Colon_whole_ExonExpressionLevel_unique.txt –o TissueCompare
In this case, both conditions have only a single sample each, so the ExonExpressionLevel files from ExpressionEstimation.sh are used as input.
FileFormatConversion.sh will convert between alignment file formats for storage or visualization convenience. It can convert SAM to BAM, MRF and WIG, BAM to BED, and MRF to WIG.
Option | Argument | Description |
---|---|---|
-i/--input | .sam, .bam, .mrf | Input file to convert, with the correct file suffix (.sam, .bam, or .mrf) |
-o/--output-prefix | Prefix of output files | |
-r/--reference | .fa | Reference sequence file in FASTA format that was used to align the input file. If a file is being converted from SAM to BAM format, the .sam file should contain either the appropriate @SQ headers or you need to use the --reference option to supply the reference file that was used to create the .sam file. |
-b/--toBAM | Convert SAM to BAM | |
-m/--toMRF | Convert SAM to MRF | |
-d/--toBED | Convert BAM to BED | |
-w/--toWIG | Convert SAM to WIG or MRF to WIG |
Examples for running FileFormatConversion.sh:
FileFormatConversion.sh -i in.sam -o out -b -r ref.fa
This command will convert the input in.sam file to a bam file. If the sam file does not contain the appropriate "@SQ" header lines, the reference sequences file (ref.fa) is required.
FileFormatConversion.sh -i in.sam -o out –m
This command will convert the input in.sam file to a mrf file.
FileFormatConversion.sh -i in.bam -o out –d
This command will convert the input in.bam file to a BED file for visualization.
There should be three separate files: one for Transcriptome reference sequences, one for Genome reference sequences and one for the Genome Annotation file. RseqFlow will automatically split the files during processing, if necessary. All eukaryotic species with files in the required formats can be analyzed in the RseqFlow pipeline.
The transcript names must begin with “>” and should meet the format requirements below. Extra information may follow the chromosome end location as long as there is a space separating the chromosome end from the extra info.
GenomeName_AnnotationSource_TranscriptsID=Chromosome:Start-End Extra Info
For example:
>hg19_wgEncodeGencodeManualV4_ENST00000480075=chr7:19757-35457 5'pad=0 3'pad=0 strand=- repeatMasking=none
The chromosome name must begin with “>” and should meet the format requirements below. Extra information may follow the chromosome as long as there is a space separating the chromosome from the extra info.
For example:
>chr1 dna:chromosome
>chr21 dna:chromosome chromosome:GRCh37:21:1:48129895:1 REF
>chrM
The Genome Annotation GTF file must be in format GTF3.0.
RNA-Seq dataset
Reference genome sequences file
Reference annotation gtf file
Reference transcriptome sequences file
Before running the pipeline, create a new directory for output results:
mkdir c.elegans_Output
i) Branch 1: QC_SNP.sh
QC_SNP.sh -f SRR514378.fastq -g genome_c.elegans.fa -c c.elegans_refseq_seq.fa -a c.elegans_refseq_anno.gtf -o c.elegans_Output/QC_c.elegans -p -q -s
ii) Branch 2: ExpressionEstimation.sh
ExpressionEstimation.sh -f SRR514378.fastq -c c.elegans_refseq_seq.fa -a c.elegans_refseq_anno.gtf -o c.elegans_Output/ExpressionEstimation_c.elegans
iii) Branch 4: FileFormatConversion.sh
FileFormatConversion.sh -i c.elegans_Output/QC_c.elegans_Bowtie2_genome.sam -r genome_c.elegans.fa -b -o c.elegans_Output/QC_c.elegans_Bowtie2_genome FileFormatConversion.sh -i c.elegans_Output/QC_c.elegans_Bowtie2_transcriptome.sam -m -o c.elegans_Output/QC_c.elegans_Bowtie2_transcriptome FileFormatConversion.sh -i c.elegans_Output/QC_c.elegans_Bowtie2_genome.sam -w -o c.elegans_Output/QC_c.elegans_Bowtie2_genome FileFormatConversion.sh -i c.elegans_Output/QC_c.elegans_Bowtie2_genome.bam -d -o c.elegans_Output/QC_c.elegans_Bowtie2_genome
RNA-Seq dataset
Reference genome sequences file
Mitochondria sequence file
Reference annotation gtf file (gencodeV14)
Reference transcriptome sequences file (gencodeV14)
Before running the pipeline, create a new directory for output results:
mkdir Human_Output
Unzip the reference files. You may wish to create a separate directory so that you can use these references for future runs.
gunzip Human_gencodeV14_anno.gtf.gz gunzip Human_gencodeV14_transcripts.fa.gz gunzip Human_genome_GRCh37.fa.gz
i) Branch 1: QC_SNP.sh
QC_SNP.sh -f ERR030893_2M.fastq.gz -g Human_genome_GRCh37.fa -c Human_gencodeV14_transcripts.fa -a Human_gencodeV14_anno.gtf --mito Human_chrM -o Human_Output/QC_human -p -q -s
ii) Branch 2: ExpressionEstimation.sh
ExpressionEstimation.sh -f ERR030893_2M.fastq.gz -c Human_gencodeV14_transcripts.fa -a Human_gencodeV14_anno.gtf -o Human_Output/ExpressionEstimation_human
iii) Branch 4: FileFormatConversion.sh
FileFormatConversion.sh -i Human_Output/QC_human_Bowtie2_genome.sam -r Human_genome_GRCh37.fa -b -o Human_Output/QC_human_Bowtie2_genome FileFormatConversion.sh -i Human_Output/QC_human_Bowtie2_transcriptome.sam -m -o Human_Output/QC_human_Bowtie2_transcriptome FileFormatConversion.sh -i Human_Output/QC_human_Bowtie2_genome.sam -w -o Human_Output/QC_human_Bowtie2_genome FileFormatConversion.sh -i Human_Output/QC_human_Bowtie2_genome.bam -d -o Human_Output/QC_human_Bowtie2_genome