rseqflow Wiki

RNA-Seq analysis pipeline for QC, Expression, DifferentialExpression

Brought to you by: jherstein

UnixModeManual

Authors: Anonymous

Manual of RseqFlow Unix Run Mode

1. Package installation and configuration
2. Branch usage
3. Input files and formats
4. Sample datasets
- c. elegans
- Human

1. Package installation and configuration

The following packages must be pre-installed:

UNIX;
Python 2.7 and higher;
R 2.11 and higher;
GCC;
ImageMagick (optional) Required if you want a single combined QC PDF report

RseqFlow Installation and Configuration

Step 1: Download the source code to your directory, e.g '/home/user/rseqflow'.

Step 2: Enter your specified directory:

   bash-3.2$ cd /home/user/rseqflow

Step 3: Extract the tar file:

   bash-3.2$ tar -xvf RseqFlow_source.tar.gz

Step 4: Enter the directory:

   bash-3.2$ cd /home/user/reseqflow/RseqFlow_source

Step 5: Set up some tools:

   bash-3.2$ ./make.sh

Step 6: Set up PATH so that the system knows where to find the executable files. It is recommended to run step 7 after this to make these changes permanent, otherwise you will need to run ./configure.sh each time you run RseqFlow from a new terminal window.

   bash-3.2$ source ./configure.sh

Step 7: This step will make the changes to your PATH and PYTHONPATH variables from step 6 permanent. If you choose not to do this step, you will need to run configure.sh each time you run RseqFlow. To make the PATH and PYTHONPATH changes permanent, copy the commands in 'configure.sh' into your bash file either manually or with the following command. **Please make sure the command contains ">>", not “>", otherwise, you will overwrite your original bash file! **

   bash-3.2$ cat configure.sh >> /home/user/.bashrc

Step 8: If your system has multiple python versions, make sure you use version 2.7 or higher. Run 'pythonCompilerSet.sh' to set the proper python header in each of the python scripts. The example below assumes the path of the python executable is '/home/user/python2.7'.

   bash-3.2$ ./pythonCompilerSet.sh -p /home/user/python2.7/python

Step 9: Now you can run the corresponding shell scripts for each branch.

2. Branch usage

Branch 1: Quality Control and SNP calling (QC_SNP.sh)

QC_SNP.sh will implement quality control and/or SNP calling analysis for single end and paired end RNA-Seq datasets.

For pre-alignment quality control analysis only, a fastq(.gz) file is required as input, either a single fastq for single end data or Read1.fastq and Read2.fastq for paired end data.

For post-alignment quality control analysis, the fastq files(s), genome reference sequences, transcriptome reference sequences, and annotation file must be input. QC_SNP.sh will align the RNA-Seq dataset to the genome and transcriptome, then merge the alignment results for quality control and SNP calling.

You may input ribosomal RNA reference sequences and/or mitochondrial reference sequences if desired.

If bowtie2 indexes already exist in the same directory as the genome and/or transcriptome reference files, RseqFlow will use those existing indexes. If it cannot find precomputed indexes, it will create the indexes in the output directory. If you plan on using the same reference files for future runs, you may wish to move the files ending in .bt2 from the output directory into the directory of the corresponding genome/transcriptome reference. This will bypass index creation for future runs and reduce run times.

If you already have RseqFlow alignment file(s) in SAM format for the genome and/or transcriptome, you may input the .sam files and the alignment step will be skipped. The post-alignment and/or SNP calling will be analyzed based on the merged result only.

Options:

Option	Argument	Description
-f/--fastq	.fastq or .fastq.gz	RNA-Seq dataset with single end reads in FASTQ format (fastq or fastq.gz)
-1/--read1	.fastq or .fastq.gz	The first reads file for paired end data in FASTQ format (fastq or fastq.gz)
-2/--read2	.fastq or .fastq.gz	The second reads file for paired end data in FASTQ format (fastq or fastq.gz)
-g/--genome	.fa	Genome reference sequences in FASTA format
-c/-transcriptome	.fa	Transcriptome reference sequences in FASTA format
-a/--annotation	.gtf	Reference annotation in GTF format
-o/--output-prefix		Prefix of output files (default setting is “QC_SNP_output”)
-p/--pre-QC		Generate pre-alignment QC reports based only on the RNA-Seq iput fastq file
-q/--QC		Generate post-QC reports based on alignment
-s/--SNP		Implement SNPs calling analysis
--ribo	.fa	Reference sequences of ribosomal RNA in FASTA format
--mito	.fa	Reference sequences of Mitochondrial chromosome in FASTA format
--gSam	.sam	Alignment to genome in SAM format (This option is for cases where RNA-Seq dataset has already been aligned to genome)
--tSam	.sam	Alignment to transcriptome in SAM format (This option is for cases where RNA-Seq dataset has already been aligned to transcriptome)
--cleanup		Delete temporary files (default setting is to save all files)

Examples:

Examples for running QC_SNP.sh:

QC_SNP.sh -f Reads.fastq -o OutputPrefix –p --cleanup

This command will give pre-alignment quality analysis only, based on the RNA-Seq dataset (Reads.fq) and will delete temporary files.

QC_SNP.sh -f Reads.fastq.gz -g RefGene.fa -c RefTran.fa -a Anno.gtf -o OutputPrefix –p -q --ribo rRNA.fa

This command will align the Reads.fastq.gz to the genome (RefGene.fa), transcriptome (RefTran.fa) and ribosomal RNA (rRNA.fa) and implement the pre and post alignment quality control analyses. It will also output the ribosomal RNA alignment report.

QC_SNP.sh -a Anno.gtf -o OutputPrefix -q --gSam AlignedToGenome.sam --tSam AlignedToTran.sam

This command will give post-alignment quality analysis based on the user specified alignment SAM files to the genome (AlignedToGenome.sam) and transcriptome(AlignedToTran.sam).

QC_SNP.sh -1 Read1.fastq -2 Read2.fastq -g RefGene.fa -c RefTran.fa –a Anno.gtf -o OutputPrefix -p –q –s

This command will align the paired end RNA-Seq datasets Read1.fastq and Read2.fastq to the genome (RefGene.fa) and transcriptome (RefTran.fa) and implement the pre and post alignment quality control analyses and SNP calling.

QC_SNP.sh -1 Read1.fastq.gz -2 Read2.fastq.gz -g RefGene.fa -c RefTran.fa -a Anno.gtf -o OutputPrefix –s --mito chrM.fa

This command will align the paired end Read1.fastq.gz and Read2.fastq.gz to the genome (RefGene.fa), transcriptome (RefTran.fa) and Mitochondrial chromosome (chrM.fa) and implement SNP calling base on the merged genome and transcriptome alignments.

Branch 2: Expression Level Estimation (ExpressionEstimation.sh)

ExpressionEstimation.sh will implement expression level estimation for gene/exon/splice junctions based on the alignment to the transcriptome for single end or paired end RNA-Seq datsets. Beginning in version 2.1 and later, only unique alignments are used in expression estimation. Prior to version 2.1, the best alignment for each read was used which may or may not have been unique.

If bowtie2 indexes already exist in the same directory as the transcriptome reference file, RseqFlow will use those existing indexes. If it cannot find precomputed indexes, it will create the indexes in the output directory. If you plan on using the same reference file for future runs, you may wish to move the files ending in .bt2 from the output directory into the directory of the transcriptome reference. This will bypass index creation for future runs and reduce run times.

If a fastq file and the reference sequences of the transcriptome are supplied as input, ExpressionEstimation.sh will align the RNA-Seq dataset to the transcriptome and then implement expression level estimation.

If you already have an alignment file to the transcriptome, you may input the .sam file and the alignment step will be skipped. ExpressionEstimation.sh will implement expression level estimation.

Please note that ExpressionEstimation.sh is expecting a bowtie2 created samfile. If your pre-existing samfile was not created with bowtie2, it is strongly suggested that you run ExpressionEstimation.sh without a samfile to let RseqFlow create its own, otherwise there is no guarantee that ExpressionEstimation will use only unique alignments.

Options:

Option	Argument	Description
-f/--fastq	.fastq or .fastq.gz	The reads file for single end data in FASTQ format (fastq or fastq.gz)
-1/--read1	.fastq or .fastq.gz	The first reads file for paired end data in FASTQ format (fastq or fastq.gz)
-2/--read2	.fastq or .fastq.gz	The second reads file for paired end data in FASTQ format (fastq or fastq.gz)
-c/-transcriptome	.fa	Transcriptome reference sequences in FASTA format
-a/--annotation	.gtf	Reference annotation in GTF format
-o/--output-prefix		Prefix of output files (default setting is “Expression_output”)
--tSam	.sam	Alignment to transcriptome in SAM format (This option is for cases where RNA-Seq dataset has already been aligned to transcriptome)
--cleanup		Delete temporary files (default setting is to save all files)

Examples:

Examples for running ExpressionEstimation.sh:

ExpressionEstimation.sh -f Reads.fastq -c RefTran.fa -a Anno.gtf -o OutputPrefix --cleanup

This command will align the single end Reads.fastq to the transcriptome reference sequence (RefTran.fa) and implement the expression level estimation using the distinct best alignment for each read. Temporary files will be deleted.

ExpressionEstimation.sh -a Anno.gtf -o OutputPrefix --tSam AlignedToTran.sam

This command uses the supplied .sam file that has already been aligned to the transcriptome. RseqFlow will implement the expression level estimation from the distinct best alignment for each read.

Branch 3: Differentially Expressed Gene Identification (DE.sh)

The "de" command will identify the differentially expressed genes for two conditions (e.g. case/control) using the output files from ExpressionEstimation.sh. If both conditions have only one sample, the ExonExpressionLevel_unique files from ExpressionEstimation.sh are used as input. Otherwise, if either condition has more than one sample, the GeneExpressionLevel_unique files from ExpressionEstimation.sh are used as input.

Options for DE.sh de:

Option	Argument	Description
--c1	Condition1_ID	ID for condition1 (e.g. Control)
--c2	Condition2_ID	ID for condition2 (e.g. Case)
-1/--f1		Comma separated file name list for condition1. The files should be the Gene/Exon expression file(s) for condition1^*
-2/--f2		Comma separated file name list for condition2. The files should be the Gene/Exon expression file(s) for condition2^*
-o/--output-prefix		Prefix of output files

^* If both conditions have only one sample each, use the ExonExpressionLevel_unique files from ExpressionEstimation.sh as input. Otherwise, if both conditions have more than one sample each, use the GeneExpressionLevel_unique files from ExpressionEstimation.sh as input.

Examples:

Examples for running DE.sh de:

DE.sh de --c1 disease --c2 normal --f1 disease_S1_whole_GeneExpressionLevel_unique.txt, disease_S2_whole_GeneExpressionLevel_unique.txt --f2 normal_S1_whole_GeneExpressionLevel_unique.txt -o HeartDisease

In this case, there are two samples with the disease condition, so the GeneExpressionLevel_unique files from ExpressionEstimation.sh are used as input.

DE.sh de --c1 Brain --c2 Colon --f1 Brain_whole_ExonExpressionLevel_unique.txt --f2 Colon_whole_ExonExpressionLevel_unique.txt –o TissueCompare

In this case, both conditions have only a single sample each, so the ExonExpressionLevel files from ExpressionEstimation.sh are used as input.

Branch 4: Alignment File Format Conversion (FileFormatConversion.sh)

FileFormatConversion.sh will convert between alignment file formats for storage or visualization convenience. It can convert SAM to BAM, MRF and WIG, BAM to BED, and MRF to WIG.

Options:

Option	Argument	Description
-i/--input	.sam, .bam, .mrf	Input file to convert, with the correct file suffix (.sam, .bam, or .mrf)
-o/--output-prefix		Prefix of output files
-r/--reference	.fa	Reference sequence file in FASTA format that was used to align the input file. If a file is being converted from SAM to BAM format, the .sam file should contain either the appropriate @SQ headers or you need to use the --reference option to supply the reference file that was used to create the .sam file.
-b/--toBAM		Convert SAM to BAM
-m/--toMRF		Convert SAM to MRF
-d/--toBED		Convert BAM to BED
-w/--toWIG		Convert SAM to WIG or MRF to WIG

Examples:

Examples for running FileFormatConversion.sh:

FileFormatConversion.sh -i in.sam -o out -b -r ref.fa

This command will convert the input in.sam file to a bam file. If the sam file does not contain the appropriate "@SQ" header lines, the reference sequences file (ref.fa) is required.

FileFormatConversion.sh -i in.sam -o out –m

This command will convert the input in.sam file to a mrf file.

FileFormatConversion.sh -i in.bam -o out –d

This command will convert the input in.bam file to a BED file for visualization.

3. Input files and formats

Possible input files for QC_SNP.sh (Depends on the selected options):

Genome annotation GTF file
Transcriptome reference sequences
Genome reference seqeuences
RNA-Seq fastq or fastq.gz file
Alignment files in SAM format
Reference sequences of Mitochondria
Reference sequences of Ribosomal RNA

Possible input files for ExpressionEstimation.sh (Depends on the selected options):

Genome annotation GTF file
Transcriptome reference sequences
RNA-Seq fastq or fastq.gz file
Alignment files in SAM format

Possible input files for DE.sh:

Output files from ExpressionEstimation.sh:
- whole_GeneExpressionLevel_unique.txt
- whole_ExonExpressionLevel_unique.txt files

Format specification of input files

There should be three separate files: one for Transcriptome reference sequences, one for Genome reference sequences and one for the Genome Annotation file. RseqFlow will automatically split the files during processing, if necessary. All eukaryotic species with files in the required formats can be analyzed in the RseqFlow pipeline.

Transcriptome Reference Sequences

The transcript names must begin with “>” and should meet the format requirements below. Extra information may follow the chromosome end location as long as there is a space separating the chromosome end from the extra info.

GenomeName_AnnotationSource_TranscriptsID=Chromosome:Start-End Extra Info

For example:

>hg19_wgEncodeGencodeManualV4_ENST00000480075=chr7:19757-35457 5'pad=0 3'pad=0 strand=- repeatMasking=none

Genome Reference Sequences

The chromosome name must begin with “>” and should meet the format requirements below. Extra information may follow the chromosome as long as there is a space separating the chromosome from the extra info.

For example:

>chr1 dna:chromosome

>chr21 dna:chromosome chromosome:GRCh37:21:1:48129895:1 REF

>chrM

Genome Annotation

The Genome Annotation GTF file must be in format GTF3.0.

4. Sample datasets

c. elegans

Download the following datasets:

RNA-Seq dataset
Reference genome sequences file
Reference annotation gtf file
Reference transcriptome sequences file

Commands for running QC_SNP, ExpressionEstimation, and FileFormatConversion:

Before running the pipeline, create a new directory for output results:

mkdir c.elegans_Output

i) Branch 1: QC_SNP.sh

 QC_SNP.sh -f SRR514378.fastq -g genome_c.elegans.fa -c c.elegans_refseq_seq.fa -a c.elegans_refseq_anno.gtf -o c.elegans_Output/QC_c.elegans -p -q -s

ii) Branch 2: ExpressionEstimation.sh

 ExpressionEstimation.sh -f SRR514378.fastq -c c.elegans_refseq_seq.fa -a c.elegans_refseq_anno.gtf -o c.elegans_Output/ExpressionEstimation_c.elegans

iii) Branch 4: FileFormatConversion.sh

 FileFormatConversion.sh -i c.elegans_Output/QC_c.elegans_Bowtie2_genome.sam -r genome_c.elegans.fa -b -o c.elegans_Output/QC_c.elegans_Bowtie2_genome

 FileFormatConversion.sh -i c.elegans_Output/QC_c.elegans_Bowtie2_transcriptome.sam -m -o c.elegans_Output/QC_c.elegans_Bowtie2_transcriptome

 FileFormatConversion.sh -i c.elegans_Output/QC_c.elegans_Bowtie2_genome.sam -w -o c.elegans_Output/QC_c.elegans_Bowtie2_genome

 FileFormatConversion.sh -i c.elegans_Output/QC_c.elegans_Bowtie2_genome.bam -d -o c.elegans_Output/QC_c.elegans_Bowtie2_genome

Human

Download the following datasets:

RNA-Seq dataset
Reference genome sequences file
Mitochondria sequence file
Reference annotation gtf file (gencodeV14)
Reference transcriptome sequences file (gencodeV14)

Commands for running QC_SNP, ExpressionEstimation, and FileFormatConversion:

Before running the pipeline, create a new directory for output results:

 mkdir Human_Output

Unzip the reference files. You may wish to create a separate directory so that you can use these references for future runs.

 gunzip Human_gencodeV14_anno.gtf.gz
 gunzip Human_gencodeV14_transcripts.fa.gz
 gunzip Human_genome_GRCh37.fa.gz

i) Branch 1: QC_SNP.sh

 QC_SNP.sh -f ERR030893_2M.fastq.gz -g Human_genome_GRCh37.fa -c Human_gencodeV14_transcripts.fa -a Human_gencodeV14_anno.gtf --mito Human_chrM -o Human_Output/QC_human -p -q -s

ii) Branch 2: ExpressionEstimation.sh

 ExpressionEstimation.sh -f ERR030893_2M.fastq.gz -c Human_gencodeV14_transcripts.fa -a Human_gencodeV14_anno.gtf -o Human_Output/ExpressionEstimation_human

iii) Branch 4: FileFormatConversion.sh

 FileFormatConversion.sh -i Human_Output/QC_human_Bowtie2_genome.sam -r  Human_genome_GRCh37.fa -b -o Human_Output/QC_human_Bowtie2_genome

 FileFormatConversion.sh -i Human_Output/QC_human_Bowtie2_transcriptome.sam -m -o Human_Output/QC_human_Bowtie2_transcriptome

 FileFormatConversion.sh -i Human_Output/QC_human_Bowtie2_genome.sam -w -o Human_Output/QC_human_Bowtie2_genome

 FileFormatConversion.sh -i Human_Output/QC_human_Bowtie2_genome.bam -d -o Human_Output/QC_human_Bowtie2_genome