Menu

UnixModeManual

Anonymous J.Herstein

Manual of RseqFlow Unix Run Mode

1. Package installation and configuration

The following packages must be pre-installed:

  • UNIX;
  • Python 2.7 and higher;
  • R 2.11 and higher;
  • GCC;
  • ImageMagick (optional) Required if you want a single combined QC PDF report

RseqFlow Installation and Configuration

Step 1: Download the source code to your directory, e.g '/home/user/rseqflow'.

Step 2: Enter your specified directory:

   bash-3.2$ cd /home/user/rseqflow

Step 3: Extract the tar file:

   bash-3.2$ tar -xvf RseqFlow_source.tar.gz

Step 4: Enter the directory:

   bash-3.2$ cd /home/user/reseqflow/RseqFlow_source

Step 5: Set up some tools:

   bash-3.2$ ./make.sh

Step 6: Set up PATH so that the system knows where to find the executable files. It is recommended to run step 7 after this to make these changes permanent, otherwise you will need to run ./configure.sh each time you run RseqFlow from a new terminal window.

   bash-3.2$ source ./configure.sh

Step 7: This step will make the changes to your PATH and PYTHONPATH variables from step 6 permanent. If you choose not to do this step, you will need to run configure.sh each time you run RseqFlow. To make the PATH and PYTHONPATH changes permanent, copy the commands in 'configure.sh' into your bash file either manually or with the following command. Please make sure the command contains ">>", not “>", otherwise, you will overwrite your original bash file!

   bash-3.2$ cat configure.sh >> /home/user/.bashrc

Step 8: If your system has multiple python versions, make sure you use version 2.7 or higher. Run 'pythonCompilerSet.sh' to set the proper python header in each of the python scripts. The example below assumes the path of the python executable is '/home/user/python2.7'.

   bash-3.2$ ./pythonCompilerSet.sh -p /home/user/python2.7/python

Step 9: Now you can run the corresponding shell scripts for each branch.

2. Branch usage

Branch 1: Quality Control and SNP calling (QC_SNP.sh)

QC_SNP.sh will implement quality control and/or SNP calling analysis for single end and paired end RNA-Seq datasets.

For pre-alignment quality control analysis only, a fastq(.gz) file is required as input, either a single fastq for single end data or Read1.fastq and Read2.fastq for paired end data.

For post-alignment quality control analysis, the fastq files(s), genome reference sequences, transcriptome reference sequences, and annotation file must be input. QC_SNP.sh will align the RNA-Seq dataset to the genome and transcriptome, then merge the alignment results for quality control and SNP calling.

You may input ribosomal RNA reference sequences and/or mitochondrial reference sequences if desired.

If bowtie2 indexes already exist in the same directory as the genome and/or transcriptome reference files, RseqFlow will use those existing indexes. If it cannot find precomputed indexes, it will create the indexes in the output directory. If you plan on using the same reference files for future runs, you may wish to move the files ending in .bt2 from the output directory into the directory of the corresponding genome/transcriptome reference. This will bypass index creation for future runs and reduce run times.

If you already have RseqFlow alignment file(s) in SAM format for the genome and/or transcriptome, you may input the .sam files and the alignment step will be skipped. The post-alignment and/or SNP calling will be analyzed based on the merged result only.

Options:

Option Argument Description
-f/--fastq .fastq or .fastq.gz RNA-Seq dataset with single end reads in FASTQ format (fastq or fastq.gz)
-1/--read1 .fastq or .fastq.gz The first reads file for paired end data in FASTQ format (fastq or fastq.gz)
-2/--read2 .fastq or .fastq.gz The second reads file for paired end data in FASTQ format (fastq or fastq.gz)
-g/--genome .fa Genome reference sequences in FASTA format
-c/-transcriptome .fa Transcriptome reference sequences in FASTA format
-a/--annotation .gtf Reference annotation in GTF format
-o/--output-prefix Prefix of output files (default setting is “QC_SNP_output”)
-p/--pre-QC Generate pre-alignment QC reports based only on the RNA-Seq iput fastq file
-q/--QC Generate post-QC reports based on alignment
-s/--SNP Implement SNPs calling analysis
--ribo .fa Reference sequences of ribosomal RNA in FASTA format
--mito .fa Reference sequences of Mitochondrial chromosome in FASTA format
--gSam .sam Alignment to genome in SAM format (This option is for cases where RNA-Seq dataset has already been aligned to genome)
--tSam .sam Alignment to transcriptome in SAM format (This option is for cases where RNA-Seq dataset has already been aligned to transcriptome)
--cleanup Delete temporary files (default setting is to save all files)

Examples:

Examples for running QC_SNP.sh:

QC_SNP.sh -f Reads.fastq -o OutputPrefix –p --cleanup

This command will give pre-alignment quality analysis only, based on the RNA-Seq dataset (Reads.fq) and will delete temporary files.

QC_SNP.sh -f Reads.fastq.gz -g RefGene.fa -c RefTran.fa -a Anno.gtf -o OutputPrefix –p -q --ribo rRNA.fa

This command will align the Reads.fastq.gz to the genome (RefGene.fa), transcriptome (RefTran.fa) and ribosomal RNA (rRNA.fa) and implement the pre and post alignment quality control analyses. It will also output the ribosomal RNA alignment report.

QC_SNP.sh -a Anno.gtf -o OutputPrefix -q --gSam AlignedToGenome.sam --tSam AlignedToTran.sam

This command will give post-alignment quality analysis based on the user specified alignment SAM files to the genome (AlignedToGenome.sam) and transcriptome(AlignedToTran.sam).

QC_SNP.sh -1 Read1.fastq -2 Read2.fastq -g RefGene.fa -c RefTran.fa –a Anno.gtf -o OutputPrefix -p –q –s

This command will align the paired end RNA-Seq datasets Read1.fastq and Read2.fastq to the genome (RefGene.fa) and transcriptome (RefTran.fa) and implement the pre and post alignment quality control analyses and SNP calling.

QC_SNP.sh -1 Read1.fastq.gz -2 Read2.fastq.gz -g RefGene.fa -c RefTran.fa -a Anno.gtf -o OutputPrefix –s --mito chrM.fa

This command will align the paired end Read1.fastq.gz and Read2.fastq.gz to the genome (RefGene.fa), transcriptome (RefTran.fa) and Mitochondrial chromosome (chrM.fa) and implement SNP calling base on the merged genome and transcriptome alignments.

Branch 2: Expression Level Estimation (ExpressionEstimation.sh)

ExpressionEstimation.sh will implement expression level estimation for gene/exon/splice junctions based on the alignment to the transcriptome for single end or paired end RNA-Seq datsets. Beginning in version 2.1 and later, only unique alignments are used in expression estimation. Prior to version 2.1, the best alignment for each read was used which may or may not have been unique.

If bowtie2 indexes already exist in the same directory as the transcriptome reference file, RseqFlow will use those existing indexes. If it cannot find precomputed indexes, it will create the indexes in the output directory. If you plan on using the same reference file for future runs, you may wish to move the files ending in .bt2 from the output directory into the directory of the transcriptome reference. This will bypass index creation for future runs and reduce run times.

If a fastq file and the reference sequences of the transcriptome are supplied as input, ExpressionEstimation.sh will align the RNA-Seq dataset to the transcriptome and then implement expression level estimation.

If you already have an alignment file to the transcriptome, you may input the .sam file and the alignment step will be skipped. ExpressionEstimation.sh will implement expression level estimation.

Please note that ExpressionEstimation.sh is expecting a bowtie2 created samfile. If your pre-existing samfile was not created with bowtie2, it is strongly suggested that you run ExpressionEstimation.sh without a samfile to let RseqFlow create its own, otherwise there is no guarantee that ExpressionEstimation will use only unique alignments.

Options:

Option Argument Description
-f/--fastq .fastq or .fastq.gz The reads file for single end data in FASTQ format (fastq or fastq.gz)
-1/--read1 .fastq or .fastq.gz The first reads file for paired end data in FASTQ format (fastq or fastq.gz)
-2/--read2 .fastq or .fastq.gz The second reads file for paired end data in FASTQ format (fastq or fastq.gz)
-c/-transcriptome .fa Transcriptome reference sequences in FASTA format
-a/--annotation .gtf Reference annotation in GTF format
-o/--output-prefix Prefix of output files (default setting is “Expression_output”)
--tSam .sam Alignment to transcriptome in SAM format (This option is for cases where RNA-Seq dataset has already been aligned to transcriptome)
--cleanup Delete temporary files (default setting is to save all files)

Examples:

Examples for running ExpressionEstimation.sh:

ExpressionEstimation.sh -f Reads.fastq -c RefTran.fa -a Anno.gtf -o OutputPrefix --cleanup

This command will align the single end Reads.fastq to the transcriptome reference sequence (RefTran.fa) and implement the expression level estimation using the distinct best alignment for each read. Temporary files will be deleted.

ExpressionEstimation.sh -a Anno.gtf -o OutputPrefix --tSam AlignedToTran.sam

This command uses the supplied .sam file that has already been aligned to the transcriptome. RseqFlow will implement the expression level estimation from the distinct best alignment for each read.

Branch 3: Differentially Expressed Gene Identification (DE.sh)

The "de" command will identify the differentially expressed genes for two conditions (e.g. case/control) using the output files from ExpressionEstimation.sh. If both conditions have only one sample, the ExonExpressionLevel_unique files from ExpressionEstimation.sh are used as input. Otherwise, if either condition has more than one sample, the GeneExpressionLevel_unique files from ExpressionEstimation.sh are used as input.

Options for DE.sh de:

Option Argument Description
--c1 Condition1_ID ID for condition1 (e.g. Control)
--c2 Condition2_ID ID for condition2 (e.g. Case)
-1/--f1 Comma separated file name list for condition1. The files should be the Gene/Exon expression file(s) for condition1*
-2/--f2 Comma separated file name list for condition2. The files should be the Gene/Exon expression file(s) for condition2*
-o/--output-prefix Prefix of output files

* If both conditions have only one sample each, use the ExonExpressionLevel_unique files from ExpressionEstimation.sh as input. Otherwise, if both conditions have more than one sample each, use the GeneExpressionLevel_unique files from ExpressionEstimation.sh as input.

Examples:

Examples for running DE.sh de:

DE.sh de --c1 disease --c2 normal --f1 disease_S1_whole_GeneExpressionLevel_unique.txt, disease_S2_whole_GeneExpressionLevel_unique.txt --f2 normal_S1_whole_GeneExpressionLevel_unique.txt -o HeartDisease

In this case, there are two samples with the disease condition, so the GeneExpressionLevel_unique files from ExpressionEstimation.sh are used as input.

DE.sh de --c1 Brain --c2 Colon --f1 Brain_whole_ExonExpressionLevel_unique.txt --f2 Colon_whole_ExonExpressionLevel_unique.txt –o TissueCompare

In this case, both conditions have only a single sample each, so the ExonExpressionLevel files from ExpressionEstimation.sh are used as input.

Branch 4: Alignment File Format Conversion (FileFormatConversion.sh)

FileFormatConversion.sh will convert between alignment file formats for storage or visualization convenience. It can convert SAM to BAM, MRF and WIG, BAM to BED, and MRF to WIG.

Options:

Option Argument Description
-i/--input .sam, .bam, .mrf Input file to convert, with the correct file suffix (.sam, .bam, or .mrf)
-o/--output-prefix Prefix of output files
-r/--reference .fa Reference sequence file in FASTA format that was used to align the input file. If a file is being converted from SAM to BAM format, the .sam file should contain either the appropriate @SQ headers or you need to use the --reference option to supply the reference file that was used to create the .sam file.
-b/--toBAM Convert SAM to BAM
-m/--toMRF Convert SAM to MRF
-d/--toBED Convert BAM to BED
-w/--toWIG Convert SAM to WIG or MRF to WIG

Examples:

Examples for running FileFormatConversion.sh:

FileFormatConversion.sh -i in.sam -o out -b -r ref.fa

This command will convert the input in.sam file to a bam file. If the sam file does not contain the appropriate "@SQ" header lines, the reference sequences file (ref.fa) is required.

FileFormatConversion.sh -i in.sam -o out –m

This command will convert the input in.sam file to a mrf file.

FileFormatConversion.sh -i in.bam -o out –d

This command will convert the input in.bam file to a BED file for visualization.

3. Input files and formats

Possible input files for QC_SNP.sh (Depends on the selected options):

  • Genome annotation GTF file
  • Transcriptome reference sequences
  • Genome reference seqeuences
  • RNA-Seq fastq or fastq.gz file
  • Alignment files in SAM format
  • Reference sequences of Mitochondria
  • Reference sequences of Ribosomal RNA

Possible input files for ExpressionEstimation.sh (Depends on the selected options):

  • Genome annotation GTF file
  • Transcriptome reference sequences
  • RNA-Seq fastq or fastq.gz file
  • Alignment files in SAM format

Possible input files for DE.sh:

  • Output files from ExpressionEstimation.sh:
    • whole_GeneExpressionLevel_unique.txt
    • whole_ExonExpressionLevel_unique.txt files

Format specification of input files

There should be three separate files: one for Transcriptome reference sequences, one for Genome reference sequences and one for the Genome Annotation file. RseqFlow will automatically split the files during processing, if necessary. All eukaryotic species with files in the required formats can be analyzed in the RseqFlow pipeline.

  • Transcriptome Reference Sequences

The transcript names must begin with “>” and should meet the format requirements below. Extra information may follow the chromosome end location as long as there is a space separating the chromosome end from the extra info.

GenomeName_AnnotationSource_TranscriptsID=Chromosome:Start-End Extra Info

For example:

>hg19_wgEncodeGencodeManualV4_ENST00000480075=chr7:19757-35457 5'pad=0 3'pad=0 strand=- repeatMasking=none

  • Genome Reference Sequences

The chromosome name must begin with “>” and should meet the format requirements below. Extra information may follow the chromosome as long as there is a space separating the chromosome from the extra info.

For example:

>chr1 dna:chromosome

>chr21 dna:chromosome chromosome:GRCh37:21:1:48129895:1 REF

>chrM

  • Genome Annotation

The Genome Annotation GTF file must be in format GTF3.0.

4. Sample datasets

c. elegans

Download the following datasets:

RNA-Seq dataset
Reference genome sequences file
Reference annotation gtf file
Reference transcriptome sequences file

Commands for running QC_SNP, ExpressionEstimation, and FileFormatConversion:

Before running the pipeline, create a new directory for output results:

mkdir c.elegans_Output

i) Branch 1: QC_SNP.sh

 QC_SNP.sh -f SRR514378.fastq -g genome_c.elegans.fa -c c.elegans_refseq_seq.fa -a c.elegans_refseq_anno.gtf -o c.elegans_Output/QC_c.elegans -p -q -s

ii) Branch 2: ExpressionEstimation.sh

 ExpressionEstimation.sh -f SRR514378.fastq -c c.elegans_refseq_seq.fa -a c.elegans_refseq_anno.gtf -o c.elegans_Output/ExpressionEstimation_c.elegans

iii) Branch 4: FileFormatConversion.sh

 FileFormatConversion.sh -i c.elegans_Output/QC_c.elegans_Bowtie2_genome.sam -r genome_c.elegans.fa -b -o c.elegans_Output/QC_c.elegans_Bowtie2_genome

 FileFormatConversion.sh -i c.elegans_Output/QC_c.elegans_Bowtie2_transcriptome.sam -m -o c.elegans_Output/QC_c.elegans_Bowtie2_transcriptome

 FileFormatConversion.sh -i c.elegans_Output/QC_c.elegans_Bowtie2_genome.sam -w -o c.elegans_Output/QC_c.elegans_Bowtie2_genome

 FileFormatConversion.sh -i c.elegans_Output/QC_c.elegans_Bowtie2_genome.bam -d -o c.elegans_Output/QC_c.elegans_Bowtie2_genome

Human

Download the following datasets:

RNA-Seq dataset
Reference genome sequences file
Mitochondria sequence file
Reference annotation gtf file (gencodeV14)
Reference transcriptome sequences file (gencodeV14)

Commands for running QC_SNP, ExpressionEstimation, and FileFormatConversion:

Before running the pipeline, create a new directory for output results:

 mkdir Human_Output

Unzip the reference files. You may wish to create a separate directory so that you can use these references for future runs.

 gunzip Human_gencodeV14_anno.gtf.gz
 gunzip Human_gencodeV14_transcripts.fa.gz
 gunzip Human_genome_GRCh37.fa.gz

i) Branch 1: QC_SNP.sh

 QC_SNP.sh -f ERR030893_2M.fastq.gz -g Human_genome_GRCh37.fa -c Human_gencodeV14_transcripts.fa -a Human_gencodeV14_anno.gtf --mito Human_chrM -o Human_Output/QC_human -p -q -s

ii) Branch 2: ExpressionEstimation.sh

 ExpressionEstimation.sh -f ERR030893_2M.fastq.gz -c Human_gencodeV14_transcripts.fa -a Human_gencodeV14_anno.gtf -o Human_Output/ExpressionEstimation_human

iii) Branch 4: FileFormatConversion.sh

 FileFormatConversion.sh -i Human_Output/QC_human_Bowtie2_genome.sam -r  Human_genome_GRCh37.fa -b -o Human_Output/QC_human_Bowtie2_genome

 FileFormatConversion.sh -i Human_Output/QC_human_Bowtie2_transcriptome.sam -m -o Human_Output/QC_human_Bowtie2_transcriptome

 FileFormatConversion.sh -i Human_Output/QC_human_Bowtie2_genome.sam -w -o Human_Output/QC_human_Bowtie2_genome

 FileFormatConversion.sh -i Human_Output/QC_human_Bowtie2_genome.bam -d -o Human_Output/QC_human_Bowtie2_genome

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.