Home
Name Modified Size InfoDownloads / Week
QuickRNASeq_1.2.tgz 2017-02-28 5.2 MB
make_gex.pl 2016-04-16 18.7 kB
make_gex_anno.pl 2016-04-16 19.8 kB
QuickRNASeq_1.1.tgz 2016-04-15 7.4 MB
README.txt 2015-12-07 12.1 kB
QuickRNASeq_1.0.testRun.zip 2015-11-20 38.9 MB
QuickRNAseq_User_Guide.pdf 2015-11-20 1.1 MB
Totals: 7 Items   52.6 MB 1
#
#   QuickRNASeq pipeline for RNA-seq data analysis
#

Copyright (C) 2015, Shanrong Zhao, Baohong Zhang

This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
 any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program.  If not, see <http://www.gnu.org/licenses/>.

#!!!
You need to download and install QuickRNAseq from http://quickrnaseq.sourceforge.net
#set environment variable to point to your installation directory
export QuickRNASeq={QuickRNASeq_installation_Directory}
#add QuickRNASeq to your system path
export PATH=$QuickRNASeq:$PATH
#!!!


#
## Prerequisite: installation of open sources
#
The following open source should be downloaded and installed
	1. download STAR from https://github.com/alexdobin/STAR/releases
	2. install SUBREAD packages http://subread.sourceforge.net/ 
	3. download JAR file from http://varscan.sourceforge.net/
	4. download and install RSeQC from http://rseqc.sourceforge.net/
	5. download and install samtools from http://sourceforge.net/projects/samtools/files/

To set the following software related parameters in configuration file. samtools is assumed 
to be in your system path since it is a basic utility tool for NGS data analysis.
		STAR_RNA=/hpc/grid/shared/ngsapp/STAR_2.4.0k/bin/Linux_x86_64
		FEATURECOUNTS=/hpc/grid/shared/ngsapp/subread-1.4.6/bin
		RSeQC=/afs/grid.pfizer.com/mm/proj01/app/python/bin
		VARSCAN_JAR=/hpc/grid/shared/ngsapp/bin/VarScan.v2.4.0.jar

#
## preparation of geneome/annotation/index for RNA-seq data analysis
#
An example on how to create GRCh38_Gencode23 index for RNA-seq data analysis step by step
	a. download genome Fastq ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_23/GRCh38.primary_assembly.genome.fa.gz
	b. download gene annotation in GTF: ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_23/gencode.v23.annotation.gtf.gz
	c. unzip and rename genome and GTF as: GRCh38.primary.genome.fa, GRCh38.gencode.v23.gtf and
		put them in the corresponding project data folder
	d.  prepare annotation and BED files using utility functions in QuickRNASeq
		1).  gtf2bed.pl  GRCh38.gencode.v23.gtf >  GRCh38.gencode.v23.bed
		2).  gtf2annot.pl GRCh38.gencode.v23.gtf > GRCh38.gencode.v23.annot
		3).  samtools faidx GRCh38.primary.genome.fa
	f.  create genome index as below:
		 /hpc/grid/shared/ngsapp/STAR_2.4.0k/bin/Linux_x86_64/STAR \
			--runThreadN 32 --runMode genomeGenerate --genomeDir /hpc/grid/shared/ngsdb/STAR/GRCh38_gencode23_100 \
			--genomeFastaFiles  /hpc/grid/shared/ngsdb/fasta/GRCh38.primary.genome.fa --sjdbGTFfile \
			/hpc/grid/shared/ngsdb/annotation/gencode/GRCh38.gencode.v23.gtf --sjdbOverhang 99
	g.  To set the following reference geneome related parameters in configuration file
		GENOME_FASTA=/hpc/grid/shared/ngsdb/fasta/GRCh38.primary.genome.fa
		GENOME_INDEX=/hpc/grid/shared/ngsdb/STAR/GRCh38_gencode23_100
		GENOME_ANNOTATION=/hpc/grid/shared/ngsdb/annotation/gencode/GRCh38.gencode.v23.annot
		GTF_FILE=/hpc/grid/shared/ngsdb/annotation/gencode/GRCh38.gencode.v23.gtf
		BEDFILE=/hpc/grid/shared/ngsdb/annotation/gencode/GRCh38.gencode.v23.bed
		CHR_REGION=chr6:1-170805979

		
#
# Step #0: Prepare RNA-seq data, sample ID list, sample annotation
#
	a. prepare a sample annotation file and sample ID file
		1)  sample ID file, an unique ID per line. Assume the suffix of fastq files is fq.gz
			and its corresponding fastq files are: 
				sample_id_1.fq.gz
				sample_id_2.fq.gz.
		2) sample annotation file:
			column #1:  sample_id
			column #2:  subject_id
			column #3--#n (optional)
			For clinical RNA-seq, samples from the same subject should be assigned to the same subject_id.
	!!!NOTE!!!:
		QuickRNASeq can be run without a sample annotation, but it is strongly recommended that 
		a user should provide meaningful annotations for all samples. A proper annotation file 
		should be tab delimited, and QuickRNASeq requires the first and second columns correspond 
		to sample and subject identifiers, respectively. A sample name should start with a letter, 
		and does not contain any white space in the middle
		
	b. To set the following parameters in your configuration file
		# directory for FASTQ files
		FASTQ_DIR=/hpc/grid/ngsws/molmed/data/QuickRNASeq/fastq
		# suffix for fastq files:  fq.gz or fastq.gz or fastq  or fq
		FASTQ_SUFFIX=fq.gz
		# stranded or nonstranded. #0: nonstranded;  1: forward strandness;  2: reverse strandness
		STRAND=0     
		# sequencing type: pair or single
		SEQUENCE_TYPE=pair
		# sequencing depth #regular: 40-80M; #deep: >100M
		SEQUENCE_DEPTH=regular
		# COMPREHENSIVE_QC: true if performed; false if no individual sample QC
		COMPREHENSIVE_QC=true
		#log directory 
		LOGDIR=/hpc/grid/scratch/$USER/log

		
#
#Step #1: Run pipeline to process individual sample: mapping, counting and QC
#Note: 
#	1. This step is computationally intensive
#	2. Under the project folder, there is a separate result folder for each sample 
#

	export QuickRNASeq={QuickRNASeq_installation_Directory}
	export PATH=$QuickRNASeq:$PATH
	
	star-fc-qc.sh allIDs.txt project_configuration_file

	!!!NOTE!!!:
	The out-of-box QuickRNASeq is fully tested in a HPC cluster using IBM Platform LSF 
	(Load Sharing Facility), a powerful workload management platform for demanding, distributed
	HPC environments. IBM Platform LSF provides a comprehensive set of intelligent, 
	policy-driven scheduling features that enable you to utilize all of your compute 
	infrastructure resources and ensure optimal application performance. In addition to 
	LSF, there is a list of notable job scheduling software available. For a cluster 
	using a job scheduler other than LSF, star-fc-qc.sh (implementation of Step #1 in 
	Figure 1) needs to be twisted or modified. 
	For those people who have no access to a HPC cluster, we offer star-fc-qc.ws.sh, a
	customized script working in a standard Linux workstation. Of course, to analyze a 
	large RNA-seq dataset in a single workstation is not typical.
	
	#run the following command if you run the analysis in a standalone workstation
	#star-fc-qc.ws.sh allIDs.txt project_configuration_file

	
#
#Step #2: Merge results from individual samples and generate an integrated report
#Note: 
#	1. Please run this command under the project folder. 
#	2. If sample_annotation_file is not provided, a mock annotation will be generated.
#
	
	export GENOME_ANNOTATION=/hpc/grid/shared/ngsdb/annotation/gencode/GRCh38.gencode.v23.annot
	star-fc-qc.summary.sh allIDs.txt sample_annotation_file[optional]
	#bsub -app large "star-fc-qc.summary.sh allIDs.txt sample.annotation.txt &> Results.log"
	
	#run the command line below if no individual sample QC is performed 
	#star-fc.summary.sh allIDs.txt sample_annotation_file[optional]
	
	
#
#Step #3: Explore integrated and interactive project report
#

open the index.html file under Results folder and you will have access to all data
and figures, and drill down RNA-seq analysis results in an interactive way
	

		
#
# QuickRNASeq test run
#
Please refer to $QuickRNASeq/test_run folder for:
	1. 	allIDs.txt:  sample identifiers
	2.	sample.annotation.txt: annotation file
	3.  GTEx.config:  sample configuration file
	4.  master-cmd.sh: command lines for test runs 
	Note: run step #1 after step #1 finishes
		Adjust QuickRNASeq to point to QuickRNASeq installation folder
	
Please ALSO refer to $QuickRNASeq/star-fc-qc.config.template to create your own project
specific configuration file. 
	
	
	
########################################################################
P.S  A sample configuration file is as below.
Note:
	Please refer to star-fc-qc.config.template for more details
	You can copy star-fc-qc.config.template in QuickRNASeq package to your 
	project folder and then customize it
########################################################################

#
# RNA-seq project related information
#
FASTQ_DIR=/hpc/grid/ngsws/molmed/data/QuickRNASeq/fastq
FASTQ_SUFFIX=fq.gz
STRAND=0     
SEQUENCE_DEPTH=regular
SEQUENCE_TYPE=pair
COMPREHENSIVE_QC=true
LOGDIR=/hpc/grid/scratch/$USER/log

#
## Species-specific GenomeIndex and GTF
## Usually, you DON'T need to modify them unless a different geneome/annotation is used
#
GENOME_FASTA=/hpc/grid/shared/ngsdb/fasta/GRCh38.primary.genome.fa
GENOME_INDEX=/hpc/grid/shared/ngsdb/STAR/GRCh38_gencode23_100
GENOME_ANNOTATION=/hpc/grid/shared/ngsdb/annotation/gencode/GRCh38.gencode.v23.annot
GTF_FILE=/hpc/grid/shared/ngsdb/annotation/gencode/GRCh38.gencode.v23.gtf
BEDFILE=/hpc/grid/shared/ngsdb/annotation/gencode/GRCh38.gencode.v23.bed
CHR_REGION=chr6:1-170805979

#
##Software parameters
## Usually, you DON'T need to modify them unless you understand the impacts of these parameters
#
STAR_PARAMETER="--alignSJDBoverhangMin 1 $limitBAMsortRAM --outFilterMismatchNoverLmax 0.05 --outFilterScoreMinOverLread 0.90 --outFilterMatchNminOverLread 0.90 --alignIntronMax 1000000" 
FEATURECOUNTS_OVERLAP="--minReadOverlap 25"
VARSCAN_PARAMETER="--min-coverage 20 --min-reads2 4"

#
##Software locations
## Usually, you DON'T need to modify them unless you update packages
#
STAR_RNA=/hpc/grid/shared/ngsapp/STAR_2.4.0k/bin/Linux_x86_64
FEATURECOUNTS=/hpc/grid/shared/ngsapp/subread-1.4.6/bin
RSeQC=/afs/grid.pfizer.com/mm/proj01/app/python/bin
VARSCAN_JAR=/hpc/grid/shared/ngsapp/bin/VarScan.v2.4.0.jar



########################################################################
P.S  Descriptions on data and figure summary files under project folder
     Results/Summary
########################################################################

sample.annotation.txt			sample annotation file
star-mapping-summary.txt		read mapping summary for STAR run 
fc-counting-summary.txt			read counting summary for featureCounts run 
RSeQC-read-distribution.txt		The distribution of mapped reads along gene elements
Gene-count-table.txt			counting table
Gene-count-table.flt.txt		filtered counting table. A gene is filtered if having 0 read in more than 50% samples
Gene-fpkm-table.txt				RPKM table, calculated from Gene-count-table.txt
Gene-fpkm-table.flt.txt			filtered RPKM table, calculated from Gene-count-table.flt.txt
RNASeq-expr-QC.txt				correlation based QC of expression profile
RNASeq-expr-corr.txt			All-against-all gene expression correlation matrix
RNASeq-snp-corr.txt				All-against-all SNP correlation matrix
RNASeq-merged-metrics.txt		Merged RNA-seq metrics from mapping, counting, distribution and QC
read_map_sum.10x7.png			plot of star-mapping-summary.txt. Intended for powerpoint presentation.
read_map_sum.png				plot of star-mapping-summary.txt. Intended for interactive HTML presentation.
read_count_sum.10x7.png			plot of fc-counting-summary.txt. Intended for powerpoint presentation.
read_count_sum.png				plot of fc-counting-summary.txt. Intended for interactive HTML presentation.
read_dist_sum.10x7.png			plot of RSeQC-read-distribution.txt. Intended for powerpoint presentation.
read_dist_sum.png				plot of RSeQC-read-distribution.txt. Intended for interactive HTML presentation.
expr_count_RPKM.10x7.png		plot the number of genes with a give RPKM cutoff. Intended for powerpoint presentation.
expr_count_RPKM.png				plot the number of genes with a give RPKM cutoff. Intended for interactive HTML presentation.
RNASeq-expr-corr.txt.corr.9x7.png	correlation plot for RNASeq-expr-corr.txt.Intended for powerpoint presentation.
RNASeq-expr-corr.txt.corr.png		correlation plot for RNASeq-expr-corr.txt.Intended for interactive HTML presentation.
RNASeq-snp-corr.txt.corr.9x7.png	correlation plot for RNASeq-snp-corr.txt.Intended for powerpoint presentation.
RNASeq-snp-corr.txt.corr.png		correlation plot for RNASeq-snp-corr.txt.Intended for interactive HTML presentation.

Source: README.txt, updated 2015-12-07