#
# QuickRNASeq pipeline for RNA-seq data analysis
#
Copyright (C) 2015, Shanrong Zhao, Baohong Zhang
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program. If not, see <http://www.gnu.org/licenses/>.
#!!!
You need to download and install QuickRNAseq from http://quickrnaseq.sourceforge.net
#set environment variable to point to your installation directory
export QuickRNASeq={QuickRNASeq_installation_Directory}
#add QuickRNASeq to your system path
export PATH=$QuickRNASeq:$PATH
#!!!
#
## Prerequisite: installation of open sources
#
The following open source should be downloaded and installed
1. download STAR from https://github.com/alexdobin/STAR/releases
2. install SUBREAD packages http://subread.sourceforge.net/
3. download JAR file from http://varscan.sourceforge.net/
4. download and install RSeQC from http://rseqc.sourceforge.net/
5. download and install samtools from http://sourceforge.net/projects/samtools/files/
To set the following software related parameters in configuration file. samtools is assumed
to be in your system path since it is a basic utility tool for NGS data analysis.
STAR_RNA=/hpc/grid/shared/ngsapp/STAR_2.4.0k/bin/Linux_x86_64
FEATURECOUNTS=/hpc/grid/shared/ngsapp/subread-1.4.6/bin
RSeQC=/afs/grid.pfizer.com/mm/proj01/app/python/bin
VARSCAN_JAR=/hpc/grid/shared/ngsapp/bin/VarScan.v2.4.0.jar
#
## preparation of geneome/annotation/index for RNA-seq data analysis
#
An example on how to create GRCh38_Gencode23 index for RNA-seq data analysis step by step
a. download genome Fastq ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_23/GRCh38.primary_assembly.genome.fa.gz
b. download gene annotation in GTF: ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_23/gencode.v23.annotation.gtf.gz
c. unzip and rename genome and GTF as: GRCh38.primary.genome.fa, GRCh38.gencode.v23.gtf and
put them in the corresponding project data folder
d. prepare annotation and BED files using utility functions in QuickRNASeq
1). gtf2bed.pl GRCh38.gencode.v23.gtf > GRCh38.gencode.v23.bed
2). gtf2annot.pl GRCh38.gencode.v23.gtf > GRCh38.gencode.v23.annot
3). samtools faidx GRCh38.primary.genome.fa
f. create genome index as below:
/hpc/grid/shared/ngsapp/STAR_2.4.0k/bin/Linux_x86_64/STAR \
--runThreadN 32 --runMode genomeGenerate --genomeDir /hpc/grid/shared/ngsdb/STAR/GRCh38_gencode23_100 \
--genomeFastaFiles /hpc/grid/shared/ngsdb/fasta/GRCh38.primary.genome.fa --sjdbGTFfile \
/hpc/grid/shared/ngsdb/annotation/gencode/GRCh38.gencode.v23.gtf --sjdbOverhang 99
g. To set the following reference geneome related parameters in configuration file
GENOME_FASTA=/hpc/grid/shared/ngsdb/fasta/GRCh38.primary.genome.fa
GENOME_INDEX=/hpc/grid/shared/ngsdb/STAR/GRCh38_gencode23_100
GENOME_ANNOTATION=/hpc/grid/shared/ngsdb/annotation/gencode/GRCh38.gencode.v23.annot
GTF_FILE=/hpc/grid/shared/ngsdb/annotation/gencode/GRCh38.gencode.v23.gtf
BEDFILE=/hpc/grid/shared/ngsdb/annotation/gencode/GRCh38.gencode.v23.bed
CHR_REGION=chr6:1-170805979
#
# Step #0: Prepare RNA-seq data, sample ID list, sample annotation
#
a. prepare a sample annotation file and sample ID file
1) sample ID file, an unique ID per line. Assume the suffix of fastq files is fq.gz
and its corresponding fastq files are:
sample_id_1.fq.gz
sample_id_2.fq.gz.
2) sample annotation file:
column #1: sample_id
column #2: subject_id
column #3--#n (optional)
For clinical RNA-seq, samples from the same subject should be assigned to the same subject_id.
!!!NOTE!!!:
QuickRNASeq can be run without a sample annotation, but it is strongly recommended that
a user should provide meaningful annotations for all samples. A proper annotation file
should be tab delimited, and QuickRNASeq requires the first and second columns correspond
to sample and subject identifiers, respectively. A sample name should start with a letter,
and does not contain any white space in the middle
b. To set the following parameters in your configuration file
# directory for FASTQ files
FASTQ_DIR=/hpc/grid/ngsws/molmed/data/QuickRNASeq/fastq
# suffix for fastq files: fq.gz or fastq.gz or fastq or fq
FASTQ_SUFFIX=fq.gz
# stranded or nonstranded. #0: nonstranded; 1: forward strandness; 2: reverse strandness
STRAND=0
# sequencing type: pair or single
SEQUENCE_TYPE=pair
# sequencing depth #regular: 40-80M; #deep: >100M
SEQUENCE_DEPTH=regular
# COMPREHENSIVE_QC: true if performed; false if no individual sample QC
COMPREHENSIVE_QC=true
#log directory
LOGDIR=/hpc/grid/scratch/$USER/log
#
#Step #1: Run pipeline to process individual sample: mapping, counting and QC
#Note:
# 1. This step is computationally intensive
# 2. Under the project folder, there is a separate result folder for each sample
#
export QuickRNASeq={QuickRNASeq_installation_Directory}
export PATH=$QuickRNASeq:$PATH
star-fc-qc.sh allIDs.txt project_configuration_file
!!!NOTE!!!:
The out-of-box QuickRNASeq is fully tested in a HPC cluster using IBM Platform LSF
(Load Sharing Facility), a powerful workload management platform for demanding, distributed
HPC environments. IBM Platform LSF provides a comprehensive set of intelligent,
policy-driven scheduling features that enable you to utilize all of your compute
infrastructure resources and ensure optimal application performance. In addition to
LSF, there is a list of notable job scheduling software available. For a cluster
using a job scheduler other than LSF, star-fc-qc.sh (implementation of Step #1 in
Figure 1) needs to be twisted or modified.
For those people who have no access to a HPC cluster, we offer star-fc-qc.ws.sh, a
customized script working in a standard Linux workstation. Of course, to analyze a
large RNA-seq dataset in a single workstation is not typical.
#run the following command if you run the analysis in a standalone workstation
#star-fc-qc.ws.sh allIDs.txt project_configuration_file
#
#Step #2: Merge results from individual samples and generate an integrated report
#Note:
# 1. Please run this command under the project folder.
# 2. If sample_annotation_file is not provided, a mock annotation will be generated.
#
export GENOME_ANNOTATION=/hpc/grid/shared/ngsdb/annotation/gencode/GRCh38.gencode.v23.annot
star-fc-qc.summary.sh allIDs.txt sample_annotation_file[optional]
#bsub -app large "star-fc-qc.summary.sh allIDs.txt sample.annotation.txt &> Results.log"
#run the command line below if no individual sample QC is performed
#star-fc.summary.sh allIDs.txt sample_annotation_file[optional]
#
#Step #3: Explore integrated and interactive project report
#
open the index.html file under Results folder and you will have access to all data
and figures, and drill down RNA-seq analysis results in an interactive way
#
# QuickRNASeq test run
#
Please refer to $QuickRNASeq/test_run folder for:
1. allIDs.txt: sample identifiers
2. sample.annotation.txt: annotation file
3. GTEx.config: sample configuration file
4. master-cmd.sh: command lines for test runs
Note: run step #1 after step #1 finishes
Adjust QuickRNASeq to point to QuickRNASeq installation folder
Please ALSO refer to $QuickRNASeq/star-fc-qc.config.template to create your own project
specific configuration file.
########################################################################
P.S A sample configuration file is as below.
Note:
Please refer to star-fc-qc.config.template for more details
You can copy star-fc-qc.config.template in QuickRNASeq package to your
project folder and then customize it
########################################################################
#
# RNA-seq project related information
#
FASTQ_DIR=/hpc/grid/ngsws/molmed/data/QuickRNASeq/fastq
FASTQ_SUFFIX=fq.gz
STRAND=0
SEQUENCE_DEPTH=regular
SEQUENCE_TYPE=pair
COMPREHENSIVE_QC=true
LOGDIR=/hpc/grid/scratch/$USER/log
#
## Species-specific GenomeIndex and GTF
## Usually, you DON'T need to modify them unless a different geneome/annotation is used
#
GENOME_FASTA=/hpc/grid/shared/ngsdb/fasta/GRCh38.primary.genome.fa
GENOME_INDEX=/hpc/grid/shared/ngsdb/STAR/GRCh38_gencode23_100
GENOME_ANNOTATION=/hpc/grid/shared/ngsdb/annotation/gencode/GRCh38.gencode.v23.annot
GTF_FILE=/hpc/grid/shared/ngsdb/annotation/gencode/GRCh38.gencode.v23.gtf
BEDFILE=/hpc/grid/shared/ngsdb/annotation/gencode/GRCh38.gencode.v23.bed
CHR_REGION=chr6:1-170805979
#
##Software parameters
## Usually, you DON'T need to modify them unless you understand the impacts of these parameters
#
STAR_PARAMETER="--alignSJDBoverhangMin 1 $limitBAMsortRAM --outFilterMismatchNoverLmax 0.05 --outFilterScoreMinOverLread 0.90 --outFilterMatchNminOverLread 0.90 --alignIntronMax 1000000"
FEATURECOUNTS_OVERLAP="--minReadOverlap 25"
VARSCAN_PARAMETER="--min-coverage 20 --min-reads2 4"
#
##Software locations
## Usually, you DON'T need to modify them unless you update packages
#
STAR_RNA=/hpc/grid/shared/ngsapp/STAR_2.4.0k/bin/Linux_x86_64
FEATURECOUNTS=/hpc/grid/shared/ngsapp/subread-1.4.6/bin
RSeQC=/afs/grid.pfizer.com/mm/proj01/app/python/bin
VARSCAN_JAR=/hpc/grid/shared/ngsapp/bin/VarScan.v2.4.0.jar
########################################################################
P.S Descriptions on data and figure summary files under project folder
Results/Summary
########################################################################
sample.annotation.txt sample annotation file
star-mapping-summary.txt read mapping summary for STAR run
fc-counting-summary.txt read counting summary for featureCounts run
RSeQC-read-distribution.txt The distribution of mapped reads along gene elements
Gene-count-table.txt counting table
Gene-count-table.flt.txt filtered counting table. A gene is filtered if having 0 read in more than 50% samples
Gene-fpkm-table.txt RPKM table, calculated from Gene-count-table.txt
Gene-fpkm-table.flt.txt filtered RPKM table, calculated from Gene-count-table.flt.txt
RNASeq-expr-QC.txt correlation based QC of expression profile
RNASeq-expr-corr.txt All-against-all gene expression correlation matrix
RNASeq-snp-corr.txt All-against-all SNP correlation matrix
RNASeq-merged-metrics.txt Merged RNA-seq metrics from mapping, counting, distribution and QC
read_map_sum.10x7.png plot of star-mapping-summary.txt. Intended for powerpoint presentation.
read_map_sum.png plot of star-mapping-summary.txt. Intended for interactive HTML presentation.
read_count_sum.10x7.png plot of fc-counting-summary.txt. Intended for powerpoint presentation.
read_count_sum.png plot of fc-counting-summary.txt. Intended for interactive HTML presentation.
read_dist_sum.10x7.png plot of RSeQC-read-distribution.txt. Intended for powerpoint presentation.
read_dist_sum.png plot of RSeQC-read-distribution.txt. Intended for interactive HTML presentation.
expr_count_RPKM.10x7.png plot the number of genes with a give RPKM cutoff. Intended for powerpoint presentation.
expr_count_RPKM.png plot the number of genes with a give RPKM cutoff. Intended for interactive HTML presentation.
RNASeq-expr-corr.txt.corr.9x7.png correlation plot for RNASeq-expr-corr.txt.Intended for powerpoint presentation.
RNASeq-expr-corr.txt.corr.png correlation plot for RNASeq-expr-corr.txt.Intended for interactive HTML presentation.
RNASeq-snp-corr.txt.corr.9x7.png correlation plot for RNASeq-snp-corr.txt.Intended for powerpoint presentation.
RNASeq-snp-corr.txt.corr.png correlation plot for RNASeq-snp-corr.txt.Intended for interactive HTML presentation.