Name | Modified | Size | Downloads / Week |
---|---|---|---|
Old.releases | 2015-12-03 | ||
Sample.Output.Results | 2015-06-07 | ||
PPLine.0.9.9.tar.gz | 2018-03-26 | 505.2 MB | |
PPLine.0.9.8.5.tar.gz | 2018-01-11 | 340.4 MB | |
PPLine.0.9.8.4.with.preinstalled.soft.tar.gz | 2016-06-12 | 284.1 MB | |
README.md | 2016-06-12 | 16.0 kB | |
PPLine.0.9.8.3.with.preinstalled.soft.tar.gz | 2016-03-16 | 253.3 MB | |
PPLine.0.9.8.2.with.preinstalled.soft.tar.gz | 2016-02-07 | 269.5 MB | |
PPLine.0.9.8.1.with.preinstalled.soft.tar.gz | 2015-12-03 | 266.4 MB | |
Totals: 9 Items | 1.9 GB | 0 |
PPLine 0.9.8.4
PPLine is automated Python-based pipeline aimed to process raw RNA-seq or exome sequencing data. PPLine provides:
• reads mapping (STAR/Tophat2/bowtie/bowtie2), including splice-aware mapping
• gene and transcript expression quantification (HTSeq-count/Cufflinks)
• SNP calling with BQSR and indel realignment (GATK/samtools)
• SNP annotation (Annovar)
• alternatively spliced transcript discovery and quanitifcation (Cufflinks)
• integration of the results, predicting proteotypic peptides and creating ref/alt proteins fasta-database.
Quick start. Prerequisites
Download PPLine.0.9.8.4.with.preinstalled.soft.tar.gz package including all these tools and do not worry about their installation.
However you should install some other prerequisites:
sudo apt-get install python3 python3-dev python3-numpy python3-psutil python3-xlsxwriter python2.7-dev python-numpy perl default-jre wget curl bowtie bowtie2
PPLine accepts .fastq, .fastq.gz, .bam, .sra, *.vcf files or SRA accessions as input. At the first start, PPLine will suggest to download genome human GTF/fasta, dbSNP and Annovar databases
To start PPLine, use the following syntax:
python3 PPLine.py --seq-type <rna-seq/exome> --in sample1_fwd.fastq,sample1_rev.fastq+sample2_fwd.fastq,sample2_rev.fastq
Use commas to separate forward and reverse reads and '+' to separate different samples or libraries. You can combine several file types (.fastq, fastq.gz, .bam, .vcf) and both raired-end and signle-end libraries in one run:
python3 PPLine.py --seq-type rna-seq --in SRR975563_1.fastq,SRR975563_2.fastq+SRR975564.fastq+SRR975565.bam
Also you may provide SRA accession numbers:
python3 PPLine.py --seq-type rna-seq --in SRR975551+SRR975552+SRR975553
Tested and optimized for Ubuntu 14.04
Main switches:
Quantification of gene expression (read counts)
--assess-read-counts-for-ref-genes [y/n]
Default: yes
SNP calling and annotation
--enable-snp-analysis [y/n]
Default: no for RNA-Seq and yes for exome
Splicing analysis with discovery of novel splice junctions discovery
--enable-novel-splice-junctions-search [y/n]
Default: no
Quantifiucation of alternative splice isoforms (FPKM)
--assess-ref-transcript-abundance [y/n]
Default: yes
Complete User's Manual is coming soon.
If you found PPLine useful, please citeKrasnov GS, Dmitriev AA, Kudryavtseva AV et al. PPLine: An Automated Pipeline for SNP, SAP, and Splice Variant Detection in the Context of Proteogenomics. J Proteome Res. 2015 Sep 4;14(9):3729-37. doi: 10.1021/acs.jproteome.5b00490. PMID 26147802
Details
When you start PPLine the first time, STAR/bowtie/bowtie2 reference genome index database is created. This may take up to 3 hours. PPLine performs some other preprocessing steps like reordering of genome to make it suitable for GATK. Also, PPLine will ask you to download Annovar databases (this also takes time).
GATK still have issues when dealing with RNA-Seq spliced reads, so GATK Indel realignment procedure is executed only for exome sequencing data. GATK base quality recalibration is performed for both RNA-Seq and exome sequencing data.
You can use PPLine to translate early obtained vcf or GTF files. To do this, don’t supply PPLine with –in-fasta argument, but supply with –vcf [path to vcf file] or --in-gtf [path to GTF file with predicted alternatively spliced gene model]. See the following section to get know the details.
The following files will be generated. In the directory where each .fastq file is localized: 1. Vcf file with found mutations/SNP 2. Tables with predicted protein splice variants and proteotypic peptides 3. BAM-files and others are placed in .tophat.results folders 4. Files with gene expression (.counts) and isoform expression (.FPKM.gtf, .FPKM.tsv) data
In the current directory (this location can be customized with --output-snp-table-dir argument): 1. SNP.table.xlsx – summary table with SNP/SAP description and annotation, for each sample (contains proteotypic peptides) 2. Proteins.Alt.fa – sequences of proteins containing SAP 3. Proteins.Ref.fa – sequence of wild-type proteins
Complete list of options:
--config-file
Configuration file (genoma fasta, GTF, Tophat, bowtie paths etc). Default: config.txt
--seq-type
Experiment type, exome or RNA-Seq
--in
Input files (fastq, fastq.gz, SRA, BAM, vcf). Syntax: sample1.R1.fastq,sample1.R2.fastq+sample2.bam+....
--enable-snp-analysis
Enables or disables SNP calling. Default: yes
--enable-novel-splice-junctions-search
Enables or disables alternative splicing analysis. Default: yes
--assess-ref-transcript-abundance
Enable expression analysis [transcripts and genes, Cufflinks, FPKM]. Default: yes
--assess-read-counts-for-ref-genes
Enable expression analysis [genes, Cufflinks, read count]. Default: yes
--bypass-annovar
Disables Annovar SNP/SAP annotation. Default: no
--cufflinks-thresholds
Custom coverage thresholds for Cufflinks isoforms assembly. Default: 30,100,300,1000,3000,10000
--output-snp-table-dir
Defines the output directory for SNP.table.xlsx, Proteins.Alt.fa and Proteins.Ref.fa files
--include-additional-info
Includes additional tables: SAP and splice saturation curve data, exon expression data (FPKM). Default: no
--aligner
Read aligner. STAR is default for RNA-Seq; bowtie2 or bowtie is default for exome
--bowtie-ver
Bowtie aligner version. Default: 2
--trim-qual
5’-tail trimming quality (for Trimmomatic). The greater value – the stronger reads are trimmed. Default: 27
--trim-min-len
Minimal length of read after trimming to survive
--trim-sliding-window-qual
Minimal quality within sliding window to keep. Default: 14
--trim-sliding-window-len
Length of sliding window. Default: 4
--threads-num
Number of threads. Not all steps of the processing are parallelized. However, most of them are. Default: CPU count-2
--in-gtf
Use this option to override GTF declared in config.txt. Default: None
--out-transcripts
Output file with transcripts predicted as the result of alternative splicing. Default: None
--out-proteins
Output file with sequences of proteins predicted as the result of alternative splicing. Default: None
--out-snp-table
Output SNP table. This argument is used when PPLine is launched to process custom vcf file. Default: None
--out-sap-fasta
Output FASTA file with protein sequences containing SAP/mutations. This argument is used when PPLine is launched to process custom vcf file. Default: None
--out-sap-fasta-ref
Output FASTA file with reference protein sequences. This argument is used when PPLine is launched to process custom vcf file. Default: None
--out-splice-table
Output table with alternatively spliced proteins info. This argument is used when PPLine is launched to process custom GTF file. Default: None
--include-transcripts-in-splice-table
Switches whether to include transcripts sequences into splice table. This significantly increases output file size. Default: no
--gtf-with-translation-starts
If this argument if not specified, alternatively spliced transcripts are translated with reference GTF start codons (declared in config.txt). To override this file, specify the argument.
--genome-fasta
Use this option to override genome sequence declared in config.txt
--exclude-nmd
Excluded ‘nonsense_mediated_decay’ entries from GTF file. Default: no
--vcf
Use this option when you launch PPLine only to process known vcf file (when you don’t specify --in-fastq ). Format: sample1.vcf,sample2.vcf,sample3.vcf,...
--merge-vcf
Switches whether to merge vcf files or analyze samples separately. Default: no
--1000genomes-format
Swithes whether input vcf files are 1000-genomes-like formatted (DP and other characteristics are specified in 8th column). Default: no
--min-q
Minimal SNP Phred quality score to pass. Default: 0
--min-alt-dp
Minimal share of reads with alternative allele. Default: 0
--max-indel-size
Maximal size of indel. Default: None
--directly-apply-snp
PPLine applies SNP directly to genome and the translates. Is not recommended. Default: no
--bypass-synonymic-snp
Bypasses synonymic SNP/mutations. Default: yes
--extract-only-protein-coding
Uses only CDS with ‘protein_coding’ status in reference GTF. Default: yes
--uniprot
Use this to override Uniprot file location specified in config.txt
--include-seq-in-table
Switches whether to include protein sequence in splice table. Default: yes
--exclude-incomplete-cds
Excludes incomplete CDS (e.g. length is not equal to 3). Default: no
--anno-list
External Annovar annotations. format: EUR:hepg2.eur.dropped.txt+AFR:hepg2.afr.dropped.txt+GWAS1:hepg2.gwas.dropped.txt+.... This argument is only used when you launch PPLine to analyze specified vcf with no --in-fasta provided. Default: None
--anno-dir
Directories with external Annovar annotations. These directories should contain files like anno.hg19_AMR.sites.2014_10_dropped, anno.hg19_SAS.sites.2014_10_dropped, anno.hg19_clinvar_20140929_dropped. Format: dir1,dir2,dir3,... This argument is only used when you launch PPLine to analyze specified vcf with no --in-fasta provided. Default: None
--separate-sift
External SIFT-generated annotations (additional). This argument is only used when you launch PPLine to analyze specified vcf with no --in-fasta provided. Default: None
--sam
PPLine is available to evaluate the coverage distribution across the proteins (declared in GTF). To do this, specify this argument. Format: Liver:reads1.sam,reads2.sam+HepG2:reads3.sam,reads4.sam . Default: None
--expression-info-dir
Output directory to place HTML-formatted reads coverage info (derived from SAM files). Default: ./Expression_info
--split-expression-info-by-chr
Switches whether to split expression info by chromosomes. This is recommended because of great amount of output files (one HTML for one protein)
--limit-to-chr
Chromosome limit. Format: 18,19,20,X. Default: None
--ref-proteins-fasta
Sequences of reference proteins to perform negative selection of proteotypic peptides. Format: file1.fa,file2.fa,... By default, PPLine uses proteins translated from reference GTF (specified in config.txt) and UniprotKB (specified in config.txt).
--exclude-known-uniprot-proteins-from-in-gtf
Switches whether to exclude Uniprot-annotated proteins when parsing GTF file. Default: no
--exclude-enst-proteins-from-splice
Switches whether to exclude known proteins (annotated in reference GTF file) from alternative splicing analysis results. Default: no
--exclude-null-fpkm-proteins-from-splice
Switches whether to exclude proteins with zero-FPKM (when all reads are localized in non-coding areas) from alternative splicing analysis results. Default: yes
--trim-proteins-to-first-stop-codon
Switches whether to trim GTF-declared proteins to first stop codon. Default: yes
--pp-min-length
Minimal length of proteotypic peptides. Default: 6
--pp-max-length
Maximal length of proteotypic peptides. Default: 25
--pp-mis-cleavages*
Allowed miscleavages of proteotypic peptides. Default: 2
--length-distrib-file
File with distribution of proteotypic peptides lengths. Default: None
--nonsyn-snp-fraction-distrib-file
File with distribution of non-synonymic SNP fraction (RnsSNP) in coding regions. In addition, fraction of known SNP (dbSNP-annotated) distribution is placed in this file. These distributions are useful to adjust coverage or Phred quality threshold and evaluate FDR. Default: None
--ref-alt-splice-table
HTML with proteotypic peptides specific for various isoforms of reference (or alternatively spliced) proteins. Default: None
--modify-uniprot-id-in-fasta
Switches whether to modify Uniprot IDs in Proteins.Alt.fa and Proteins.Ref.fa. If you turn this option, the IDs would look like this: sp|Q96NU1-s|SAM11_HUMAN (SAP-containing protein) or sp|Q96NU1-ns|SAM11_HUMAN (reference protein). This simple option is useful when you load fasta protein databases in MaxQuant or other proteomic software. Default: yes
--add-full-indel-column
Switches whether to add Indel detailed info columns to SAP table. These columns contain full sequences of alt and ref alleles, as well as alternative open reading frame (ARF) length which is useful to discover novel long functional ARFs coming from either indels or mRNA editing.
Other info, if you don't want to use preinstalled soft
If you don't want to use pre-installed software (located in ./Useful_stuff/) you should download and install the following prerequisites:
• samtools and bcftools (http://www.htslib.org/download/ )
• bowtie2 (http://sourceforge.net/projects/bowtie-bio/files/bowtie2/ ) and/or bowtie (http://sourceforge.net/projects/bowtie-bio/files/bowtie/ )
• Tophat (https://ccb.jhu.edu/software/tophat/tutorial.shtml )
• GATK (https://www.broadinstitute.org/gatk/download/ )
• picard-tools (http://broadinstitute.github.io/picard/ )
• Trimmomatic (http://www.usadellab.org/cms/?page=trimmomatic )
• Annovar (http://annovar.openbioinformatics.org/en/latest/user-guide/download/ )
• Cufflinks (http://cole-trapnell-lab.github.io/cufflinks/install/ )
• HTSeq-count (http://www-huber.embl.de/users/anders/HTSeq/doc/overview.html)
• STAR (https://github.com/alexdobin/STAR)
Once you finished, you should edit config.txt to fit to your configuration. If you downloaded PPLine with preinstalled soft, you only need to specify genome fasta/GTF, Uniprot and dbSNP vcf paths.
ref_GTF = path to genome model file [like GRCh38.80.gtf]
ref_genome = path to genome sequence file [like GRCh38.dna.toplevel.fa]
uniprot_path = path to UniProtKB proteome file; used to reveal UniProtKB identifier [like uniprot.fa]
dbSNP = path to dbSNP vcf file [like dbSNP.all.hg19.20150102.vcf]
dbSNP_reordered = path to dbSNP vcf in karyotypic order, which is requied by GATK. PPLine will automatically create this file and add this entry to config.txt, but this process takes lots of memory (up to 40 Gb) and time (1 hour).
samtools_path = path to samtools binary [/usr/bin/samtools]. For some reasons, updated versions of samtools have strange behavior during SNP calling with PPLine.
bcftools_path = path to bcftools binary [/usr/bin/bcftools]
tophat_path = path to tophat executable [/usr/bin/tophat]
STAR_path = path to STAR aligner
bed_to_juncs_path = path to splice junctions list converter. Usually is provided together with Tophat [/usr/bin/bed_to_juncs]
GATK_path = path to GATK .jar-file [like this: /home/user/GATK/GenomeAnalysisTK.jar]
Annovar_path = path to Annovar directory containing annotate_variation.pl and convert2annovar.pl Perl scripts.
picard-tools_path = path to picard-tools binary [/usr/bin/picard-tools]
trimmomatic_path = path to Trimmomatic .jar-file [like /home/user/Trimmomatic-0.32/trimmomatic-0.32.jar]
adapters_seq_path = path to file with sequencing adapters list, used to crop out. Usually, it is provided together with Trimmomatic archive [like /home/user/Trimmomatic-0.32/adapters/TruSeq12.fas]
bowtie2_path = path to bowtie2 executable [/usr/bin/bowtie2]
bowtie2-build_path = path to bowtie2-build executable [/usr/bin/bowtie2-build]
bowtie_path = path to bowtie executable [/usr/bin/bowtie]
bowtie-build_path = path to bowtie-build executable [/usr/bin/bowtie-build]
cufflinks_path = path to Cufflinks binary [/usr/bin/cufflinks]
cuffcompare_path = path to cuffcompare bin [/usr/bin/cuffcompare]
htseq-count_path = path to HTSeq-count
fastq-dump_path = path to NCBI SRA toolkit fastq-dump