Download Latest Version PPLine.0.9.9.tar.gz (505.2 MB)
Email in envelope

Get an email when there's a new version of PPLine

Home
Name Modified Size InfoDownloads / Week
Old.releases 2015-12-03
Sample.Output.Results 2015-06-07
PPLine.0.9.9.tar.gz 2018-03-26 505.2 MB
PPLine.0.9.8.5.tar.gz 2018-01-11 340.4 MB
PPLine.0.9.8.4.with.preinstalled.soft.tar.gz 2016-06-12 284.1 MB
README.md 2016-06-12 16.0 kB
PPLine.0.9.8.3.with.preinstalled.soft.tar.gz 2016-03-16 253.3 MB
PPLine.0.9.8.2.with.preinstalled.soft.tar.gz 2016-02-07 269.5 MB
PPLine.0.9.8.1.with.preinstalled.soft.tar.gz 2015-12-03 266.4 MB
Totals: 9 Items   1.9 GB 0

PPLine 0.9.8.4

PPLine is automated Python-based pipeline aimed to process raw RNA-seq or exome sequencing data. PPLine provides:

• reads mapping (STAR/Tophat2/bowtie/bowtie2), including splice-aware mapping

• gene and transcript expression quantification (HTSeq-count/Cufflinks)

• SNP calling with BQSR and indel realignment (GATK/samtools)

• SNP annotation (Annovar)

• alternatively spliced transcript discovery and quanitifcation (Cufflinks)

• integration of the results, predicting proteotypic peptides and creating ref/alt proteins fasta-database.

Quick start. Prerequisites

Download PPLine.0.9.8.4.with.preinstalled.soft.tar.gz package including all these tools and do not worry about their installation.

However you should install some other prerequisites:

sudo apt-get install python3 python3-dev python3-numpy python3-psutil python3-xlsxwriter python2.7-dev python-numpy perl default-jre wget curl bowtie bowtie2

PPLine accepts .fastq, .fastq.gz, .bam, .sra, *.vcf files or SRA accessions as input. At the first start, PPLine will suggest to download genome human GTF/fasta, dbSNP and Annovar databases

To start PPLine, use the following syntax:

python3 PPLine.py --seq-type <rna-seq/exome> --in sample1_fwd.fastq,sample1_rev.fastq+sample2_fwd.fastq,sample2_rev.fastq

Use commas to separate forward and reverse reads and '+' to separate different samples or libraries. You can combine several file types (.fastq, fastq.gz, .bam, .vcf) and both raired-end and signle-end libraries in one run:

python3 PPLine.py --seq-type rna-seq --in SRR975563_1.fastq,SRR975563_2.fastq+SRR975564.fastq+SRR975565.bam

Also you may provide SRA accession numbers:

python3 PPLine.py --seq-type rna-seq --in SRR975551+SRR975552+SRR975553

Tested and optimized for Ubuntu 14.04

Main switches:

Quantification of gene expression (read counts)

--assess-read-counts-for-ref-genes [y/n]

Default: yes

SNP calling and annotation

--enable-snp-analysis [y/n]

Default: no for RNA-Seq and yes for exome

Splicing analysis with discovery of novel splice junctions discovery

--enable-novel-splice-junctions-search [y/n]

Default: no

Quantifiucation of alternative splice isoforms (FPKM)

--assess-ref-transcript-abundance [y/n]

Default: yes

Complete User's Manual is coming soon.

If you found PPLine useful, please citeKrasnov GS, Dmitriev AA, Kudryavtseva AV et al. PPLine: An Automated Pipeline for SNP, SAP, and Splice Variant Detection in the Context of Proteogenomics. J Proteome Res. 2015 Sep 4;14(9):3729-37. doi: 10.1021/acs.jproteome.5b00490. PMID 26147802

Details

When you start PPLine the first time, STAR/bowtie/bowtie2 reference genome index database is created. This may take up to 3 hours. PPLine performs some other preprocessing steps like reordering of genome to make it suitable for GATK. Also, PPLine will ask you to download Annovar databases (this also takes time).

GATK still have issues when dealing with RNA-Seq spliced reads, so GATK Indel realignment procedure is executed only for exome sequencing data. GATK base quality recalibration is performed for both RNA-Seq and exome sequencing data.

You can use PPLine to translate early obtained vcf or GTF files. To do this, don’t supply PPLine with –in-fasta argument, but supply with –vcf [path to vcf file] or --in-gtf [path to GTF file with predicted alternatively spliced gene model]. See the following section to get know the details.

The following files will be generated. In the directory where each .fastq file is localized: 1. Vcf file with found mutations/SNP 2. Tables with predicted protein splice variants and proteotypic peptides 3. BAM-files and others are placed in .tophat.results folders 4. Files with gene expression (.counts) and isoform expression (.FPKM.gtf, .FPKM.tsv) data

In the current directory (this location can be customized with --output-snp-table-dir argument): 1. SNP.table.xlsx – summary table with SNP/SAP description and annotation, for each sample (contains proteotypic peptides) 2. Proteins.Alt.fa – sequences of proteins containing SAP 3. Proteins.Ref.fa – sequence of wild-type proteins

Complete list of options:

--config-file

Configuration file (genoma fasta, GTF, Tophat, bowtie paths etc). Default: config.txt

--seq-type

Experiment type, exome or RNA-Seq

--in

Input files (fastq, fastq.gz, SRA, BAM, vcf). Syntax: sample1.R1.fastq,sample1.R2.fastq+sample2.bam+....

--enable-snp-analysis

Enables or disables SNP calling. Default: yes

--enable-novel-splice-junctions-search

Enables or disables alternative splicing analysis. Default: yes

--assess-ref-transcript-abundance

Enable expression analysis [transcripts and genes, Cufflinks, FPKM]. Default: yes

--assess-read-counts-for-ref-genes

Enable expression analysis [genes, Cufflinks, read count]. Default: yes

--bypass-annovar

Disables Annovar SNP/SAP annotation. Default: no

--cufflinks-thresholds

Custom coverage thresholds for Cufflinks isoforms assembly. Default: 30,100,300,1000,3000,10000

--output-snp-table-dir

Defines the output directory for SNP.table.xlsx, Proteins.Alt.fa and Proteins.Ref.fa files

--include-additional-info

Includes additional tables: SAP and splice saturation curve data, exon expression data (FPKM). Default: no

--aligner

Read aligner. STAR is default for RNA-Seq; bowtie2 or bowtie is default for exome

--bowtie-ver

Bowtie aligner version. Default: 2

--trim-qual

5’-tail trimming quality (for Trimmomatic). The greater value – the stronger reads are trimmed. Default: 27

--trim-min-len

Minimal length of read after trimming to survive

--trim-sliding-window-qual

Minimal quality within sliding window to keep. Default: 14

--trim-sliding-window-len

Length of sliding window. Default: 4

--threads-num

Number of threads. Not all steps of the processing are parallelized. However, most of them are. Default: CPU count-2

--in-gtf

Use this option to override GTF declared in config.txt. Default: None

--out-transcripts

Output file with transcripts predicted as the result of alternative splicing. Default: None

--out-proteins

Output file with sequences of proteins predicted as the result of alternative splicing. Default: None

--out-snp-table

Output SNP table. This argument is used when PPLine is launched to process custom vcf file. Default: None

--out-sap-fasta

Output FASTA file with protein sequences containing SAP/mutations. This argument is used when PPLine is launched to process custom vcf file. Default: None

--out-sap-fasta-ref

Output FASTA file with reference protein sequences. This argument is used when PPLine is launched to process custom vcf file. Default: None

--out-splice-table

Output table with alternatively spliced proteins info. This argument is used when PPLine is launched to process custom GTF file. Default: None

--include-transcripts-in-splice-table

Switches whether to include transcripts sequences into splice table. This significantly increases output file size. Default: no

--gtf-with-translation-starts

If this argument if not specified, alternatively spliced transcripts are translated with reference GTF start codons (declared in config.txt). To override this file, specify the argument.

--genome-fasta

Use this option to override genome sequence declared in config.txt

--exclude-nmd

Excluded ‘nonsense_mediated_decay’ entries from GTF file. Default: no

--vcf

Use this option when you launch PPLine only to process known vcf file (when you don’t specify --in-fastq ). Format: sample1.vcf,sample2.vcf,sample3.vcf,...

--merge-vcf

Switches whether to merge vcf files or analyze samples separately. Default: no

--1000genomes-format

Swithes whether input vcf files are 1000-genomes-like formatted (DP and other characteristics are specified in 8th column). Default: no

--min-q

Minimal SNP Phred quality score to pass. Default: 0

--min-alt-dp

Minimal share of reads with alternative allele. Default: 0

--max-indel-size

Maximal size of indel. Default: None

--directly-apply-snp

PPLine applies SNP directly to genome and the translates. Is not recommended. Default: no

--bypass-synonymic-snp

Bypasses synonymic SNP/mutations. Default: yes

--extract-only-protein-coding

Uses only CDS with ‘protein_coding’ status in reference GTF. Default: yes

--uniprot

Use this to override Uniprot file location specified in config.txt

--include-seq-in-table

Switches whether to include protein sequence in splice table. Default: yes

--exclude-incomplete-cds

Excludes incomplete CDS (e.g. length is not equal to 3). Default: no

--anno-list

External Annovar annotations. format: EUR:hepg2.eur.dropped.txt+AFR:hepg2.afr.dropped.txt+GWAS1:hepg2.gwas.dropped.txt+.... This argument is only used when you launch PPLine to analyze specified vcf with no --in-fasta provided. Default: None

--anno-dir

Directories with external Annovar annotations. These directories should contain files like anno.hg19_AMR.sites.2014_10_dropped, anno.hg19_SAS.sites.2014_10_dropped, anno.hg19_clinvar_20140929_dropped. Format: dir1,dir2,dir3,... This argument is only used when you launch PPLine to analyze specified vcf with no --in-fasta provided. Default: None

--separate-sift

External SIFT-generated annotations (additional). This argument is only used when you launch PPLine to analyze specified vcf with no --in-fasta provided. Default: None

--sam

PPLine is available to evaluate the coverage distribution across the proteins (declared in GTF). To do this, specify this argument. Format: Liver:reads1.sam,reads2.sam+HepG2:reads3.sam,reads4.sam . Default: None

--expression-info-dir

Output directory to place HTML-formatted reads coverage info (derived from SAM files). Default: ./Expression_info

--split-expression-info-by-chr

Switches whether to split expression info by chromosomes. This is recommended because of great amount of output files (one HTML for one protein)

--limit-to-chr

Chromosome limit. Format: 18,19,20,X. Default: None

--ref-proteins-fasta

Sequences of reference proteins to perform negative selection of proteotypic peptides. Format: file1.fa,file2.fa,... By default, PPLine uses proteins translated from reference GTF (specified in config.txt) and UniprotKB (specified in config.txt).

--exclude-known-uniprot-proteins-from-in-gtf

Switches whether to exclude Uniprot-annotated proteins when parsing GTF file. Default: no

--exclude-enst-proteins-from-splice

Switches whether to exclude known proteins (annotated in reference GTF file) from alternative splicing analysis results. Default: no

--exclude-null-fpkm-proteins-from-splice

Switches whether to exclude proteins with zero-FPKM (when all reads are localized in non-coding areas) from alternative splicing analysis results. Default: yes

--trim-proteins-to-first-stop-codon

Switches whether to trim GTF-declared proteins to first stop codon. Default: yes

--pp-min-length

Minimal length of proteotypic peptides. Default: 6

--pp-max-length

Maximal length of proteotypic peptides. Default: 25

--pp-mis-cleavages*

Allowed miscleavages of proteotypic peptides. Default: 2

--length-distrib-file

File with distribution of proteotypic peptides lengths. Default: None

--nonsyn-snp-fraction-distrib-file

File with distribution of non-synonymic SNP fraction (RnsSNP) in coding regions. In addition, fraction of known SNP (dbSNP-annotated) distribution is placed in this file. These distributions are useful to adjust coverage or Phred quality threshold and evaluate FDR. Default: None

--ref-alt-splice-table

HTML with proteotypic peptides specific for various isoforms of reference (or alternatively spliced) proteins. Default: None

--modify-uniprot-id-in-fasta

Switches whether to modify Uniprot IDs in Proteins.Alt.fa and Proteins.Ref.fa. If you turn this option, the IDs would look like this: sp|Q96NU1-s|SAM11_HUMAN (SAP-containing protein) or sp|Q96NU1-ns|SAM11_HUMAN (reference protein). This simple option is useful when you load fasta protein databases in MaxQuant or other proteomic software. Default: yes

--add-full-indel-column

Switches whether to add Indel detailed info columns to SAP table. These columns contain full sequences of alt and ref alleles, as well as alternative open reading frame (ARF) length which is useful to discover novel long functional ARFs coming from either indels or mRNA editing.

Other info, if you don't want to use preinstalled soft

If you don't want to use pre-installed software (located in ./Useful_stuff/) you should download and install the following prerequisites:

• samtools and bcftools (http://www.htslib.org/download/ )

• bowtie2 (http://sourceforge.net/projects/bowtie-bio/files/bowtie2/ ) and/or bowtie (http://sourceforge.net/projects/bowtie-bio/files/bowtie/ )

• Tophat (https://ccb.jhu.edu/software/tophat/tutorial.shtml )

• GATK (https://www.broadinstitute.org/gatk/download/ )

• picard-tools (http://broadinstitute.github.io/picard/ )

• Trimmomatic (http://www.usadellab.org/cms/?page=trimmomatic )

• Annovar (http://annovar.openbioinformatics.org/en/latest/user-guide/download/ )

• Cufflinks (http://cole-trapnell-lab.github.io/cufflinks/install/ )

• HTSeq-count (http://www-huber.embl.de/users/anders/HTSeq/doc/overview.html)

• STAR (https://github.com/alexdobin/STAR)

Once you finished, you should edit config.txt to fit to your configuration. If you downloaded PPLine with preinstalled soft, you only need to specify genome fasta/GTF, Uniprot and dbSNP vcf paths.

ref_GTF = path to genome model file [like GRCh38.80.gtf]

ref_genome = path to genome sequence file [like GRCh38.dna.toplevel.fa]

uniprot_path = path to UniProtKB proteome file; used to reveal UniProtKB identifier [like uniprot.fa]

dbSNP = path to dbSNP vcf file [like dbSNP.all.hg19.20150102.vcf]

dbSNP_reordered = path to dbSNP vcf in karyotypic order, which is requied by GATK. PPLine will automatically create this file and add this entry to config.txt, but this process takes lots of memory (up to 40 Gb) and time (1 hour).

samtools_path = path to samtools binary [/usr/bin/samtools]. For some reasons, updated versions of samtools have strange behavior during SNP calling with PPLine.

bcftools_path = path to bcftools binary [/usr/bin/bcftools]

tophat_path = path to tophat executable [/usr/bin/tophat]

STAR_path = path to STAR aligner

bed_to_juncs_path = path to splice junctions list converter. Usually is provided together with Tophat [/usr/bin/bed_to_juncs]

GATK_path = path to GATK .jar-file [like this: /home/user/GATK/GenomeAnalysisTK.jar]

Annovar_path = path to Annovar directory containing annotate_variation.pl and convert2annovar.pl Perl scripts.

picard-tools_path = path to picard-tools binary [/usr/bin/picard-tools]

trimmomatic_path = path to Trimmomatic .jar-file [like /home/user/Trimmomatic-0.32/trimmomatic-0.32.jar]

adapters_seq_path = path to file with sequencing adapters list, used to crop out. Usually, it is provided together with Trimmomatic archive [like /home/user/Trimmomatic-0.32/adapters/TruSeq12.fas]

bowtie2_path = path to bowtie2 executable [/usr/bin/bowtie2]

bowtie2-build_path = path to bowtie2-build executable [/usr/bin/bowtie2-build]

bowtie_path = path to bowtie executable [/usr/bin/bowtie]

bowtie-build_path = path to bowtie-build executable [/usr/bin/bowtie-build]

cufflinks_path = path to Cufflinks binary [/usr/bin/cufflinks]

cuffcompare_path = path to cuffcompare bin [/usr/bin/cuffcompare]

htseq-count_path = path to HTSeq-count

fastq-dump_path = path to NCBI SRA toolkit fastq-dump

Source: README.md, updated 2016-06-12