SampleSpecificDBGenerator Code

Generates custom UniProt-XML databases from RNA-Seq data

Brought to you by: acesnik

Tree [14ed05] master / History

HTTPS access

File	Date	Author	Commit
tests	2015-11-25	acesnik	[87e573] Initial commit
BedEntry.py	2015-11-25	acesnik	[87e573] Initial commit
README.txt	2015-11-30	acesnik	[3117ef] Added system requirements to README
license.txt	2015-11-30	acesnik	[f4455f] Added README and license information.
novelsplices.py	2015-11-26	acesnik	[ecb2f4] Accepts GTF files with headers, and annotates u...
refparse.py	2015-11-26	acesnik	[ecb2f4] Accepts GTF files with headers, and annotates u...
samplespecificdbgenerator.py	2015-11-30	acesnik	[211d8a] Updated usage. Minimum length filter now filter...
variantcalls.py	2015-12-17	acesnik	[14ed05] Handling of tabular peptide fasta headers

Read Me

SampleSpecificDBGenerator is a program that takes RNA sequencing data analysis results and translates them into protein database entries that are appended to a UniProt-XML file. This XML is condensed, meaning that much of the extraneous author information in UniProt-XMLs is removed for faster search times in Morpheus.

Usage: samplespecificdbgenerator.py [options]

Options:
-h, --help show this help message and exit
-x REFERENCE_XML, --reference_xml=REFERENCE_XML
Reference protein UniProt-XML file. Sequence variant
peptide entries are appended to this database to
generate the ouptut UniProt-XML protein database.
-p PROTEIN_FASTA, --protein_fasta=PROTEIN_FASTA
Reference protein FASTA file. Used to generate SAV
peptide entries. If no UniProt-XML is specified, SAV
and NSJ entries will be appended to this database to
generate an output database. By default, this output
will be a UniProt-XML protein database without PTM
annotations. If --output-fasta is selected, the output
will be a protein FASTA.
-g GENE_MODEL, --gene_model=GENE_MODEL
GTF gene model file. Used to annotate NSJ peptide
entries.
-v SNPEFF_VCF, --snpeff_vcf=SNPEFF_VCF
SnpEff VCF file with HGVS annotations (else read from
stdin).
-b SPLICE_BED, --splice_bed=SPLICE_BED
BED file (tophat junctions.bed) with sequence column
added.
-o OUTPUT, --output=OUTPUT
Output file path. Outputs UniProt-XML format unless
--output-fasta is selected.
-z, --output_fasta Output a FASTA-format database. Place path for output
file after the --output flag.
-l LEADING_AA_NUM, --leading_aa_num=LEADING_AA_NUM
Leading number of AAs to output for SAV peptides.
Default: 33.
-t TRAILING_AA_NUM, --trailing_aa_num=TRAILING_AA_NUM
Trailing number of AAs to output for SAV peptides.
Default: 33.
-D NSJ_DEPTH_CUTOFF, --nsj_depth_cutoff=NSJ_DEPTH_CUTOFF
Keep only NSJs found with above this depth (BED score
field). Default: 0.
-E SNV_DEPTH_CUTOFF, --snv_depth_cutoff=SNV_DEPTH_CUTOFF
Keep only SNVs found with above this depth (DP=#
field). Default: 0.
-M MINIMUM_LENGTH, --minimum_length=MINIMUM_LENGTH
Keep only sequence variant peptides with greater than
or equal to this length. Default: 0.
-Q BED_SCORE_NAME, --bed_score_name=BED_SCORE_NAME
Include in the NSJ ID line score_name:score. Default:
"depth."
-R REFERENCE, --reference=REFERENCE
Genome Reference Name for NSJ ID location.
Automatically pulled from genome_build header in GTF
if present.

SampleSpecificDBGenerator is based on the following papers:
1. Sheynkman, et al. "Discovery and Mass Spectrometric Analysis of Novel
Splice-Junction Peptides Using RNA-Seq." Mol Cell Proteomics 2013, 12,
2341-2353.
2. Sheynkman, et al. "Large-scale mass spectrometric detection of variant
peptides resulting from nonsynonymous nucleotide differences." J Proteome
Research 2014, 13, 228-240.
3. Sheynkman, et al. "Using Galaxy-P to leverage RNA-Seq for the discovery
of novel protein variations." BMC Genomics 2014, 15, 9.
4. Cesnik, et al. "Human Proteomic Variation Revealed by Combining RNA-Seq
Proteogenomics and Global Post-Translational Modification (G-PTM) Search
Strategy." In review.

Author information: Anthony Cesnik, UW-Madison

System requirements:
- 8 GB of RAM is recommended
- python v2.7.10
See https://www.python.org/downloads/ for installation instructions.
This includes the “pip” package manager.
- Biopython python package
Install using the command: pip install biopython
Or see http://biopython.org/wiki/Download for installation instructions.
- Lxml python package
Install using the command: pip install lxml
Or see http://lxml.de for installation instructions.
- If you encounter errors installing either package, we recommend
trying an alternate package manager, such as Canopy, which can be found
here: https://www.enthought.com/products/canopy/.

Version updates:
v0.0.2 November 26, 2015 Initial commit
v0.0.3 November 30, 2015 Updated usage information. Allows minimum length
cutoff to filter both SAV and NSJ peptide entries.

SampleSpecificDBGenerator Code

Generates custom UniProt-XML databases from RNA-Seq data

Branches

Tree [14ed05] master / Download Snapshot History

Read Me

Tree [14ed05] master /

History