Menu

Tree [14ed05] master /
 History

HTTPS access


File Date Author Commit
 tests 2015-11-25 acesnik acesnik [87e573] Initial commit
 BedEntry.py 2015-11-25 acesnik acesnik [87e573] Initial commit
 README.txt 2015-11-30 acesnik acesnik [3117ef] Added system requirements to README
 license.txt 2015-11-30 acesnik acesnik [f4455f] Added README and license information.
 novelsplices.py 2015-11-26 acesnik acesnik [ecb2f4] Accepts GTF files with headers, and annotates u...
 refparse.py 2015-11-26 acesnik acesnik [ecb2f4] Accepts GTF files with headers, and annotates u...
 samplespecificdbgenerator.py 2015-11-30 acesnik acesnik [211d8a] Updated usage. Minimum length filter now filter...
 variantcalls.py 2015-12-17 acesnik acesnik [14ed05] Handling of tabular peptide fasta headers

Read Me

SampleSpecificDBGenerator is a program that takes RNA sequencing data analysis results and translates them into protein database entries that are appended to a UniProt-XML file. This XML is condensed, meaning that much of the extraneous author information in UniProt-XMLs is removed for faster search times in Morpheus.

Usage: samplespecificdbgenerator.py [options]

Options:
  -h, --help            show this help message and exit
  -x REFERENCE_XML, --reference_xml=REFERENCE_XML
                        Reference protein UniProt-XML file. Sequence variant
                        peptide entries are appended to this database to
                        generate the ouptut UniProt-XML protein database.
  -p PROTEIN_FASTA, --protein_fasta=PROTEIN_FASTA
                        Reference protein FASTA file. Used to generate SAV
                        peptide entries. If no UniProt-XML is specified, SAV
                        and NSJ entries will be appended to this database to
                        generate an output database. By default, this output
                        will be a UniProt-XML protein database without PTM
                        annotations. If --output-fasta is selected, the output
                        will be a protein FASTA.
  -g GENE_MODEL, --gene_model=GENE_MODEL
                        GTF gene model file. Used to annotate NSJ peptide
                        entries.
  -v SNPEFF_VCF, --snpeff_vcf=SNPEFF_VCF
                        SnpEff VCF file with HGVS annotations (else read from
                        stdin).
  -b SPLICE_BED, --splice_bed=SPLICE_BED
                        BED file (tophat junctions.bed) with sequence column
                        added.
  -o OUTPUT, --output=OUTPUT
                        Output file path. Outputs UniProt-XML format unless
                        --output-fasta is selected.
  -z, --output_fasta    Output a FASTA-format database. Place path for output
                        file after the --output flag.
  -l LEADING_AA_NUM, --leading_aa_num=LEADING_AA_NUM
                        Leading number of AAs to output for SAV peptides.
                        Default: 33.
  -t TRAILING_AA_NUM, --trailing_aa_num=TRAILING_AA_NUM
                        Trailing number of AAs to output for SAV peptides.
                        Default: 33.
  -D NSJ_DEPTH_CUTOFF, --nsj_depth_cutoff=NSJ_DEPTH_CUTOFF
                        Keep only NSJs found with above this depth (BED score
                        field). Default: 0.
  -E SNV_DEPTH_CUTOFF, --snv_depth_cutoff=SNV_DEPTH_CUTOFF
                        Keep only SNVs found with above this depth (DP=#
                        field). Default: 0.
  -M MINIMUM_LENGTH, --minimum_length=MINIMUM_LENGTH
                        Keep only sequence variant peptides with greater than
                        or equal to this length. Default: 0.
  -Q BED_SCORE_NAME, --bed_score_name=BED_SCORE_NAME
                        Include in the NSJ ID line score_name:score. Default:
                        "depth."
  -R REFERENCE, --reference=REFERENCE
                        Genome Reference Name for NSJ ID location.
                        Automatically pulled from genome_build header in GTF
                        if present.

SampleSpecificDBGenerator is based on the following papers:
1. Sheynkman, et al. "Discovery and Mass Spectrometric Analysis of Novel 
Splice-Junction Peptides Using RNA-Seq." Mol Cell Proteomics 2013, 12, 
2341-2353.
2. Sheynkman, et al. "Large-scale mass spectrometric detection of variant
peptides resulting from nonsynonymous nucleotide differences." J Proteome 
Research 2014, 13, 228-240.
3. Sheynkman, et al. "Using Galaxy-P to leverage RNA-Seq for the discovery 
of novel protein variations." BMC Genomics 2014, 15, 9.
4. Cesnik, et al. "Human Proteomic Variation Revealed by Combining RNA-Seq 
Proteogenomics and Global Post-Translational Modification (G-PTM) Search 
Strategy." In review.

Author information: Anthony Cesnik, UW-Madison

System requirements:
- 8 GB of RAM is recommended
- python v2.7.10
See https://www.python.org/downloads/ for installation instructions.
This includes the “pip” package manager.
- Biopython python package
Install using the command: pip install biopython
Or see http://biopython.org/wiki/Download for installation instructions.
- Lxml python package
Install using the command: pip install lxml 
Or see http://lxml.de for installation instructions.
- If you encounter errors installing either package, we recommend
trying an alternate package manager, such as Canopy, which can be found
here: https://www.enthought.com/products/canopy/.


Version updates:
v0.0.2 November 26, 2015 Initial commit
v0.0.3 November 30, 2015 Updated usage information. Allows minimum length 
cutoff to filter both SAV and NSJ peptide entries.
Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.