Home
Name Modified Size InfoDownloads / Week
cdsmapper_v0_6.tar.gz 2011-11-11 10.3 kB
README 2011-11-11 11.0 kB
Totals: 2 Items   21.2 kB 0
====================================================================================================
CDSMAPPER
====================================================================================================
Code written by:
	Dr James K. Hane
	Post-Doctoral Fellow
	CSIRO Plant Industry (Floreat)
	Phone: 08 9333 6107
	Email: James.Hane@csiro.au
====================================================================================================
Version History:
----------------------------------------------------------------------------------------------------
version 0.6 - 11th November 2011
	tool		version
	---------------------------
	testpeptideframes		5 --> 6
	cdsmapper 			5.4 --> 5.6
	summarise_frametest		5.1
	compare2gff			5
	orf2gff				1 --> 1

version 0.5 - 11th October 2010

version 0.4 and prior - non-public release
====================================================================================================
Citation:
----------------------------------------------------------------------------------------------------
A publication describing these tools is in preparation.  In the meantime, cite:

	Bringans, S., Hane, J.K., Casey, T., Tan, Kar-Chun, Lipscombe, R., Solomon, P.S. and Oliver, R.P. (2009)
	Deep proteogenomics; high throughput gene validation by multidimensional liquid chromatography and mass spectrometry of proteins from the fungal wheat pathogen Stagonospora nodorum. 
	BMC Bioinformatics, 10 (1). p. 301.

====================================================================================================
Description:
----------------------------------------------------------------------------------------------------

Proteogenomics is the "direct-to-genome mapping" of peptide data, for the validation of gene annotations and gene discovery.  Whole genome sequencing can facilitate the characterisation or prediction of the entire protein content of an organism.  Correct representation of the proteome relies on the accurate annotation of protein-coding exons (CDSs) within gene models.  Gene model prediction is prone to errors, both in the identification of genes and in exon structure.  Proteogenomics can be used to provide supporting data which can increase the accuracy of gene models or identify new genes.

Here we describe the various bioinformatic techniques used to generate whole-genome amino-acid translation databases, map proteogenomic peptide data to a genome assembly and compare genome-mapped peptides to existing gene annotations.

1. Introduction 
Proteogenomics is similar in principle to proteomics, but while conventional proteomics matches mass-spectra to a database of translated gene-models, proteogenomics matches mass spectra to a database of translated open-reading frames derived from the genome sequenced.  This allows for a direct-to-genome mapping of peptides.  The major advantage of this is that proteogenomically-mapped peptide data can support the creation of gene models which may not have been detected by other methods.  Additionally mapping directly to the genome can detect errors in the prediction of exon-structure or translation frame-shifts in existing gene models which may have produced incorrect amino-acid tranlations.  Finally, proteogenomics can be coupled with protein-purification methods to target genes which code for proteins with specific physical properties, such as molecular weight, acidity, hydrophobicity and specific binding-affinities.  This can be particularly useful for identifying gene models lacking homologs in related species or with abnormal G:C contents, exon lengths and structures and are thus not reliably identified by gene-prediction algorithms.

2. Materials
2.1 Sequence data
1. Sequenced whole genome assembly of the organism-of-interest in FASTA format
3. Coordinates of predicted or manually-annotated gene model coding-sequence (CDS) features in GFF3 format
2.2 Software:
1. getorf (available from http://emboss.sourceforge.net/)
2. cdsmapper and miscelleneous perl scripts (orf2gff, compare2gff, testpeptideframes and summarise_frametest) available for download from https://sourceforge.net/projects/cdsmapper/
3. Perl (www.perl.org)

3. Methods
3.1. 6-frame translation of the the nucleotide sequence of genome-of-interest into amino acid sequence.
1. Run getorf on the genome sequence of interest:
       getorf -sequence genome.fasta -outseq orf.fasta -minsize 30 -table 0 -find 0
         	"genome.fasta" is the sequence of the genome-of-interest in FASTA format "orf.fasta" is the file containing the translated open-reading frames (ORFs)
	"-minsize 30" imposes a minimum threshold on the length of each open-reading frame of 30 nucleotides (10 amino acids)
	"-table" refers to the genetic code by which nucleotides are translated into amino acids.  "0" refers to the standard code.
	refer to http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi for more information on alternate genetic codes.
	"-find 0" instructs getorf to report ORFs that lie between stop codons in the input sequence.
2. "orf.fasta" becomes the database to which peptide mass-spectra are matched against in software such as MASCOT.
3. Generate a GFF3 file ("orf.gff3") containing coordinates of ORFs from "orf.fasta" using orf2gff:
	perl orf2gff orf.fasta orf.gff3

3.2. Conversion of MASCOT protein reports to appropriate data format for input into CDSmapper.
1. Export MASCOT protein report to tabular format (CSV or XML), including the optional "peptide start" and "peptide stop" coordinate information.
2. Rearrange tabular MASCOT data to conform to the following format for cdsmapper input:
 	Tab-delimited text with column headings:
	Column   1: Name/ID of peptide (must be unique).
	Column   2: GFF feature type (recommended: "peptide").
	Column   3: DNA ("D") or Protein ("P").  For the purpose of mapping proteogenomic peptides use "P".
	Column   4: ORF ID (must match exactly to Parent attribute in ORF GFF3 file generated in 3.1.3).
	Column   5: Start coordinate of peptide on ORF (MASCOT protein report: "peptide start").
	Column   6: Stop coordinate of peptide on ORF (MASCOT protein report: "peptide stop").
	Column   7: Peptide orientation on ORF (must be "+").
	Columns 8+: Additional data (requries unique column heading).  Data in columns 8 and greater are coverted to elements of the attributes column in the GFF3 output.

3.3.Mapping of peptides to genome sequence using CDSmapper.
1. Run cdsmapper:
	perl cdsmapper mascotreport.table orf.gff3 peptides.gff3
	"mascotreport.table" is the file created in 3.2.2
	"orf.gff3" contains the coordinates of 6-frame translated ORFs generated in 3.1.3 (Parent attribute required).
	"peptides.gff3" contains the coordinates of proteogenomic peptides converted into genomic coordinates
	
3.4.Comparison of proteogenomic peptide coordinates to those of annotated genes to verify CDS exon boundaries.
1. The overlap of genomic coordinates between mapped peptides and other features (e.g. genes, CDS exons, UTRs) can be used to identify:
	a) features that are supported by peptide database
	b) features that have annotated boundaries that conflict with supporting peptide database
	c) features that are nearby to mapped peptides
2. Run compare2gff:
	perl compare2gff peptide.gff3 cds.gff3 output.table neighbourdistance
	"cds.gff3" contains the CDS exon coordinates in GFF3 format (Parent attribute required).
	"peptide.gff3" contains the coordinates of genome-mapped peptides generated in 3.3.1 (Parent attribute required).
	"neighbourdistance" is a optional parameter which defines the distance in bp which defines a range within which two features are reported as "nearby".  This can be useful in detecting peptides which support the revised extension of CDS exon boundaries into flanking annotated UTR regions.
	"output.table" is an overall summary of overlapping peptides and CDS features, or if "neighbourdistance" was defined, are within the threshold distance of each other.
	additionally compare2gff generates additional outputs, prefixed "output.table" and suffixed:
	".hits1" and "hits2", table of features which have ovelapping or neighbouring features (1 and 2 correspond to the first and second GFF3 input files respectively)
	".nonhits1" and ".nonhits2", table of features which have no overlapping or neighbouring features (1 and 2 correspond to the first and second GFF3 input files respectively)

3.5.Comparison of proteogenomic peptide coordinates to those of annotated genes to verify CDS frame and correct for frameshift errors.
1. Run testpeptideframes:
	perl testpeptideframes peptide.gff3 cds.gff3 frametest.table
	"peptide.gff3" contains the coordinates of genome-mapped peptides generated in 3.3.1 (Parent attribute required).
	"cds.gff3" contains the coordinate of CDS exons in GFF3 format (Parent attribute required).
	"frametest.table" is a tabular output comparing the reading frames of overlapping peptides and CDS features.
2. Run summarise_frametest:
	perl summarise_frametest frametest.table > summary.txt
	"frametest.table" is the output generated in 3.5.1.
	"summary.txt" is a short summary of the number of mapped peptides that were in and out of frame with CDS features and the number of genes (GFF3: CDS Parents) which were in and out of frame with mapped peptides.

3.6.Manual curation of gene annotations based on proteogenomic peptide conflicts and other supporting data.
1. The data in "output.table" (3.4.2) provides a list of genes which have potential errors in the annotation of CDS exon boundaries
2. The data in "output.table.nohits2" (3.4.2) provides a list of peptides which did not overlap any existing gene models and may support the creation of new gene annotations.
3. The data in "frametest.table" (3.5.1) provides a list of genes which have potential frameshift errors in their sequence and/or CDS annotations.
4. Load GFF3 "peptides.gff3" and "CDS.gff3" into a genome browser capable of facilitating manual annotation of genes.  (Recommended: Argo, Apollo, Genomeview).  Reannotate gene models based on supporting peptide data.  Additional supporting data such as cDNA or protein (tblastn) alignment from related species can be combined with peptide alignments in a genome browser display to improve the reliabiltiy manual gene curations.

4. Notes
1. Getorf user manual:  (http://emboss.sourceforge.net/apps/release/6.2/emboss/apps/getorf.html)
2. Non-ACGTN characters (such as "-" characters which are commonly introduced into genome assemblies scaffolded by the ABySS assembler) may be ignored by getorf during the 6-frame genome translation step.  This will cause getorf to report incorrect ORF coordinates relative to the genome input sequence(s).  We recommend replacing non-standard base pair characters with "N".
3. Selection of MuDPIT scoring thresholds instead of standard scoring is recommended prior to conversion of MASCOT protein reports.
4. GFF3 format: http://www.sequenceontology.org/gff3.shtml

Source: README, updated 2011-11-11