iPiG Wiki

Integrating PSMs into Genome browser visualisations

Status: Beta

Brought to you by: mkuhring

Output Formats

This page provides some information about the output files.

BED

Each line in BED format corresponds to a peptide mapped to the genome.
For a generell description please see http://genome.ucsc.edu/FAQ/FAQformat.html#format1

There will be two BED files as output (annotation-filtered mapping or alternative mapping) which have different tracks for uniqueness (unique, non-unique, unmarked) so up to three tracks. Unmapped peptides won't be exported to BED.

Modification to the format are as follows:

As the format does not provide custom or additional columns, the name column (4) is used to present some peptide information (e.g. "3617a_AELGLPPLAEDSIQVVKSMR_z=3_sh=1_>").
Separated by underscores, it contains a query (with a modifier to make it unique), the peptide sequence, the charge, a shared value indicating the number of mappings of the peptide spectrum match to the genome and a strand orientation indicator.

The score column (5) contains the peptide spectrum match score, so it is not scaled between 0 and 1000 like usually done.

For coloring purposes each track uses the parameter itemRgb="On" and each peptide line provide an rgb color (column 9) which depends on the score and color settings in the configuration.

A short BED file example is provided here:

track name="darwin_top250_t2 (u,anno)" description="darwin_top250_t2 (unique, annotation mapped)" visibility=full itemRgb=On
chr7    129096330   129096390   3617a_AELGLPPLAEDSIQVVKSMR_z=3_sh=1_>   8.32    +   129096330   129096390   191,191,191 1   60, 0,
chr11   7614437 7618832 3617c_ESLILQVSVLTDQVEAQGEK_z=3_sh=1_>   9.79    +   7614437 7618832 191,191,191 2   18,42,  0,4353,
chr16   72094543    72094603    3618c_SPVGVQPILNEHTFCAGMSK_z=3_sh=1_>   28.79   +   72094543    72094603    255,159,0   1   60, 0,

GFF3

In a GFF3 file a peptide is represented as a feature.
For a generell description please see http://www.sequenceontology.org/gff3.shtml.

There will be two GFF3 files as output (annotation-filtered mapping or alternative mapping). Unmapped peptides won't be exported to GFF3.

Modifications to the format are as follows:

The SOURCE is "ipig".

The values in the TYPE column are self creations, starting with "peptide", followed by a mark for the uniqueness (unique, non-unique, unmarked) and a classification for the score in three groups (depending on user parameters threshold1 and threshold2). E.g. "peptide_unique_mid". The different feature type names might promt some genome browser to create different tracks per type. This could result in different colors for uniqueness and the scores ranges (e.g. in Geneious).

The SCORE is the same as the peptide sprectrum score.

ATTRIBUTES are:

ID, which is the query (with a modifier to make it unique). The ID is in charge for grouping parts of the same feature from different lines, so the genome browser should handle those as one element.
Name, which is the peptide sequence. Good way to keep the sequence in this format and to compare it with translations (e.g. in Geneious).
z, custom attribute representing the charge.
shared, custom attribute representing the peptide's frequency of occurrence at different positions (including the current).
mods and modpos, indicating peptide modifications and their positions.

A short GFF3 file example is provided here:

chr11   ipig    peptide_unique_mid  5254203 5254238 32.16   -   0   ID=280e; Name=KHALANAVGAVV; z=3; shared=1; mods="Deamidated (NQ)"; modpos=0.000000100000.0; 
chr3    ipig    peptide_unique_low  58109099    58109134    17.08   +   0   ID=280f; Name=VVASGPGLEHGK; z=3; shared=1; mods=""; modpos=; 
chr11   ipig    peptide_unique_high 5246837 5246872 51.71   -   0   ID=281d; Name=KHALANAVGAVV; z=3; shared=1; mods="Deamidated (NQ)"; modpos=0.000000100000.0;

Txt (tab separated)

There will be two text files as output for the mapped peptides (annotation-filtered mapping or alternative mapping). Unmapped peptides will be exported to separated text file.

The format of the text files is a extension of the format of the input text files for peptide spectrum matches as described at [Input Formats].

The text file with mapped peptides is extended with columns indication the positions in the genome. This is chrom, strand, start_pos, stop_pos, shared. The start_pos and end_pos columns contain comma-separated position lists capable to indicate exon spanning peptides. The shared column indicates the number of mappings of the peptide spectrum match to the genome.

A short text file example is provided here:

prot_acc    prot_desc   pep_query   pep_isunique    pep_exp_z   pep_score   pep_seq pep_var_mod pep_var_mod_pos chrom   strand  start_pos   stop_pos    shared
112821681   G protein-regulated inducer of neurite outgrowth 1 [Homo sapiens]   9   1   2   7.23    KALGSAR         chr5    -   176024627,  176024648,  1
41281496    mediator complex subunit 24 isoform 1 [Homo sapiens]    9   1   2   6.47    LSCHGK          chr17   -   38191589,38191974,  38191602,38191979,  1
154800453   tastin isoform 1 [Homo sapiens] 10  1   3   11.66   IGILQQLLR           chr12   +   49723930,   49723957,   1

The text file with unmapped peptides is extended with only one column indicating the problem of mapping a certain peptide spectrum match and providing the obtained references/links.

noProt, indicates that no corresponding entry/line in the mapping file was found
noGene, indicates that no corresponding gene in the annotation file was found, although references were given from the mapping file
noMatch, indicates that the peptide sequenced could not be found in the translation of the linked gene.