Menu

Input Formats

Mathias Kuhring

For the mapping of peptide spectrum matches (PSMs) to the genome a file with the PSMs of interest and some files containing information about gene locations and protein-gene connections are required as described below.

In general, example files are provide in the "examples" folder coming with the tool (except for mzIdentML and chromosome files). Please note, they are just excerpts to exemplify the format, they are not suitable for actual meaningful usage.

Peptide spectrum matches

The PSMs of interest should be provided in the mzIdentML format (*.mzid) which is a XML based standard specific for peptide and protein identifications from mass spectra. The integrity of the input file will be checked with a XML schema (currently: mzIdentML1.0.0.xsd).
For more details about the format and some example files visit the webpage at the HUPO Proteomics Standards Initiative (http://www.psidev.info/index.php?q=node/403).

iPiG will import every "PeptideEvidence" element from the "SpectrumIdentificationItem" elements pooled with necessary information from the referenced peptides ("Peptide_ref" attribute) and proteins ("DBSequence_Ref" attribute) which are the peptide sequences and modification as well as protein accessions and descriptions.

As an alternative to the mzIdentML format, iPiG can import a simple tab-separated text file. E.g. such files can be extract from Mascot Search Results as CSV files
by removing all header lines except the column titles and removing those columns which are not included in the example below.

Please note, the names and order of the remaining columns are important since these files are verified by their header line. In addition, sometimes it may be necessary to add some columns manually, e.g. "pep_isunique"

The data in the text file should look like this:

prot_acc    prot_desc   pep_query   pep_isunique    pep_exp_z   pep_score   pep_seq pep_var_mod pep_var_mod_pos
CPSM_HUMAN  Carbamoyl-phosphate synthase ammonia, mitochondrial OS=Homo sapiens GN=CPS1 PE=1 SV=2   99  1   2   5.36    ASRSFPFVSK
FAS_HUMAN   Fatty acid synthase OS=Homo sapiens GN=FASN PE=1 SV=3   149 1   2   60.67   VGDPQELNGITR


Gene Annotations

As gene annotations UCSC and Ensembl genes in the UCSC table format (tab-separated) are currently supported, thus the UCSC Table Browser is recommended as source (http://genome.ucsc.edu/cgi-bin/hgTables?command=start).

E.g. if mapping peptide spectrum matches (PSMs) obtained from a human sample, we recommend going to the Table Browser and using the following settings:
clade: "Mammal", genome: "Human", assembly: "Feb. 2009 (GRCh37/hg19)" (resp. the latest),
group: "Genes and Gene Prediction Tracks", track: "UCSC Genes" or "Ensembl Genes"
table: "knownGene" resp. "ensGene", region: "genome", output format: "all fields from selected table"
Indicate a file name (*.txt) in the field "output file" and download the file via the "get output" button.

The data in the file should look like this:

Example UCSC Genes:
#name   chrom   strand  txStart txEnd   cdsStart    cdsEnd  exonCount   exonStarts  exonEnds    proteinID   alignID
uc009vjk.2  chr1    +   322036  326938  324342  325605  3   322036,324287,324438,   322228,324345,326938,   C9J4L2  uc009vjk.2
uc001aau.3  chr1    +   323891  328581  324342  325605  3   323891,324287,324438,   324060,324345,328581,   C9J4L2  uc001aau.3
Example Ensemble Genes:
#bin    name    chrom   strand  txStart txEnd   cdsStart    cdsEnd  exonCount   exonStarts  exonEnds    score   name2   cdsStartStat    cdsEndStat  exonFrames
9   ENST00000472741 chr1    -   1026425 1051467 1051467 1051467 3   1026425,1027370,1051439,    1026945,1027483,1051467,    0   ENSG00000131591 none    none    -1,-1,-1,
34  ENST00000478275 chr1    -   212859759   212872097   212872097   212872097   2   212859759,212870302,    212860321,212872097,    0   ENSG00000123685 none    none    -1,-1,


Amino Acid Sequences

The amino acid sequences have to correspond to the gene annotations, therefore they have to be in the UCSC table format (tab-separated) as well.
The UCSC Table Browser is a recommended source for this data, too (http://genome.ucsc.edu/cgi-bin/hgTables?command=start).

E.g. following the example with human PSMs, the parameters are mainly the same, except one field:
table: "knownGenePep" resp. "ensPep"
Saving the data can be accomplished as described above.

The data in the file should look like this:

Example UCSC Genes:
#name   seq
uc010nxq.1      MSESINFSHNLGQLLSPPRCVVMPGMPFPSIRSPELQKTTADLDHTLVSVPSVAESLHHPEITFLTAFCLPSFTRSRPLPDRQLHHCLALCPSFALPAGDGVCHGPGLQGSCYKGETQESVESRVLPGPRHRH
uc001adj.1  MQRWIMEKTAEHFQEAMEESKTHFRAVDPDGDGHVSWDEYKVKFLASKGHSEKEVADAIRLNEELKVDEESECSARLPPPVSGILCVRACVVCT
Example Ensemble Genes:
#name   seq
ENST00000004921 MKGLAAALLVLVCTMALCSCAQVGTNKELCCLVYTSWQIPQKFIVDYSETSPQCPKPGVILLTKRGRQICADPNKKWVQKYISDLKLNA
ENST00000005180 MMGLSLASAVLLASLLSLHLGTATRGSDISKTCCFQYSHKPLPWTWVRSYEFTSNSCSQRAVIFTTKRGKKVCTHPRKKWVQKYISLLKTPKQL


For the id mapping only the Uniprot ID-mapping file with tab-separation is currently supported. It can be downloaded from the Uniprot FTP for several species
(ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/idmapping/).

E.g. for a Human ID-mapping file, browse the FTP directory, change to the "by organism" directory and download the corresponding tab-file (like "HUMAN*.tab").
For an example, open "examples\Uniprot_idmapping.tab" with a text editor.

Proteome fasta file (optional)

The id mapping might be supported by the headers in a proteome fasta file, e.g. with those proteins used for the peptide identifications. For the human PSMs example you can just download the human proteome in fasta format from the Uniprot FTP server (ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/proteomes/).

The data in the file should look like this:

>sp|A0A183|LCE6A_HUMAN Late cornified envelope protein 6A OS=Homo sapiens GN=LCE6A PE=2 SV=1
MSQQKQQSWKPPNVPKCSPPQRSNPCLAPYSTPCGAPHSEGCHSSSQRPEVQKPRRARQK
LRCLSRGTTYHCKEEECEGD
>tr|A0A4R5|A0A4R5_HUMAN Keratin 19 (Fragment) OS=Homo sapiens GN=keratin 19 PE=2 SV=1
TIENARIVLQINNAQLAADDF


Reference Chromosomes (for GeneControl)

For using the GeneControl, a set of reference chromosomes (resp. scaffolds) covering all the genes in the annotation file is required. A fasta file (*.fa!) has to be provided for each chromosome in a single folder, whereby the file names must correspond to the chromosome names in the annotations (e.g. chr1.fa, chrX.fa, chrIV.fa, etc.).
Again, UCSC is a good source for the data (http://hgdownload.cse.ucsc.edu/downloads.html).

E.g. to get a human chromosome reference set, choose "human" from the vertebrates at the suggested download site, continue with "Full data set" and download the "chromFa.tar.gz" file from the button of the page. Extract the archive to a folder of choice which can be indicated in the GeneControl later.


Related

Wiki: Command Line
Wiki: Downloader
Wiki: GeneControl GUI
Wiki: Home
Wiki: Output Formats
Wiki: iPiG GUI

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.