Menu

Annotation Corpora

Marco Mina

Annotation Corpora

An annotation corpora is a set of annotations relating genes or proteins to GO Terms contained in the Gene Ontology.

FastSemSim requires you to provide an annotation corpus as input data.
Few example annotation corpora are included in the examples directory.
In general, when using FastSemSim you have to specify the path of the file containing the annotation corpus that you need for your analysis.

In the following sections you can find some notes on supported file formats and on how to download or build an up-to-date annotation corpus.

Supported file formats

FastSemSim is able to load flat files and GAF-2 formatted files.

Flat files

Simple text files with the following structure: each row contains an association between a protein/gene and an ontology term.
Current row format is:
obj_id [tab] term_id

Example:
Q3LHL9 GO:0005634
Q3S4X5 GO:0005730

Please note that the columns are tab-separted.
I've planned to design a more flexible parser for the future versions of FastSemSim, able to load flat files with different formats and additional information columns).

GAF-2 files

This is the standard file format used by the Gene Ontology Consortium and (usually) by other websites (i.e. SwissProt-Uniprot).

More information about the file format can be found here.

Retrieving annotation corpora

Up-to-date annotation corpora can be downloaded from
the Gene Ontology website
http://www.geneontology.org/GO.downloads.annotations.shtml
or from the UniProtKB-GOA website: http://www.ebi.ac.uk/GOA/downloads.html


Related

Wiki: FastSemSim Core
Wiki: FastSemSimGui
Wiki: FastSemSimGui2