Menu

FastSemSim Core

Marco Mina

FastSemSim Library

FastSemSim is a library entirely written in Python. Its purpose is enabling the use of semantic similarity measures under Python. Compared to other tools available for other platforms, such as R, it is much faster and flexible.

FastSemSim does not require any additional package to be installed on the machine. It relies on the standard xml library embedded in Python.

Several example and test applications are available in examples and test folders. These simple applications explains in detail the few steps required to use FastSemSim.

The Gene Ontology an two example annotation corpora are included too. While these files are perfect for learning how to use FastSemSim, they might become obsolete. You can easily retrieve the latest version online. FastSemSim is able to parse the files provided by the Gene Ontology Consortium as they are.

See [Gene Ontology] and [Annotation Corpora] for additional details.

Conceptually, FastSemSim can be divided in two components: GO and SemSim.

GO component

Its purpose is to parse Gene Ontologies as well as annotation corpora.
It cosists of two parts: (i) a set of classes to handle the Gene Ontology (see the file GeneOntology.py), and (ii) a set of files useful for parsing and using annotation corpora (see file AnnotationCorpus.py).

GeneOntology.py

GeneOntology.py provides basic routies to read xml formatted obo files describing The Gene Ontology. Currently, the DAG of the Gene Ontology is stored in 2 variables, node_edge and edge_node: the former describes, for each node, the edges involving it; the latter keeps rtack of all the edges, pointing to its ending nodes. The 3 supplementary tables "alt_ids", "edge_types", and "obsolete_ids" provide additional information about each GO term or GO edge.

AnnotationCorpus.py

Routines and structures to handle annotation corpora. Together with other AnnotationCorpus*.py files, it allows to parse GOA annotation files, as well as plain annotation files (each line is a pair "term" - "object" or vice versa)

AnnotationCorpus variables:
annotations: for each object in annotation corpus, list GO terms annotated for it
reverse_annotations: for each GO term, list objects annotated with it
obj_set: list of objects involved in annotations
term_set: list of GO terms involved in annotations

AnnotationCorpus functions:
load: load an annotation corpus file. Format specification is required. Populates variables described above.
parse: see load
check_consistency: verifies whether GO terms used are updated to current GeneOntology version.
sanitize: remove annotations involving GO terms obsolete for current GeneOntology version, and resolve alternative id mappings.
constrain: filter annotations according to several parameters:
- object taxonomy
- annotation type
It is possible to filter data directly at parse-time, avoiding problems with huge files

SemSim component

Several files are present here. The most important are:
SemSimUtils.py: basic functions and routines useful for several SS measures (IC, term roots, term ancestors and offspring, shortest path, common ancestors, ...)
ObjSemSim.py: template for object semantic similarity measures. Provides a description of a common interface to any object SS measure
TermSemSim.py: template for term semantic similarity measures. Provides a description of a common interface to any term SS measure
MixSemSim.py: template for mixing strategies (max, avg, BMA, ...)


Related

Wiki: Annotation Corpora
Wiki: Gene Ontology