Home / GNAT release 1.2
Name Modified Size InfoDownloads / Week
Parent folder
history.txt 2012-09-10 1.6 kB
README.txt 2012-09-10 8.7 kB
gnat-1.22.tar.gz 2012-09-10 42.9 MB
Totals: 3 Items   42.9 MB 0
-- GNAT -- Gene mention Normalization Across Taxa --
----------------------------------------------------

[Note for the notoriously impatient: If you just want to run GNAT, see the 
"Main Classes" section further down follow the instructions in INSTALLATION.txt]


GNAT is a package to recognize and identify gene names in biomedical text.

GNAT performs a series of steps, first to recognize names of genes, species,
and other terms in each text it is given; assign candidate IDs (of all possible
genes that share the found name; and then to disambiguate between the 
candidates to find a single identifier (or as few as possible) pointing to an
EntrezGene record for that gene.

GNAT comes in two parts, a client and a server, although both can run on the 
same machine.

Server
------
A server is able to handle
- requests for named entity recognition: by submitting plain texts, PubMed IDs,
  and/or PMC IDs; candidate IDs will be returned with positional and other 
  information; 
  - a server can cache results if PubMed IDs / PMC IDs are used
- requests to obtain information on individual genes; an SQL database is
  needed at the server side (at least on a machine visible to the server);
but not each server has to provide each of these steps, thus a mix of servers
can be invoked by the client to address them separately.

There are two options for server-side NER requests: services and dictionaries 
(see packages gnat.server.* and gnat.server.dictionary.*). Services are 
basically wrappers around dictonaries, and they can download texts, pre-process
texts, cache results, before sending them to dictionaries. Dictionaries handle
the actual named entity recognition requests. As dictionaries run on individual
ports, and many dictionaries might be needed, a service can thus 'hide' all 
these functions behind a single port that is visible from the outside. An 
examplary wrapper service is gnat.server.GnatService, which takes texts, PubMed
IDs, and PMC IDs, and can perform gene NER, GO term NER, species NER, and
gene normalization (although this last step should be --in a developmental 
environment-- typically handled by the client).

For more information on dictionaries, see dictionaries/readme.dictionaries.txt.

To obtain information on individual genes that are required for disambiguation 
(normalization, identification, grounding) of genes, the client typically needs
access to a local (or visible) database. However, these can be provided through
a gnat.server.GeneService, again essentially a wrapper that hides the actual 
database from direct, outside access and also makes access by the client more 
transparent.

Client
------
A client is responsible for
- loading and pre-processing texts (or obtain IDs if an invoked server can 
  handle the download/pre-processing/caching);
- invoke a series of filtering steps to recognize and disambiguate gene
  names, and potentially other entities such as species names, GeneOntology
  terms, MeSH terms, etc; some of these steps can be provided by servers,
  which have to called at the proper time in a processing pipeline by the 
  client;
- load a repository of genes that contains information on individual genes,
  needed for disambiguation (as aforementioned, this can be obtained using
  a GeneService running remotely);
- store the final results.

The filtering steps in a client's processing pipeline are thought to
- annotate entities, such as genes and species;
- assign identifiers to these entities, or lists of candidates;
- remove false positive annotations;
- rank candidate IDs;
- disambiguate gene candidate IDs;
- contact servers to perform any of these steps if available.
Filters can be found in the gnat.client.filter.* package; for instance, some
filters concerned with NER in gnat.client.filter.ner.*, or others for step-wise
identification in gnat.client.filter.nei.*.






Requirements
------------

Client and Server:
- GNAT Jars and configuration files
- Third-party libraries, see libraries.readme.txt

Server:
- dictionaries, each will take 100MB to 2GB of memory (1-2GB for species with 
  many known genes and aliases, such as mouse, human, rat, fruit fly)
- RDBMS, for instance, MySQL, to save gene-specific information such as 
  species, GO terms, tissue specificity, protein length, interactions, ...




Main classes
------------

To run GNAT, check the Pipeline classes in gnat.client.*, for example, the 
DefaultPipeline.

Several such classes are provided for users not interested in developing their
own pipelines, but would rather ran GNAT as-is, using predefined filtering
steps and output formats. Probably the most useful supported in/output formats
are plain text with tabular output, and XML input with XML output:
- JustAnnotate: takes a directory of *.txt files as input, runs GNAT on each 
  such text file, and prints a list of recognized genes and IDs in a tabular 
  format. To see examples, take a look at the files in the folders texts/test/ 
  and texts/test100/. Conventions: if your files originate from PubMed, name
  them accordingly as <pubmedid>.txt.
- JustAnnotateInline: takes a directory of *.xml files as input, runs GNAT on
  each, and annotates each text.
  Supported input formats are 
  - plain text in .txt files (output will be enclosed by <document> tags)
  - MedlineCitation and PubmedArticle, as single documents or as collections
    inside MedlineCitationSet and PubmedArticleSet, respectively.
  GNAT guesses the file type from the file names/extensions, therefore it is
  best to follow these conventions:
  - *.medline.xml files contain single documents
  - *.medlines.xml files contain Sets
  - NCBI Medline FTP download file names (medline[year]n[number].xml
    can contain either MedlineCitation or PubmedArticle Sets; those files can
    optionally be GZipped (.gz); the output will in any case be unzipped.
  The current DTD for PubmedArticle(Set)s can be found here:
    http://www.ncbi.nlm.nih.gov/corehtml/query/DTD/pubmed_120101.dtd
  and MedlineCitation(Sets) here:
    http://www.nlm.nih.gov/databases/dtd/nlmmedlinecitationset_120101.dtd
- AnnotateMedline: takes a directory of Medline XML files (gzipped or open)
  that were downloaded from NCBI FTP, where the files names are like
  "medline12n0123.xml.gz"; annotates genes using <GNAT> XML tags and writes
  the result into a given output directory, one XML output file per input
  file.
- Example files can be found in texts/test and texts/test_xml/.
- To call JustAnnotate, JustAnnotateInline, and AnnotateMedline, see the shell 
  scripts in the scripts/ folder.



Configuration files
-------------------

- isgn_properties.xml - main configuartion file for client and server
- server_properties.xml - additional configuration settings for servers
- config/
  - taxonToServerPort.txt - mapping of NCBI taxon IDs to server addresses and
    port that serve gene NER for each respective species
  - runAdditionalFilters.txt - loaded by gnat.filter.RunAdditionalFilters
    to set up additional gnat.filter.Filter steps at runtime
  - geneNerDictionarServers.txt 



Publications:
[1] Jörg Hakenberg, Martin Gerner, Maximilian Haeussler, Illés Solt, Conrad
    Plake, Michael Schroeder, Graciela Gonzalez, Goran Nenadic, and Casey M. 
    Bergman: "The GNAT library for local and remote gene mention normalization"
    Bioinformatics, 2011, Advance access 3 August 2011
    DOI: 10.1093/bioinformatics/btr455

[2] Jörg Hakenberg, Conrad Plake, Loic Royer, Hendrik Strobelt, Ulf Leser, 
    Michael Schroeder: "Gene mention normalization and interaction extraction
    with context models and sentence motifs". Genome Biology 2008, 9:S14.
    Abstract: http://genomebiology.com/2008/9/S2/S14/abstract
    PubMed:   http://www.ncbi.nlm.nih.gov/sites/entrez/18834492

[3] Jörg Hakenberg, Conrad Plake, Robert Leaman, Michael Schroeder, and 
    Graciela Gonzalez: "Inter-species normalization of gene mentions with
    GNAT".
    Bioinformatics (2008) 24 (16):i126-i132.
    DOI:    10.1093/bioinformatics/btn299
    PubMed: http://www.ncbi.nlm.nih.gov/pubmed/18689813

More on gene mention normalization:
[4] Florian Leitner et al.: "Overview of BioCreative II gene normalization".
    Genome Biology 2008, 9:S3.
    Abstract: http://genomebiology.com/2008/9/S2/S3/abstract
    PubMed: http://www.ncbi.nlm.nih.gov/sites/entrez/18834494

[5] Genome Biology's Special Issue on BioCreative II:
    http://genomebiology.com/supplements/9/S2

[6] Minlie Huang, Jingchen Liu, and Xiaoyan Zhu: "GeneTUKit: a software for 
    document-level gene normalization".
    Bioinformatics, 2011, 27(7):1032-1033
    DOI: 10.1093/bioinformatics/btr042 
Source: README.txt, updated 2012-09-10