Menu

Success_story

Anonymous Igor


We successfully use IBM's text analysis technology to identify chemical names and other entities in unstructured text. Once identified, we convert the chemical names into their chemical structures using [name=structure] programs. This produces SMILES strings representing the chemical structures which are subsequently used in computational calculations and as input for other applications. Using this technology, we analyze millions of patents and Medline abstracts and generate a large database of molecular structures derived from the text of those documents. This work effectively renders the scientific and patent literature searchable by structure/substructure search applications. The combined technologies for reading and processing molecular structures allow researchers the ability to build large databases of previously inaccessible literature – relevant in the areas such as patents, pharmaceuticals, publishing, health care, and environmental science. Recently, we migrated the above databases into International Chemical Identifiers (InChI) format and developed search and similarity analysis algorithms for the InChI database. In addition to identifying chemical entities, we have also developed a family of annotators that identify and extract proteins, genes, cell-lines, celltypes and a host of other domain-specific entities. The combined data derived from the text analytics and subsequent post processing operations (such as co-occurrence analysis ) are retained in a data warehouse that is integrated with our Business Insights Workbench (BIW) application which provides e-classification, visualization, and OLAP analysis capabilities. Integration of the above operations – with our BlueGene supercomputer - has enabled us to process >11M documents to date. We currently index >100,000 documents per month including all of the US, EP and WO documents on a weekly basis. Additionally, our computing environment is capable of indexing ~ 1 billion web pages in approximately 3 hours – retrieving
and indexing biologically relevant information.


''Stephen K Boyer''