BioC Wiki

We describe a simple XML format to share text documents and annotation

Status: Beta

Brought to you by: rezarta

BioC

***BioC: A Minimalist Approach to Interoperability for Biomedical Text Processing ***

We describe a simple XML format to share text documents and annotations. Allows a large number of different annotations to be represented. We provide simple code to hold this data, read it and write it back to XML files, and perform some sample processing.

The Problem

NLP and text mining tools are essential in searching for and extracting information from text. Strong research efforts have produced useful tools and manually labeled text corpora have been produced to improve such tools. To encourage combining these efforts into larger, more powerful, and more capable systems, it is highly desirable to have a common interchange format to represent, store and exchange the data in a simple manner between different NLP systems and text mining tools.

BioC goals

simplicity
interoperability
broad use
reuse

The most significant difference from previous efforts is our emphasis on simplicity of use. There should be little investment required to learn to use a format or a software module to process that format. We are interested in reuse, and we focus on common NLP tasks that are broadly useful for textmining.

XML File Format

Is easily written and read in any computer language.
Is portable between different operating systems.
Is well known and familiar to many people.