Tree [e1fe69] default tip /

File Date Author Commit
 bsdata 2012-01-09 Paul Boddie Paul Boddie [92b931] Removed the constructed numbered (1-indexed) fi...
 docs 2012-08-30 Paul Boddie Paul Boddie [e1fe69] Added usage documentation, moving data format n...
 packages 2012-08-30 Paul Boddie Paul Boddie [e1fe69] Added usage documentation, moving data format n...
 resources 2011-09-26 Paul Boddie Paul Boddie [b6c902] Merged redundant branch.
 scripts 2012-08-30 Paul Boddie Paul Boddie [e1fe69] Added usage documentation, moving data format n...
 tests 2011-09-26 Paul Boddie Paul Boddie [b6c902] Merged redundant branch.
 tools 2012-07-12 Paul Boddie Paul Boddie [ca2f2a] Replaced $'...' string usage with portable stri...
 README.txt 2012-08-30 Paul Boddie Paul Boddie [e1fe69] Added usage documentation, moving data format n...
 TO_DO.txt 2012-08-30 Paul Boddie Paul Boddie [e1fe69] Added usage documentation, moving data format n... 2011-09-26 Paul Boddie Paul Boddie [b6c902] Merged redundant branch.

Read Me

The bsdata distribution is a collection of software for processing data -
textual data in particular - working within a Unix-like process pipeline


The following software is required to use this distribution:

  * A POSIX-like shell and environment (for the high-level scripts)
  * Python (tested with 2.5.4, for the tools)
  * cmdsyntax (command option processing)
  * xsltproc (XML file parsing)
  * NCBI Tools (ASN.1 file parsing, optional)
  * iixr (text indexing)

Most Unix-based operating systems will provide the necessary commands for the
high-level scripts, but these commands may be provided separately or
explicitly on some platforms by packages such as GNU Coreutils and Findutils.
Amongst the commands used are the following:

  cat, cp, grep, gunzip, head, mv, rm, sort, tail, tee, xargs

The xsltproc program is provided by libxslt, which depends on libxml2.

The NCBI Software Development Toolkit (NCBI Tools) is required to parse ASN.1
files provided by Entrez Gene.

See the "Resources" section for download information.

Configuring the Software

A configuration script called bsdata-config is located in the scripts
directory of this distribution. It may be edited or copied to another location
on the PATH of any user running the software.

Before continuing, enter the distribution directory (normally containing this
README.txt file) and copy the bsdata-config file into the current directory as

  cp scripts/bsdata-config .

The details in the file can now be reviewed and edited. If an installation is
performed, any edits after installation can be incorporated into that
installation by once again running the command given in "Performing an
Installation" in the distribution directory.

Configuring an Installation of the Software

Once the prerequisites have been installed, the software can be run from the
distribution directory. Alternatively, a system-wide installation can be
performed or prepared using the script provided.

If a system-wide installation is to reside in a directory hierarchy other than
the conventional system root (that being /, with programs situated in
/usr/bin, and so on), the bsdata-config file should be modified to reflect
this by changing the SYSPREFIX setting. For example:

  SYSPREFIX=/home/bioscape                    # system-wide installation root

This setting states the directory at the top of the desired hierarchy. Even
if a system-wide installation ends up with inappropriate settings, such
settings can be overridden as described in "Configuring the Software".

Performing an Installation

With the bsdata-config file modified, the script can then be run:

  python install --prefix=/home/bioscape/usr

Note that SYSPREFIX will be /home/bioscape in this case: the script
needs the additional "/usr" to know where to install programs and resources.


With a SYSPREFIX other than / (the conventional system root), such as
/home/bioscape, the PATH and PYTHONPATH variables in the environment need to
be modified so that the shell can find the installed programs and libraries.

To obtain suggested definitions of these variables, run the following command
in the distribution directory of this software:

  scripts/bsdata-show-settings --suggested

The output should provide output resembling the following for a SYSPREFIX of

  export PATH=/home/bioscape/usr/bin:$PATH
  export PYTHONPATH=/home/bioscape/usr/lib/python2.6/site-packages:$PYTHONPATH

These definitions can be executed in the shell, and they can also saved in the
appropriate shell configuration file, such as in .profile, .bashrc,
.bash_profile or any other appropriate file in a user's home directory.

Reserving a Location for Data

Before any operations can be performed using the software installation,
various data and resource locations must be initialised. This can be done as

  bsdata-show-settings --make-dirs

Any required directories that are not already present will be reported as
being created.

Downloading Data

Given a set of data sources defined in the bsdata-config file, the following
command can be used to download and make data from these sources available:

  bsdownload --all

Processing the Data

A set of special "bootstrap" scripts is provided that performs the necessary
processing steps, taking the downloaded data and processing it to produce a
term lexicon and then a set of search results indicating the locations of such
terms in the downloaded textual resources.

The top-level script is the bsbootstrap script and is run as follows:


This script can be run with various options in order to generate updated
results, either involving updated textual resources, where the lexicon remains
the same but where the text to be search has been updated...

  bsbootstrap --update-results

...or where the lexicon has changed (and possibly the textual resources, too)
and where a different set of results would be produced:

  bsbootstrap --update

The script can be run as a background process as follows:

  nohup bsbootstrap > log 2>&1 &

Once running, it can be controlled using the --show and --stop options, with
the latter attempting to stop the process and any work it is performing.

The output of this processing will appear in the data directory defined in the
bsdata-config file with the following subdirectories being used:

  biocreative2normalisation (unpacked BioCreative data)
  gene (Entrez Gene data archives, processed data and gene name lexicon)
  indexes (text search indexes for the lexicon and textual resources)
  pubmed (PubMed data archives)
  results (text search results)

The results have two forms:

  * A set of raw "verified" results providing details of the search term and
    the result context.

  * A collection of "shown" fragments covering all document sentences, with
    result sentences being given in terms of fragments before, between and
    after results (and thus providing no result information), as well as
    fragments providing results (and thus search term details).

Using the Programs Separately

In order to search textual documents and provide search results, a number of
separate programs have been combined. These programs can also be used
separately to perform certain tasks, and they are described briefly below.

Some programs that perform tasks of a more general nature:

  biocreative2sentences (show each BioCreative sentence on a separate line)
  gene2lexicon (convert gene_info into separate entity-specific files)
  gene_filter_by_taxid (filter gene_info by taxonomy identifier)
  gene_filter_numeric_terms (remove numeric terms from gene_info)
  gene_filter_systematic_terms (remove systematic terms from gene_info)
  gene_filter_uninformative_terms (remove uninformative terms from gene_info)
  pubmed2sentences (show each PubMed sentence on a separate line)

Some programs that are more specific to this system:

  bscheckfile (check the existence of source data newer than the output data)
  bsdownload (download data from configured sources)
  bsdownload-biocreative (download BioCreative data)
  bsdownload-gene (download Entrez Gene data)
  bsdownload-pubmed (download PubMed data)
  bsindex (index document data)
  bsindex-auto (a wrapper around bsindex)
  bslexicon (make an index of terms from a lexicon)
  bslexicon-auto (a wrapper around bslexicon)
  bsmanifest-biocreative (show available BioCreative files)
  bsmanifest-pubmed (show available PubMed files)
  bsparallel (a script that lets programs run in parallel)
  bsparse-auto (a wrapper around the sentence generation programs)
  bssearch (search document data)
  bssearch-auto (a wrapper around bssearch)
  bsshow (show results in context)
  bsshow-auto (a wrapper around bsshow)

Bootstrapping programs:

  bsbootstrap (bootstrap the entire system)
  bsbootstrap-gene (bootstrap gene-related resources)
  bsbootstrap-text (bootstrap textual resources)
  bsbootstrap-text-pipeline (the text processing pipeline)

Configuration-related files:

  bsdata-config (the general system configuration)
  bsdata-config-biocreative (BioCreative-specific configuration)
  bsdata-show-settings (show the system configuration)

Contact, Copyright and Licence Information

The current Web page for this software at the time of release is:

The current maintainer can be contacted at the following e-mail address:

Copyright and licence information can be found in the docs directory - see
docs/COPYING.txt and docs/gpl-3.0.txt for more information.


The following locations provide the prerequisites for this system:

  NCBI Tools

The intention is that operating system packages should provide such
prerequisites, but there remains a possibility that not all prerequisites will
be packaged for all operating system distributions.