irefindex DrugBank Parser

Tools for building/working with consolidated protein interaction data

Brought to you by: ianoslo, pboddie

Tree [f8ccab] default tip / History

Read Only access

File	Date	Author	Commit
docs	2013-03-26	Paul Boddie	[f8ccab] Added copyright and licensing information.
reports	2011-03-11	Paul Boddie	[1fb745] Moved various database operations into differen...
sql	2011-10-05	Paul Boddie	[599f80] Tidied up the scripts, removing the default pro...
tools	2011-10-05	Paul Boddie	[599f80] Tidied up the scripts, removing the default pro...
README.txt	2011-10-05	Paul Boddie	[599f80] Tidied up the scripts, removing the default pro...
database_action.py	2011-03-11	Paul Boddie	[1fb745] Moved various database operations into differen...
export_data	2011-10-05	Paul Boddie	[599f80] Tidied up the scripts, removing the default pro...
parse_data	2011-10-05	Paul Boddie	[599f80] Tidied up the scripts, removing the default pro...

Read Me

This software is concerned with the parsing of DrugBank data and the eventual
production of MITAB output data for consumption by other tools and systems.
See the following page for more information on the supported MITAB format:

http://irefindex.uio.no/wiki/DrugBank_MITAB2.6_File_Format

Prerequisites
-------------

The following programs are required to use the parser:

  * Python (tested with 2.5.4): http://www.python.org/
  * PostgreSQL (tested with 8.1.17): http://www.postgresql.org/
  * iRefIndex MITAB Parser: http://irefindex.uio.no/wiki/iRefIndex_MITAB2.6_Parser

A database module such as pyPgSQL is not required for Python since the native
PostgreSQL tools are used to create and populate the database.

For processing of the DrugBank data, data is required from UniProt and
iRefIndex.

UniProt ("text" format data):	http://www.uniprot.org/downloads
iRefIndex (MITAB format data):	http://irefindex.uio.no/

Running the Parser
------------------

A shell script is provided for the complete parsing workflow:

  ./parse_data --all

This will parse UniProt data and DrugBank XML data, performing SEGUID checksum
operations on protein sequences, and preparing files for database import.

To restrict parsing and processing to a particular dataset, specify either
--uniprot or --drugbank after the command filename. Specifying --all will cause
both datasets to be processed.

Creating the Database
---------------------

A database can be created using the usual PostgreSQL tools:

  createdb -E unicode drugbank

This database is initialised as follows:

  psql -f sql/init_drugbank.sql drugbank
  psql -f sql/init_uniprot.sql drugbank
  psql -f sql/init_irefindex.sql drugbank

Should the database tables need to be dropped (perhaps in case of problems
with the import), the following command can be used:

  psql -f sql/drop_drugbank.sql drugbank
  psql -f sql/drop_uniprot.sql drugbank
  psql -f sql/drop_irefindex.sql drugbank

Populating the Database
-----------------------

To prepare iRefIndex data for import into the database, the MITAB parser for
iRefIndex must first be run on the MITAB file downloaded from the iRefIndex
downloads site. Some of the data files produced may then be copied into the
data directory for import into the database:

  python database_action.py <mitab database> sql/export_irefindex.sql

For example:

  python database_action.py mitab sql/export_irefindex.sql

The database is then populated as follows:

  python database_action.py drugbank sql/import_drugbank.sql
  python database_action.py drugbank sql/import_uniprot.sql
  python database_action.py drugbank sql/import_irefindex.sql

Indexes will be created to assist the workflow, but for other applications
additional indexes may be required.

Processing DrugBank Data and Exporting MITAB Data
-------------------------------------------------

In order to be able to create a complete MITAB format file, some processing is
required; this is performed in a shell script:

  ./export_data drugbank --legacy

Here, "drugbank" can be substituted for a different database name, if this has
been specified with the database-related commands above.

Meanwhile, the --legacy argument indicates that sequences and signatures shall
have additional processing performed on them when signatures/checksums/hashes
are produced.

The processing involves a sequence of steps:

 1. The initial step creates a table of interactors which will be used as the
    basis for the interactor information in the MITAB file, as well as
    providing the necessary ingredients for the computation of interaction
    checksums.

 2. A table of interactions is exported for interaction checksum computation.

 3. The checksum computation is performed, adding a checksum to each complete
    interaction provided by the interaction table.

 4. The completed interaction table is re-imported.

 5. The interactor and interaction data is combined to produce a table of
    MITAB records.

 6. The MITAB data is exported and a metadata header is added.

The resulting MITAB file will reside in the data directory with the name
drugbank_mitab.txt.

Structure of DrugBank Records
-----------------------------

Each drug record contains a collection of external identifiers for the drug,
qualified by resource names such as "National Drug Code Directory" and
"GenBank".

Each record also describes interactions between drugs and other drugs or drugs
and "partners" acting as carriers, enzymes, targets and transporters.

Each partner also provides external identifiers qualified by resource names
such as "GenBank Gene Database" or "UniProtKB". In addition, gene and protein
sequences are also provided in partner records.

Drugs which are classified as "biotech" drugs may themselves also provide
protein sequence information. Thus, interaction partners can be regarded as a
wider set containing all drugs and the "partners" described above.

Data Representations and Restrictions
-------------------------------------

Only the partners in drug-partner interactions which have UniProt accessions
are stored in the drugbank_partners table. These are effectively "protein
partners".

From this table and the information stored in the drugbank_alternatives table,
a wider set of "partner interactors" is produced containing the protein
partners from the drugbank_partners table together with any drugs which are
also proteins, as well as those drugs which are small molecules.

Notes on Various Data Types
---------------------------

According to http://en.wikipedia.org/wiki/CAS_registry_number, restrictions
exist on the inclusion of CAS numbers in databases. Thus, these numbers are
not included currently.

Notes on Databases and Resources
--------------------------------

DrugBank qualifies external identifiers using a range of resource or database
names. In order to maintain consistency with iRefIndex, whose MITAB
information is of particular interest, some of these identifiers are mapped to
their form as employed by iRefIndex in its MITAB output. Thus, a mapping file
is provided for this purpose and can be found here:

  sql/dbname_mapping.txt

References
----------

Data file:              http://drugbank.ca/system/downloads/current/drugbank.xml.zip
Schema:                 http://drugbank.ca/docs/drugbank.xsd
Format documentation:   http://drugbank.ca/documentation
UniProt documentation:  http://au.expasy.org/sprot/userman.html

irefindex DrugBank Parser

Tools for building/working with consolidated protein interaction data

Branches

Tags

Tree [f8ccab] default tip /

History

Read Me

irefindex DrugBank Parser

Tools for building/working with consolidated protein interaction data

Branches

Tags

Tree [f8ccab] default tip / Download Snapshot History

Read Me

Tree [f8ccab] default tip /

History