irefindex MITAB parser

Tools for building/working with consolidated protein interaction data

Brought to you by: ianoslo, pboddie
Tree [a8736a] default tip / History
Read Only access
File	Date	Author	Commit
docs	2011-03-17	Paul Boddie	[85b470] Updated copyright information.
reports	2011-10-21	Paul Boddie	[73f2c2] Sort the output for convenience.
sql	2011-09-19	Paul Boddie	[fff4b1] Added aggregate function removal to the appropr...
.hgtags	2011-11-29	Paul Boddie	[a8736a] Added tag snapshot-2011-11-29 for changeset 8b9...
README.txt	2011-11-29	Paul Boddie	[8b9369] Updated the release notes.
database_action.py	2011-03-11	Paul Boddie	[ffcfbd] Fixed processing of MITAB2.6 files to handle bo...
parse_mitab.py	2011-07-18	Paul Boddie	[2f7cdf] Improved the usage message.
Read Me

The mitab distribution contains a tool that has been developed to parse the
MITAB files produced in the iRefIndex build process. See the following page
for more information on the supported MITAB formats:

http://irefindex.uio.no/wiki/README_MITAB2.6_for_iRefIndex_9.0
http://irefindex.uio.no/wiki/README_MITAB2.6_for_iRefIndex_8.0
http://irefindex.uio.no/wiki/README_MITAB2.6_for_iRefIndex_7.0
http://irefindex.uio.no/wiki/README_iRefIndex_MITAB_7.0

The format employed by the current iRefIndex release is documented on the
following page:

http://irefindex.uio.no/wiki/README_MITAB2.6_for_iRefIndex

Important Notices
-----------------

In releases of iRefIndex prior to 9.0, RIGIDs were incorrectly computed:
instead of a simple concatenation of ROGIDs and the application of the hashing
function, ROGIDs were processed in a similar fashion to protein sequence
information, resulting in the removal of non-alphanumeric characters and the
conversion of the processed text to upper case.

In order to determine whether such incorrect RIGIDs still maintained their
desirable properties, particularly that of referring to distinct interactions,
the following reports were devised:

  * The check_interactors.sql report attempts to determine whether the
    inappropriately processed ROGIDs correspond unambiguously to the original
    ROGIDs. This should be sufficient to determine whether the nature of the
    computed RIGIDs has been compromised by the inappropriate processing,
    since any incorrect ROGID corresponding to multiple original ROGIDs has
    the potential to produce a RIGID describing more than one distinct
    interaction. Where a one-to-one correspondence is upheld, combinations of
    ROGIDs should refer to distinct interactions, and only a weakness in the
    hashing algorithm should produce a collision that undermines the desirable
    property of a RIGID as a distinct interaction label.

  * The check_interactions.sql report attempts to show that RIGIDs describe
    interactions involving distinct sets of interactors. Unfortunately, it is
    not possible to obtain sufficient information from the MITAB data to show
    this for non-binary interactions since only collections of distinct
    interactors are included in the MITAB files, yet non-binary interactions
    could involve the same interactor many times, and such repetition of an
    interactor's ROGID would have to be included in the necessary RIGID
    computation.

Prerequisites
-------------

The following programs are required to use the parser:

  * Python (tested with 2.3.5 and 2.5.4): http://www.python.org/
  * PostgreSQL (tested with 8.1.x, 8.3.x and 9.0.x): http://www.postgresql.org/

A database module such as pyPgSQL is not required for Python since the native
PostgreSQL tools are used to create and populate the database.

Running the Parser
------------------

Given a directory for the iRefIndex output files such as...

  /home/irefindex/output

...run the parser as follows:

  python parse_mitab.py /home/irefindex/output/All.mitab.10182011.txt

It will be necessary to change the date details included in the above filename
to match the actual name of the appropriate file found in your own output
directory. Note that any compressed version of the file should first be
uncompressed and the uncompressed version used with this software.

By specifying certain options, warnings will be written to standard output:

  --all         All warnings will be displayed
  --serious     Only serious warnings will be displayed
  --trivial     Only trivial warnings will be displayed

Consider redirecting standard output to a file in order to record warnings.

Known Warnings
--------------

Trivial warnings are generally about things which could be corrected, but
which are unlikely to indicate a problem with the data. For example:

  For interaction type vocabulary term, empty value given as -

Serious warnings may indicate a problem with the data that goes beyond a mere
lack of compliance with the format. However, some of these warnings, although
worth a brief inspection, are not likely to be remedied in future releases due
to the nature of the underlying data. For example:

  For database vocabulary term, generic code MI:0000 has non-empty names
  section: ophid

In the above example, it is regarded as unlikely that a vocabulary term will
be assigned for the OPHID database or that the data source responsible will
make use of such a term. Consequently, although the warning is valid (and a
database reference to OPHID will employ "MI:0000" as a vocabulary term code),
it can effectively be ignored.

  For method vocabulary term, generic code MI:0000 has non-empty names
  section: 10024883

In the above example, a method has been described using "MI:0000" and
something that appears to resemble a PubMed identifier. This way of describing
methods has previously appeared in the underlying data, but the cause of this
phenomenon has since been fixed in iRefIndex 9. However, earlier versions of
iRefIndex are likely to contain a large number of such records.

  For method vocabulary term, generic code MI:0000 has non-empty names
  section: OPHID Predicted Protein Interaction

In the above example, an unrecognised method name has been used (as is the
case in the previous example). Once again, unless a suitable vocabulary term
can be found in order to choose a more appropriate code, the data has to be
accepted in its imperfect form.

  Possible oversized PubMed identifier for 2TXb2Q8o1LU0/z9eFKK/g/gt3qI:
  4294967295

Some PubMed identifiers are clearly invalid. Unfortunately, it is difficult to
correct such entries to use the intended PubMed identifier, and such
references will effectively be useless in any further processing or analysis.

  Score entry has illegal format: -

The above format-related warning is currently regarded as serious, but may be
reclassified as trivial in later releases.

To collect only serious warnings from a log of all warnings, the following
command invocation can be used on systems providing the grep tool:

  grep SERIOUS logfile > logfile-serious

A summary of all warning types produced in a log can be generated on systems
providing the cut and sort tools:

  cut -d $':' -f 3- logfile-serious | sort -u

Creating the Database
---------------------

A database can be created using the usual PostgreSQL tools:

  createdb -E unicode mitab_irefindex

This database is initialised as follows:

  psql -f sql/init_mitab.sql mitab_irefindex

Should the database tables need to be dropped (perhaps in case of problems
with the import), the following command can be used:

  psql -f sql/drop_mitab.sql mitab_irefindex

Populating the Database
-----------------------

The database is populated as follows:

  python database_action.py mitab_irefindex sql/import_mitab.sql

As a result, a number of tables representing the structure of the data should
be available in the database. For applications built to use this data, indexes
may need creating in order to make querying more efficient.

Inspecting the Database
-----------------------

A number of reports can be generated from the database. For example:

  python database_action.py mitab_irefindex reports/interactions_by_source.sql

Report and export files are written to the default or specified data
directory.

Differences between Parsing MITAB2.5 and MITAB2.6 Files
-------------------------------------------------------

  * Since gene identifiers are no longer provided in MITAB2.6, the
    mitab_interactor_genes table will be empty when parsing MITAB2.6 format
    files. This table can be populated using the populate_interactor_genes.sql
    template with the database_action.py script.

Contact, Copyright and Licence Information
------------------------------------------

The current Web page for this software at the time of release is:

http://irefindex.uio.no/wiki/iRefIndex_MITAB2.6_Parser

The author can be contacted at the following e-mail address:

paul.boddie@biotek.uio.no

Copyright and licence information can be found in the docs directory - see
docs/COPYING.txt and docs/gpl-3.0.txt for more information.

New in mitab snapshot 2011-11-29 (Changes since mitab snapshot 2011-03-17)
--------------------------------------------------------------------------

  * Added reports testing for RIG identifier collisions and collisions of
    inappropriately "sanitised" ROG identifiers.
  * Fixed the interactor genes population template.
  * Added support for parsing non-integer scores, albeit producing warnings
    and not importing such scores into the database.
  * Added a test for differing numbers of columns per line.
  * Improved and expanded various reports.

New in mitab snapshot 2011-03-17 (Changes since mitab snapshot 2011-02-23)
--------------------------------------------------------------------------

  * Added interaction and canonical interaction integer RIG identifier tables.
  * Added support for "standard" MITAB2.6 format files which lack the full
    range of iRefIndex fields.
  * Added schema support in order to populate the same database with different
    versions of data or data from different sources.
  * Enhanced the database_action.py script so that database templates can make
    use of extra parameters.
  * Excluded incomplete interactions, relevant to non-iRefIndex data sources.
  * Excluded information for interactions not providing RIG identifiers and
    interactors not providing ROG identifiers, relevant to non-iRefIndex data
    sources.

New in mitab snapshot 2011-02-23 (Changes since mitab snapshot 2010-06-29)
--------------------------------------------------------------------------

  * Added support for canonical iRefIndex data in the MITAB2.6 format.
  * Introduced a data directory for tidier handling of database import files.
  * Changed the import template to use the \copy command instead of the SQL
    copy statement, since the latter is generally a superuser-only statement
    in PostgreSQL.
  * Merged the database access/modification scripts into a single script.
  * Added various reports and export templates.
  * Changed the handling of MI:0000 names/descriptions in order to more easily
    detect and report inappropriate combinations of names and such codes, and
    to include such combinations in the database for further analysis.

New in mitab snapshot 2010-06-29 (Changes since mitab snapshot 2010-03-29)
--------------------------------------------------------------------------

  * Fixed taxonomy identifier processing for iRefIndex 7.0 (thanks to Jon
    Lees' testing).

New in mitab snapshot 2010-03-29 (Changes since mitab snapshot 2009-08-31)
--------------------------------------------------------------------------

  * Changed uid-related fields, dropping the redundant "irefindex:" prefix.
  * Split aliases, alternatives and interaction identifiers from the MITAB
    file, creating a dbname column in the associated tables, and removing the
    prefix from the uid-related columns.
  * Changed the tables to hold only pertinent information for each particular
    concept. This means that a number of interactor-related tables no longer
    hold interaction information.
  * Upheld "MI:0000" code usage for things like source databases.
  * Fixed taxonomy identifiers so that -1 really means "in vitro" as described
    in the PSI-MI MITAB specification.
  * Fixed empty list recognition, omitting collections if "-" was specified.
  * Omitted interaction identifiers where "-" is given for a particular
    database.
  * Improved the warning framework and adjusted the levels of certain
    warnings.
  * Changed the templating to employ a temporary file, executed in one step.
  * Added tolerance of the iRefIndex 6.1 MITAB format which had one additional
    column that this software currently ignores.
irefindex MITAB parser

Tools for building/working with consolidated protein interaction data

Branches

Tags

Tree [a8736a] default tip /

History

Read Me

irefindex MITAB parser

Tools for building/working with consolidated protein interaction data

Branches

Tags

Tree [a8736a] default tip / Download Snapshot History

Read Me

Tree [a8736a] default tip /

History