Read Me
The mitab distribution contains a tool that has been developed to parse the
MITAB files produced in the iRefIndex build process. See the following page
for more information on the supported MITAB formats:
http://irefindex.uio.no/wiki/README_MITAB2.6_for_iRefIndex_9.0
http://irefindex.uio.no/wiki/README_MITAB2.6_for_iRefIndex_8.0
http://irefindex.uio.no/wiki/README_MITAB2.6_for_iRefIndex_7.0
http://irefindex.uio.no/wiki/README_iRefIndex_MITAB_7.0
The format employed by the current iRefIndex release is documented on the
following page:
http://irefindex.uio.no/wiki/README_MITAB2.6_for_iRefIndex
Important Notices
-----------------
In releases of iRefIndex prior to 9.0, RIGIDs were incorrectly computed:
instead of a simple concatenation of ROGIDs and the application of the hashing
function, ROGIDs were processed in a similar fashion to protein sequence
information, resulting in the removal of non-alphanumeric characters and the
conversion of the processed text to upper case.
In order to determine whether such incorrect RIGIDs still maintained their
desirable properties, particularly that of referring to distinct interactions,
the following reports were devised:
* The check_interactors.sql report attempts to determine whether the
inappropriately processed ROGIDs correspond unambiguously to the original
ROGIDs. This should be sufficient to determine whether the nature of the
computed RIGIDs has been compromised by the inappropriate processing,
since any incorrect ROGID corresponding to multiple original ROGIDs has
the potential to produce a RIGID describing more than one distinct
interaction. Where a one-to-one correspondence is upheld, combinations of
ROGIDs should refer to distinct interactions, and only a weakness in the
hashing algorithm should produce a collision that undermines the desirable
property of a RIGID as a distinct interaction label.
* The check_interactions.sql report attempts to show that RIGIDs describe
interactions involving distinct sets of interactors. Unfortunately, it is
not possible to obtain sufficient information from the MITAB data to show
this for non-binary interactions since only collections of distinct
interactors are included in the MITAB files, yet non-binary interactions
could involve the same interactor many times, and such repetition of an
interactor's ROGID would have to be included in the necessary RIGID
computation.
Prerequisites
-------------
The following programs are required to use the parser:
* Python (tested with 2.3.5 and 2.5.4): http://www.python.org/
* PostgreSQL (tested with 8.1.x, 8.3.x and 9.0.x): http://www.postgresql.org/
A database module such as pyPgSQL is not required for Python since the native
PostgreSQL tools are used to create and populate the database.
Running the Parser
------------------
Given a directory for the iRefIndex output files such as...
/home/irefindex/output
...run the parser as follows:
python parse_mitab.py /home/irefindex/output/All.mitab.10182011.txt
It will be necessary to change the date details included in the above filename
to match the actual name of the appropriate file found in your own output
directory. Note that any compressed version of the file should first be
uncompressed and the uncompressed version used with this software.
By specifying certain options, warnings will be written to standard output:
--all All warnings will be displayed
--serious Only serious warnings will be displayed
--trivial Only trivial warnings will be displayed
Consider redirecting standard output to a file in order to record warnings.
Known Warnings
--------------
Trivial warnings are generally about things which could be corrected, but
which are unlikely to indicate a problem with the data. For example:
For interaction type vocabulary term, empty value given as -
Serious warnings may indicate a problem with the data that goes beyond a mere
lack of compliance with the format. However, some of these warnings, although
worth a brief inspection, are not likely to be remedied in future releases due
to the nature of the underlying data. For example:
For database vocabulary term, generic code MI:0000 has non-empty names
section: ophid
In the above example, it is regarded as unlikely that a vocabulary term will
be assigned for the OPHID database or that the data source responsible will
make use of such a term. Consequently, although the warning is valid (and a
database reference to OPHID will employ "MI:0000" as a vocabulary term code),
it can effectively be ignored.
For method vocabulary term, generic code MI:0000 has non-empty names
section: 10024883
In the above example, a method has been described using "MI:0000" and
something that appears to resemble a PubMed identifier. This way of describing
methods has previously appeared in the underlying data, but the cause of this
phenomenon has since been fixed in iRefIndex 9. However, earlier versions of
iRefIndex are likely to contain a large number of such records.
For method vocabulary term, generic code MI:0000 has non-empty names
section: OPHID Predicted Protein Interaction
In the above example, an unrecognised method name has been used (as is the
case in the previous example). Once again, unless a suitable vocabulary term
can be found in order to choose a more appropriate code, the data has to be
accepted in its imperfect form.
Possible oversized PubMed identifier for 2TXb2Q8o1LU0/z9eFKK/g/gt3qI:
4294967295
Some PubMed identifiers are clearly invalid. Unfortunately, it is difficult to
correct such entries to use the intended PubMed identifier, and such
references will effectively be useless in any further processing or analysis.
Score entry has illegal format: -
The above format-related warning is currently regarded as serious, but may be
reclassified as trivial in later releases.
To collect only serious warnings from a log of all warnings, the following
command invocation can be used on systems providing the grep tool:
grep SERIOUS logfile > logfile-serious
A summary of all warning types produced in a log can be generated on systems
providing the cut and sort tools:
cut -d $':' -f 3- logfile-serious | sort -u
Creating the Database
---------------------
A database can be created using the usual PostgreSQL tools:
createdb -E unicode mitab_irefindex
This database is initialised as follows:
psql -f sql/init_mitab.sql mitab_irefindex
Should the database tables need to be dropped (perhaps in case of problems
with the import), the following command can be used:
psql -f sql/drop_mitab.sql mitab_irefindex
Populating the Database
-----------------------
The database is populated as follows:
python database_action.py mitab_irefindex sql/import_mitab.sql
As a result, a number of tables representing the structure of the data should
be available in the database. For applications built to use this data, indexes
may need creating in order to make querying more efficient.
Inspecting the Database
-----------------------
A number of reports can be generated from the database. For example:
python database_action.py mitab_irefindex reports/interactions_by_source.sql
Report and export files are written to the default or specified data
directory.
Differences between Parsing MITAB2.5 and MITAB2.6 Files
-------------------------------------------------------
* Since gene identifiers are no longer provided in MITAB2.6, the
mitab_interactor_genes table will be empty when parsing MITAB2.6 format
files. This table can be populated using the populate_interactor_genes.sql
template with the database_action.py script.
Contact, Copyright and Licence Information
------------------------------------------
The current Web page for this software at the time of release is:
http://irefindex.uio.no/wiki/iRefIndex_MITAB2.6_Parser
The author can be contacted at the following e-mail address:
paul.boddie@biotek.uio.no
Copyright and licence information can be found in the docs directory - see
docs/COPYING.txt and docs/gpl-3.0.txt for more information.
New in mitab snapshot 2011-11-29 (Changes since mitab snapshot 2011-03-17)
--------------------------------------------------------------------------
* Added reports testing for RIG identifier collisions and collisions of
inappropriately "sanitised" ROG identifiers.
* Fixed the interactor genes population template.
* Added support for parsing non-integer scores, albeit producing warnings
and not importing such scores into the database.
* Added a test for differing numbers of columns per line.
* Improved and expanded various reports.
New in mitab snapshot 2011-03-17 (Changes since mitab snapshot 2011-02-23)
--------------------------------------------------------------------------
* Added interaction and canonical interaction integer RIG identifier tables.
* Added support for "standard" MITAB2.6 format files which lack the full
range of iRefIndex fields.
* Added schema support in order to populate the same database with different
versions of data or data from different sources.
* Enhanced the database_action.py script so that database templates can make
use of extra parameters.
* Excluded incomplete interactions, relevant to non-iRefIndex data sources.
* Excluded information for interactions not providing RIG identifiers and
interactors not providing ROG identifiers, relevant to non-iRefIndex data
sources.
New in mitab snapshot 2011-02-23 (Changes since mitab snapshot 2010-06-29)
--------------------------------------------------------------------------
* Added support for canonical iRefIndex data in the MITAB2.6 format.
* Introduced a data directory for tidier handling of database import files.
* Changed the import template to use the \copy command instead of the SQL
copy statement, since the latter is generally a superuser-only statement
in PostgreSQL.
* Merged the database access/modification scripts into a single script.
* Added various reports and export templates.
* Changed the handling of MI:0000 names/descriptions in order to more easily
detect and report inappropriate combinations of names and such codes, and
to include such combinations in the database for further analysis.
New in mitab snapshot 2010-06-29 (Changes since mitab snapshot 2010-03-29)
--------------------------------------------------------------------------
* Fixed taxonomy identifier processing for iRefIndex 7.0 (thanks to Jon
Lees' testing).
New in mitab snapshot 2010-03-29 (Changes since mitab snapshot 2009-08-31)
--------------------------------------------------------------------------
* Changed uid-related fields, dropping the redundant "irefindex:" prefix.
* Split aliases, alternatives and interaction identifiers from the MITAB
file, creating a dbname column in the associated tables, and removing the
prefix from the uid-related columns.
* Changed the tables to hold only pertinent information for each particular
concept. This means that a number of interactor-related tables no longer
hold interaction information.
* Upheld "MI:0000" code usage for things like source databases.
* Fixed taxonomy identifiers so that -1 really means "in vitro" as described
in the PSI-MI MITAB specification.
* Fixed empty list recognition, omitting collections if "-" was specified.
* Omitted interaction identifiers where "-" is given for a particular
database.
* Improved the warning framework and adjusted the levels of certain
warnings.
* Changed the templating to employ a temporary file, executed in one step.
* Added tolerance of the iRefIndex 6.1 MITAB format which had one additional
column that this software currently ignores.