The irdata distribution is a collection of software for building iRefIndex
The following software is required to use this distribution:
* PostgreSQL (to host the database)
* The PostgreSQL client program, psql, and database management tools
* A POSIX-like shell and environment (for the high-level scripts)
* Python (tested with 2.5.4, for the tools)
* cmdsyntax (command option processing)
* libxml2dom (HTML parsing for the manifest generation)
* libxml2 (required by libxml2dom)
* The jar utility (required to package iRefScape data)
Most Unix-based operating systems will provide the necessary commands for the
high-level scripts, but these commands may be provided separately or
explicitly on some platforms by packages such as GNU Coreutils and Findutils.
Amongst the commands used are the following:
cat, cp, grep, gunzip, head, mv, rm, sort, tail, tee, xargs
In addition, where a previous release resides in a database system such as
MySQL, the MySQL client program, mysql, must be installed.
See the "Resources" section for download information.
The documentation is in a format that can be used with MoinMoin (and the
ImprovedTableParser extension) for deployment on the Web.
See docs/pages/Project for details of how this distribution is arranged and
See docs/pages/Schema for information about the database schema.
See docs/pages/Sources for details of data source formats and issues.
Configuring the Software
A configuration script called irdata-config is located in the scripts
directory of this distribution. It may be edited or copied to another location
on the PATH of any user running the software.
Before continuing, enter the distribution directory (normally containing this
README.txt file) and copy the irdata-config file into the current directory as
cp scripts/irdata-config .
The details in the file can now be reviewed and edited. If an installation is
performed, any edits after installation can be incorporated into that
installation by once again running the command given in "Performing an
Installation" in the distribution directory.
Reserving a Location for Data
The configuration script contains a setting dedicated to the data downloaded
and processed by the software. By default, it looks like this:
DATA= # user defined data directory location
Left in this state, the system will attempt to locate the data relative to the
installed software. However, it can be beneficial to explicitly choose a
location, especially if the data will reside in a separate partition from the
installed software. For example:
Note that this DATA setting is not connected with the database system that
will also used to store and process data during the build process. See below
for database system configuration information.
Fine-Tuning the Data Source Details
In the section of the configuration script concerned with source locations and
details, the VERSION and DOWNLOAD_FILES settings specific to data sources may
need updating to take new releases of data into account. Unfortunately, this
cannot be done automatically due to the complexity of having to deal with the
widely differing mechanisms employed by data providers to publish their data.
In the "Downloading Source Data" section below, a method is provided to view
the locations of configured data sources, and manual inspection of each
resource's Web site may then lead to the discovery of new data. The details of
such new data can then be provided in the configuration and any system-wide
configuration updated as described in "Performing an Installation".
Configuring an Installation of the Software
Once the prerequisites have been installed, the software can be run from the
distribution directory. If you choose to do this, you can skip this and the
following installation sections. Make sure, in this case, to leave SYSPREFIX
blank in the irdata-config file:
SYSPREFIX= # system-wide installation root
Alternatively, a system-wide installation can be performed or prepared using
the setup.py script provided. You can choose the conventional system root as
follows, although this is not recommended:
SYSPREFIX=/ # system-wide installation root
The reason for not recommending this is that programs would be installed in
/usr/bin, and other resources in other locations that should normally be
managed by the system's package manager. If you would prefer to install the
software centrally in this way, please consider using a packaged version of
If a system-wide installation is to reside in a directory hierarchy other than
the conventional system root, the SYSPREFIX setting should be adjusted to
reflect this. For example:
SYSPREFIX=/home/irefindex # system-wide installation root
This setting specifies the directory at the top of the desired hierarchy. Upon
installing the software, given this example, programs would be placed in
Even if a system-wide installation ends up with inappropriate settings, such
settings can be overridden as described in "Configuring the Software".
Performing an Installation
With the irdata-config file modified, the setup.py script can then be run:
python setup.py install --prefix=/home/irefindex/usr
Note that SYSPREFIX will be /home/irefindex in this case: the setup.py script
needs the additional "/usr" to know where to install programs and resources.
Setting PATH and PYTHONPATH
With a SYSPREFIX other than / (the conventional system root), such as
/home/irefindex, the PATH and PYTHONPATH variables in the environment need to
be modified so that the shell can find the installed programs and libraries.
To obtain suggested definitions of these variables, run the following command
in the distribution directory of this software:
The output should provide output resembling the following for a SYSPREFIX of
These definitions can be executed in the shell, and they can also saved in the
appropriate shell configuration file, such as in .profile, .bashrc,
.bash_profile or any other appropriate file in a user's home directory.
Initialising a Location for Data
Before any operations can be performed using the software installation,
various data and resource locations must be initialised. This can be done as
Any required directories that are not already present will be reported as
Creating a Database Cluster
On systems that already provide databases, it may not be necessary to create a
database cluster. Nevertheless, it can be worth checking to see if any
existing database cluster is appropriately configured, and this is described
Due to limitations with PostgreSQL and the interaction between locales and the
sorting/ordering of textual data, it is essential that the database be
initialised in a "cluster" with a locale that employs the ordering defined for
ASCII character values. Such a cluster can be defined as follows:
initdb -D /home/irefindex/pgdata --no-locale
On Debian-based systems (including Ubuntu and derivatives), a cluster can be
defined using a special command, in the following example specifying a
PostgreSQL version of 8.2 and a cluster name of irdata:
pg_createcluster --locale=C 8.2 irdata
Note that the cluster's data directory is different from the data directory
employed by this software to collect source data and to deposit processed
Note also that the default location of clusters is typically in the
/var/lib/postgresql region of the filesystem, at least for Debian packages of
PostgreSQL, which can lead to disk space issues since /var is often given a
partition of limited size or resides within the root partition which may
itself have a limited size.
To choose an alternative location for a cluster, add the -d option:
pg_createcluster --locale=C -d /home/irefindex/databases 8.2 irdata
A cluster can be started as follows:
pg_ctl -D /home/irefindex/pgdata start
On Debian-based systems, the following command is used instead:
pg_ctlcluster 8.2 irdata start
To list the available clusters on Debian-based systems:
This should show, amongst other things, the location, status, locale and port
number associated with each of the available clusters.
Connecting to Databases and Clusters
See the documentation for PostgreSQL and the various tools (createdb, psql)
for details of connecting to a specific cluster. Generally, the -p option is
used to direct an operation towards a particular cluster. For example, for a
cluster listening on port 5433, the following command lists the available
psql -p 5433 -l
Any connection options must be given in the configuration of this software
using the PSQL_OPTIONS setting. For example, for a cluster listening on port
5433 the following could be used in the configuration file:
PSQL_OPTIONS="--psql-options -p 5433"
If the use of a separate cluster is undesirable, PostgreSQL 9.1 or later could
be used by employing various explicit "collate" declarations in certain column
declarations or in various SQL statements where ROG identifiers are being
retrieved in a particular order. This is not currently supported.
Creating a Database User
It is recommended that iRefIndex be run using a separate database user or
role, and this user can be set up as follows:
(Additional connection options should be specified to affect the appropriate
Although making the new user a superuser may appear excessive, doing so will
allow the user to create databases, tables and other objects without any
The choice of username can also be important. PostgreSQL is able to associate
system users with database users, and so any database user should have the
same name as the system user running the iRefIndex software in order to take
advantage of this feature. If the way databases are managed in your own
environment diverges from this practice, you may choose another username
instead, but this will then need to be specified in the connection options
Creating the Database
Once a database cluster has been started, a database can then be created using
the usual PostgreSQL tools:
(Additional connection options should be specified to affect the appropriate
Configuring the Database
PostgreSQL configuration can be challenging. An example configuration can be
found in the docs directory in the form of the postgresql.conf file. Although
the settings have been known to change from one release of PostgreSQL to the
next, the following appear to be crucial:
shared_buffers (25% of RAM where 1GB or more is available)
effective_cache_size (50% of RAM)
For non-interactive systems, the autovacuum feature can be switched off. This
helps to avoid contention due to table locking performed by the autovacuum
More information can be found in the PostgreSQL documentation:
The shared memory limit for the system on which the database will be hosted
will need to be checked and possibly changed. This topic is covered in the
above documentation, but to summarise, the kernel.shmmax parameter should be
inspected as follows:
It can be set to a certain number of bytes as follows:
sysctl -w kernel.shmmax=1073741824 # 1024 * 1024 * 1024 == 1GB
To make this setting permanent, either edit the /etc/sysctl.conf file or add a
file to the /etc/sysctl.d directory, if present, and write the following line:
kernel.shmmax = 1073741824
For all operations changing the configuration, you will need to have root
Initialising the Database
(Note that you can defer this step until you are ready to import data,
described in the "Importing Source Data" section below.)
Once the database system has been started, the database used by this software
can be initialised using the following command:
Should the need arise for the removal of schema information from the database,
the following command can be used:
irdrop --drop --all
However, it may be more convenient to issue the dropdb command on the database
and recreate it as described above.
To drop only the build products and not imported source data, run the
irdrop --drop --build
To reinitialise the build products, the following is then required:
irinit --init --build
If you need to reinitialise the database, you can jump ahead to "Importing
Source Data" after doing so, or if only the build products have been
reinitialised, you can jump ahead to "Finishing the Build" instead.
Downloading Source Data
Source data is downloaded using the following command:
Any sources that could not be downloaded in their entirety will be reported as
having failed. It is then necessary to attempt to download them individually
and potentially investigate any underlying problems with each of the download
The locations of published data can be shown using the following command:
For nicer tabulation, use the column command in addition to the above:
irdownload --show-locations | column -t
In most cases, a plain URL is listed, and with this information it is then
generally possible to manually inspect a download site and to find any new,
updated or moved data files. This information can then be added to the
configuration as noted in the "Fine-Tuning the Data Source Details" section
Generating Manifest Information
Manifest/release information for the data sources is generated using the
Any sources that could not provide manifests will be reported as having
failed. Re-running irmanifest with specific source names will add information
for those sources to the manifest file, although some investigation of
problems related to manifest/release information retrieval may be necessary.
Unpacking Source Data
The downloaded data is typically provided in the form of compressed archives
potentially containing many individual files. Before parsing can be performed,
such archives must be unpacked, and this can be done for all sources as
Additional options are available to uncompress all downloaded files, which can
be useful for inspecting the data, but the parsing process should be able to
handle compressed single files in gzip format and thus avoid expanding such
files in the filesystem.
Parsing Source Data
The source data must be parsed and converted to a form that can be imported
into the database. Before attempting to parse data, the presence of the
required data files should be established:
irparse --no-parse --all
It is also recommended that XML data is checked for correctness using a
command of the following form:
irparse --check --all
See the section below on handling invalid source data if this command produces
Parsing of the source data is done as follows:
Once parsed, the import data will reside in an "import" subdirectory of the
main data directory. Thus, if the main data directory is /home/irefindex/data
then the import data will reside in /home/irefindex/data/import. Parsing
errors will be reported on standard error.
Handling Invalid Source Data
Currently, the only serious case of invalid data is the lack of proper
encoding information in BIND Translation data files, causing errors resembling
irparse-source: Examining BIND_TRANSLATION...
irparse-source: File /home/irefindex/var/lib/irdata/data/BIND_Translation/taxid10090_PSIMI25.xml in source BIND_TRANSLATION failed.
irparse-source: File /home/irefindex/var/lib/irdata/data/BIND_Translation/taxid9606_PSIMI25.xml in source BIND_TRANSLATION failed.
irparse-source: Source BIND_TRANSLATION had invalid data.
These files can be fixed by adding a proper XML declaration with encoding
details as follows:
irparse --fix BIND_TRANSLATION
Although the --fix option can be used for all data sources, this is not
generally recommended because the nature of errors may vary and need proper
Fixed sources can be parsed individually once fixed. For example:
Importing Source Data
Source data is imported into the database using the following command:
Each imported source should have its name emitted on standard output. Errors
are produced on standard error.
To perform a cursory check for the presence of data for all sources, run the
irimport --check --all
A list of imported sources will be produced on standard output. Any missing
sources will be reported in messages written to standard error.
Obtaining Integer Identifiers from Previous Releases
Although iRefIndex employs unique identifiers in the form of RIG and ROG
identifiers, it also maintains sequential numbering for interactions and
interactors in order to more easily support applications whose notion of
identifiers are limited to integers. Since correspondences between identifier
types will have been defined by previous iRefIndex releases, such resources
should be extracted from their release databases and then imported into the
current release database in order to refer to known entities in a fashion
consistent with previous releases.
Integer identifiers are obtained from a previous release using the following
irprevious --pgsql <database>
In the above form, with <database> substituted with an actual database name,
the identifiers will be exported from a PostgreSQL database system.
For MySQL-based releases of iRefIndex, the following command is required:
irprevious --mysql -h <host> -u <username> -p -A -D <database>
In this form, each of the placeholders must be substituted with the relevant
values. In addition, other options may be employed after the --mysql argument
in addition to or in place of those shown in order to connect to the database
Finishing the Build
Once the source data resides in the database, it is processed by a sequence of
operations that can be invoked as follows:
If reports are to be generated, this can be done by specifying the --reports
option when building or by running the command with only that option
The report output includes a summary Wiki page featuring a selection of
individual reports which can be published when the build has been completed.
Output files are generated using the following command:
The primary output format is PSI-MI TAB, also known as MITAB.
Uploading the Output Files
Traditionally, iRefIndex releases have been published in a directory structure
having a particular form. Given a particular root directory for an area of the
filesystem exposed via FTP or HTTP (or another mechanism) for the purpose of
downloading the release data, such as...
...the following command can be used to copy the MITAB release data into such
a directory structure:
irupload --upload /home/ftp/irefindex --mitab
The result of this command will be the construction of a hierarchy of
directories of the following form:
Thus, the following hierarchy will be created for the example root directory
given above and a release number of 10.0:
Similarly, the iRefScape data can be published as follows:
irupload --upload /home/ftp/irefindex --irefscape
The result of this command will be a different hierarchy:
And, for the example root directory and release number, the following
hierarchy will be created:
It has also been the accepted convention to provide a symbolic link to direct
users to the "current" release. This link can be set up in the published
directory hierarchy by using the following commands:
irupload --update-current /home/ftp/irefindex --mitab
irupload --update-current /home/ftp/irefindex --irefscape
Thus, the current release can be updated after the release data has been
After issuing the above commands, symbolic links will be created in the
Contact, Copyright and Licence Information
The current Web page for this software at the time of release is:
The current maintainer can be contacted at the following e-mail address:
Copyright and licence information can be found in the docs directory - see
docs/COPYING.txt and docs/gpl-3.0.txt for more information.
The following locations provide the prerequisites for this system:
(jar should be provided by the JDK)
The intention is that operating system packages should provide such
prerequisites, but there remains a possibility that not all prerequisites will
be packaged for all operating system distributions.