Read Me
Introduction
============
PyCancerDB is a system for searching and updating the Cancer Proteomics
Database.
Getting Started
===============
As documented by sections in this document, the following steps are required
to deploy an instance of the Cancer Proteomics Database software:
1. Obtain and install dependences.
2. Install and configure the database system.
3. Initialise the database (either from scratch or using existing data).
4. Deploy this software as a Web application.
See the docs/maintenance.txt file for instructions on maintaining a deployed
instance of the software and performing related tasks.
Dependencies
============
The bundled markup.py file originates from the markup-1.9 package, obtained
from the following location:
http://markup.sourceforge.net/
You should not need to update this file unless the system misbehaves and
markup.py turns out to be the cause.
psycopg2 is required to access PostgreSQL:
http://initd.org/psycopg/
On Red Hat and Debian systems, install the python-psycopg2 package.
Database System
===============
Currently, only PostgreSQL is supported:
http://www.postgresql.org/
On Red Hat systems, the postgresql, postgresql-libs and postgresql-server
packages provide the necessary functionality.
On Debian systems, install the postgresql package.
Private Installations of the Database
-------------------------------------
It is assumed that a database cluster has been initialised; this is done for
PostgreSQL as follows:
initdb --locale=C -D <cluster directory>
For example:
initdb --locale=C -D /storage/data/pycdb
Here, the specified directory will be created and populated by the initdb
program. After this, the system can be started as follows:
pg_ctl -D <cluster directory> -l <log filename> start
For example:
pg_ctl -D /storage/data/pycdb -l /storage/data/pycdb_log.txt start
Central Installations of the Database
-------------------------------------
On Debian systems, a set of commands are available to manage database
"clusters", which are collections of individual databases. Thus, to initialise
the database system, a command of the following form is required:
sudo pg_createcluster --locale=C <version> <name>
For example:
sudo pg_createcluster --locale=C 9.1 pycdb
This creates a cluster with the given name for PostgreSQL 9.1 (assuming that
it is that particular version that is being used). The database system can be
started for this cluster as follows:
sudo pg_ctlcluster <version> <name> start
For example:
sudo pg_ctlcluster 9.1 pycdb start
The cluster will be assigned a particular port number, possibly 5433 for the
first new cluster, and thus any database commands will need an additional
option (-p <port>) to access this cluster. Alternatively, the default cluster
might be dropped if not already in use:
sudo pg_dropcluster 9.1 main
A new cluster may then be created, with the default port being set explicitly
if required:
sudo pg_createcluster -p 5432 --locale=C 9.1 pycdb
To list the available clusters, use the pg_lsclusters command.
----
On Red Hat systems, if the database system is to be managed centrally, the
following definitions will need placing in the /etc/sysconfig/pgsql/postgresql
file:
PGDATA=/storage/data/pycdb
To stop SELinux complaining and refusing to let the database initialise and
start, first run the following commands:
sudo semanage fcontext -a -t var_t /storage
sudo semanage fcontext -a -t postgresql_db_t /storage/data/pycdb
sudo restorecon /storage
Then, use the service program to initialise the database:
sudo service postgresql initdb
Since the database will only be available to local processes, the
authentication settings in /storage/data/pycdb/pg_hba.conf need to be changed
so that the "METHOD" column employs "trust" instead of "ident". This will
allow a separate identity that is unconnected to system identities to access
the database.
It should now be possible to start the database system:
sudo service postgresql start
Before proceeding, it is useful to define a role or user for yourself in order
to make administration and use of the database more convenient. The following
command switches to the system's postgres account which provides superuser
privileges:
sudo -u postgres -i
Use the createuser command with your own username and then indicate that the
new user/role will be a superuser. Then, log out as postgres.
Configuration
-------------
First, a database user needs to be created:
createuser <database user>
For example:
createuser pycdb
Where an explicit port number is required for the database system, it can be
specified as in the following example:
createuser -p 5433 pycdb
Then, a database needs to be created:
createdb -E unicode -T template0 <database>
For example:
createdb -E unicode -T template0 pycdb
Where an explicit port number is required for the database system, it can be
specified as in the following example:
createdb -p 5433 -E unicode -T template0 pycdb
Initialising the Database
=========================
The database can be initialised in a number of different ways:
* Starting with an empty database
* Starting with a database populated from previously exported data
* Using a dump from the Cell Death Proteomics Database schema
Creating an Empty Database
--------------------------
A script can be invoked to set up an empty database:
scripts/rebuild_database.sh <database> <database user> <data directory>
Here is an example of the script being run with real arguments:
scripts/rebuild_database.sh pycdb pycdb /storage/data
Here, /storage/data is the location of potentially large data files that may
be downloaded during the initialisation process. The current directory "." or
other locations may also be used.
The result should be a database capable of holding the managed
publication-related data, but without any actual records related to
publications. However, various resources related to protein identifiers will
be incorporated into the database.
Where an explicit port number is required for the database system, it can be
specified as in the following example:
scripts/rebuild_database.sh -p 5433 pycdb pycdb /storage/data
Populating the Database from Exported Data
------------------------------------------
The script used above to create an empty database can be used to set up a
pre-initialised database containing data provided by a file exported from
another instance of the Cancer Proteomics Database. It will then be invoked as
follows:
scripts/rebuild_database.sh <database> <database user> <data directory> <database export file>
Here is an example of the script being run with real arguments:
scripts/rebuild_database.sh pycdb pycdb /storage/data export.tsv
Here, export.tsv is a file exported from the Web interface of another
deployment of the Cancer Proteomics Database software, although in principle
any file providing data in the same form can also be used.
Where an explicit port number is required for the database system, it can be
specified as in the following example:
scripts/rebuild_database.sh -p 5433 pycdb pycdb /storage/data export.tsv
See the docs/maintenance.txt file for more information about preparing data
for import.
Populating the Database from Cell Death Proteomics Data
-------------------------------------------------------
The Cell Death Proteomics Database is the predecessor of the Cancer Proteomics
Database, and its data traditionally resided in an SQLite database (and
obtaining such data in a usable form is discussed below).
A script can be invoked to populate the database from CDPdb data. It is
invoked in the following way:
scripts/make_database.sh <database> <database user> <data directory> <database dump>
Here is an example of the script being run with real arguments:
scripts/make_database.sh pycdb pycdb /storage/data db.psqldump
Here, db.psqldump is the file from the original SQLite database, modified for
import into PostgreSQL.
Where an explicit port number is required for the database system, it can be
specified as in the following example:
scripts/make_database.sh -p 5433 pycdb pycdb /storage/data db.psqldump
Making a Suitable Database Dump File from Cell Death Proteomics Data
--------------------------------------------------------------------
The convert_sqlite_dump.sh script can be used to convert a database dump
either produced directly from an SQLite database or provided by a previously
obtained dump file. Here is the command syntax:
scripts/convert_sqlite_dump.sh [ <database file> ]
For example, to access a database directly and to write the dump file ready
for PostgreSQL to use:
scripts/convert_sqlite_dump.sh CancerDatabase.sqlite > db.psqldump
To convert an existing SQLite dump file:
scripts/convert_sqlite_dump.sh < db.sqldump > db.psqldump
In the latter example, the existing dump file db.sqldump is presented as
standard input to the script, and in both examples, the converted dump file
db.psqldump is produced from the script's standard output.
Deployment
==========
The install.sh script can be used to copy the resources belonging to this
application into a suitable location. For example:
./install.sh /storage/www/pycdb
Where SELinux is active, such as on Red Hat systems, the following additional
commands are required to make these resources available to the Web server:
sudo semanage fcontext -a -t default_t /storage
sudo semanage fcontext -a -t default_t /storage/www
sudo semanage fcontext -a -t httpd_sys_content_t '/storage/www/pycdb(/.*)?'
sudo restorecon -R /storage
sudo restorecon -R /storage/www
sudo restorecon -R /storage/www/pycdb
For access to networked resources, such as the retrieval of article metadata
from Europe PubMed Central and other locations, the following command is
needed:
sudo semanage boolean --on httpd_can_network_connect
To configure Apache for the deployed application, copy and adjust either the
pycdb.conf file or the pycdb-virtual-host.conf file found in the resources
directory of this distribution, placing the chosen file in the appropriate
location. For example:
cp resources/pycdb.conf /etc/httpd/conf.d/
If the virtual host configuration is used (pycdb-virtual-host.conf), make sure
that the main Apache configuration file has named virtual hosts enabled. For
example, the /etc/httpd/conf/httpd.conf file (on Red Hat) will need to contain
a line as follows:
NameVirtualHost *:80
Finally, restart or start Apache as follows:
sudo service httpd restart
Other Documents
===============
The docs directory contains a collection of documents describing the design
and implementation of this system and the database representation.
Contact, Copyright and Licence Information
==========================================
The current Web page for this software at the time of release is:
https://sourceforge.net/p/pycancerdb/
The current maintainer can be contacted at the following e-mail address:
paul@boddie.org.uk
Copyright and licence information can be found in the docs directory - see
docs/COPYING.txt and docs/gpl-3.0.txt for more information.