Menu

Guide

Laboratório de Biodados

MaLe-PSI-BLAST

User guide for Standalone MaLe-PSI-BLAST installation and usage

Authors:

  • Henrique Assis Lucas Ribeiro
  • Tetsu Sakamoto
  • J Miguel Ortega

1. Introduction

MaLe-PSI-BLAST is a tool, based on BLAST/PSI-BLAST, supervised machine learn tecniques and text-mining,
that suggests annotation for uncharacterized proteins.

2. Installation

2.1. Requirements

Before downloading and installing MaLe-PSI-BLAST, make sure that your computer or server fulfills the following requirements to run MaLe-PSI-BLAST.

  • UNIX platform;
  • 6 Gb free space in the disk;
  • MySQL;
  • Java;
  • Perl.

2.1.1. MySQL

MaLe-PSI-BLAST requires MySQL with a database containing information about UniProt accession IDs and its taxonomic data and protein description. If you’re not familiar on how installing MySQL in a UNIX platform, follow these steps:
If you're using Fedora, CentOS or RedHat, type this command to download and install MySQL:

> yum install mysql mysql-server

In case you’re using Debian or Ubuntu, use this command:

> apt-get install mysql mysql-server

After the installation, type these commands to configure your MySQL appropriately :

> chkconfig  mysqld on
> mysql_install_db
> service mysqld start

Then, create an administrator account (usually referred as root) and a password for it by typing:

> mysqladmin -u root password “[yourpassword]”

Now you are able to access the administrator account of your MySQL with the username “root” and password “[yourpassword]” by typing in the Terminal:

> mysql -u root -p

You can use this user account or create another to store the database required by MaLe-PSI-BLAST (see Creating and configuring MaLe-PSI-BLAST database).

2.1.2. Java

The main code of MaLe-PSI-BLAST was developed in Java (v.1.6.0). To install Java in your computer, type on Terminal the following command line according to your Unix distribution:
Debian, Ubuntu, etc.

> sudo apt-get install openjdk-7-jre

Fedora, Oracle Linux, Red Hat Enterprise Linux, etc.

> su -c "yum install java-1.7.0-openjdk"

2.1.3. Perl

MaLe-PSI-BLAST has a pipeline written in Perl that generates phylogenetic tree and calculates phylogenetic distance between the query and all retrieved sequences from the analysis. Perl scripts used in MaLe-PSI-BLAST were written and tested on Perl v.5.18.2 (also tested on Perl v.5.10.1) and require some modules from BioPerl v.1.6.1 that are already included in MaLe-PSI-BLAST.

Usually, Perl is already installed in a UNIX platform. But if not, download the source code from http://www.perl.org/get.html and type the following commands on a Terminal (you have to be logged as root):

> tar -xvzf perl-x.x.x.tar.gz       #1
> cd perl-x.x.x                     #1

> sh Configure -de
> make                              #2
> make test
> make install

#1: in “x”, type the corresponding Perl version that you downloaded.
#2: gcc compiler is required for Perl installation.

2.2. Downloading MaLe-PSI-BLAST

MaLe-PSI-BLAST is available at https://sourceforge.net/projects/malepsiblast/files/latest/download
After downloading the file, open a terminal, go to the directory where MaLe-PSI-BLAST was downloaded and extract the content of MaLe-PSI-BLAST.tar.gz by typing:

> tar -zxvf MaLe-PSI-BLAST.tar.gz

2.3. Creating and configuring MaLe-PSI-BLAST database

MaLe-PSI-BLAST uses a database with idmapping, id_txis_2012_02, tax_names and tax_simple tables. MaLe-PSI-BLAST is configured to access a MySQL database containing this information. Here we assume that you have MySQL installed in your computer and a MySQL account (If not, see section MySQL in Requirements). To create and fill a MySQL database for MaLe-PSI-BLAST usage, follow this procedures:

2.3.1. Creating a database for MaLe-PSI-BLAST

Open a Terminal and access your account on MySQL by typing the command:

> mysql --user=[yourUsername] -p

In the MySQL environment, create a database named malepsi by typing this command:

mysql > create database malepsi;

All tables required by MaLe-PSI-BLAST will be added in malepsi database. To ensure that the database was created type the command

mysql > show databases;

and look for the name malepsi in the list. If you found in the list, you can exit the MySQL (by typing “exit”) and proceed to the next step.

2.3.2. Populating malepsi database

All tables required by MaLe-PSI-BLAST are in the folder MaLe-PSI-BLAST/resources/sql_dump. By accessing this folder in your console by typing

> cd MaLe-PSI-BLAST/resources/sql_dump

you will find four files corresponding to tables that must be in malepsi database. They are:

  • idmapping.sql
  • id_txis_2012_02.sql
  • tax_names.sql
  • tax_simple.sql

To add these tables in malepsi database, type the following commands:

> mysql --user=[username] --password=[password] malepsi < idmapping.sql
> mysql --user=[username] --password=[password] malepsi < id_txis_2012_02.sql
> mysql --user=[username] --password=[password] malepsi < tax_names.sql
> mysql --user=[username] --password=[password] malepsi < tax_simple.sql

Where [username] and [password] is the name of your account and its password, respectively. This procedure may take couple of hours. After this, ensure that all four tables are in malepsi database by accessing your MySQL account (>mysql -u [username] -p) and typing the following commands in the MySQL environment:

mysql > use malepsi;
mysql > show tables;

This will show a list of all tables in malepsi database. Make sure that the name of the four tables are listed. If it is all correct, go to the next step to configure MaLe-PSI-BLAST to correctly access the database that you just created.

2.3.3. Configuring MaLe-PSI-BLAST to access the database

In MaLe-PSI-BLAST folder, there is a file called config.xml where you can configure some parameters of MaLe-PSI-BLAST (for more details see Advanced topics), including the location of the database required by MaLe-PSI-BLAST. Here, we will edit the config.xml setting the correct information about the database that you just have created in the previous section.
Firstly, go to the folder MaLe-PSI-BLAST and open the file config.xml using the software “vi” by typing in your console:

> vi config.xml

you will see a text in XML format with sql tag (<sql>) in the middle of the file. In <sql> there is four more tags named value (<value>) like this:

<sql>
    <value name="database">malepsi</value>
    <value name="host">localhost</value>
    <value name="user">username</value>
    <value name="password">password</value>
...

The value tags with the name “database”, “host”, “user” and “password” here carry the information to allow MaLe-PSI-BLAST access your MySQL account and the database name with its contents. So, basically, you have to edit this part of the file accordingly to your MySQL account and structure. For example, if you have an username “root” with the password “root123” and the database named as “malepsi”, the file have to be edited like as follow:

...
<sql>
    <value name="database">malepsi</value>
    <value name="host">localhost</value>
    <value name="user">root</value>
    <value name="password">root123</value>
...

You can edit this file using any text editor, but If you’re restricted to edit this file in Terminal and not familiar to “vi”, follow these steps to edit the config.xml file:
After opening the file config.xml by the command

> vi config.xml

press the “i” key of your keyboard. By doing this you enter in “INSERT” mode which allow you to write or erase characters of the file like a normal text editor. Move the cursor using the arrow keys to the part of the file that needs modification and replace it according to your MySQL account data (username and password) and the MaLe-PSI-BLAST database name.

After finishing all modification, press “ESC” key and then type “:x!” (without the double quotes) + ENTER. This will save all modification in the file and exit from the “vi”. If you made some mistake and want to leave the editor without saving the modification, just press the “ESC” key and then type “:q!” (without the double quotes) + ENTER.

Now you’re able to run MaLe-PSI-BLAST on your computer. Go to MaLe-PSI-BLAST usage section to see how it works.

3. MaLe-PSI-BLAST usage

If you followed all MaLe-PSI-BLAST Installation instruction, your PC is ready to execute a MaLe-PSI-BLAST analysis. To run MaLe-PSI-BLAST, go to the directory /MaLe-PSI-BLAST/bin/ and type the following command line (see Inputs parameters for more detail):

> java -jar MaLe-PSI-BLAST.jar -in [fastafile] -pid [pidnumber - optional]

3.1. Inputs parameters

-in [fastafile]: Receive a file name containing a single protein sequence in FASTA format. Optionally, if you are interested in determining the last common ancestor (LCA) of organism set that has a protein retrieved in your MaLe-PSI-BLAST analysis, you have to inform MaLe-PSI-BLAST the organism that your sequence belongs. To do this, search for the taxonomy ID of the organism (taxonomy ID can be obtained in http://www.ncbi.nlm.nih.gov/taxonomy) and include the “taxid=[taxIDnumber]” tag on the sequence header like the example below:

>sp|P61916|NPC2_HUMAN Epididymal secretory protein E1 GN=NPC2 PE=1 SV=1 taxid=9606
MRFLAATFLLLALSTAAQAEPVQFKDCGSVDGVIKEVNVSPCPTQPCQLSKGQSYSVNVT
FTSNIQSKSSKAVVHGILMGVPVPFPIPEPDGCKSGINCPIQKDKTYSYLNKLPVKSEYP
SIKLVVEWQLQDDKNQSLFCWEIPVQIVSHL

In this case, “9606” is the taxonomy ID for Homo sapiens.

-pid [pidnumber - optional]: receive a number that will identify your task. If this parameter is omitted, MaLe-PSI-BLAST will generate a random number and index it to your task.

3.2. Outputs

After the analysis MaLe-PSI-BLAST will give two files as output: one ".nhx" with the tree and the other with hit-table sobrescribing the input-file.

3.2.1. MaLe-PSI-BLAST main table

This table contains the main result of MaLe-PSI-BLAST. All proteins retrieved by MaLe-PSI-BLAST analysis is summarized here together with all statistics analysis about the protein clusterization. See below for the description of each column:

Column Description
Iteration Indicates in which PSI-BLAST iteration the protein had been retrieved during analysis.
Uniprot Uniprot accession number of the retrieved protein.
Identity Percentage of identity of the retrieved protein in relation to the query.
Alignment Alignment size between the retrieved protein and query.
Phylogenetic distance Distance between the retrieved protein and the query based on a phylogenetic tree. For more details about the phylogenetic tree constructed by MaLe-PSI-BLAST, refer to Phylogenetic tree.
SelfScore The ration of query-self-score in the iteration.
Confidence Machine-learn calculated confidence index in the current hit.
E-value Expect value from PSI-BLAST analysis.
LCA TxID Taxonomy ID of the last common ancestor (LCA) between the query and the retrieved protein organisms
LCA Last common ancestor (LCA) code that we use to indicate the LCA level. It ranges from 0 to 18. The higher is the LCA level, more recent is the LCA between the retrieved protein and the query organisms.
Organism Taxonomy name of the organism in which the retrieved protein belongs.
Description relevance A measure of how much the current annotation agrees with the weighted consensus.
Description Uniprot description of the retrieved protein.

3.2.2. Phylogenetic tree

MaLe-PSI-BLAST has a set of software and Perl scripts to automatically generate a phylogenetic tree from its results. The tree is generated using MUSCLE (ref) as the sequence aligner and FastTree (ref) as the phylogenetic tree constructor. Moreover, MaLe-PSI-BLAST has a Perl script that will color the tree according to the LCA level between your query and all retrieved proteins.

The tree is in NHX format and designed to be visualized with PhyloWidget (ref, www.phylowidget.org/‎), a software for phylogenetic tree visualization.

4. Advanced topics

5. References