Home
Name Modified Size InfoDownloads / Week
exampledatabase_cogup14714.txt 2010-07-09 3.4 MB
exampleinput.txt 2010-07-09 230 Bytes
psi_readme.txt 2010-07-09 6.8 kB
psi_square.py 2010-07-09 31.0 kB
Totals: 4 Items   3.4 MB 0
README file for psi-square_algorithm

This program is designed to find groups of highly similar vectors.  Vectors
can be gene expression profiles, gene occurrence patterns, protein interaction
lists, etc.

Given a vector database and a query (one or more vectors), both in
tab-delimited format, psi-square outputs a subset of the database that is
similar to the query.  For the details of psi-square algorithm see ref..

The main script is psi-square.py.

usage: psi_square.py [options] <query>
       psi_square.py --stats

Additional functionality is provided by psi_op.py.  This script performs set
operations on psi-square outputs.  Namely, it takes one or more input files,
which are psi-square output files and outputs their intersection or union,
depending on the command line option.  It is useful for performance evaluation
of different distance measures, threshold parameters and so on.

usage: psi_op.py [options] <vector-file>...


IMPORTANT STEPS:

1) Get Python.  It's available from 'http://www.python.org/' and runs on most
   platforms (Linux, Windows, Mac, etc.).  You will need version 2.4 or
   higher.

2) Put your database and query files in tab-delimited format.  The scripts
   (psi_square.py, psi_op.py) either need to be in the same directory as these
   files or else in an appropriate directory in your $PATH.  (The former
   option is the simplest if you're not familiar with $PATH.)

   The module ewsa_quantile.py can be placed in the same directory as
   the scripts.  It's only required if you plan to use the -E flag.

   The database can either be specified as the value of the environment
   variable PSISQUAREDB or specified explicitly using the -D flag.

3) To run the program type at the command prompt:

	psi_square.py [options] <query>

   Depending on the platform you're using, you might need to use one of these
   forms:

	./psi_square.py [options] <query>

	python psi_square.py [options] <query>

   There is additional information on running Python programs under Windows on
   the Python web site.



PSI-SQUARE COMMAND LINE OPTIONS:

Psi-square sensitivity/specificity, type of search and output formats are all
controlled via command line options.

I. CONTROL THE TYPE OF SEARCH:

1) Psi-square can be applied to binary as well as to floating point vectors.
These possibilities are controlled via parameter -k.
-k BINS: number of bins to discretize vector values into [default 80];
a value of 0 means this parameter will be estimated heuristically;
-k 2: for binary vectors search.

2) Psi-square can search the database in order to find vectors similar to
queries using a variety of different distance measures.  The type of distance
measure to be used is controlled via parameter -d.

-d METRIC, --distance-metric=METRIC: type of distance measure to be used;
	0: Correlation coefficient [default];
	1: Eucledean distance;
	2: Manhattan distance;
	3: Complement to extended Jaccard similarity index;
	4: Generalized average-based (GA) distance.
	REMEMBER: if d == 4 you have to set degree for GA distance (option -e)
	REMEMBER:: if OPTION d <> 0 the threshold should be inverted relative to
        correlation-related thresholds REMEMBER: d = 1 - corr

3) Psi-square can search the database using GA-based distance measures (valid
only with k=2).  The type of distance measure to be used is controlled via
parameter --degree.

-e DEGREE, --degree=DEGREE: power for generalized average-based distance (GA)
GA distance between vectors, X,Y is computed as d(X,Y)=1-(X,Y)/A,
where A=((X,X)^degree+(Y,Y)^degree)^1/degree.
	0: degree = 0, geometric average [default];
	-100: degree == -infinity, complement to Simpson indice;
	100: degree == +infinity,
	n: degree == n, any number in the range [-inf,+inf]

4) Prior to running searches, it's possible to compute distance distribution
statistics for a database.  This makes it easier to choose a good threshold
for distances.  This option is controlled via parameter --stats.
Additionally, you will need to provide the database file and the type of
distance measure (-d), so the command line should include these sources, e.g.:

	python psi_square.py -D whog_sorgen.txt -d 4 --degree -2 --stats

The output is sent to stdout.

5) Use fixed number of iterations.

-p PASSES, --passes=PASSES


II. CONTROL THE SPECIFICITY/SENSITIVITY PARAMETERS

-c CORRELATION_THRESHOLD: threshold for correlation (or distance) between two
vectors to be considered significant [default 0.9]
-C SECONDARY_CORRELATION_THRESHOLD: threshold for correlation (distance)
between two vectors, which vectors in additional passes must satisfy with
respect to the query [default 0.4 for d == 0, infinity otherwise]
-s CUMULATIVE_SCORE_THRESHOLD: threshold for score distribution [default 0.99]
-z SCALE, --scale SCALE: scale for score distribution [default 1000]
-m MIN_INITIAL, --min-initial=MIN_INITIAL: minimum number of 'close' vectors
in first iteration required to continue [default 3]


III. CONTROL THE OUTPUT

-t, --truncate: truncate gene identiers (to 60 characters)
-o FILE, --output=FILE: output file [default is stdout]
-r PREFIX, --results=PREFIX: write results to tab-delimited files, using
PREFIX to generate the filenames.  This option is useful when you have a list
of queries.  Then if you have (say) 9 queries and used -r 1, matches, found
separately for all nine queries will be placed in files 1.q0.txt,
... ,1.q8.txt.  Additionally, you can use psi_op.py and take
e.g. intersection/union of 1.q0.txt, ... ,1.q8.txt.
-b N, --show-best=N: show only the best N vectors for each iteration [default
is to show all] 
-v, --verbose: be verbose


EXAMPLES

Let's say you have a database called whog_sorgen.txt containing phyletic
patterns and cog1298 containing COG1298, component FlhA of flagellar
biosynthesis pathway, that you want to use as a query in order to find
patterns in whog_sorgen.txt mostly similar to COG1298.

Type at your prompt...

1. python psi_square.py -D whog_sorgen.txt -k 2 -d 0 -c 0.6 -s 0.999 cog1298 > cog1298.dcor

File cog1298.dcor will contain the list of matches found and their
correlations with query. 

2. python psi_square.py -D whog_sorgen.txt -k 2 -d 4 --degree -100 -c 0.1 -s 0.999 cog1298  > cog1298.d_inf

File cog1298.d_inf will contain the list of matches found and the GA-based
distance, complement to the Simpson similarity index.

3. python psi_square.py -D whog_sorgen.txt -k 2 -d 0 -c 0.6 -s 0.999 -r 1 cog1298 > cog1298.dcor

File cog1298.dcor will contain the list of matches found and their
correlations with query.  In addition, file 1.q0.txt appears, which in the
case of single query is pretty much the same as the output file.
Source: psi_readme.txt, updated 2010-07-09