Home
Name Modified Size Downloads / Week Status
Totals: 4 Items   3.4 MB 11
exampledatabase_cogup14714.txt 2010-07-09 3.4 MB 44 weekly downloads
exampleinput.txt 2010-07-09 230 Bytes 33 weekly downloads
psi_readme.txt 2010-07-09 6.8 kB 33 weekly downloads
psi_square.py 2010-07-09 31.0 kB 11 weekly downloads
README file for psi-square_algorithm This program is designed to find groups of highly similar vectors. Vectors can be gene expression profiles, gene occurrence patterns, protein interaction lists, etc. Given a vector database and a query (one or more vectors), both in tab-delimited format, psi-square outputs a subset of the database that is similar to the query. For the details of psi-square algorithm see ref.. The main script is psi-square.py. usage: psi_square.py [options] <query> psi_square.py --stats Additional functionality is provided by psi_op.py. This script performs set operations on psi-square outputs. Namely, it takes one or more input files, which are psi-square output files and outputs their intersection or union, depending on the command line option. It is useful for performance evaluation of different distance measures, threshold parameters and so on. usage: psi_op.py [options] <vector-file>... IMPORTANT STEPS: 1) Get Python. It's available from 'http://www.python.org/' and runs on most platforms (Linux, Windows, Mac, etc.). You will need version 2.4 or higher. 2) Put your database and query files in tab-delimited format. The scripts (psi_square.py, psi_op.py) either need to be in the same directory as these files or else in an appropriate directory in your $PATH. (The former option is the simplest if you're not familiar with $PATH.) The module ewsa_quantile.py can be placed in the same directory as the scripts. It's only required if you plan to use the -E flag. The database can either be specified as the value of the environment variable PSISQUAREDB or specified explicitly using the -D flag. 3) To run the program type at the command prompt: psi_square.py [options] <query> Depending on the platform you're using, you might need to use one of these forms: ./psi_square.py [options] <query> python psi_square.py [options] <query> There is additional information on running Python programs under Windows on the Python web site. PSI-SQUARE COMMAND LINE OPTIONS: Psi-square sensitivity/specificity, type of search and output formats are all controlled via command line options. I. CONTROL THE TYPE OF SEARCH: 1) Psi-square can be applied to binary as well as to floating point vectors. These possibilities are controlled via parameter -k. -k BINS: number of bins to discretize vector values into [default 80]; a value of 0 means this parameter will be estimated heuristically; -k 2: for binary vectors search. 2) Psi-square can search the database in order to find vectors similar to queries using a variety of different distance measures. The type of distance measure to be used is controlled via parameter -d. -d METRIC, --distance-metric=METRIC: type of distance measure to be used; 0: Correlation coefficient [default]; 1: Eucledean distance; 2: Manhattan distance; 3: Complement to extended Jaccard similarity index; 4: Generalized average-based (GA) distance. REMEMBER: if d == 4 you have to set degree for GA distance (option -e) REMEMBER:: if OPTION d <> 0 the threshold should be inverted relative to correlation-related thresholds REMEMBER: d = 1 - corr 3) Psi-square can search the database using GA-based distance measures (valid only with k=2). The type of distance measure to be used is controlled via parameter --degree. -e DEGREE, --degree=DEGREE: power for generalized average-based distance (GA) GA distance between vectors, X,Y is computed as d(X,Y)=1-(X,Y)/A, where A=((X,X)^degree+(Y,Y)^degree)^1/degree. 0: degree = 0, geometric average [default]; -100: degree == -infinity, complement to Simpson indice; 100: degree == +infinity, n: degree == n, any number in the range [-inf,+inf] 4) Prior to running searches, it's possible to compute distance distribution statistics for a database. This makes it easier to choose a good threshold for distances. This option is controlled via parameter --stats. Additionally, you will need to provide the database file and the type of distance measure (-d), so the command line should include these sources, e.g.: python psi_square.py -D whog_sorgen.txt -d 4 --degree -2 --stats The output is sent to stdout. 5) Use fixed number of iterations. -p PASSES, --passes=PASSES II. CONTROL THE SPECIFICITY/SENSITIVITY PARAMETERS -c CORRELATION_THRESHOLD: threshold for correlation (or distance) between two vectors to be considered significant [default 0.9] -C SECONDARY_CORRELATION_THRESHOLD: threshold for correlation (distance) between two vectors, which vectors in additional passes must satisfy with respect to the query [default 0.4 for d == 0, infinity otherwise] -s CUMULATIVE_SCORE_THRESHOLD: threshold for score distribution [default 0.99] -z SCALE, --scale SCALE: scale for score distribution [default 1000] -m MIN_INITIAL, --min-initial=MIN_INITIAL: minimum number of 'close' vectors in first iteration required to continue [default 3] III. CONTROL THE OUTPUT -t, --truncate: truncate gene identiers (to 60 characters) -o FILE, --output=FILE: output file [default is stdout] -r PREFIX, --results=PREFIX: write results to tab-delimited files, using PREFIX to generate the filenames. This option is useful when you have a list of queries. Then if you have (say) 9 queries and used -r 1, matches, found separately for all nine queries will be placed in files 1.q0.txt, ... ,1.q8.txt. Additionally, you can use psi_op.py and take e.g. intersection/union of 1.q0.txt, ... ,1.q8.txt. -b N, --show-best=N: show only the best N vectors for each iteration [default is to show all] -v, --verbose: be verbose EXAMPLES Let's say you have a database called whog_sorgen.txt containing phyletic patterns and cog1298 containing COG1298, component FlhA of flagellar biosynthesis pathway, that you want to use as a query in order to find patterns in whog_sorgen.txt mostly similar to COG1298. Type at your prompt... 1. python psi_square.py -D whog_sorgen.txt -k 2 -d 0 -c 0.6 -s 0.999 cog1298 > cog1298.dcor File cog1298.dcor will contain the list of matches found and their correlations with query. 2. python psi_square.py -D whog_sorgen.txt -k 2 -d 4 --degree -100 -c 0.1 -s 0.999 cog1298 > cog1298.d_inf File cog1298.d_inf will contain the list of matches found and the GA-based distance, complement to the Simpson similarity index. 3. python psi_square.py -D whog_sorgen.txt -k 2 -d 0 -c 0.6 -s 0.999 -r 1 cog1298 > cog1298.dcor File cog1298.dcor will contain the list of matches found and their correlations with query. In addition, file 1.q0.txt appears, which in the case of single query is pretty much the same as the output file.
Source: psi_readme.txt, updated 2010-07-09