Re: [Biskit-general] database location
Brought to you by:
graik
From: Thomas E. <te...@gm...> - 2009-12-13 21:57:46
|
Raik, I appreciate a lot your kind assistance. But apart from running Biskit I must also know what's happening behind the calculations. I have some brief questions about the available arguments: >From what I understand, search_sequences.py performs several clustering iterations lowering -simcut and -lencut each time, until it reaches a number of clusters which is lower than -limit (50 by default). Is 50 the optimum value? I think I have comprehended what the arguments mean, all but -aln? What do you mean when you suggest to set -aln to 100? Is it right to set -simlen to 1.0 when increasing -simcut to 5, to filter very close homologs? Why don't you use -mode accurate in T-COFFEE (align.py)? I run the pipeline up to the sequence-structure alignment using T-coffee (align.py), but the resulting final.pir_aln (59 sequences) does not contain my target sequence, although I named the target file target.fasta. The maximum sequence similarity of my target with PDB structures is ~42%. Is this the reason I don't have it into my alignment file? Apart from that the alignment is very long (4510 aa) and contains long gaps. What should I do to improve it? thanks, Thomas 2009/12/13 Raik Gruenberg <rai...@cr...> > Hi again, > > > Thomas Evangelidis wrote: > >> Raik, >> >> >> I followed the stack trace and found out that the target sequence must be >> named target.fasta. Perhaps you should point this out somewhere in the >> documentation, as there is no argument to set the desired file name. >> > > In theory, all of the modeling scripts should have an optional argument to > give an alternative input file/folder. Please try running: > > search_sequences.py --help > or align.py --help > etc. > > This should give you all the input options. However, all the scripts have > default input and output file names so that the default pipeline can be run > in a fresh folder starting from a single target.fasta without any options. > The problem is that we usually test the pipeline only with these default > input/output names. So sometimes we miss bugs that occur only with the > non-default parameters. > > > >> Now align.py works but t_coffee (Version_5.72) takes ages to finish. It >> may be the large number of sequences (366) in sequences.nr.fasta and >> templates (38) in templates/templates.fasta that slows it down. >> >> Could you guide me to optimize the performance on my dual core machine? >> I've noticed t_coffee eats up almost all the memory (RAM 4GB), is it >> possible to distribute the jobs to both processors? >> > > Mhm... this question is better suited for the T-Coffee mailing list. From > my experience, large alignments >100 entries do not improve the result > (rather the opposite). So it would be better if you restrict the number of > sequence hits before going to the alignment step. search_templates and > search_sequences have several options to play around with: > > search_templates.py --help > ... > Options: > -q fasta file with query sequence (default: ./target.fasta) > -o output folder for results (default: .) > -log log file (default: STDOUT) > -db sequence data base > -limit Largest number of clusters allowed > -e E-value cutoff for sequence search > -aln number of alignments to be returned > -simcut similarity threshold for blastclust (score < 3 or % identity) > -simlen length threshold for clustering > -ncpu number of CPUs for clustering > -psi int, use PSI Blast with specified number of iterations > ... > > If you get lots of sequences in this step, I would reccommend to use > -aln 100 (or less) > If you have lots of very very similar sequences you can try: > -simcut 5 (or more) > If you have some closely related sequences but also many far away ones, it > is a very good idea to exclude the far away ones by imposing a higher > E-value cutoff: > -e 0.0000001 > > The idea is to get template PDBs that are as closely as possible related to > your target and a bunch of more sequences (without structure) that help you > to fill the "relation gaps" within the structures. > > Good luck! > Raik > > >> thanks in advance, >> Thomas >> >> 2009/12/11 Thomas Evangelidis < >> Tho...@po... <mailto: >> Tho...@po...>> >> >> >> >> Raik, >> >> Firstly: >> >> >> > Finally, an irrelevant question: does pvm distribute jobs to >> my both >> > CPUs or is it just for individual computers? >> >> By default, jobs go to one CPU on one host. If you add the same >> host twice, it will get two jobs in parallel that end up on each CPU. >> I gues that's what you want. The hosts.py configuration of Biskit >> allows you to define all that. >> >> Could you be more specific? I tried several combinations of nodes_own >> and cpus_own in Biskit/data/defaults/hosts.py like: >> nodes_own = ['localhost', 'localhost'] >> cpus_own = ['localhost', 'localhost', 'localhost', 'localhost'] >> >> or nodes_own = ['localhost'] >> cpus_own = ['localhost', 'localhost'] >> >> But haven't noticed both CPUs occupied. How should I modify it? >> >> >> As for search_sequences.py, I ran it from ~/Documents with my own >> sequence and worked. So did search_template.py and clean_templates.py, >> but align.py returns: >> >> /usr/local/lib/python2.6/dist-packages/Bio/Fasta/__init__.py:68: >> DeprecationWarning: Bio.Fasta is deprecated. Please use the "fasta" >> support in Bio.SeqIO (or Bio.AlignIO) instead. >> 'Bio.SeqIO (or Bio.AlignIO) instead.', DeprecationWarning) >> >> Error: Error while building alingnments. >> <type 'exceptions.OSError'> in >> /home/thomas2/Documents/biskit/scripts/Mod/align.py line 123: >> (2, 'No such file or directory'). >> TraceBack: >> align: 123 (<module>) a.align_for_modeller_inp() >> Aligner: 296 (align_for_modeller_inp) self.repair_target_fasta( >> f_target ) >> Aligner: 416 (repair_target_fasta) os.rename( fname, bak_fname) >> >> >> Traceback (most recent call last): >> File "/home/thomas2/Documents/biskit/scripts/Mod/align.py", line >> 128, in <module> >> EHandler.error( 'Error while building alingnments.') >> File >> "/usr/local/lib/python2.6/dist-packages/Biskit/ErrorHandler.py", line >> 77, in error >> raise NormalError >> Biskit.Errors.NormalError >> >> >> thanks, >> Thomas >> >> >> >> >> >> >> >> ------------------------------------------------------------------------------ >> Return on Information: >> Google Enterprise Search pays you back >> Get the facts. >> http://p.sf.net/sfu/google-dev2dev >> _______________________________________________ >> Biskit-general mailing list >> Bis...@li... >> <mailto:Bis...@li...> >> >> https://lists.sourceforge.net/lists/listinfo/biskit-general >> >> >> > -- > ________________________________ > > Dr. Raik Gruenberg > http://www.raiks.de/contact.html > ________________________________ > |