Re: [Biskit-general] database location

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Raik,

I appreciate a lot your kind assistance. But apart from running Biskit I
must also know what's happening  behind the calculations. I have some brief
questions about the available arguments:

>From what I understand, search_sequences.py performs several clustering
iterations lowering -simcut and -lencut each time, until it reaches a number
of clusters which is lower than -limit (50 by default). Is 50 the optimum
value?

I think I have comprehended what the arguments mean, all but -aln? What do
you mean when you suggest to set -aln to 100?

Is it right to set -simlen to 1.0 when increasing -simcut to 5, to filter
very close homologs?

Why don't you use -mode accurate in T-COFFEE (align.py)?

I run the pipeline up to the sequence-structure alignment using T-coffee
(align.py), but the resulting final.pir_aln (59 sequences) does not contain
my target sequence, although I named the target file target.fasta. The
maximum sequence similarity
 of my target with PDB structures is ~42%. Is this the reason I don't have
it into my alignment file? Apart from that the alignment is very long (4510
aa) and contains long gaps. What should I do to improve it?

thanks,
Thomas

2009/12/13 Raik Gruenberg <rai...@cr...>

> Hi again,
>
>
> Thomas Evangelidis wrote:
>
>> Raik,
>>
>>
>> I followed the stack trace and found out that the target sequence must be
>> named target.fasta. Perhaps you should point this out somewhere in the
>> documentation, as there is no argument to set the desired file name.
>>
>
> In theory, all of the modeling scripts should have an optional argument to
> give an alternative input file/folder. Please try running:
>
> search_sequences.py --help
> or align.py --help
> etc.
>
> This should give you all the input options. However, all the scripts have
> default input and output file names so that the default pipeline can be run
> in a fresh folder starting from a single target.fasta without any options.
> The problem is that we usually test the pipeline only with these default
> input/output names. So sometimes we miss bugs that occur only with the
> non-default parameters.
>
>
>
>> Now align.py works but t_coffee (Version_5.72) takes ages to finish. It
>> may be the large number of sequences (366) in sequences.nr.fasta and
>> templates (38) in templates/templates.fasta that slows it down.
>>
>> Could you guide me to optimize the performance on my dual core machine?
>> I've noticed t_coffee eats up almost all the memory (RAM 4GB), is it
>> possible to distribute the jobs to both processors?
>>
>
> Mhm... this question is better suited for the T-Coffee mailing list. From
> my experience, large alignments >100 entries do not improve the result
> (rather the opposite). So it would be better if you restrict the number of
> sequence hits before going to the alignment step. search_templates and
> search_sequences have several options to play around with:
>
> search_templates.py --help
> ...
> Options:
>    -q       fasta file with query sequence (default: ./target.fasta)
>    -o       output folder for results      (default: .)
>    -log     log file                       (default: STDOUT)
>    -db      sequence data base
>    -limit   Largest number of clusters allowed
>    -e       E-value cutoff for sequence search
>    -aln     number of alignments to be returned
>    -simcut  similarity threshold for blastclust (score < 3 or % identity)
>    -simlen  length threshold for clustering
>    -ncpu    number of CPUs for clustering
>    -psi     int, use PSI Blast with specified number of iterations
> ...
>
> If you get lots of sequences in this step, I would reccommend to use
> -aln 100 (or less)
> If you have lots of very very similar sequences you can try:
> -simcut 5 (or more)
> If you have some closely related sequences but also many far away ones, it
> is a very good idea to exclude the far away ones by imposing a higher
> E-value cutoff:
> -e 0.0000001
>
> The idea is to get template PDBs that are as closely as possible related to
> your target and a bunch of more sequences (without structure) that help you
> to fill the "relation gaps" within the structures.
>
> Good luck!
> Raik
>
>
>> thanks in advance,
>> Thomas
>>
>> 2009/12/11 Thomas Evangelidis <
>> Tho...@po... <mailto:
>> Tho...@po...>>
>>
>>
>>
>>    Raik,
>>
>>    Firstly:
>>
>>
>>        > Finally, an irrelevant question: does pvm distribute jobs to
>>    my both
>>        > CPUs or is it just for individual computers?
>>
>>        By default, jobs go to one CPU on one host. If you add the same
>>    host twice, it will get two jobs in parallel that end up on each CPU.
>>    I gues that's what you want. The hosts.py configuration of Biskit
>>    allows you to define all that.
>>
>>    Could you be more specific? I tried several combinations of nodes_own
>>    and cpus_own in Biskit/data/defaults/hosts.py like:
>>    nodes_own = ['localhost', 'localhost']
>>    cpus_own = ['localhost', 'localhost', 'localhost', 'localhost']
>>
>>    or nodes_own = ['localhost']
>>    cpus_own = ['localhost', 'localhost']
>>
>>    But haven't noticed both CPUs occupied. How should I modify it?
>>
>>
>>    As for search_sequences.py, I ran it from ~/Documents with my own
>>    sequence and worked. So did search_template.py and clean_templates.py,
>>    but align.py returns:
>>
>>    /usr/local/lib/python2.6/dist-packages/Bio/Fasta/__init__.py:68:
>>    DeprecationWarning: Bio.Fasta is deprecated. Please use the "fasta"
>>    support in Bio.SeqIO (or Bio.AlignIO) instead.
>>      'Bio.SeqIO (or Bio.AlignIO) instead.', DeprecationWarning)
>>
>>    Error: Error while building alingnments.
>>            <type 'exceptions.OSError'> in
>>    /home/thomas2/Documents/biskit/scripts/Mod/align.py line 123:
>>            (2, 'No such file or directory').
>>    TraceBack:
>>    align: 123 (<module>) a.align_for_modeller_inp()
>>    Aligner: 296 (align_for_modeller_inp) self.repair_target_fasta(
>>    f_target )
>>    Aligner: 416 (repair_target_fasta) os.rename( fname,  bak_fname)
>>
>>
>>    Traceback (most recent call last):
>>      File "/home/thomas2/Documents/biskit/scripts/Mod/align.py", line
>>    128, in <module>
>>        EHandler.error( 'Error while building alingnments.')
>>      File
>>    "/usr/local/lib/python2.6/dist-packages/Biskit/ErrorHandler.py", line
>>    77, in error
>>        raise NormalError
>>    Biskit.Errors.NormalError
>>
>>
>>    thanks,
>>    Thomas
>>
>>
>>
>>
>>
>>
>>
>>  ------------------------------------------------------------------------------
>>    Return on Information:
>>    Google Enterprise Search pays you back
>>    Get the facts.
>>    http://p.sf.net/sfu/google-dev2dev
>>    _______________________________________________
>>    Biskit-general mailing list
>>    Bis...@li...
>>    <mailto:Bis...@li...>
>>
>>    https://lists.sourceforge.net/lists/listinfo/biskit-general
>>
>>
>>
> --
> ________________________________
>
> Dr. Raik Gruenberg
> http://www.raiks.de/contact.html
> ________________________________
>