Thread: [Gusdev-gusdev] parseBlastFilesForSimilarity.pl

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hello all,

Where can find the script parseBlastFilesForSimilarity.pl??
I'm trying to run LoadBlastSimFast...

Poliana

Hello all,

Where can find the script parseBlastFilesForSimilarity.pl??
I'm trying to run LoadBlastSimFast...

Poliana

poliana-

oops, the usage statement for LoadBlastSimFast is out of date.   it 
should instruct you to use the blastSimilarity command.

LoadBlastSimFast makes a big assumption, that the subject and query 
sequences are in GUS, and their def. lines have GUS primary keys. 

Are your sequences already loaded into GUS?

steve

Poliana Mateus wrote:

>Hello all,
>
>Where can find the script parseBlastFilesForSimilarity.pl??
>I'm trying to run LoadBlastSimFast...
>
>Poliana
>
>
>-------------------------------------------------------
>SF email is sponsored by - The IT Product Guide
>Read honest & candid reviews on hundreds of IT Products from real users.
>Discover which products truly live up to the hype. Start reading now.
>http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
>_______________________________________________
>Gusdev-gusdev mailing list
>Gus...@li...
>https://lists.sourceforge.net/lists/listinfo/gusdev-gusdev
>  
>

Hi Steve,

On Fri, 2005-02-11 at 08:56 -0500, Steve Fischer wrote:
> poliana-
> 
> oops, the usage statement for LoadBlastSimFast is out of date.   it 
> should instruct you to use the blastSimilarity command.
> 
> LoadBlastSimFast makes a big assumption, that the subject and query 
> sequences are in GUS, and their def. lines have GUS primary keys. 
> 
> Are your sequences already loaded into GUS?

They are not, there would be any howto/tips for that plugin ? We will
certainly need a plugin to load "Interpro" and "ORF finding" results
into GUS... If they are not available, then maybe we will have to write
them ...

Cheers, Alberto

> 
> steve
> 
> 
> 
> Poliana Mateus wrote:
> 
> >Hello all,
> >
> >Where can find the script parseBlastFilesForSimilarity.pl??
> >I'm trying to run LoadBlastSimFast...
> >
> >Poliana
> >
> >

what are you planning on blasting?

steve

Alberto Davila wrote:

>Hi Steve,
>
>On Fri, 2005-02-11 at 08:56 -0500, Steve Fischer wrote:
>  
>
>>poliana-
>>
>>oops, the usage statement for LoadBlastSimFast is out of date.   it 
>>should instruct you to use the blastSimilarity command.
>>
>>LoadBlastSimFast makes a big assumption, that the subject and query 
>>sequences are in GUS, and their def. lines have GUS primary keys. 
>>
>>Are your sequences already loaded into GUS?
>>    
>>
>
>They are not, there would be any howto/tips for that plugin ? We will
>certainly need a plugin to load "Interpro" and "ORF finding" results
>into GUS... If they are not available, then maybe we will have to write
>them ...
>
>Cheers, Alberto
>
>  
>
>>steve
>>
>>
>>
>>Poliana Mateus wrote:
>>
>>    
>>
>>>Hello all,
>>>
>>>Where can find the script parseBlastFilesForSimilarity.pl??
>>>I'm trying to run LoadBlastSimFast...
>>>
>>>Poliana
>>>
>>>
>>>      
>>>

A couple more comments. The"big assumption" (also applies to 
LoadBLATAlignments plugin) might seem restrictive and anti-intuitive (thus 
easily assumed otherwise) at first, and impose more work if you just want 
to experiment with your datasets (which could be heterogenious).

But think about it, this is necessay safeguard to protect data integrity 
in GUS. This is dictated by the foreign key constraints (query sequence 
and subject sequence) on the similarity table. In reality, this forces you 
to think carefully about your datasets (e.g. how to organize if have gene 
trap tags from half a dozen sources, and you want to align them all to the 
genome).

-Thomas

On Fri, 11 Feb 2005, Steve Fischer wrote:

> poliana-
>
> oops, the usage statement for LoadBlastSimFast is out of date.   it should 
> instruct you to use the blastSimilarity command.
>
> LoadBlastSimFast makes a big assumption, that the subject and query sequences 
> are in GUS, and their def. lines have GUS primary keys. 
> Are your sequences already loaded into GUS?
>
> steve
>
>
>
> Poliana Mateus wrote:
>
>> Hello all,
>> 
>> Where can find the script parseBlastFilesForSimilarity.pl??
>> I'm trying to run LoadBlastSimFast...
>> 
>> Poliana
>> 
>> 
>> -------------------------------------------------------
>> SF email is sponsored by - The IT Product Guide
>> Read honest & candid reviews on hundreds of IT Products from real users.
>> Discover which products truly live up to the hype. Start reading now.
>> http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
>> _______________________________________________
>> Gusdev-gusdev mailing list
>> Gus...@li...
>> https://lists.sourceforge.net/lists/listinfo/gusdev-gusdev
>> 
>
>
> -------------------------------------------------------
> SF email is sponsored by - The IT Product Guide
> Read honest & candid reviews on hundreds of IT Products from real users.
> Discover which products truly live up to the hype. Start reading now.
> http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
> _______________________________________________
> Gusdev-gusdev mailing list
> Gus...@li...
> https://lists.sourceforge.net/lists/listinfo/gusdev-gusdev
>

Query:

Either sequences from genbank (genbank format) or sequences generated in
the lab (fasta format)

Blastable databases (all are formatted databases from NCBI):

NR
NT
EST

Alberto

On Fri, 2005-02-11 at 10:34 -0500, Steve Fischer wrote:
> for the blast, what are the query sequences and what are the blastable 
> databases?
> 
> steve
> 
> Alberto Davila wrote:
> 
> >Basically we will use sequences (loaded into GUS with the GBParser) for
> >NCBI Blast (Blastx, Blastp and TBlastX), the same sequences will be also
> >used for Interpro analyses. Results of both (Blast and Interpro) will be
> >loaded into GUS. We will parse specific things from the Blast results, I
> >would say:
> >
> >  `Gi` 
> >  `Accession` 
> >  `Description` 
> >  `E_value` 
> >  `Score` 
> >  `Length` 
> >  `Frame_Query` 
> >  `Frame_Hit` 
> >  `Identical` 
> >  `Hsp_Frac_Identical` 
> >  `Conserved` 
> >  `Hsp_Frac_Conserved`
> >  `Query_Start`
> >  `Query_End` 
> >  `Hit_Start` 
> >  `Hit_End` 
> >  `Hsp_Align` 
> >  `database_letters` 
> >  `database_entries` 
> >
> >We already have a Bioperl parser for that (specific for another system:
> >GARSA) that could be adapted to GUS, problem being we are not sure what
> >tables should be used to store those data in GUS.
> >
> >Cheers, Alberto
> >
> >
> >On Fri, 2005-02-11 at 10:06 -0500, Steve Fischer wrote:
> >  
> >
> >>what are you planning on blasting?
> >>
> >>steve
> >>
> >>Alberto Davila wrote:
> >>
> >>    
> >>
> >>>Hi Steve,
> >>>
> >>>On Fri, 2005-02-11 at 08:56 -0500, Steve Fischer wrote:
> >>> 
> >>>
> >>>      
> >>>
> >>>>poliana-
> >>>>
> >>>>oops, the usage statement for LoadBlastSimFast is out of date.   it 
> >>>>should instruct you to use the blastSimilarity command.
> >>>>
> >>>>LoadBlastSimFast makes a big assumption, that the subject and query 
> >>>>sequences are in GUS, and their def. lines have GUS primary keys. 
> >>>>
> >>>>Are your sequences already loaded into GUS?
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>They are not, there would be any howto/tips for that plugin ? We will
> >>>certainly need a plugin to load "Interpro" and "ORF finding" results
> >>>into GUS... If they are not available, then maybe we will have to write
> >>>them ...
> >>>
> >>>Cheers, Alberto
> >>>
> >>> 
> >>>
> >>>      
> >>>
> >>>>steve
> >>>>
> >>>>
> >>>>
> >>>>Poliana Mateus wrote:
> >>>>
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>Hello all,
> >>>>>
> >>>>>Where can find the script parseBlastFilesForSimilarity.pl??
> >>>>>I'm trying to run LoadBlastSimFast...
> >>>>>
> >>>>>Poliana

(what is NT?)

which of these (genbank, your fasta, NRDB, NT, EST) have you loaded into 
gus?

steve

Alberto Davila wrote:

>Query:
>
>Either sequences from genbank (genbank format) or sequences generated in
>the lab (fasta format)
>
>Blastable databases (all are formatted databases from NCBI):
>
>NR
>NT
>EST
>
>Alberto
>
>On Fri, 2005-02-11 at 10:34 -0500, Steve Fischer wrote:
>  
>
>>for the blast, what are the query sequences and what are the blastable 
>>databases?
>>
>>steve
>>
>>Alberto Davila wrote:
>>
>>    
>>
>>>Basically we will use sequences (loaded into GUS with the GBParser) for
>>>NCBI Blast (Blastx, Blastp and TBlastX), the same sequences will be also
>>>used for Interpro analyses. Results of both (Blast and Interpro) will be
>>>loaded into GUS. We will parse specific things from the Blast results, I
>>>would say:
>>>
>>> `Gi` 
>>> `Accession` 
>>> `Description` 
>>> `E_value` 
>>> `Score` 
>>> `Length` 
>>> `Frame_Query` 
>>> `Frame_Hit` 
>>> `Identical` 
>>> `Hsp_Frac_Identical` 
>>> `Conserved` 
>>> `Hsp_Frac_Conserved`
>>> `Query_Start`
>>> `Query_End` 
>>> `Hit_Start` 
>>> `Hit_End` 
>>> `Hsp_Align` 
>>> `database_letters` 
>>> `database_entries` 
>>>
>>>We already have a Bioperl parser for that (specific for another system:
>>>GARSA) that could be adapted to GUS, problem being we are not sure what
>>>tables should be used to store those data in GUS.
>>>
>>>Cheers, Alberto
>>>
>>>
>>>On Fri, 2005-02-11 at 10:06 -0500, Steve Fischer wrote:
>>> 
>>>
>>>      
>>>
>>>>what are you planning on blasting?
>>>>
>>>>steve
>>>>
>>>>Alberto Davila wrote:
>>>>
>>>>   
>>>>
>>>>        
>>>>
>>>>>Hi Steve,
>>>>>
>>>>>On Fri, 2005-02-11 at 08:56 -0500, Steve Fischer wrote:
>>>>>
>>>>>
>>>>>     
>>>>>
>>>>>          
>>>>>
>>>>>>poliana-
>>>>>>
>>>>>>oops, the usage statement for LoadBlastSimFast is out of date.   it 
>>>>>>should instruct you to use the blastSimilarity command.
>>>>>>
>>>>>>LoadBlastSimFast makes a big assumption, that the subject and query 
>>>>>>sequences are in GUS, and their def. lines have GUS primary keys. 
>>>>>>
>>>>>>Are your sequences already loaded into GUS?
>>>>>>  
>>>>>>
>>>>>>       
>>>>>>
>>>>>>            
>>>>>>
>>>>>They are not, there would be any howto/tips for that plugin ? We will
>>>>>certainly need a plugin to load "Interpro" and "ORF finding" results
>>>>>into GUS... If they are not available, then maybe we will have to write
>>>>>them ...
>>>>>
>>>>>Cheers, Alberto
>>>>>
>>>>>
>>>>>
>>>>>     
>>>>>
>>>>>          
>>>>>
>>>>>>steve
>>>>>>
>>>>>>
>>>>>>
>>>>>>Poliana Mateus wrote:
>>>>>>
>>>>>>  
>>>>>>
>>>>>>       
>>>>>>
>>>>>>            
>>>>>>
>>>>>>>Hello all,
>>>>>>>
>>>>>>>Where can find the script parseBlastFilesForSimilarity.pl??
>>>>>>>I'm trying to run LoadBlastSimFast...
>>>>>>>
>>>>>>>Poliana
>>>>>>>              
>>>>>>>
>
>  
>

All the blastable databases I mentioned are standard databases from NCBI
(ftp://ftp.ncbi.nlm.nih.gov/blast/db/blastdb.txt):

NT = nucleotides

~30000 entries from genbank (genbank format) are loaded into GUS now.

Not sure about your "NRDB", I know NR from NCBI that is a collection of
aminoacid entries, could it be the same ?

Alberto

On Fri, 2005-02-11 at 10:43 -0500, Steve Fischer wrote:
> (what is NT?)
> 
> which of these (genbank, your fasta, NRDB, NT, EST) have you loaded into 
> gus?
> 
> steve
> 
> Alberto Davila wrote:
> 
> >Query:
> >
> >Either sequences from genbank (genbank format) or sequences generated in
> >the lab (fasta format)
> >
> >Blastable databases (all are formatted databases from NCBI):
> >
> >NR
> >NT
> >EST
> >
> >Alberto
> >
> >On Fri, 2005-02-11 at 10:34 -0500, Steve Fischer wrote:
> >  
> >
> >>for the blast, what are the query sequences and what are the blastable 
> >>databases?
> >>
> >>steve
> >>
> >>Alberto Davila wrote:
> >>
> >>    
> >>
> >>>Basically we will use sequences (loaded into GUS with the GBParser) for
> >>>NCBI Blast (Blastx, Blastp and TBlastX), the same sequences will be also
> >>>used for Interpro analyses. Results of both (Blast and Interpro) will be
> >>>loaded into GUS. We will parse specific things from the Blast results, I
> >>>would say:
> >>>
> >>> `Gi` 
> >>> `Accession` 
> >>> `Description` 
> >>> `E_value` 
> >>> `Score` 
> >>> `Length` 
> >>> `Frame_Query` 
> >>> `Frame_Hit` 
> >>> `Identical` 
> >>> `Hsp_Frac_Identical` 
> >>> `Conserved` 
> >>> `Hsp_Frac_Conserved`
> >>> `Query_Start`
> >>> `Query_End` 
> >>> `Hit_Start` 
> >>> `Hit_End` 
> >>> `Hsp_Align` 
> >>> `database_letters` 
> >>> `database_entries` 
> >>>
> >>>We already have a Bioperl parser for that (specific for another system:
> >>>GARSA) that could be adapted to GUS, problem being we are not sure what
> >>>tables should be used to store those data in GUS.
> >>>
> >>>Cheers, Alberto
> >>>
> >>>
> >>>On Fri, 2005-02-11 at 10:06 -0500, Steve Fischer wrote:
> >>> 
> >>>
> >>>      
> >>>
> >>>>what are you planning on blasting?
> >>>>
> >>>>steve
> >>>>
> >>>>Alberto Davila wrote:
> >>>>
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>Hi Steve,
> >>>>>
> >>>>>On Fri, 2005-02-11 at 08:56 -0500, Steve Fischer wrote:
> >>>>>
> >>>>>
> >>>>>     
> >>>>>
> >>>>>          
> >>>>>
> >>>>>>poliana-
> >>>>>>
> >>>>>>oops, the usage statement for LoadBlastSimFast is out of date.   it 
> >>>>>>should instruct you to use the blastSimilarity command.
> >>>>>>
> >>>>>>LoadBlastSimFast makes a big assumption, that the subject and query 
> >>>>>>sequences are in GUS, and their def. lines have GUS primary keys. 
> >>>>>>
> >>>>>>Are your sequences already loaded into GUS?
> >>>>>>  
> >>>>>>
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>>They are not, there would be any howto/tips for that plugin ? We will
> >>>>>certainly need a plugin to load "Interpro" and "ORF finding" results
> >>>>>into GUS... If they are not available, then maybe we will have to write
> >>>>>them ...
> >>>>>
> >>>>>Cheers, Alberto
> >>>>>
> >>>>>
> >>>>>
> >>>>>     
> >>>>>
> >>>>>          
> >>>>>
> >>>>>>steve
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>Poliana Mateus wrote:
> >>>>>>
> >>>>>>  
> >>>>>>
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>>>>Hello all,
> >>>>>>>
> >>>>>>>Where can find the script parseBlastFilesForSimilarity.pl??
> >>>>>>>I'm trying to run LoadBlastSimFast...
> >>>>>>>
> >>>>>>>Poliana

ok.  NR = NRDB

the way we have used gus with similarities is that both the query and 
subject are loaded into gus.  As thomas explained, the similarity table 
captures similarity between sequences that are in gus. 

our approach has always been to just load (warehouse) the entire subject 
database (NR, EST) that we are blasting against.

the current plugins and blastSimilarity are set up for this.

obviously, this takes a lot of disk space.  two major efficiencies that 
we don't currently have plugins for would be:
  1. to only store in gus a *reference* to the external sequence (ie, 
don't store the actgs).
  2. only store in gus the sequences that actually have similarities

steve

Alberto Davila wrote:

>All the blastable databases I mentioned are standard databases from NCBI
>(ftp://ftp.ncbi.nlm.nih.gov/blast/db/blastdb.txt):
>
>NT = nucleotides
>
>~30000 entries from genbank (genbank format) are loaded into GUS now.
>
>Not sure about your "NRDB", I know NR from NCBI that is a collection of
>aminoacid entries, could it be the same ?
>
>Alberto
>
>On Fri, 2005-02-11 at 10:43 -0500, Steve Fischer wrote:
>  
>
>>(what is NT?)
>>
>>which of these (genbank, your fasta, NRDB, NT, EST) have you loaded into 
>>gus?
>>
>>steve
>>
>>Alberto Davila wrote:
>>
>>    
>>
>>>Query:
>>>
>>>Either sequences from genbank (genbank format) or sequences generated in
>>>the lab (fasta format)
>>>
>>>Blastable databases (all are formatted databases from NCBI):
>>>
>>>NR
>>>NT
>>>EST
>>>
>>>Alberto
>>>
>>>On Fri, 2005-02-11 at 10:34 -0500, Steve Fischer wrote:
>>> 
>>>
>>>      
>>>
>>>>for the blast, what are the query sequences and what are the blastable 
>>>>databases?
>>>>
>>>>steve
>>>>
>>>>Alberto Davila wrote:
>>>>
>>>>   
>>>>
>>>>        
>>>>
>>>>>Basically we will use sequences (loaded into GUS with the GBParser) for
>>>>>NCBI Blast (Blastx, Blastp and TBlastX), the same sequences will be also
>>>>>used for Interpro analyses. Results of both (Blast and Interpro) will be
>>>>>loaded into GUS. We will parse specific things from the Blast results, I
>>>>>would say:
>>>>>
>>>>>`Gi` 
>>>>>`Accession` 
>>>>>`Description` 
>>>>>`E_value` 
>>>>>`Score` 
>>>>>`Length` 
>>>>>`Frame_Query` 
>>>>>`Frame_Hit` 
>>>>>`Identical` 
>>>>>`Hsp_Frac_Identical` 
>>>>>`Conserved` 
>>>>>`Hsp_Frac_Conserved`
>>>>>`Query_Start`
>>>>>`Query_End` 
>>>>>`Hit_Start` 
>>>>>`Hit_End` 
>>>>>`Hsp_Align` 
>>>>>`database_letters` 
>>>>>`database_entries` 
>>>>>
>>>>>We already have a Bioperl parser for that (specific for another system:
>>>>>GARSA) that could be adapted to GUS, problem being we are not sure what
>>>>>tables should be used to store those data in GUS.
>>>>>
>>>>>Cheers, Alberto
>>>>>
>>>>>
>>>>>On Fri, 2005-02-11 at 10:06 -0500, Steve Fischer wrote:
>>>>>
>>>>>
>>>>>     
>>>>>
>>>>>          
>>>>>
>>>>>>what are you planning on blasting?
>>>>>>
>>>>>>steve
>>>>>>
>>>>>>Alberto Davila wrote:
>>>>>>
>>>>>>  
>>>>>>
>>>>>>       
>>>>>>
>>>>>>            
>>>>>>
>>>>>>>Hi Steve,
>>>>>>>
>>>>>>>On Fri, 2005-02-11 at 08:56 -0500, Steve Fischer wrote:
>>>>>>>
>>>>>>>
>>>>>>>    
>>>>>>>
>>>>>>>         
>>>>>>>
>>>>>>>              
>>>>>>>
>>>>>>>>poliana-
>>>>>>>>
>>>>>>>>oops, the usage statement for LoadBlastSimFast is out of date.   it 
>>>>>>>>should instruct you to use the blastSimilarity command.
>>>>>>>>
>>>>>>>>LoadBlastSimFast makes a big assumption, that the subject and query 
>>>>>>>>sequences are in GUS, and their def. lines have GUS primary keys. 
>>>>>>>>
>>>>>>>>Are your sequences already loaded into GUS?
>>>>>>>> 
>>>>>>>>
>>>>>>>>      
>>>>>>>>
>>>>>>>>           
>>>>>>>>
>>>>>>>>                
>>>>>>>>
>>>>>>>They are not, there would be any howto/tips for that plugin ? We will
>>>>>>>certainly need a plugin to load "Interpro" and "ORF finding" results
>>>>>>>into GUS... If they are not available, then maybe we will have to write
>>>>>>>them ...
>>>>>>>
>>>>>>>Cheers, Alberto
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>    
>>>>>>>
>>>>>>>         
>>>>>>>
>>>>>>>              
>>>>>>>
>>>>>>>>steve
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>Poliana Mateus wrote:
>>>>>>>>
>>>>>>>> 
>>>>>>>>
>>>>>>>>      
>>>>>>>>
>>>>>>>>           
>>>>>>>>
>>>>>>>>                
>>>>>>>>
>>>>>>>>>Hello all,
>>>>>>>>>
>>>>>>>>>Where can find the script parseBlastFilesForSimilarity.pl??
>>>>>>>>>I'm trying to run LoadBlastSimFast...
>>>>>>>>>
>>>>>>>>>Poliana
>>>>>>>>>                  
>>>>>>>>>
>
>  
>

Hey Steve, Thomas,

Thanks a lot for the tips, really helpful.. now, few more questions:

> ok.  NR = NRDB
> 
> the way we have used gus with similarities is that both the query and 
> subject are loaded into gus.  As thomas explained, the similarity table 
> captures similarity between sequences that are in gus. 
> 
> our approach has always been to just load (warehouse) the entire subject 
> database (NR, EST) that we are blasting against.
> 
> the current plugins and blastSimilarity are set up for this.
> 
> obviously, this takes a lot of disk space.  two major efficiencies that 
> we don't currently have plugins for would be:
>   1. to only store in gus a *reference* to the external sequence (ie, 
> don't store the actgs).
>   2. only store in gus the sequences that actually have similarities

Option 2 sound better for us, since we will be blasting against several
databases (> 10GB databases)

What about the plugins to load Interpro and "gene finder" (glimmer, etc)
results ? Is there any at all ?

Cheers, Alberto

> 
> steve
> 
> Alberto Davila wrote:
> 
> >All the blastable databases I mentioned are standard databases from NCBI
> >(ftp://ftp.ncbi.nlm.nih.gov/blast/db/blastdb.txt):
> >
> >NT = nucleotides
> >
> >~30000 entries from genbank (genbank format) are loaded into GUS now.
> >
> >Not sure about your "NRDB", I know NR from NCBI that is a collection of
> >aminoacid entries, could it be the same ?
> >
> >Alberto
> >
> >On Fri, 2005-02-11 at 10:43 -0500, Steve Fischer wrote:
> >  
> >
> >>(what is NT?)
> >>
> >>which of these (genbank, your fasta, NRDB, NT, EST) have you loaded into 
> >>gus?
> >>
> >>steve
> >>
> >>Alberto Davila wrote:
> >>
> >>    
> >>
> >>>Query:
> >>>
> >>>Either sequences from genbank (genbank format) or sequences generated in
> >>>the lab (fasta format)
> >>>
> >>>Blastable databases (all are formatted databases from NCBI):
> >>>
> >>>NR
> >>>NT
> >>>EST
> >>>
> >>>Alberto
> >>>
> >>>On Fri, 2005-02-11 at 10:34 -0500, Steve Fischer wrote:
> >>> 
> >>>
> >>>      
> >>>
> >>>>for the blast, what are the query sequences and what are the blastable 
> >>>>databases?
> >>>>
> >>>>steve
> >>>>
> >>>>Alberto Davila wrote:
> >>>>
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>Basically we will use sequences (loaded into GUS with the GBParser) for
> >>>>>NCBI Blast (Blastx, Blastp and TBlastX), the same sequences will be also
> >>>>>used for Interpro analyses. Results of both (Blast and Interpro) will be
> >>>>>loaded into GUS. We will parse specific things from the Blast results, I
> >>>>>would say:
> >>>>>
> >>>>>`Gi` 
> >>>>>`Accession` 
> >>>>>`Description` 
> >>>>>`E_value` 
> >>>>>`Score` 
> >>>>>`Length` 
> >>>>>`Frame_Query` 
> >>>>>`Frame_Hit` 
> >>>>>`Identical` 
> >>>>>`Hsp_Frac_Identical` 
> >>>>>`Conserved` 
> >>>>>`Hsp_Frac_Conserved`
> >>>>>`Query_Start`
> >>>>>`Query_End` 
> >>>>>`Hit_Start` 
> >>>>>`Hit_End` 
> >>>>>`Hsp_Align` 
> >>>>>`database_letters` 
> >>>>>`database_entries` 
> >>>>>
> >>>>>We already have a Bioperl parser for that (specific for another system:
> >>>>>GARSA) that could be adapted to GUS, problem being we are not sure what
> >>>>>tables should be used to store those data in GUS.
> >>>>>
> >>>>>Cheers, Alberto
> >>>>>
> >>>>>
> >>>>>On Fri, 2005-02-11 at 10:06 -0500, Steve Fischer wrote:
> >>>>>
> >>>>>
> >>>>>     
> >>>>>
> >>>>>          
> >>>>>
> >>>>>>what are you planning on blasting?
> >>>>>>
> >>>>>>steve
> >>>>>>
> >>>>>>Alberto Davila wrote:
> >>>>>>
> >>>>>>  
> >>>>>>
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>>>>Hi Steve,
> >>>>>>>
> >>>>>>>On Fri, 2005-02-11 at 08:56 -0500, Steve Fischer wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>>    
> >>>>>>>
> >>>>>>>         
> >>>>>>>
> >>>>>>>              
> >>>>>>>
> >>>>>>>>poliana-
> >>>>>>>>
> >>>>>>>>oops, the usage statement for LoadBlastSimFast is out of date.   it 
> >>>>>>>>should instruct you to use the blastSimilarity command.
> >>>>>>>>
> >>>>>>>>LoadBlastSimFast makes a big assumption, that the subject and query 
> >>>>>>>>sequences are in GUS, and their def. lines have GUS primary keys. 
> >>>>>>>>
> >>>>>>>>Are your sequences already loaded into GUS?
> >>>>>>>> 
> >>>>>>>>
> >>>>>>>>      
> >>>>>>>>
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>>>They are not, there would be any howto/tips for that plugin ? We will
> >>>>>>>certainly need a plugin to load "Interpro" and "ORF finding" results
> >>>>>>>into GUS... If they are not available, then maybe we will have to write
> >>>>>>>them ...
> >>>>>>>
> >>>>>>>Cheers, Alberto
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>    
> >>>>>>>
> >>>>>>>         
> >>>>>>>
> >>>>>>>              
> >>>>>>>
> >>>>>>>>steve
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>Poliana Mateus wrote:
> >>>>>>>>
> >>>>>>>> 
> >>>>>>>>
> >>>>>>>>      
> >>>>>>>>
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>>>>>Hello all,
> >>>>>>>>>
> >>>>>>>>>Where can find the script parseBlastFilesForSimilarity.pl??
> >>>>>>>>>I'm trying to run LoadBlastSimFast...
> >>>>>>>>>
> >>>>>>>>>Poliana
> >>>>>>>>>                  
> >>>>>>>>>
> >
> >  
> >

alberto-

we've never loaded interpro, so there isn't a plugin. 

i believe plasmodb has loaded glimmer results, though i'm not sure.   i 
have asked a plasmodb developer to answer that question.

steve

Alberto Davila wrote:

>Hey Steve, Thomas,
>
>Thanks a lot for the tips, really helpful.. now, few more questions:
>
>  
>
>>ok.  NR = NRDB
>>
>>the way we have used gus with similarities is that both the query and 
>>subject are loaded into gus.  As thomas explained, the similarity table 
>>captures similarity between sequences that are in gus. 
>>
>>our approach has always been to just load (warehouse) the entire subject 
>>database (NR, EST) that we are blasting against.
>>
>>the current plugins and blastSimilarity are set up for this.
>>
>>obviously, this takes a lot of disk space.  two major efficiencies that 
>>we don't currently have plugins for would be:
>>  1. to only store in gus a *reference* to the external sequence (ie, 
>>don't store the actgs).
>>  2. only store in gus the sequences that actually have similarities
>>    
>>
>
>Option 2 sound better for us, since we will be blasting against several
>databases (> 10GB databases)
>
>What about the plugins to load Interpro and "gene finder" (glimmer, etc)
>results ? Is there any at all ?
>
>Cheers, Alberto
>
>  
>
>>steve
>>
>>Alberto Davila wrote:
>>
>>    
>>
>>>All the blastable databases I mentioned are standard databases from NCBI
>>>(ftp://ftp.ncbi.nlm.nih.gov/blast/db/blastdb.txt):
>>>
>>>NT = nucleotides
>>>
>>>~30000 entries from genbank (genbank format) are loaded into GUS now.
>>>
>>>Not sure about your "NRDB", I know NR from NCBI that is a collection of
>>>aminoacid entries, could it be the same ?
>>>
>>>Alberto
>>>
>>>On Fri, 2005-02-11 at 10:43 -0500, Steve Fischer wrote:
>>> 
>>>
>>>      
>>>
>>>>(what is NT?)
>>>>
>>>>which of these (genbank, your fasta, NRDB, NT, EST) have you loaded into 
>>>>gus?
>>>>
>>>>steve
>>>>
>>>>Alberto Davila wrote:
>>>>
>>>>   
>>>>
>>>>        
>>>>
>>>>>Query:
>>>>>
>>>>>Either sequences from genbank (genbank format) or sequences generated in
>>>>>the lab (fasta format)
>>>>>
>>>>>Blastable databases (all are formatted databases from NCBI):
>>>>>
>>>>>NR
>>>>>NT
>>>>>EST
>>>>>
>>>>>Alberto
>>>>>
>>>>>On Fri, 2005-02-11 at 10:34 -0500, Steve Fischer wrote:
>>>>>
>>>>>
>>>>>     
>>>>>
>>>>>          
>>>>>
>>>>>>for the blast, what are the query sequences and what are the blastable 
>>>>>>databases?
>>>>>>
>>>>>>steve
>>>>>>
>>>>>>Alberto Davila wrote:
>>>>>>
>>>>>>  
>>>>>>
>>>>>>       
>>>>>>
>>>>>>            
>>>>>>
>>>>>>>Basically we will use sequences (loaded into GUS with the GBParser) for
>>>>>>>NCBI Blast (Blastx, Blastp and TBlastX), the same sequences will be also
>>>>>>>used for Interpro analyses. Results of both (Blast and Interpro) will be
>>>>>>>loaded into GUS. We will parse specific things from the Blast results, I
>>>>>>>would say:
>>>>>>>
>>>>>>>`Gi` 
>>>>>>>`Accession` 
>>>>>>>`Description` 
>>>>>>>`E_value` 
>>>>>>>`Score` 
>>>>>>>`Length` 
>>>>>>>`Frame_Query` 
>>>>>>>`Frame_Hit` 
>>>>>>>`Identical` 
>>>>>>>`Hsp_Frac_Identical` 
>>>>>>>`Conserved` 
>>>>>>>`Hsp_Frac_Conserved`
>>>>>>>`Query_Start`
>>>>>>>`Query_End` 
>>>>>>>`Hit_Start` 
>>>>>>>`Hit_End` 
>>>>>>>`Hsp_Align` 
>>>>>>>`database_letters` 
>>>>>>>`database_entries` 
>>>>>>>
>>>>>>>We already have a Bioperl parser for that (specific for another system:
>>>>>>>GARSA) that could be adapted to GUS, problem being we are not sure what
>>>>>>>tables should be used to store those data in GUS.
>>>>>>>
>>>>>>>Cheers, Alberto
>>>>>>>
>>>>>>>
>>>>>>>On Fri, 2005-02-11 at 10:06 -0500, Steve Fischer wrote:
>>>>>>>
>>>>>>>
>>>>>>>    
>>>>>>>
>>>>>>>         
>>>>>>>
>>>>>>>              
>>>>>>>
>>>>>>>>what are you planning on blasting?
>>>>>>>>
>>>>>>>>steve
>>>>>>>>
>>>>>>>>Alberto Davila wrote:
>>>>>>>>
>>>>>>>> 
>>>>>>>>
>>>>>>>>      
>>>>>>>>
>>>>>>>>           
>>>>>>>>
>>>>>>>>                
>>>>>>>>
>>>>>>>>>Hi Steve,
>>>>>>>>>
>>>>>>>>>On Fri, 2005-02-11 at 08:56 -0500, Steve Fischer wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>   
>>>>>>>>>
>>>>>>>>>        
>>>>>>>>>
>>>>>>>>>             
>>>>>>>>>
>>>>>>>>>                  
>>>>>>>>>
>>>>>>>>>>poliana-
>>>>>>>>>>
>>>>>>>>>>oops, the usage statement for LoadBlastSimFast is out of date.   it 
>>>>>>>>>>should instruct you to use the blastSimilarity command.
>>>>>>>>>>
>>>>>>>>>>LoadBlastSimFast makes a big assumption, that the subject and query 
>>>>>>>>>>sequences are in GUS, and their def. lines have GUS primary keys. 
>>>>>>>>>>
>>>>>>>>>>Are your sequences already loaded into GUS?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>     
>>>>>>>>>>
>>>>>>>>>>          
>>>>>>>>>>
>>>>>>>>>>               
>>>>>>>>>>
>>>>>>>>>>                    
>>>>>>>>>>
>>>>>>>>>They are not, there would be any howto/tips for that plugin ? We will
>>>>>>>>>certainly need a plugin to load "Interpro" and "ORF finding" results
>>>>>>>>>into GUS... If they are not available, then maybe we will have to write
>>>>>>>>>them ...
>>>>>>>>>
>>>>>>>>>Cheers, Alberto
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>   
>>>>>>>>>
>>>>>>>>>        
>>>>>>>>>
>>>>>>>>>             
>>>>>>>>>
>>>>>>>>>                  
>>>>>>>>>
>>>>>>>>>>steve
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>Poliana Mateus wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>     
>>>>>>>>>>
>>>>>>>>>>          
>>>>>>>>>>
>>>>>>>>>>               
>>>>>>>>>>
>>>>>>>>>>                    
>>>>>>>>>>
>>>>>>>>>>>Hello all,
>>>>>>>>>>>
>>>>>>>>>>>Where can find the script parseBlastFilesForSimilarity.pl??
>>>>>>>>>>>I'm trying to run LoadBlastSimFast...
>>>>>>>>>>>
>>>>>>>>>>>Poliana
>>>>>>>>>>>                 
>>>>>>>>>>>
>>>>>>>>>>>                      
>>>>>>>>>>>
>>> 
>>>
>>>      
>>>

I was going to give the same answer steve gave for interpro and gene 
finding results.

For loading sequences into GUS, the dillema with option 2 is: how do you 
know which sequence to load when you load (which is before you actually 
have the similarity result)? One solution would be to initially load 
complete dataset(s) but delete those without similarity after loading 
similarity results.

-Thomas

On Fri, 11 Feb 2005, Steve Fischer wrote:

> alberto-
>
> we've never loaded interpro, so there isn't a plugin. 
> i believe plasmodb has loaded glimmer results, though i'm not sure.   i have 
> asked a plasmodb developer to answer that question.
>
> steve
>
> Alberto Davila wrote:
>
>> Hey Steve, Thomas,
>> 
>> Thanks a lot for the tips, really helpful.. now, few more questions:
>> 
>> 
>>> ok.  NR = NRDB
>>> 
>>> the way we have used gus with similarities is that both the query and 
>>> subject are loaded into gus.  As thomas explained, the similarity table 
>>> captures similarity between sequences that are in gus. 
>>> our approach has always been to just load (warehouse) the entire subject 
>>> database (NR, EST) that we are blasting against.
>>> 
>>> the current plugins and blastSimilarity are set up for this.
>>> 
>>> obviously, this takes a lot of disk space.  two major efficiencies that we 
>>> don't currently have plugins for would be:
>>>  1. to only store in gus a *reference* to the external sequence (ie, don't 
>>> store the actgs).
>>>  2. only store in gus the sequences that actually have similarities
>>> 
>> 
>> Option 2 sound better for us, since we will be blasting against several
>> databases (> 10GB databases)
>> 
>> What about the plugins to load Interpro and "gene finder" (glimmer, etc)
>> results ? Is there any at all ?
>> 
>> Cheers, Alberto
>> 
>> 
>>> steve
>>> 
>>> Alberto Davila wrote:
>>> 
>>> 
>>>> All the blastable databases I mentioned are standard databases from NCBI
>>>> (ftp://ftp.ncbi.nlm.nih.gov/blast/db/blastdb.txt):
>>>> 
>>>> NT = nucleotides
>>>> 
>>>> ~30000 entries from genbank (genbank format) are loaded into GUS now.
>>>> 
>>>> Not sure about your "NRDB", I know NR from NCBI that is a collection of
>>>> aminoacid entries, could it be the same ?
>>>> 
>>>> Alberto
>>>> 
>>>> On Fri, 2005-02-11 at 10:43 -0500, Steve Fischer wrote:
>>>> 
>>>> 
>>>> 
>>>>> (what is NT?)
>>>>> 
>>>>> which of these (genbank, your fasta, NRDB, NT, EST) have you loaded into 
>>>>> gus?
>>>>> 
>>>>> steve
>>>>> 
>>>>> Alberto Davila wrote:
>>>>> 
>>>>> 
>>>>> 
>>>>>> Query:
>>>>>> 
>>>>>> Either sequences from genbank (genbank format) or sequences generated 
>>>>>> in
>>>>>> the lab (fasta format)
>>>>>> 
>>>>>> Blastable databases (all are formatted databases from NCBI):
>>>>>> 
>>>>>> NR
>>>>>> NT
>>>>>> EST
>>>>>> 
>>>>>> Alberto
>>>>>> 
>>>>>> On Fri, 2005-02-11 at 10:34 -0500, Steve Fischer wrote:
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> for the blast, what are the query sequences and what are the blastable 
>>>>>>> databases?
>>>>>>> 
>>>>>>> steve
>>>>>>> 
>>>>>>> Alberto Davila wrote:
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> Basically we will use sequences (loaded into GUS with the GBParser) 
>>>>>>>> for
>>>>>>>> NCBI Blast (Blastx, Blastp and TBlastX), the same sequences will be 
>>>>>>>> also
>>>>>>>> used for Interpro analyses. Results of both (Blast and Interpro) will 
>>>>>>>> be
>>>>>>>> loaded into GUS. We will parse specific things from the Blast 
>>>>>>>> results, I
>>>>>>>> would say:
>>>>>>>> 
>>>>>>>> `Gi` `Accession` `Description` `E_value` `Score` `Length` 
>>>>>>>> `Frame_Query` `Frame_Hit` `Identical` `Hsp_Frac_Identical` 
>>>>>>>> `Conserved` `Hsp_Frac_Conserved`
>>>>>>>> `Query_Start`
>>>>>>>> `Query_End` `Hit_Start` `Hit_End` `Hsp_Align` `database_letters` 
>>>>>>>> `database_entries` 
>>>>>>>> We already have a Bioperl parser for that (specific for another 
>>>>>>>> system:
>>>>>>>> GARSA) that could be adapted to GUS, problem being we are not sure 
>>>>>>>> what
>>>>>>>> tables should be used to store those data in GUS.
>>>>>>>> 
>>>>>>>> Cheers, Alberto
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Fri, 2005-02-11 at 10:06 -0500, Steve Fischer wrote:
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> what are you planning on blasting?
>>>>>>>>> 
>>>>>>>>> steve
>>>>>>>>> 
>>>>>>>>> Alberto Davila wrote:
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> Hi Steve,
>>>>>>>>>> 
>>>>>>>>>> On Fri, 2005-02-11 at 08:56 -0500, Steve Fischer wrote:
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> poliana-
>>>>>>>>>>> 
>>>>>>>>>>> oops, the usage statement for LoadBlastSimFast is out of date. 
>>>>>>>>>>> it should instruct you to use the blastSimilarity command.
>>>>>>>>>>> 
>>>>>>>>>>> LoadBlastSimFast makes a big assumption, that the subject and 
>>>>>>>>>>> query sequences are in GUS, and their def. lines have GUS primary 
>>>>>>>>>>> keys. 
>>>>>>>>>>> Are your sequences already loaded into GUS?
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> They are not, there would be any howto/tips for that plugin ? We 
>>>>>>>>>> will
>>>>>>>>>> certainly need a plugin to load "Interpro" and "ORF finding" 
>>>>>>>>>> results
>>>>>>>>>> into GUS... If they are not available, then maybe we will have to 
>>>>>>>>>> write
>>>>>>>>>> them ...
>>>>>>>>>> 
>>>>>>>>>> Cheers, Alberto
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> steve
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Poliana Mateus wrote:
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> Hello all,
>>>>>>>>>>>> 
>>>>>>>>>>>> Where can find the script parseBlastFilesForSimilarity.pl??
>>>>>>>>>>>> I'm trying to run LoadBlastSimFast...
>>>>>>>>>>>> 
>>>>>>>>>>>> Poliana
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>> 
>>>> 
>>>> 
>
>
> -------------------------------------------------------
> SF email is sponsored by - The IT Product Guide
> Read honest & candid reviews on hundreds of IT Products from real users.
> Discover which products truly live up to the hype. Start reading now.
> http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
> _______________________________________________
> Gusdev-gusdev mailing list
> Gus...@li...
> https://lists.sourceforge.net/lists/listinfo/gusdev-gusdev
>

We are doing this for Garsa (another system) .. basically we have a
bioperl parser (Bio::Search::IO) that reads the Blast results file and
extract all the needed info (to the "Blast_Hit" table)... and also load
into a given table (eg: External_DB) all the sequences (in fasta format)
presenting similarity with the queries... at the end we have "Blast_Hit"
and "External_DB" populated with the same script.

Regarding Interpro and Glimmer, the main problem is to know in which
tables we should load the parsed results ?

Alberto

On Fri, 2005-02-11 at 13:21 -0500, Y. Thomas Gan wrote:
> I was going to give the same answer steve gave for interpro and gene 
> finding results.
> 
> For loading sequences into GUS, the dillema with option 2 is: how do you 
> know which sequence to load when you load (which is before you actually 
> have the similarity result)? One solution would be to initially load 
> complete dataset(s) but delete those without similarity after loading 
> similarity results.
> 
> -Thomas
> 
> On Fri, 11 Feb 2005, Steve Fischer wrote:
> 
> > alberto-
> >
> > we've never loaded interpro, so there isn't a plugin. 
> > i believe plasmodb has loaded glimmer results, though i'm not sure.   i have 
> > asked a plasmodb developer to answer that question.
> >
> > steve
> >
> > Alberto Davila wrote:
> >
> >> Hey Steve, Thomas,
> >> 
> >> Thanks a lot for the tips, really helpful.. now, few more questions:
> >> 
> >> 
> >>> ok.  NR = NRDB
> >>> 
> >>> the way we have used gus with similarities is that both the query and 
> >>> subject are loaded into gus.  As thomas explained, the similarity table 
> >>> captures similarity between sequences that are in gus. 
> >>> our approach has always been to just load (warehouse) the entire subject 
> >>> database (NR, EST) that we are blasting against.
> >>> 
> >>> the current plugins and blastSimilarity are set up for this.
> >>> 
> >>> obviously, this takes a lot of disk space.  two major efficiencies that we 
> >>> don't currently have plugins for would be:
> >>>  1. to only store in gus a *reference* to the external sequence (ie, don't 
> >>> store the actgs).
> >>>  2. only store in gus the sequences that actually have similarities
> >>> 
> >> 
> >> Option 2 sound better for us, since we will be blasting against several
> >> databases (> 10GB databases)
> >> 
> >> What about the plugins to load Interpro and "gene finder" (glimmer, etc)
> >> results ? Is there any at all ?
> >> 
> >> Cheers, Alberto
> >> 
> >> 
> >>> steve
> >>> 
> >>> Alberto Davila wrote:
> >>> 
> >>> 
> >>>> All the blastable databases I mentioned are standard databases from NCBI
> >>>> (ftp://ftp.ncbi.nlm.nih.gov/blast/db/blastdb.txt):
> >>>> 
> >>>> NT = nucleotides
> >>>> 
> >>>> ~30000 entries from genbank (genbank format) are loaded into GUS now.
> >>>> 
> >>>> Not sure about your "NRDB", I know NR from NCBI that is a collection of
> >>>> aminoacid entries, could it be the same ?
> >>>> 
> >>>> Alberto
> >>>> 
> >>>> On Fri, 2005-02-11 at 10:43 -0500, Steve Fischer wrote:
> >>>> 
> >>>> 
> >>>> 
> >>>>> (what is NT?)
> >>>>> 
> >>>>> which of these (genbank, your fasta, NRDB, NT, EST) have you loaded into 
> >>>>> gus?
> >>>>> 
> >>>>> steve
> >>>>> 
> >>>>> Alberto Davila wrote:
> >>>>> 
> >>>>> 
> >>>>> 
> >>>>>> Query:
> >>>>>> 
> >>>>>> Either sequences from genbank (genbank format) or sequences generated 
> >>>>>> in
> >>>>>> the lab (fasta format)
> >>>>>> 
> >>>>>> Blastable databases (all are formatted databases from NCBI):
> >>>>>> 
> >>>>>> NR
> >>>>>> NT
> >>>>>> EST
> >>>>>> 
> >>>>>> Alberto
> >>>>>> 
> >>>>>> On Fri, 2005-02-11 at 10:34 -0500, Steve Fischer wrote:
> >>>>>> 
> >>>>>> 
> >>>>>> 
> >>>>>> 
> >>>>>>> for the blast, what are the query sequences and what are the blastable 
> >>>>>>> databases?
> >>>>>>> 
> >>>>>>> steve
> >>>>>>> 
> >>>>>>> Alberto Davila wrote:
> >>>>>>> 
> >>>>>>> 
> >>>>>>> 
> >>>>>>> 
> >>>>>>>> Basically we will use sequences (loaded into GUS with the GBParser) 
> >>>>>>>> for
> >>>>>>>> NCBI Blast (Blastx, Blastp and TBlastX), the same sequences will be 
> >>>>>>>> also
> >>>>>>>> used for Interpro analyses. Results of both (Blast and Interpro) will 
> >>>>>>>> be
> >>>>>>>> loaded into GUS. We will parse specific things from the Blast 
> >>>>>>>> results, I
> >>>>>>>> would say:
> >>>>>>>> 
> >>>>>>>> `Gi` `Accession` `Description` `E_value` `Score` `Length` 
> >>>>>>>> `Frame_Query` `Frame_Hit` `Identical` `Hsp_Frac_Identical` 
> >>>>>>>> `Conserved` `Hsp_Frac_Conserved`
> >>>>>>>> `Query_Start`
> >>>>>>>> `Query_End` `Hit_Start` `Hit_End` `Hsp_Align` `database_letters` 
> >>>>>>>> `database_entries` 
> >>>>>>>> We already have a Bioperl parser for that (specific for another 
> >>>>>>>> system:
> >>>>>>>> GARSA) that could be adapted to GUS, problem being we are not sure 
> >>>>>>>> what
> >>>>>>>> tables should be used to store those data in GUS.
> >>>>>>>> 
> >>>>>>>> Cheers, Alberto
> >>>>>>>> 
> >>>>>>>> 
> >>>>>>>> On Fri, 2005-02-11 at 10:06 -0500, Steve Fischer wrote:
> >>>>>>>> 
> >>>>>>>> 
> >>>>>>>> 
> >>>>>>>> 
> >>>>>>>> 
> >>>>>>>>> what are you planning on blasting?
> >>>>>>>>> 
> >>>>>>>>> steve
> >>>>>>>>> 
> >>>>>>>>> Alberto Davila wrote:
> >>>>>>>>> 
> >>>>>>>>> 
> >>>>>>>>> 
> >>>>>>>>> 
> >>>>>>>>> 
> >>>>>>>>> 
> >>>>>>>>>> Hi Steve,
> >>>>>>>>>> 
> >>>>>>>>>> On Fri, 2005-02-11 at 08:56 -0500, Steve Fischer wrote:
> >>>>>>>>>> 
> >>>>>>>>>> 
> >>>>>>>>>> 
> >>>>>>>>>> 
> >>>>>>>>>> 
> >>>>>>>>>> 
> >>>>>>>>>>> poliana-
> >>>>>>>>>>> 
> >>>>>>>>>>> oops, the usage statement for LoadBlastSimFast is out of date. 
> >>>>>>>>>>> it should instruct you to use the blastSimilarity command.
> >>>>>>>>>>> 
> >>>>>>>>>>> LoadBlastSimFast makes a big assumption, that the subject and 
> >>>>>>>>>>> query sequences are in GUS, and their def. lines have GUS primary 
> >>>>>>>>>>> keys. 
> >>>>>>>>>>> Are your sequences already loaded into GUS?
> >>>>>>>>>>> 
> >>>>>>>>>>> 
> >>>>>>>>>>> 
> >>>>>>>>>>> 
> >>>>>>>>>>> 
> >>>>>>>>>>> 
> >>>>>>>>>> They are not, there would be any howto/tips for that plugin ? We 
> >>>>>>>>>> will
> >>>>>>>>>> certainly need a plugin to load "Interpro" and "ORF finding" 
> >>>>>>>>>> results
> >>>>>>>>>> into GUS... If they are not available, then maybe we will have to 
> >>>>>>>>>> write
> >>>>>>>>>> them ...
> >>>>>>>>>> 
> >>>>>>>>>> Cheers, Alberto
> >>>>>>>>>> 
> >>>>>>>>>> 
> >>>>>>>>>> 
> >>>>>>>>>> 
> >>>>>>>>>> 
> >>>>>>>>>> 
> >>>>>>>>>> 
> >>>>>>>>>>> steve
> >>>>>>>>>>> 
> >>>>>>>>>>> 
> >>>>>>>>>>> 
> >>>>>>>>>>> Poliana Mateus wrote:
> >>>>>>>>>>> 
> >>>>>>>>>>> 
> >>>>>>>>>>> 
> >>>>>>>>>>> 
> >>>>>>>>>>> 
> >>>>>>>>>>> 
> >>>>>>>>>>> 
> >>>>>>>>>>>> Hello all,
> >>>>>>>>>>>> 
> >>>>>>>>>>>> Where can find the script parseBlastFilesForSimilarity.pl??
> >>>>>>>>>>>> I'm trying to run LoadBlastSimFast...
> >>>>>>>>>>>> 
> >>>>>>>>>>>> Poliana
> >>>>>>>>>>>> 
> >>>>>>>

see below

Alberto Davila wrote:

>We are doing this for Garsa (another system) .. basically we have a
>bioperl parser (Bio::Search::IO) that reads the Blast results file and
>extract all the needed info (to the "Blast_Hit" table)... and also load
>into a given table (eg: External_DB) all the sequences (in fasta format)
>presenting similarity with the queries... at the end we have "Blast_Hit"
>and "External_DB" populated with the same script.
>
>  
>
wow, great.  could you make a gus plugin from that?

>Regarding Interpro and Glimmer, the main problem is to know in which
>tables we should load the parsed results ?
>
>  
>
describe the info you want to store.

steve

>Alberto
>
>On Fri, 2005-02-11 at 13:21 -0500, Y. Thomas Gan wrote:
>  
>
>>I was going to give the same answer steve gave for interpro and gene 
>>finding results.
>>
>>For loading sequences into GUS, the dillema with option 2 is: how do you 
>>know which sequence to load when you load (which is before you actually 
>>have the similarity result)? One solution would be to initially load 
>>complete dataset(s) but delete those without similarity after loading 
>>similarity results.
>>
>>-Thomas
>>
>>On Fri, 11 Feb 2005, Steve Fischer wrote:
>>
>>    
>>
>>>alberto-
>>>
>>>we've never loaded interpro, so there isn't a plugin. 
>>>i believe plasmodb has loaded glimmer results, though i'm not sure.   i have 
>>>asked a plasmodb developer to answer that question.
>>>
>>>steve
>>>
>>>Alberto Davila wrote:
>>>
>>>      
>>>
>>>>Hey Steve, Thomas,
>>>>
>>>>Thanks a lot for the tips, really helpful.. now, few more questions:
>>>>
>>>>
>>>>        
>>>>
>>>>>ok.  NR = NRDB
>>>>>
>>>>>the way we have used gus with similarities is that both the query and 
>>>>>subject are loaded into gus.  As thomas explained, the similarity table 
>>>>>captures similarity between sequences that are in gus. 
>>>>>our approach has always been to just load (warehouse) the entire subject 
>>>>>database (NR, EST) that we are blasting against.
>>>>>
>>>>>the current plugins and blastSimilarity are set up for this.
>>>>>
>>>>>obviously, this takes a lot of disk space.  two major efficiencies that we 
>>>>>don't currently have plugins for would be:
>>>>> 1. to only store in gus a *reference* to the external sequence (ie, don't 
>>>>>store the actgs).
>>>>> 2. only store in gus the sequences that actually have similarities
>>>>>
>>>>>          
>>>>>
>>>>Option 2 sound better for us, since we will be blasting against several
>>>>databases (> 10GB databases)
>>>>
>>>>What about the plugins to load Interpro and "gene finder" (glimmer, etc)
>>>>results ? Is there any at all ?
>>>>
>>>>Cheers, Alberto
>>>>
>>>>
>>>>        
>>>>
>>>>>steve
>>>>>
>>>>>Alberto Davila wrote:
>>>>>
>>>>>
>>>>>          
>>>>>
>>>>>>All the blastable databases I mentioned are standard databases from NCBI
>>>>>>(ftp://ftp.ncbi.nlm.nih.gov/blast/db/blastdb.txt):
>>>>>>
>>>>>>NT = nucleotides
>>>>>>
>>>>>>~30000 entries from genbank (genbank format) are loaded into GUS now.
>>>>>>
>>>>>>Not sure about your "NRDB", I know NR from NCBI that is a collection of
>>>>>>aminoacid entries, could it be the same ?
>>>>>>
>>>>>>Alberto
>>>>>>
>>>>>>On Fri, 2005-02-11 at 10:43 -0500, Steve Fischer wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>            
>>>>>>
>>>>>>>(what is NT?)
>>>>>>>
>>>>>>>which of these (genbank, your fasta, NRDB, NT, EST) have you loaded into 
>>>>>>>gus?
>>>>>>>
>>>>>>>steve
>>>>>>>
>>>>>>>Alberto Davila wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>              
>>>>>>>
>>>>>>>>Query:
>>>>>>>>
>>>>>>>>Either sequences from genbank (genbank format) or sequences generated 
>>>>>>>>in
>>>>>>>>the lab (fasta format)
>>>>>>>>
>>>>>>>>Blastable databases (all are formatted databases from NCBI):
>>>>>>>>
>>>>>>>>NR
>>>>>>>>NT
>>>>>>>>EST
>>>>>>>>
>>>>>>>>Alberto
>>>>>>>>
>>>>>>>>On Fri, 2005-02-11 at 10:34 -0500, Steve Fischer wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                
>>>>>>>>
>>>>>>>>>for the blast, what are the query sequences and what are the blastable 
>>>>>>>>>databases?
>>>>>>>>>
>>>>>>>>>steve
>>>>>>>>>
>>>>>>>>>Alberto Davila wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                  
>>>>>>>>>
>>>>>>>>>>Basically we will use sequences (loaded into GUS with the GBParser) 
>>>>>>>>>>for
>>>>>>>>>>NCBI Blast (Blastx, Blastp and TBlastX), the same sequences will be 
>>>>>>>>>>also
>>>>>>>>>>used for Interpro analyses. Results of both (Blast and Interpro) will 
>>>>>>>>>>be
>>>>>>>>>>loaded into GUS. We will parse specific things from the Blast 
>>>>>>>>>>results, I
>>>>>>>>>>would say:
>>>>>>>>>>
>>>>>>>>>>`Gi` `Accession` `Description` `E_value` `Score` `Length` 
>>>>>>>>>>`Frame_Query` `Frame_Hit` `Identical` `Hsp_Frac_Identical` 
>>>>>>>>>>`Conserved` `Hsp_Frac_Conserved`
>>>>>>>>>>`Query_Start`
>>>>>>>>>>`Query_End` `Hit_Start` `Hit_End` `Hsp_Align` `database_letters` 
>>>>>>>>>>`database_entries` 
>>>>>>>>>>We already have a Bioperl parser for that (specific for another 
>>>>>>>>>>system:
>>>>>>>>>>GARSA) that could be adapted to GUS, problem being we are not sure 
>>>>>>>>>>what
>>>>>>>>>>tables should be used to store those data in GUS.
>>>>>>>>>>
>>>>>>>>>>Cheers, Alberto
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>On Fri, 2005-02-11 at 10:06 -0500, Steve Fischer wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>                    
>>>>>>>>>>
>>>>>>>>>>>what are you planning on blasting?
>>>>>>>>>>>
>>>>>>>>>>>steve
>>>>>>>>>>>
>>>>>>>>>>>Alberto Davila wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>                      
>>>>>>>>>>>
>>>>>>>>>>>>Hi Steve,
>>>>>>>>>>>>
>>>>>>>>>>>>On Fri, 2005-02-11 at 08:56 -0500, Steve Fischer wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>                        
>>>>>>>>>>>>
>>>>>>>>>>>>>poliana-
>>>>>>>>>>>>>
>>>>>>>>>>>>>oops, the usage statement for LoadBlastSimFast is out of date. 
>>>>>>>>>>>>>it should instruct you to use the blastSimilarity command.
>>>>>>>>>>>>>
>>>>>>>>>>>>>LoadBlastSimFast makes a big assumption, that the subject and 
>>>>>>>>>>>>>query sequences are in GUS, and their def. lines have GUS primary 
>>>>>>>>>>>>>keys. 
>>>>>>>>>>>>>Are your sequences already loaded into GUS?
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>                          
>>>>>>>>>>>>>
>>>>>>>>>>>>They are not, there would be any howto/tips for that plugin ? We 
>>>>>>>>>>>>will
>>>>>>>>>>>>certainly need a plugin to load "Interpro" and "ORF finding" 
>>>>>>>>>>>>results
>>>>>>>>>>>>into GUS... If they are not available, then maybe we will have to 
>>>>>>>>>>>>write
>>>>>>>>>>>>them ...
>>>>>>>>>>>>
>>>>>>>>>>>>Cheers, Alberto
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>                        
>>>>>>>>>>>>
>>>>>>>>>>>>>steve
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>Poliana Mateus wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>                          
>>>>>>>>>>>>>
>>>>>>>>>>>>>>Hello all,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>Where can find the script parseBlastFilesForSimilarity.pl??
>>>>>>>>>>>>>>I'm trying to run LoadBlastSimFast...
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>Poliana
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>                            
>>>>>>>>>>>>>>

Hi Steve

I need to insert given in the GUS (resulted blast) as:

----------------------------------------------------
extracted data of ours script
----------------------------------------------------
query_name
name
accession
description
significance
raw_score
length
num_identical
frac_identical
num_conserved
frac_conserved
start('query') 
end('query')
start('hit')
end('hit')
----------------------------------------------------

Analyzing the LoadBlastSimFast Plugin I verified that it inserts in
tables DoTs.Similarity and DoTs.SymilaritySpan, both only accept given
numerics.
Exists into GUS other tables that store resulted of Blast? 

Poliana

On Fri, 11 Feb 2005 13:50:32 -0500, Steve Fischer
<sfi...@pc...> wrote:
> see below
> 
> Alberto Davila wrote:
> 
> >We are doing this for Garsa (another system) .. basically we have a
> >bioperl parser (Bio::Search::IO) that reads the Blast results file and
> >extract all the needed info (to the "Blast_Hit" table)... and also load
> >into a given table (eg: External_DB) all the sequences (in fasta format)
> >presenting similarity with the queries... at the end we have "Blast_Hit"
> >and "External_DB" populated with the same script.
> >
> >
> >
> wow, great.  could you make a gus plugin from that?
> 
> >Regarding Interpro and Glimmer, the main problem is to know in which
> >tables we should load the parsed results ?
> >
> >
> >
> describe the info you want to store.
> 
> steve
> 
> >Alberto
> >
> >On Fri, 2005-02-11 at 13:21 -0500, Y. Thomas Gan wrote:
> >
> >
> >>I was going to give the same answer steve gave for interpro and gene
> >>finding results.
> >>
> >>For loading sequences into GUS, the dillema with option 2 is: how do you
> >>know which sequence to load when you load (which is before you actually
> >>have the similarity result)? One solution would be to initially load
> >>complete dataset(s) but delete those without similarity after loading
> >>similarity results.
> >>
> >>-Thomas
> >>
> >>On Fri, 11 Feb 2005, Steve Fischer wrote:
> >>
> >>
> >>
> >>>alberto-
> >>>
> >>>we've never loaded interpro, so there isn't a plugin.
> >>>i believe plasmodb has loaded glimmer results, though i'm not sure.   i have
> >>>asked a plasmodb developer to answer that question.
> >>>
> >>>steve
> >>>
> >>>Alberto Davila wrote:
> >>>
> >>>
> >>>
> >>>>Hey Steve, Thomas,
> >>>>
> >>>>Thanks a lot for the tips, really helpful.. now, few more questions:
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>>ok.  NR = NRDB
> >>>>>
> >>>>>the way we have used gus with similarities is that both the query and
> >>>>>subject are loaded into gus.  As thomas explained, the similarity table
> >>>>>captures similarity between sequences that are in gus.
> >>>>>our approach has always been to just load (warehouse) the entire subject
> >>>>>database (NR, EST) that we are blasting against.
> >>>>>
> >>>>>the current plugins and blastSimilarity are set up for this.
> >>>>>
> >>>>>obviously, this takes a lot of disk space.  two major efficiencies that we
> >>>>>don't currently have plugins for would be:
> >>>>> 1. to only store in gus a *reference* to the external sequence (ie, don't
> >>>>>store the actgs).
> >>>>> 2. only store in gus the sequences that actually have similarities
> >>>>>
> >>>>>
> >>>>>
> >>>>Option 2 sound better for us, since we will be blasting against several
> >>>>databases (> 10GB databases)
> >>>>
> >>>>What about the plugins to load Interpro and "gene finder" (glimmer, etc)
> >>>>results ? Is there any at all ?
> >>>>
> >>>>Cheers, Alberto
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>>steve
> >>>>>
> >>>>>Alberto Davila wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>>All the blastable databases I mentioned are standard databases from NCBI
> >>>>>>(ftp://ftp.ncbi.nlm.nih.gov/blast/db/blastdb.txt):
> >>>>>>
> >>>>>>NT = nucleotides
> >>>>>>
> >>>>>>~30000 entries from genbank (genbank format) are loaded into GUS now.
> >>>>>>
> >>>>>>Not sure about your "NRDB", I know NR from NCBI that is a collection of
> >>>>>>aminoacid entries, could it be the same ?
> >>>>>>
> >>>>>>Alberto
> >>>>>>
> >>>>>>On Fri, 2005-02-11 at 10:43 -0500, Steve Fischer wrote:
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>>(what is NT?)
> >>>>>>>
> >>>>>>>which of these (genbank, your fasta, NRDB, NT, EST) have you loaded into
> >>>>>>>gus?
> >>>>>>>
> >>>>>>>steve
> >>>>>>>
> >>>>>>>Alberto Davila wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>>Query:
> >>>>>>>>
> >>>>>>>>Either sequences from genbank (genbank format) or sequences generated
> >>>>>>>>in
> >>>>>>>>the lab (fasta format)
> >>>>>>>>
> >>>>>>>>Blastable databases (all are formatted databases from NCBI):
> >>>>>>>>
> >>>>>>>>NR
> >>>>>>>>NT
> >>>>>>>>EST
> >>>>>>>>
> >>>>>>>>Alberto
> >>>>>>>>
> >>>>>>>>On Fri, 2005-02-11 at 10:34 -0500, Steve Fischer wrote:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>>for the blast, what are the query sequences and what are the blastable
> >>>>>>>>>databases?
> >>>>>>>>>
> >>>>>>>>>steve
> >>>>>>>>>
> >>>>>>>>>Alberto Davila wrote:
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>>Basically we will use sequences (loaded into GUS with the GBParser)
> >>>>>>>>>>for
> >>>>>>>>>>NCBI Blast (Blastx, Blastp and TBlastX), the same sequences will be
> >>>>>>>>>>also
> >>>>>>>>>>used for Interpro analyses. Results of both (Blast and Interpro) will
> >>>>>>>>>>be
> >>>>>>>>>>loaded into GUS. We will parse specific things from the Blast
> >>>>>>>>>>results, I
> >>>>>>>>>>would say:
> >>>>>>>>>>
> >>>>>>>>>>`Gi` `Accession` `Description` `E_value` `Score` `Length`
> >>>>>>>>>>`Frame_Query` `Frame_Hit` `Identical` `Hsp_Frac_Identical`
> >>>>>>>>>>`Conserved` `Hsp_Frac_Conserved`
> >>>>>>>>>>`Query_Start`
> >>>>>>>>>>`Query_End` `Hit_Start` `Hit_End` `Hsp_Align` `database_letters`
> >>>>>>>>>>`database_entries`
> >>>>>>>>>>We already have a Bioperl parser for that (specific for another
> >>>>>>>>>>system:
> >>>>>>>>>>GARSA) that could be adapted to GUS, problem being we are not sure
> >>>>>>>>>>what
> >>>>>>>>>>tables should be used to store those data in GUS.
> >>>>>>>>>>
> >>>>>>>>>>Cheers, Alberto
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>On Fri, 2005-02-11 at 10:06 -0500, Steve Fischer wrote:
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>>what are you planning on blasting?
> >>>>>>>>>>>
> >>>>>>>>>>>steve
> >>>>>>>>>>>
> >>>>>>>>>>>Alberto Davila wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>>Hi Steve,
> >>>>>>>>>>>>
> >>>>>>>>>>>>On Fri, 2005-02-11 at 08:56 -0500, Steve Fischer wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>>poliana-
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>oops, the usage statement for LoadBlastSimFast is out of date.
> >>>>>>>>>>>>>it should instruct you to use the blastSimilarity command.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>LoadBlastSimFast makes a big assumption, that the subject and
> >>>>>>>>>>>>>query sequences are in GUS, and their def. lines have GUS primary
> >>>>>>>>>>>>>keys.
> >>>>>>>>>>>>>Are your sequences already loaded into GUS?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>They are not, there would be any howto/tips for that plugin ? We
> >>>>>>>>>>>>will
> >>>>>>>>>>>>certainly need a plugin to load "Interpro" and "ORF finding"
> >>>>>>>>>>>>results
> >>>>>>>>>>>>into GUS... If they are not available, then maybe we will have to
> >>>>>>>>>>>>write
> >>>>>>>>>>>>them ...
> >>>>>>>>>>>>
> >>>>>>>>>>>>Cheers, Alberto
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>>steve
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>Poliana Mateus wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>Hello all,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>Where can find the script parseBlastFilesForSimilarity.pl??
> >>>>>>>>>>>>>>I'm trying to run LoadBlastSimFast...
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>Poliana
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
>

Poliana-

the only blast plugins we have are LoadBlastSimFast and 
LoadBlastSimilarityPK.

the only tables are Similarity and SimilaritySpan

steve

Poliana Mateus wrote:

>Hi Steve
>
>I need to insert given in the GUS (resulted blast) as:
>
>----------------------------------------------------
>extracted data of ours script
>----------------------------------------------------
>query_name
>name
>accession
>description
>significance
>raw_score
>length
>num_identical
>frac_identical
>num_conserved
>frac_conserved
>start('query') 
>end('query')
>start('hit')
>end('hit')
>----------------------------------------------------
>
>Analyzing the LoadBlastSimFast Plugin I verified that it inserts in
>tables DoTs.Similarity and DoTs.SymilaritySpan, both only accept given
>numerics.
>Exists into GUS other tables that store resulted of Blast? 
>
>Poliana
>
>
>
>
>
>
>On Fri, 11 Feb 2005 13:50:32 -0500, Steve Fischer
><sfi...@pc...> wrote:
>  
>
>>see below
>>
>>Alberto Davila wrote:
>>
>>    
>>
>>>We are doing this for Garsa (another system) .. basically we have a
>>>bioperl parser (Bio::Search::IO) that reads the Blast results file and
>>>extract all the needed info (to the "Blast_Hit" table)... and also load
>>>into a given table (eg: External_DB) all the sequences (in fasta format)
>>>presenting similarity with the queries... at the end we have "Blast_Hit"
>>>and "External_DB" populated with the same script.
>>>
>>>
>>>
>>>      
>>>
>>wow, great.  could you make a gus plugin from that?
>>
>>    
>>
>>>Regarding Interpro and Glimmer, the main problem is to know in which
>>>tables we should load the parsed results ?
>>>
>>>
>>>
>>>      
>>>
>>describe the info you want to store.
>>
>>steve
>>
>>    
>>
>>>Alberto
>>>
>>>On Fri, 2005-02-11 at 13:21 -0500, Y. Thomas Gan wrote:
>>>
>>>
>>>      
>>>
>>>>I was going to give the same answer steve gave for interpro and gene
>>>>finding results.
>>>>
>>>>For loading sequences into GUS, the dillema with option 2 is: how do you
>>>>know which sequence to load when you load (which is before you actually
>>>>have the similarity result)? One solution would be to initially load
>>>>complete dataset(s) but delete those without similarity after loading
>>>>similarity results.
>>>>
>>>>-Thomas
>>>>
>>>>On Fri, 11 Feb 2005, Steve Fischer wrote:
>>>>
>>>>
>>>>
>>>>        
>>>>
>>>>>alberto-
>>>>>
>>>>>we've never loaded interpro, so there isn't a plugin.
>>>>>i believe plasmodb has loaded glimmer results, though i'm not sure.   i have
>>>>>asked a plasmodb developer to answer that question.
>>>>>
>>>>>steve
>>>>>
>>>>>Alberto Davila wrote:
>>>>>
>>>>>
>>>>>
>>>>>          
>>>>>
>>>>>>Hey Steve, Thomas,
>>>>>>
>>>>>>Thanks a lot for the tips, really helpful.. now, few more questions:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>            
>>>>>>
>>>>>>>ok.  NR = NRDB
>>>>>>>
>>>>>>>the way we have used gus with similarities is that both the query and
>>>>>>>subject are loaded into gus.  As thomas explained, the similarity table
>>>>>>>captures similarity between sequences that are in gus.
>>>>>>>our approach has always been to just load (warehouse) the entire subject
>>>>>>>database (NR, EST) that we are blasting against.
>>>>>>>
>>>>>>>the current plugins and blastSimilarity are set up for this.
>>>>>>>
>>>>>>>obviously, this takes a lot of disk space.  two major efficiencies that we
>>>>>>>don't currently have plugins for would be:
>>>>>>>1. to only store in gus a *reference* to the external sequence (ie, don't
>>>>>>>store the actgs).
>>>>>>>2. only store in gus the sequences that actually have similarities
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>              
>>>>>>>
>>>>>>Option 2 sound better for us, since we will be blasting against several
>>>>>>databases (> 10GB databases)
>>>>>>
>>>>>>What about the plugins to load Interpro and "gene finder" (glimmer, etc)
>>>>>>results ? Is there any at all ?
>>>>>>
>>>>>>Cheers, Alberto
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>            
>>>>>>
>>>>>>>steve
>>>>>>>
>>>>>>>Alberto Davila wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>              
>>>>>>>
>>>>>>>>All the blastable databases I mentioned are standard databases from NCBI
>>>>>>>>(ftp://ftp.ncbi.nlm.nih.gov/blast/db/blastdb.txt):
>>>>>>>>
>>>>>>>>NT = nucleotides
>>>>>>>>
>>>>>>>>~30000 entries from genbank (genbank format) are loaded into GUS now.
>>>>>>>>
>>>>>>>>Not sure about your "NRDB", I know NR from NCBI that is a collection of
>>>>>>>>aminoacid entries, could it be the same ?
>>>>>>>>
>>>>>>>>Alberto
>>>>>>>>
>>>>>>>>On Fri, 2005-02-11 at 10:43 -0500, Steve Fischer wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                
>>>>>>>>
>>>>>>>>>(what is NT?)
>>>>>>>>>
>>>>>>>>>which of these (genbank, your fasta, NRDB, NT, EST) have you loaded into
>>>>>>>>>gus?
>>>>>>>>>
>>>>>>>>>steve
>>>>>>>>>
>>>>>>>>>Alberto Davila wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                  
>>>>>>>>>
>>>>>>>>>>Query:
>>>>>>>>>>
>>>>>>>>>>Either sequences from genbank (genbank format) or sequences generated
>>>>>>>>>>in
>>>>>>>>>>the lab (fasta format)
>>>>>>>>>>
>>>>>>>>>>Blastable databases (all are formatted databases from NCBI):
>>>>>>>>>>
>>>>>>>>>>NR
>>>>>>>>>>NT
>>>>>>>>>>EST
>>>>>>>>>>
>>>>>>>>>>Alberto
>>>>>>>>>>
>>>>>>>>>>On Fri, 2005-02-11 at 10:34 -0500, Steve Fischer wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>                    
>>>>>>>>>>
>>>>>>>>>>>for the blast, what are the query sequences and what are the blastable
>>>>>>>>>>>databases?
>>>>>>>>>>>
>>>>>>>>>>>steve
>>>>>>>>>>>
>>>>>>>>>>>Alberto Davila wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>                      
>>>>>>>>>>>
>>>>>>>>>>>>Basically we will use sequences (loaded into GUS with the GBParser)
>>>>>>>>>>>>for
>>>>>>>>>>>>NCBI Blast (Blastx, Blastp and TBlastX), the same sequences will be
>>>>>>>>>>>>also
>>>>>>>>>>>>used for Interpro analyses. Results of both (Blast and Interpro) will
>>>>>>>>>>>>be
>>>>>>>>>>>>loaded into GUS. We will parse specific things from the Blast
>>>>>>>>>>>>results, I
>>>>>>>>>>>>would say:
>>>>>>>>>>>>
>>>>>>>>>>>>`Gi` `Accession` `Description` `E_value` `Score` `Length`
>>>>>>>>>>>>`Frame_Query` `Frame_Hit` `Identical` `Hsp_Frac_Identical`
>>>>>>>>>>>>`Conserved` `Hsp_Frac_Conserved`
>>>>>>>>>>>>`Query_Start`
>>>>>>>>>>>>`Query_End` `Hit_Start` `Hit_End` `Hsp_Align` `database_letters`
>>>>>>>>>>>>`database_entries`
>>>>>>>>>>>>We already have a Bioperl parser for that (specific for another
>>>>>>>>>>>>system:
>>>>>>>>>>>>GARSA) that could be adapted to GUS, problem being we are not sure
>>>>>>>>>>>>what
>>>>>>>>>>>>tables should be used to store those data in GUS.
>>>>>>>>>>>>
>>>>>>>>>>>>Cheers, Alberto
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>On Fri, 2005-02-11 at 10:06 -0500, Steve Fischer wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>                        
>>>>>>>>>>>>
>>>>>>>>>>>>>what are you planning on blasting?
>>>>>>>>>>>>>
>>>>>>>>>>>>>steve
>>>>>>>>>>>>>
>>>>>>>>>>>>>Alberto Davila wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>                          
>>>>>>>>>>>>>
>>>>>>>>>>>>>>Hi Steve,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>On Fri, 2005-02-11 at 08:56 -0500, Steve Fischer wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>                            
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>poliana-
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>oops, the usage statement for LoadBlastSimFast is out of date.
>>>>>>>>>>>>>>>it should instruct you to use the blastSimilarity command.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>LoadBlastSimFast makes a big assumption, that the subject and
>>>>>>>>>>>>>>>query sequences are in GUS, and their def. lines have GUS primary
>>>>>>>>>>>>>>>keys.
>>>>>>>>>>>>>>>Are your sequences already loaded into GUS?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>                              
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>They are not, there would be any howto/tips for that plugin ? We
>>>>>>>>>>>>>>will
>>>>>>>>>>>>>>certainly need a plugin to load "Interpro" and "ORF finding"
>>>>>>>>>>>>>>results
>>>>>>>>>>>>>>into GUS... If they are not available, then maybe we will have to
>>>>>>>>>>>>>>write
>>>>>>>>>>>>>>them ...
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>Cheers, Alberto
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>                            
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>steve
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>Poliana Mateus wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>                              
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>Hello all,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>Where can find the script parseBlastFilesForSimilarity.pl??
>>>>>>>>>>>>>>>>I'm trying to run LoadBlastSimFast...
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>Poliana
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>                                
>>>>>>>>>>>>>>>>

hi Alberto -
PlasmoDB project uses a plugin to load the GlimmerM results; it is
GUS::Common::Plugin::ImportPlasmoDBPrediction plugin in the Sanger cvs 
repository. however, please note that this plugin is not generalized, 
and has been used here only for the PlasmoDB project so far.
It would be useful to generalize this plugin some day, so that all can 
benefit.

Bindu

On Feb 11, 2005, at 12:44 PM, Alberto Davila wrote:

> Hey Steve, Thomas,
>
> Thanks a lot for the tips, really helpful.. now, few more questions:
>
>> ok.  NR = NRDB
>>
>> the way we have used gus with similarities is that both the query and
>> subject are loaded into gus.  As thomas explained, the similarity 
>> table
>> captures similarity between sequences that are in gus.
>>
>> our approach has always been to just load (warehouse) the entire 
>> subject
>> database (NR, EST) that we are blasting against.
>>
>> the current plugins and blastSimilarity are set up for this.
>>
>> obviously, this takes a lot of disk space.  two major efficiencies 
>> that
>> we don't currently have plugins for would be:
>>   1. to only store in gus a *reference* to the external sequence (ie,
>> don't store the actgs).
>>   2. only store in gus the sequences that actually have similarities
>
> Option 2 sound better for us, since we will be blasting against several
> databases (> 10GB databases)
>
> What about the plugins to load Interpro and "gene finder" (glimmer, 
> etc)
> results ? Is there any at all ?
>
> Cheers, Alberto
>
>>
>> steve
>>
>> Alberto Davila wrote:
>>
>>> All the blastable databases I mentioned are standard databases from 
>>> NCBI
>>> (ftp://ftp.ncbi.nlm.nih.gov/blast/db/blastdb.txt):
>>>
>>> NT = nucleotides
>>>
>>> ~30000 entries from genbank (genbank format) are loaded into GUS now.
>>>
>>> Not sure about your "NRDB", I know NR from NCBI that is a collection 
>>> of
>>> aminoacid entries, could it be the same ?
>>>
>>> Alberto
>>>
>>> On Fri, 2005-02-11 at 10:43 -0500, Steve Fischer wrote:
>>>
>>>
>>>> (what is NT?)
>>>>
>>>> which of these (genbank, your fasta, NRDB, NT, EST) have you loaded 
>>>> into
>>>> gus?
>>>>
>>>> steve
>>>>
>>>> Alberto Davila wrote:
>>>>
>>>>
>>>>
>>>>> Query:
>>>>>
>>>>> Either sequences from genbank (genbank format) or sequences 
>>>>> generated in
>>>>> the lab (fasta format)
>>>>>
>>>>> Blastable databases (all are formatted databases from NCBI):
>>>>>
>>>>> NR
>>>>> NT
>>>>> EST
>>>>>
>>>>> Alberto
>>>>>
>>>>> On Fri, 2005-02-11 at 10:34 -0500, Steve Fischer wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> for the blast, what are the query sequences and what are the 
>>>>>> blastable
>>>>>> databases?
>>>>>>
>>>>>> steve
>>>>>>
>>>>>> Alberto Davila wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> Basically we will use sequences (loaded into GUS with the 
>>>>>>> GBParser) for
>>>>>>> NCBI Blast (Blastx, Blastp and TBlastX), the same sequences will 
>>>>>>> be also
>>>>>>> used for Interpro analyses. Results of both (Blast and Interpro) 
>>>>>>> will be
>>>>>>> loaded into GUS. We will parse specific things from the Blast 
>>>>>>> results, I
>>>>>>> would say:
>>>>>>>
>>>>>>> `Gi`
>>>>>>> `Accession`
>>>>>>> `Description`
>>>>>>> `E_value`
>>>>>>> `Score`
>>>>>>> `Length`
>>>>>>> `Frame_Query`
>>>>>>> `Frame_Hit`
>>>>>>> `Identical`
>>>>>>> `Hsp_Frac_Identical`
>>>>>>> `Conserved`
>>>>>>> `Hsp_Frac_Conserved`
>>>>>>> `Query_Start`
>>>>>>> `Query_End`
>>>>>>> `Hit_Start`
>>>>>>> `Hit_End`
>>>>>>> `Hsp_Align`
>>>>>>> `database_letters`
>>>>>>> `database_entries`
>>>>>>>
>>>>>>> We already have a Bioperl parser for that (specific for another 
>>>>>>> system:
>>>>>>> GARSA) that could be adapted to GUS, problem being we are not 
>>>>>>> sure what
>>>>>>> tables should be used to store those data in GUS.
>>>>>>>
>>>>>>> Cheers, Alberto
>>>>>>>
>>>>>>>
>>>>>>> On Fri, 2005-02-11 at 10:06 -0500, Steve Fischer wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> what are you planning on blasting?
>>>>>>>>
>>>>>>>> steve
>>>>>>>>
>>>>>>>> Alberto Davila wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> Hi Steve,
>>>>>>>>>
>>>>>>>>> On Fri, 2005-02-11 at 08:56 -0500, Steve Fischer wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> poliana-
>>>>>>>>>>
>>>>>>>>>> oops, the usage statement for LoadBlastSimFast is out of 
>>>>>>>>>> date.   it
>>>>>>>>>> should instruct you to use the blastSimilarity command.
>>>>>>>>>>
>>>>>>>>>> LoadBlastSimFast makes a big assumption, that the subject and 
>>>>>>>>>> query
>>>>>>>>>> sequences are in GUS, and their def. lines have GUS primary 
>>>>>>>>>> keys.
>>>>>>>>>>
>>>>>>>>>> Are your sequences already loaded into GUS?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> They are not, there would be any howto/tips for that plugin ? 
>>>>>>>>> We will
>>>>>>>>> certainly need a plugin to load "Interpro" and "ORF finding" 
>>>>>>>>> results
>>>>>>>>> into GUS... If they are not available, then maybe we will have 
>>>>>>>>> to write
>>>>>>>>> them ...
>>>>>>>>>
>>>>>>>>> Cheers, Alberto
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> steve
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Poliana Mateus wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> Hello all,
>>>>>>>>>>>
>>>>>>>>>>> Where can find the script parseBlastFilesForSimilarity.pl??
>>>>>>>>>>> I'm trying to run LoadBlastSimFast...
>>>>>>>>>>>
>>>>>>>>>>> Poliana
>>>>>>>>>>>
>>>>>>>>>>>
>>>
>>>
>>>
>
>
>
> -------------------------------------------------------
> SF email is sponsored by - The IT Product Guide
> Read honest & candid reviews on hundreds of IT Products from real 
> users.
> Discover which products truly live up to the hype. Start reading now.
> http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
> _______________________________________________
> Gusdev-gusdev mailing list
> Gus...@li...
> https://lists.sourceforge.net/lists/listinfo/gusdev-gusdev

Thanks Bindu,

We will have a look on it... also, just found a Bioperl module for
GlimmerM:

http://doc.bioperl.org/bioperl-live/Bio/Tools/Glimmer.html

Cheers, Alberto

On Mon, 2005-02-14 at 14:45 -0500, Bindu Gajria wrote:
> hi Alberto -
> PlasmoDB project uses a plugin to load the GlimmerM results; it is
> GUS::Common::Plugin::ImportPlasmoDBPrediction plugin in the Sanger cvs 
> repository. however, please note that this plugin is not generalized, 
> and has been used here only for the PlasmoDB project so far.
> It would be useful to generalize this plugin some day, so that all can 
> benefit.
> 
> Bindu
> 
> 
> On Feb 11, 2005, at 12:44 PM, Alberto Davila wrote:
> 
> > Hey Steve, Thomas,
> >
> > Thanks a lot for the tips, really helpful.. now, few more questions:
> >
> >> ok.  NR = NRDB
> >>
> >> the way we have used gus with similarities is that both the query and
> >> subject are loaded into gus.  As thomas explained, the similarity 
> >> table
> >> captures similarity between sequences that are in gus.
> >>
> >> our approach has always been to just load (warehouse) the entire 
> >> subject
> >> database (NR, EST) that we are blasting against.
> >>
> >> the current plugins and blastSimilarity are set up for this.
> >>
> >> obviously, this takes a lot of disk space.  two major efficiencies 
> >> that
> >> we don't currently have plugins for would be:
> >>   1. to only store in gus a *reference* to the external sequence (ie,
> >> don't store the actgs).
> >>   2. only store in gus the sequences that actually have similarities
> >
> > Option 2 sound better for us, since we will be blasting against several
> > databases (> 10GB databases)
> >
> > What about the plugins to load Interpro and "gene finder" (glimmer, 
> > etc)
> > results ? Is there any at all ?
> >
> > Cheers, Alberto
> >
> >>
> >> steve
> >>
> >> Alberto Davila wrote:
> >>
> >>> All the blastable databases I mentioned are standard databases from 
> >>> NCBI
> >>> (ftp://ftp.ncbi.nlm.nih.gov/blast/db/blastdb.txt):
> >>>
> >>> NT = nucleotides
> >>>
> >>> ~30000 entries from genbank (genbank format) are loaded into GUS now.
> >>>
> >>> Not sure about your "NRDB", I know NR from NCBI that is a collection 
> >>> of
> >>> aminoacid entries, could it be the same ?
> >>>
> >>> Alberto
> >>>
> >>> On Fri, 2005-02-11 at 10:43 -0500, Steve Fischer wrote:
> >>>
> >>>
> >>>> (what is NT?)
> >>>>
> >>>> which of these (genbank, your fasta, NRDB, NT, EST) have you loaded 
> >>>> into
> >>>> gus?
> >>>>
> >>>> steve
> >>>>
> >>>> Alberto Davila wrote:
> >>>>
> >>>>
> >>>>
> >>>>> Query:
> >>>>>
> >>>>> Either sequences from genbank (genbank format) or sequences 
> >>>>> generated in
> >>>>> the lab (fasta format)
> >>>>>
> >>>>> Blastable databases (all are formatted databases from NCBI):
> >>>>>
> >>>>> NR
> >>>>> NT
> >>>>> EST
> >>>>>
> >>>>> Alberto
> >>>>>
> >>>>> On Fri, 2005-02-11 at 10:34 -0500, Steve Fischer wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>> for the blast, what are the query sequences and what are the 
> >>>>>> blastable
> >>>>>> databases?
> >>>>>>
> >>>>>> steve
> >>>>>>
> >>>>>> Alberto Davila wrote:
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>> Basically we will use sequences (loaded into GUS with the 
> >>>>>>> GBParser) for
> >>>>>>> NCBI Blast (Blastx, Blastp and TBlastX), the same sequences will 
> >>>>>>> be also
> >>>>>>> used for Interpro analyses. Results of both (Blast and Interpro) 
> >>>>>>> will be
> >>>>>>> loaded into GUS. We will parse specific things from the Blast 
> >>>>>>> results, I
> >>>>>>> would say:
> >>>>>>>
> >>>>>>> `Gi`
> >>>>>>> `Accession`
> >>>>>>> `Description`
> >>>>>>> `E_value`
> >>>>>>> `Score`
> >>>>>>> `Length`
> >>>>>>> `Frame_Query`
> >>>>>>> `Frame_Hit`
> >>>>>>> `Identical`
> >>>>>>> `Hsp_Frac_Identical`
> >>>>>>> `Conserved`
> >>>>>>> `Hsp_Frac_Conserved`
> >>>>>>> `Query_Start`
> >>>>>>> `Query_End`
> >>>>>>> `Hit_Start`
> >>>>>>> `Hit_End`
> >>>>>>> `Hsp_Align`
> >>>>>>> `database_letters`
> >>>>>>> `database_entries`
> >>>>>>>
> >>>>>>> We already have a Bioperl parser for that (specific for another 
> >>>>>>> system:
> >>>>>>> GARSA) that could be adapted to GUS, problem being we are not 
> >>>>>>> sure what
> >>>>>>> tables should be used to store those data in GUS.
> >>>>>>>
> >>>>>>> Cheers, Alberto
> >>>>>>>
> >>>>>>>
> >>>>>>> On Fri, 2005-02-11 at 10:06 -0500, Steve Fischer wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>> what are you planning on blasting?
> >>>>>>>>
> >>>>>>>> steve
> >>>>>>>>
> >>>>>>>> Alberto Davila wrote:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> Hi Steve,
> >>>>>>>>>
> >>>>>>>>> On Fri, 2005-02-11 at 08:56 -0500, Steve Fischer wrote:
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> poliana-
> >>>>>>>>>>
> >>>>>>>>>> oops, the usage statement for LoadBlastSimFast is out of 
> >>>>>>>>>> date.   it
> >>>>>>>>>> should instruct you to use the blastSimilarity command.
> >>>>>>>>>>
> >>>>>>>>>> LoadBlastSimFast makes a big assumption, that the subject and 
> >>>>>>>>>> query
> >>>>>>>>>> sequences are in GUS, and their def. lines have GUS primary 
> >>>>>>>>>> keys.
> >>>>>>>>>>
> >>>>>>>>>> Are your sequences already loaded into GUS?
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>> They are not, there would be any howto/tips for that plugin ? 
> >>>>>>>>> We will
> >>>>>>>>> certainly need a plugin to load "Interpro" and "ORF finding" 
> >>>>>>>>> results
> >>>>>>>>> into GUS... If they are not available, then maybe we will have 
> >>>>>>>>> to write
> >>>>>>>>> them ...
> >>>>>>>>>
> >>>>>>>>> Cheers, Alberto
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> steve
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Poliana Mateus wrote:
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> Hello all,
> >>>>>>>>>>>
> >>>>>>>>>>> Where can find the script parseBlastFilesForSimilarity.pl??
> >>>>>>>>>>> I'm trying to run LoadBlastSimFast...
> >>>>>>>>>>>
> >>>>>>>>>>> Poliana