Re: [Gusdev-gusdev] parseBlastFilesForSimilarity.pl

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

I think the way to do it would be to make a new plugin 
LoadSimilaritySequences.

It would be like LoadBlastSimFast and LoadBlastSimilaritiesPK, in that 
it wouldread the output of blastSimilarity.  But, unlike them, the 
subject sequences would have source_ids not na_sequence_ids.   (the 
query sequences would still be stored in gus and extracted using 
dumpSequencesFromTable)

the plugin would:
  -  take as an argument the ExternalDatabase and its Version (eg, NRDB 1.3)
  - call the plugin superclass's getExtDbRelId() to get the 
external_database_release_id.  
  - use that id to query to get all (source_id, na_sequence_id) pairs 
that exist for that external database release
  - put that in a hash, with source_id as key
  - take as an argument the file holding the similarities
  - optionally take as an argument a fasta.gz file holding the subject 
database.
  - run through all the similarities in the input file. 
  - if a subject sequence is not already in the db (not found in hash), 
add it (optionally including the actgs if the fasta file is provided)
  - then, use that sequence's na_sequence_id to form the Similarity

steve

Y. Thomas Gan wrote:

> I was going to give the same answer steve gave for interpro and gene 
> finding results.
>
> For loading sequences into GUS, the dillema with option 2 is: how do 
> you know which sequence to load when you load (which is before you 
> actually have the similarity result)? One solution would be to 
> initially load complete dataset(s) but delete those without similarity 
> after loading similarity results.
>
> -Thomas
>
> On Fri, 11 Feb 2005, Steve Fischer wrote:
>
>> alberto-
>>
>> we've never loaded interpro, so there isn't a plugin. i believe 
>> plasmodb has loaded glimmer results, though i'm not sure.   i have 
>> asked a plasmodb developer to answer that question.
>>
>> steve
>>
>> Alberto Davila wrote:
>>
>>> Hey Steve, Thomas,
>>>
>>> Thanks a lot for the tips, really helpful.. now, few more questions:
>>>
>>>
>>>> ok.  NR = NRDB
>>>>
>>>> the way we have used gus with similarities is that both the query 
>>>> and subject are loaded into gus.  As thomas explained, the 
>>>> similarity table captures similarity between sequences that are in 
>>>> gus. our approach has always been to just load (warehouse) the 
>>>> entire subject database (NR, EST) that we are blasting against.
>>>>
>>>> the current plugins and blastSimilarity are set up for this.
>>>>
>>>> obviously, this takes a lot of disk space.  two major efficiencies 
>>>> that we don't currently have plugins for would be:
>>>>  1. to only store in gus a *reference* to the external sequence 
>>>> (ie, don't store the actgs).
>>>>  2. only store in gus the sequences that actually have similarities
>>>>
>>>
>>> Option 2 sound better for us, since we will be blasting against several
>>> databases (> 10GB databases)
>>>
>>> What about the plugins to load Interpro and "gene finder" (glimmer, 
>>> etc)
>>> results ? Is there any at all ?
>>>
>>> Cheers, Alberto
>>>
>>>
>>>> steve
>>>>
>>>> Alberto Davila wrote:
>>>>
>>>>
>>>>> All the blastable databases I mentioned are standard databases 
>>>>> from NCBI
>>>>> (ftp://ftp.ncbi.nlm.nih.gov/blast/db/blastdb.txt):
>>>>>
>>>>> NT = nucleotides
>>>>>
>>>>> ~30000 entries from genbank (genbank format) are loaded into GUS now.
>>>>>
>>>>> Not sure about your "NRDB", I know NR from NCBI that is a 
>>>>> collection of
>>>>> aminoacid entries, could it be the same ?
>>>>>
>>>>> Alberto
>>>>>
>>>>> On Fri, 2005-02-11 at 10:43 -0500, Steve Fischer wrote:
>>>>>
>>>>>
>>>>>
>>>>>> (what is NT?)
>>>>>>
>>>>>> which of these (genbank, your fasta, NRDB, NT, EST) have you 
>>>>>> loaded into gus?
>>>>>>
>>>>>> steve
>>>>>>
>>>>>> Alberto Davila wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>> Query:
>>>>>>>
>>>>>>> Either sequences from genbank (genbank format) or sequences 
>>>>>>> generated in
>>>>>>> the lab (fasta format)
>>>>>>>
>>>>>>> Blastable databases (all are formatted databases from NCBI):
>>>>>>>
>>>>>>> NR
>>>>>>> NT
>>>>>>> EST
>>>>>>>
>>>>>>> Alberto
>>>>>>>
>>>>>>> On Fri, 2005-02-11 at 10:34 -0500, Steve Fischer wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> for the blast, what are the query sequences and what are the 
>>>>>>>> blastable databases?
>>>>>>>>
>>>>>>>> steve
>>>>>>>>
>>>>>>>> Alberto Davila wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> Basically we will use sequences (loaded into GUS with the 
>>>>>>>>> GBParser) for
>>>>>>>>> NCBI Blast (Blastx, Blastp and TBlastX), the same sequences 
>>>>>>>>> will be also
>>>>>>>>> used for Interpro analyses. Results of both (Blast and 
>>>>>>>>> Interpro) will be
>>>>>>>>> loaded into GUS. We will parse specific things from the Blast 
>>>>>>>>> results, I
>>>>>>>>> would say:
>>>>>>>>>
>>>>>>>>> `Gi` `Accession` `Description` `E_value` `Score` `Length` 
>>>>>>>>> `Frame_Query` `Frame_Hit` `Identical` `Hsp_Frac_Identical` 
>>>>>>>>> `Conserved` `Hsp_Frac_Conserved`
>>>>>>>>> `Query_Start`
>>>>>>>>> `Query_End` `Hit_Start` `Hit_End` `Hsp_Align` 
>>>>>>>>> `database_letters` `database_entries` We already have a 
>>>>>>>>> Bioperl parser for that (specific for another system:
>>>>>>>>> GARSA) that could be adapted to GUS, problem being we are not 
>>>>>>>>> sure what
>>>>>>>>> tables should be used to store those data in GUS.
>>>>>>>>>
>>>>>>>>> Cheers, Alberto
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, 2005-02-11 at 10:06 -0500, Steve Fischer wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> what are you planning on blasting?
>>>>>>>>>>
>>>>>>>>>> steve
>>>>>>>>>>
>>>>>>>>>> Alberto Davila wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> Hi Steve,
>>>>>>>>>>>
>>>>>>>>>>> On Fri, 2005-02-11 at 08:56 -0500, Steve Fischer wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> poliana-
>>>>>>>>>>>>
>>>>>>>>>>>> oops, the usage statement for LoadBlastSimFast is out of 
>>>>>>>>>>>> date. it should instruct you to use the blastSimilarity 
>>>>>>>>>>>> command.
>>>>>>>>>>>>
>>>>>>>>>>>> LoadBlastSimFast makes a big assumption, that the subject 
>>>>>>>>>>>> and query sequences are in GUS, and their def. lines have 
>>>>>>>>>>>> GUS primary keys. Are your sequences already loaded into GUS?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> They are not, there would be any howto/tips for that plugin 
>>>>>>>>>>> ? We will
>>>>>>>>>>> certainly need a plugin to load "Interpro" and "ORF finding" 
>>>>>>>>>>> results
>>>>>>>>>>> into GUS... If they are not available, then maybe we will 
>>>>>>>>>>> have to write
>>>>>>>>>>> them ...
>>>>>>>>>>>
>>>>>>>>>>> Cheers, Alberto
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> steve
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Poliana Mateus wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> Hello all,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Where can find the script parseBlastFilesForSimilarity.pl??
>>>>>>>>>>>>> I'm trying to run LoadBlastSimFast...
>>>>>>>>>>>>>
>>>>>>>>>>>>> Poliana
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>
>>>>>
>>>>>
>>
>>
>> -------------------------------------------------------
>> SF email is sponsored by - The IT Product Guide
>> Read honest & candid reviews on hundreds of IT Products from real users.
>> Discover which products truly live up to the hype. Start reading now.
>> http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
>> _______________________________________________
>> Gusdev-gusdev mailing list
>> Gus...@li...
>> https://lists.sourceforge.net/lists/listinfo/gusdev-gusdev
>>