From: Steve F. <sfi...@pc...> - 2005-02-11 18:35:29
|
I think the way to do it would be to make a new plugin LoadSimilaritySequences. It would be like LoadBlastSimFast and LoadBlastSimilaritiesPK, in that it wouldread the output of blastSimilarity. But, unlike them, the subject sequences would have source_ids not na_sequence_ids. (the query sequences would still be stored in gus and extracted using dumpSequencesFromTable) the plugin would: - take as an argument the ExternalDatabase and its Version (eg, NRDB 1.3) - call the plugin superclass's getExtDbRelId() to get the external_database_release_id. - use that id to query to get all (source_id, na_sequence_id) pairs that exist for that external database release - put that in a hash, with source_id as key - take as an argument the file holding the similarities - optionally take as an argument a fasta.gz file holding the subject database. - run through all the similarities in the input file. - if a subject sequence is not already in the db (not found in hash), add it (optionally including the actgs if the fasta file is provided) - then, use that sequence's na_sequence_id to form the Similarity steve Y. Thomas Gan wrote: > I was going to give the same answer steve gave for interpro and gene > finding results. > > For loading sequences into GUS, the dillema with option 2 is: how do > you know which sequence to load when you load (which is before you > actually have the similarity result)? One solution would be to > initially load complete dataset(s) but delete those without similarity > after loading similarity results. > > -Thomas > > On Fri, 11 Feb 2005, Steve Fischer wrote: > >> alberto- >> >> we've never loaded interpro, so there isn't a plugin. i believe >> plasmodb has loaded glimmer results, though i'm not sure. i have >> asked a plasmodb developer to answer that question. >> >> steve >> >> Alberto Davila wrote: >> >>> Hey Steve, Thomas, >>> >>> Thanks a lot for the tips, really helpful.. now, few more questions: >>> >>> >>>> ok. NR = NRDB >>>> >>>> the way we have used gus with similarities is that both the query >>>> and subject are loaded into gus. As thomas explained, the >>>> similarity table captures similarity between sequences that are in >>>> gus. our approach has always been to just load (warehouse) the >>>> entire subject database (NR, EST) that we are blasting against. >>>> >>>> the current plugins and blastSimilarity are set up for this. >>>> >>>> obviously, this takes a lot of disk space. two major efficiencies >>>> that we don't currently have plugins for would be: >>>> 1. to only store in gus a *reference* to the external sequence >>>> (ie, don't store the actgs). >>>> 2. only store in gus the sequences that actually have similarities >>>> >>> >>> Option 2 sound better for us, since we will be blasting against several >>> databases (> 10GB databases) >>> >>> What about the plugins to load Interpro and "gene finder" (glimmer, >>> etc) >>> results ? Is there any at all ? >>> >>> Cheers, Alberto >>> >>> >>>> steve >>>> >>>> Alberto Davila wrote: >>>> >>>> >>>>> All the blastable databases I mentioned are standard databases >>>>> from NCBI >>>>> (ftp://ftp.ncbi.nlm.nih.gov/blast/db/blastdb.txt): >>>>> >>>>> NT = nucleotides >>>>> >>>>> ~30000 entries from genbank (genbank format) are loaded into GUS now. >>>>> >>>>> Not sure about your "NRDB", I know NR from NCBI that is a >>>>> collection of >>>>> aminoacid entries, could it be the same ? >>>>> >>>>> Alberto >>>>> >>>>> On Fri, 2005-02-11 at 10:43 -0500, Steve Fischer wrote: >>>>> >>>>> >>>>> >>>>>> (what is NT?) >>>>>> >>>>>> which of these (genbank, your fasta, NRDB, NT, EST) have you >>>>>> loaded into gus? >>>>>> >>>>>> steve >>>>>> >>>>>> Alberto Davila wrote: >>>>>> >>>>>> >>>>>> >>>>>>> Query: >>>>>>> >>>>>>> Either sequences from genbank (genbank format) or sequences >>>>>>> generated in >>>>>>> the lab (fasta format) >>>>>>> >>>>>>> Blastable databases (all are formatted databases from NCBI): >>>>>>> >>>>>>> NR >>>>>>> NT >>>>>>> EST >>>>>>> >>>>>>> Alberto >>>>>>> >>>>>>> On Fri, 2005-02-11 at 10:34 -0500, Steve Fischer wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>> for the blast, what are the query sequences and what are the >>>>>>>> blastable databases? >>>>>>>> >>>>>>>> steve >>>>>>>> >>>>>>>> Alberto Davila wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> Basically we will use sequences (loaded into GUS with the >>>>>>>>> GBParser) for >>>>>>>>> NCBI Blast (Blastx, Blastp and TBlastX), the same sequences >>>>>>>>> will be also >>>>>>>>> used for Interpro analyses. Results of both (Blast and >>>>>>>>> Interpro) will be >>>>>>>>> loaded into GUS. We will parse specific things from the Blast >>>>>>>>> results, I >>>>>>>>> would say: >>>>>>>>> >>>>>>>>> `Gi` `Accession` `Description` `E_value` `Score` `Length` >>>>>>>>> `Frame_Query` `Frame_Hit` `Identical` `Hsp_Frac_Identical` >>>>>>>>> `Conserved` `Hsp_Frac_Conserved` >>>>>>>>> `Query_Start` >>>>>>>>> `Query_End` `Hit_Start` `Hit_End` `Hsp_Align` >>>>>>>>> `database_letters` `database_entries` We already have a >>>>>>>>> Bioperl parser for that (specific for another system: >>>>>>>>> GARSA) that could be adapted to GUS, problem being we are not >>>>>>>>> sure what >>>>>>>>> tables should be used to store those data in GUS. >>>>>>>>> >>>>>>>>> Cheers, Alberto >>>>>>>>> >>>>>>>>> >>>>>>>>> On Fri, 2005-02-11 at 10:06 -0500, Steve Fischer wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>> what are you planning on blasting? >>>>>>>>>> >>>>>>>>>> steve >>>>>>>>>> >>>>>>>>>> Alberto Davila wrote: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> Hi Steve, >>>>>>>>>>> >>>>>>>>>>> On Fri, 2005-02-11 at 08:56 -0500, Steve Fischer wrote: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> poliana- >>>>>>>>>>>> >>>>>>>>>>>> oops, the usage statement for LoadBlastSimFast is out of >>>>>>>>>>>> date. it should instruct you to use the blastSimilarity >>>>>>>>>>>> command. >>>>>>>>>>>> >>>>>>>>>>>> LoadBlastSimFast makes a big assumption, that the subject >>>>>>>>>>>> and query sequences are in GUS, and their def. lines have >>>>>>>>>>>> GUS primary keys. Are your sequences already loaded into GUS? >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> They are not, there would be any howto/tips for that plugin >>>>>>>>>>> ? We will >>>>>>>>>>> certainly need a plugin to load "Interpro" and "ORF finding" >>>>>>>>>>> results >>>>>>>>>>> into GUS... If they are not available, then maybe we will >>>>>>>>>>> have to write >>>>>>>>>>> them ... >>>>>>>>>>> >>>>>>>>>>> Cheers, Alberto >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> steve >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Poliana Mateus wrote: >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> Hello all, >>>>>>>>>>>>> >>>>>>>>>>>>> Where can find the script parseBlastFilesForSimilarity.pl?? >>>>>>>>>>>>> I'm trying to run LoadBlastSimFast... >>>>>>>>>>>>> >>>>>>>>>>>>> Poliana >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>> >>>>> >>>>> >> >> >> ------------------------------------------------------- >> SF email is sponsored by - The IT Product Guide >> Read honest & candid reviews on hundreds of IT Products from real users. >> Discover which products truly live up to the hype. Start reading now. >> http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click >> _______________________________________________ >> Gusdev-gusdev mailing list >> Gus...@li... >> https://lists.sourceforge.net/lists/listinfo/gusdev-gusdev >> |