From: davila <da...@io...> - 2005-02-14 15:58:26
|
Hey Ed, Great, I will look forward to it... Poliana just started to look at the = code since we are on a rush to meet some deadlines, anyway, she will = contact you by Friday to check your progresses with the document ;-) We are learning little by little about genomics databases (not too bad), = then "hope" to motivate my colleagues (the real DB experts, not = beginners like me) at the Federal University of Rio de Janeiro and IME = to offer a course on "Genomic Databases" as part of the graduate = programme for the second half of 2005. GUS and Chado schemas should = (hopefully) be a topic. Alberto -----Original Message----- From: Ed Robinson [mailto:ero...@ug...] Sent: Mon 2/14/2005 11:46 AM To: davila; Steve Fischer Cc: Y. Thomas Gan; Poliana Mateus; gus...@li... Subject: Re: RES: [Gusdev-gusdev] parseBlastFilesForSimilarity.pl Alberto, Poliana may be interested in a GUS developers guide I am trying to write this week for the course. I just went through the nightmare of learning how to correctly write GUS plugins for a completely undocumented API and little help or pointers to where that API can be found in the source code. There is a plugin description on the WIKI, but absolutely NO API for the GUS Model. I should have a document written for this with an API for the Plugin Class and a general API written for the GUS Model by Friday. It will also include other points for debugging GUS and some best practices I have collected in my notes. -Ed ---- Original message ---- >Date: Sun, 13 Feb 2005 18:01:22 -0300 >From: "davila" <da...@io...> =20 >Subject: RES: [Gusdev-gusdev] parseBlastFilesForSimilarity.pl =20 >To: "Steve Fischer" <sfi...@pc...> >Cc: "Y. Thomas Gan" <yon...@pc...>, "Poliana Mateus" <pol...@gm...>, <gus...@li...> > >Steve, >* see below > >Alberto Davila wrote: > >>We are doing this for Garsa (another system) .. basically we have a >>bioperl parser (Bio::Search::IO) that reads the Blast results file and >>extract all the needed info (to the "Blast_Hit" table)... and also load >>into a given table (eg: External_DB) all the sequences (in fasta format) >>presenting similarity with the queries... at the end we have "Blast_Hit" >>and "External_DB" populated with the same script. >> >>=20 >> >wow, great. could you make a gus plugin from that? > > >Should not be a big problem, I will ask Poliana to do that... she can ocassionally contact you asking for some details... at the end we will put things being debugged/developed by us at : www.biowebdb.org and also provide them to any interested people. In an ideal world, nobody should suffer twice with the same "bug" ;-) > >>Regarding Interpro and Glimmer, the main problem is to know in which >>tables we should load the parsed results ? >> >>=20 >> >* describe the info you want to store. > >Basically this: > >Frame_Hit, Method , Method_Accession, Accession, Hit_Status, Query_Start, Query_End, Description, E_value > >Again, I am asking Poliana to take care of that. > >* steve > >Cheers, Alberto > > >Alberto M. R. D=E1vila, PhD >Kinetoplastid Biology and Disease (Biomed Central) >http://www.kinetoplastids.com >http://www.darwin.fiocruz.br >DBBM / Instituto Oswaldo Cruz / FIOCRUZ >Av. Brasil 4365 >Rio de Janeiro, RJ, Brasil >CEP 21045-900 >Email: da...@fi... > amr...@ya... >Phone: 55-21-3865-8229 / 3865-8206 >Fax: 55-21-2590-3495 >------------------------------------------------- >The BiowebDB consortium: http://www.biowebdb.org > >=20 > > > >>Alberto >> >>On Fri, 2005-02-11 at 13:21 -0500, Y. Thomas Gan wrote: >>=20 >> >>>I was going to give the same answer steve gave for interpro and gene >>>finding results. >>> >>>For loading sequences into GUS, the dillema with option 2 is: how do you >>>know which sequence to load when you load (which is before you actually >>>have the similarity result)? One solution would be to initially load >>>complete dataset(s) but delete those without similarity after loading >>>similarity results. >>> >>>-Thomas >>> >>>On Fri, 11 Feb 2005, Steve Fischer wrote: >>> >>> =20 >>> >>>>alberto- >>>> >>>>we've never loaded interpro, so there isn't a plugin. >>>>i believe plasmodb has loaded glimmer results, though i'm not sure. i have >>>>asked a plasmodb developer to answer that question. >>>> >>>>steve >>>> >>>>Alberto Davila wrote: >>>> >>>> =20 >>>> >>>>>Hey Steve, Thomas, >>>>> >>>>>Thanks a lot for the tips, really helpful.. now, few more questions: >>>>> >>>>> >>>>> =20 >>>>> >>>>>>ok. NR =3D NRDB >>>>>> >>>>>>the way we have used gus with similarities is that both the query and >>>>>>subject are loaded into gus. As thomas explained, the similarity table >>>>>>captures similarity between sequences that are in gus. >>>>>>our approach has always been to just load (warehouse) the entire subject >>>>>>database (NR, EST) that we are blasting against. >>>>>> >>>>>>the current plugins and blastSimilarity are set up for this. >>>>>> >>>>>>obviously, this takes a lot of disk space. two major efficiencies that we >>>>>>don't currently have plugins for would be: >>>>>> 1. to only store in gus a *reference* to the external sequence (ie, don't >>>>>>store the actgs). >>>>>> 2. only store in gus the sequences that actually have similarities >>>>>> >>>>>> =20 >>>>>> >>>>>Option 2 sound better for us, since we will be blasting against several >>>>>databases (> 10GB databases) >>>>> >>>>>What about the plugins to load Interpro and "gene finder" (glimmer, etc) >>>>>results ? Is there any at all ? >>>>> >>>>>Cheers, Alberto >>>>> >>>>> >>>>> =20 >>>>> >>>>>>steve >>>>>> >>>>>>Alberto Davila wrote: >>>>>> >>>>>> >>>>>> =20 >>>>>> >>>>>>>All the blastable databases I mentioned are standard databases from NCBI >>>>>>>(ftp://ftp.ncbi.nlm.nih.gov/blast/db/blastdb.txt): >>>>>>> >>>>>>>NT =3D nucleotides >>>>>>> >>>>>>>~30000 entries from genbank (genbank format) are loaded into GUS now. >>>>>>> >>>>>>>Not sure about your "NRDB", I know NR from NCBI that is a collection of >>>>>>>aminoacid entries, could it be the same ? >>>>>>> >>>>>>>Alberto >>>>>>> >>>>>>>On Fri, 2005-02-11 at 10:43 -0500, Steve Fischer wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> =20 >>>>>>> >>>>>>>>(what is NT?) >>>>>>>> >>>>>>>>which of these (genbank, your fasta, NRDB, NT, EST) have you loaded into >>>>>>>>gus? >>>>>>>> >>>>>>>>steve >>>>>>>> >>>>>>>>Alberto Davila wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> =20 >>>>>>>> >>>>>>>>>Query: >>>>>>>>> >>>>>>>>>Either sequences from genbank (genbank format) or sequences generated >>>>>>>>>in >>>>>>>>>the lab (fasta format) >>>>>>>>> >>>>>>>>>Blastable databases (all are formatted databases from NCBI): >>>>>>>>> >>>>>>>>>NR >>>>>>>>>NT >>>>>>>>>EST >>>>>>>>> >>>>>>>>>Alberto >>>>>>>>> >>>>>>>>>On Fri, 2005-02-11 at 10:34 -0500, Steve Fischer wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> =20 >>>>>>>>> >>>>>>>>>>for the blast, what are the query sequences and what are the blastable >>>>>>>>>>databases? >>>>>>>>>> >>>>>>>>>>steve >>>>>>>>>> >>>>>>>>>>Alberto Davila wrote: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> =20 >>>>>>>>>> >>>>>>>>>>>Basically we will use sequences (loaded into GUS with the GBParser) >>>>>>>>>>>for >>>>>>>>>>>NCBI Blast (Blastx, Blastp and TBlastX), the same sequences will be >>>>>>>>>>>also >>>>>>>>>>>used for Interpro analyses. Results of both (Blast and Interpro) will >>>>>>>>>>>be >>>>>>>>>>>loaded into GUS. We will parse specific things from the Blast >>>>>>>>>>>results, I >>>>>>>>>>>would say: >>>>>>>>>>> >>>>>>>>>>>`Gi` `Accession` `Description` `E_value` `Score` `Length` >>>>>>>>>>>`Frame_Query` `Frame_Hit` `Identical` `Hsp_Frac_Identical` >>>>>>>>>>>`Conserved` `Hsp_Frac_Conserved` >>>>>>>>>>>`Query_Start` >>>>>>>>>>>`Query_End` `Hit_Start` `Hit_End` `Hsp_Align` `database_letters` >>>>>>>>>>>`database_entries` >>>>>>>>>>>We already have a Bioperl parser for that (specific for another >>>>>>>>>>>system: >>>>>>>>>>>GARSA) that could be adapted to GUS, problem being we are not sure >>>>>>>>>>>what >>>>>>>>>>>tables should be used to store those data in GUS. >>>>>>>>>>> >>>>>>>>>>>Cheers, Alberto >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>On Fri, 2005-02-11 at 10:06 -0500, Steve Fischer wrote: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> =20 >>>>>>>>>>> >>>>>>>>>>>>what are you planning on blasting? >>>>>>>>>>>> >>>>>>>>>>>>steve >>>>>>>>>>>> >>>>>>>>>>>>Alberto Davila wrote: >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> =20 >>>>>>>>>>>> >>>>>>>>>>>>>Hi Steve, >>>>>>>>>>>>> >>>>>>>>>>>>>On Fri, 2005-02-11 at 08:56 -0500, Steve Fischer wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> =20 >>>>>>>>>>>>> >>>>>>>>>>>>>>poliana- >>>>>>>>>>>>>> >>>>>>>>>>>>>>oops, the usage statement for LoadBlastSimFast is out of date. >>>>>>>>>>>>>>it should instruct you to use the blastSimilarity command. >>>>>>>>>>>>>> >>>>>>>>>>>>>>LoadBlastSimFast makes a big assumption, that the subject and >>>>>>>>>>>>>>query sequences are in GUS, and their def. lines have GUS primary >>>>>>>>>>>>>>keys. >>>>>>>>>>>>>>Are your sequences already loaded into GUS? >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> =20 >>>>>>>>>>>>>> >>>>>>>>>>>>>They are not, there would be any howto/tips for that plugin ? We >>>>>>>>>>>>>will >>>>>>>>>>>>>certainly need a plugin to load "Interpro" and "ORF finding" >>>>>>>>>>>>>results >>>>>>>>>>>>>into GUS... If they are not available, then maybe we will have to >>>>>>>>>>>>>write >>>>>>>>>>>>>them ... >>>>>>>>>>>>> >>>>>>>>>>>>>Cheers, Alberto >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> =20 >>>>>>>>>>>>> >>>>>>>>>>>>>>steve >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>Poliana Mateus wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> =20 >>>>>>>>>>>>>> >>>>>>>>>>>>>>>Hello all, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>Where can find the script parseBlastFilesForSimilarity.pl?? >>>>>>>>>>>>>>>I'm trying to run LoadBlastSimFast... >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>Poliana > |