From: Aaron J. M. <am...@pc...> - 2005-02-15 02:33:30
|
FYI, if you wish to store details of a pairwise sequence alignment, =20 it's arguably preferable to not store three separate alignment strings =20= (query-with-gaps, target-with-gaps and similarity string), since these =20= are not quite independent attributes (you can not truly change one =20 without changing the others, in some sense), but rather to save only =20 the location of gaps (requiring one to reconstruct the various =20 alignment strings if/when necessary). EnsEMBL chose to use the "CIGAR" =20= string representation of an alignment; this format has made its way =20 into GFF3 as well: http://www.ensembl.org/Docs/wiki/html/EnsemblDocs/CigarFormat.html The FASTA programs have a similar (but more expressive) alignment =20 "encoding" that includes possibilities for forward and backwards =20 frameshifts (i.e. for protein-to-DNA alignments). Either of these =20 encodings are also (somewhat) more "computable" than the raw string =20 alignment representation. -Aaron On Feb 14, 2005, at 8:27 PM, davila wrote: > Hi Steve, > > I wonder to know if you think it would be interesting to expand the =20= > "Similarity and SimilaritySpan" tables ? Some blast results, > eg: query_string, hit_string, homology_string and alignment don=C2=B4t = =20 > appear to be represented in those tables (of course, I might be =20 > wrong)... > > Ideally, those tables should be able to store most data parsed from =20= > Blast results, an example of most important data is listed in the =20 > Bio::SearchIO system of Bioperl: =20 > http://bioperl.org/HOWTOs/SearchIO/use.html > > Cheers, Alberto > > > -----Mensagem original----- > De: Steve Fischer [mailto:sfi...@pc...] > Enviada: seg 14/2/2005 17:53 > Para: Poliana Mateus > Cc: davila; gus...@li... > Assunto: Re: [Gusdev-gusdev] parseBlastFilesForSimilarity.pl > =09 > =09 > Poliana- > =09 > the only blast plugins we have are LoadBlastSimFast and > LoadBlastSimilarityPK. > =09 > the only tables are Similarity and SimilaritySpan > =09 > steve > =09 > Poliana Mateus wrote: > =09 > >Hi Steve > > > >I need to insert given in the GUS (resulted blast) as: > > > >---------------------------------------------------- > >extracted data of ours script > >---------------------------------------------------- > >query_name > >name > >accession > >description > >significance > >raw_score > >length > >num_identical > >frac_identical > >num_conserved > >frac_conserved > >start('query') > >end('query') > >start('hit') > >end('hit') > >---------------------------------------------------- > > > >Analyzing the LoadBlastSimFast Plugin I verified that it = inserts in > >tables DoTs.Similarity and DoTs.SymilaritySpan, both only = accept =20 > given > >numerics. > >Exists into GUS other tables that store resulted of Blast? > > > >Poliana > > > > > > > > > > > > > >On Fri, 11 Feb 2005 13:50:32 -0500, Steve Fischer > ><sfi...@pc...> wrote: > > > > > >>see below > >> > >>Alberto Davila wrote: > >> > >> > >> > >>>We are doing this for Garsa (another system) .. basically we = have a > >>>bioperl parser (Bio::Search::IO) that reads the Blast results = file =20 > and > >>>extract all the needed info (to the "Blast_Hit" table)... and = also =20 > load > >>>into a given table (eg: External_DB) all the sequences (in = fasta =20 > format) > >>>presenting similarity with the queries... at the end we have =20= > "Blast_Hit" > >>>and "External_DB" populated with the same script. > >>> > >>> > >>> > >>> > >>> > >>wow, great. could you make a gus plugin from that? > >> > >> > >> > >>>Regarding Interpro and Glimmer, the main problem is to know = in =20 > which > >>>tables we should load the parsed results ? > >>> > >>> > >>> > >>> > >>> > >>describe the info you want to store. > >> > >>steve > >> > >> > >> > >>>Alberto > >>> > >>>On Fri, 2005-02-11 at 13:21 -0500, Y. Thomas Gan wrote: > >>> > >>> > >>> > >>> > >>>>I was going to give the same answer steve gave for interpro = and =20 > gene > >>>>finding results. > >>>> > >>>>For loading sequences into GUS, the dillema with option 2 = is: how =20 > do you > >>>>know which sequence to load when you load (which is before = you =20 > actually > >>>>have the similarity result)? One solution would be to = initially =20 > load > >>>>complete dataset(s) but delete those without similarity = after =20 > loading > >>>>similarity results. > >>>> > >>>>-Thomas > >>>> > >>>>On Fri, 11 Feb 2005, Steve Fischer wrote: > >>>> > >>>> > >>>> > >>>> > >>>> > >>>>>alberto- > >>>>> > >>>>>we've never loaded interpro, so there isn't a plugin. > >>>>>i believe plasmodb has loaded glimmer results, though i'm = not =20 > sure. i have > >>>>>asked a plasmodb developer to answer that question. > >>>>> > >>>>>steve > >>>>> > >>>>>Alberto Davila wrote: > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>>>Hey Steve, Thomas, > >>>>>> > >>>>>>Thanks a lot for the tips, really helpful.. now, few more =20= > questions: > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>>>ok. NR =3D NRDB > >>>>>>> > >>>>>>>the way we have used gus with similarities is that both = the =20 > query and > >>>>>>>subject are loaded into gus. As thomas explained, the =20= > similarity table > >>>>>>>captures similarity between sequences that are in gus. > >>>>>>>our approach has always been to just load (warehouse) the = =20 > entire subject > >>>>>>>database (NR, EST) that we are blasting against. > >>>>>>> > >>>>>>>the current plugins and blastSimilarity are set up for = this. > >>>>>>> > >>>>>>>obviously, this takes a lot of disk space. two major =20 > efficiencies that we > >>>>>>>don't currently have plugins for would be: > >>>>>>>1. to only store in gus a *reference* to the external = sequence =20 > (ie, don't > >>>>>>>store the actgs). > >>>>>>>2. only store in gus the sequences that actually have =20 > similarities > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>Option 2 sound better for us, since we will be blasting = against =20 > several > >>>>>>databases (> 10GB databases) > >>>>>> > >>>>>>What about the plugins to load Interpro and "gene finder" =20= > (glimmer, etc) > >>>>>>results ? Is there any at all ? > >>>>>> > >>>>>>Cheers, Alberto > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>>>steve > >>>>>>> > >>>>>>>Alberto Davila wrote: > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>>>All the blastable databases I mentioned are standard =20 > databases from NCBI > >>>>>>>>(ftp://ftp.ncbi.nlm.nih.gov/blast/db/blastdb.txt): > >>>>>>>> > >>>>>>>>NT =3D nucleotides > >>>>>>>> > >>>>>>>>~30000 entries from genbank (genbank format) are loaded = into =20 > GUS now. > >>>>>>>> > >>>>>>>>Not sure about your "NRDB", I know NR from NCBI that is = a =20 > collection of > >>>>>>>>aminoacid entries, could it be the same ? > >>>>>>>> > >>>>>>>>Alberto > >>>>>>>> > >>>>>>>>On Fri, 2005-02-11 at 10:43 -0500, Steve Fischer wrote: > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>>>(what is NT?) > >>>>>>>>> > >>>>>>>>>which of these (genbank, your fasta, NRDB, NT, EST) = have you =20 > loaded into > >>>>>>>>>gus? > >>>>>>>>> > >>>>>>>>>steve > >>>>>>>>> > >>>>>>>>>Alberto Davila wrote: > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>>>Query: > >>>>>>>>>> > >>>>>>>>>>Either sequences from genbank (genbank format) or = sequences =20 > generated > >>>>>>>>>>in > >>>>>>>>>>the lab (fasta format) > >>>>>>>>>> > >>>>>>>>>>Blastable databases (all are formatted databases from = NCBI): > >>>>>>>>>> > >>>>>>>>>>NR > >>>>>>>>>>NT > >>>>>>>>>>EST > >>>>>>>>>> > >>>>>>>>>>Alberto > >>>>>>>>>> > >>>>>>>>>>On Fri, 2005-02-11 at 10:34 -0500, Steve Fischer = wrote: > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>>>for the blast, what are the query sequences and what = are =20 > the blastable > >>>>>>>>>>>databases? > >>>>>>>>>>> > >>>>>>>>>>>steve > >>>>>>>>>>> > >>>>>>>>>>>Alberto Davila wrote: > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>>>Basically we will use sequences (loaded into GUS = with the =20 > GBParser) > >>>>>>>>>>>>for > >>>>>>>>>>>>NCBI Blast (Blastx, Blastp and TBlastX), the same =20= > sequences will be > >>>>>>>>>>>>also > >>>>>>>>>>>>used for Interpro analyses. Results of both (Blast = and =20 > Interpro) will > >>>>>>>>>>>>be > >>>>>>>>>>>>loaded into GUS. We will parse specific things from = the =20 > Blast > >>>>>>>>>>>>results, I > >>>>>>>>>>>>would say: > >>>>>>>>>>>> > >>>>>>>>>>>>`Gi` `Accession` `Description` `E_value` `Score` = `Length` > >>>>>>>>>>>>`Frame_Query` `Frame_Hit` `Identical` = `Hsp_Frac_Identical` > >>>>>>>>>>>>`Conserved` `Hsp_Frac_Conserved` > >>>>>>>>>>>>`Query_Start` > >>>>>>>>>>>>`Query_End` `Hit_Start` `Hit_End` `Hsp_Align` =20 > `database_letters` > >>>>>>>>>>>>`database_entries` > >>>>>>>>>>>>We already have a Bioperl parser for that (specific = for =20 > another > >>>>>>>>>>>>system: > >>>>>>>>>>>>GARSA) that could be adapted to GUS, problem being = we are =20 > not sure > >>>>>>>>>>>>what > >>>>>>>>>>>>tables should be used to store those data in GUS. > >>>>>>>>>>>> > >>>>>>>>>>>>Cheers, Alberto > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>>On Fri, 2005-02-11 at 10:06 -0500, Steve Fischer = wrote: > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>>>what are you planning on blasting? > >>>>>>>>>>>>> > >>>>>>>>>>>>>steve > >>>>>>>>>>>>> > >>>>>>>>>>>>>Alberto Davila wrote: > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>>>Hi Steve, > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>On Fri, 2005-02-11 at 08:56 -0500, Steve Fischer = wrote: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>>poliana- > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>oops, the usage statement for LoadBlastSimFast is = out =20 > of date. > >>>>>>>>>>>>>>>it should instruct you to use the blastSimilarity = =20 > command. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>LoadBlastSimFast makes a big assumption, that the = =20 > subject and > >>>>>>>>>>>>>>>query sequences are in GUS, and their def. lines = have =20 > GUS primary > >>>>>>>>>>>>>>>keys. > >>>>>>>>>>>>>>>Are your sequences already loaded into GUS? > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>They are not, there would be any howto/tips for = that =20 > plugin ? We > >>>>>>>>>>>>>>will > >>>>>>>>>>>>>>certainly need a plugin to load "Interpro" and = "ORF =20 > finding" > >>>>>>>>>>>>>>results > >>>>>>>>>>>>>>into GUS... If they are not available, then maybe = we =20 > will have to > >>>>>>>>>>>>>>write > >>>>>>>>>>>>>>them ... > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>Cheers, Alberto > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>>steve > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>Poliana Mateus wrote: > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>Hello all, > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>Where can find the script =20 > parseBlastFilesForSimilarity.pl?? > >>>>>>>>>>>>>>>>I'm trying to run LoadBlastSimFast... > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>Poliana > >>>>>>>>>>>>>>>> > =09 > > =FF=FF=FF=FF=FF=FF=FF=FF=FF=FF=FF=FF=FF=FF=FF=FF=FF=FF=FF=FF=FF=FF=FF=FF= =FF=FF=FF=FF=FF=FF=FF=FF=FF=FF=FF=FF=FF=FF=FF=FF=FF=D2=15=E9=9A=8AX=AC=B2=9A= '=B2=8A=DEu=BC=FFN=17=88L=FA=E8v=E7-=20 > = =1A=E8=9Dy=17=9Av=1A'z=CB=FFq=A9=DD=89=DA=DE=BE'=B0=B2=89=E1=BAwky=DB(|=84= =CF=AE=87nr=DB=1F=AE=89=ABy=A9n=B1=EA=EC=FC8=ACr=8B=DE=AF=08br=1Ak=A1=DB=9C= =B6=CBk=BA\=A5=8A=F7=20 > =AE=A6=DA- > =E8r=A5=EF=D2=B5=AA=ED=AD=20 > = =E6=9D=8Ax'=A3=0F=E1=B6=DA=FF=FF=F6=9D=B3=FA,v=7F=DC=A2o=FFi=DF=E2=F7=9F=DA= =96Z=1C=FE'=D7=8D=FD=EB=FA)rO=FF=FF=FF=FF=FF=FF=FF=FF=FF=FF=FF=FF=FF=FF=FF= =FF=FF=FF=FF=FF=FF=FF=FF=FF=FF=FF=FF=FF=FF=20 > =FF=FF=FF=FF=FF=FCk=ACu=EB=FF=82=EB=1Dz=F9=9A=8AX=A7=82X=AC=B4k=ACu=EB=FF= =82=EB=1Dz=FF=E5=8A=CBl=FE=CA.=AD=C7=9F=A2=B8=1E=FEw=AD=86=DBi=B3=FF=FF=96= +-=20 > =B3=FB(=BA=B7=1E~=8A=E0{=F9=DE=B7=F9b=B2=DB?=96+-=8Aw=E8=FE=0B=ACu=EB=FF= =82=EB=1D > -- Aaron J. Mackey, Ph.D. Dept. of Biology, Goddard 212 University of Pennsylvania email: am...@pc... 415 S. University Avenue office: 215-898-1205 Philadelphia, PA 19104-6017 fax: 215-746-6697 |