Re: RES: [Gusdev-gusdev] parseBlastFilesForSimilarity.pl

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

FYI, if you wish to store details of a pairwise sequence alignment, =20
it's arguably preferable to not store three separate alignment strings =20=

(query-with-gaps, target-with-gaps and similarity string), since these =20=

are not quite independent attributes (you can not truly change one =20
without changing the others, in some sense), but rather to save only =20
the location of gaps (requiring one to reconstruct the various =20
alignment strings if/when necessary).  EnsEMBL chose to use the "CIGAR" =20=

string representation of an alignment; this format has made its way =20
into GFF3 as well:

   http://www.ensembl.org/Docs/wiki/html/EnsemblDocs/CigarFormat.html

The FASTA programs have a similar (but more expressive) alignment =20
"encoding" that includes possibilities for forward and backwards =20
frameshifts (i.e. for protein-to-DNA alignments).  Either of these =20
encodings are also (somewhat) more "computable" than the raw string =20
alignment representation.

-Aaron

On Feb 14, 2005, at 8:27 PM, davila wrote:

> Hi Steve,
>
> I wonder to know if you think it would be interesting to expand the =20=

> "Similarity and SimilaritySpan" tables ? Some blast results,
> eg: query_string, hit_string, homology_string and alignment don=C2=B4t =
=20
> appear to be represented in those tables (of course, I might be =20
> wrong)...
>
> Ideally, those tables should be able to store most data parsed from =20=

> Blast results, an example of most important data is listed in the =20
> Bio::SearchIO system of Bioperl: =20
> http://bioperl.org/HOWTOs/SearchIO/use.html
>
> Cheers, Alberto
>
>
> 	-----Mensagem original-----
> 	De: Steve Fischer [mailto:sfi...@pc...]
> 	Enviada: seg 14/2/2005 17:53
> 	Para: Poliana Mateus
> 	Cc: davila; gus...@li...
> 	Assunto: Re: [Gusdev-gusdev] parseBlastFilesForSimilarity.pl
> =09
> =09
> 	Poliana-
> =09
> 	the only blast plugins we have are LoadBlastSimFast and
> 	LoadBlastSimilarityPK.
> =09
> 	the only tables are Similarity and SimilaritySpan
> =09
> 	steve
> =09
> 	Poliana Mateus wrote:
> =09
> 	>Hi Steve
> 	>
> 	>I need to insert given in the GUS (resulted blast) as:
> 	>
> 	>----------------------------------------------------
> 	>extracted data of ours script
> 	>----------------------------------------------------
> 	>query_name
> 	>name
> 	>accession
> 	>description
> 	>significance
> 	>raw_score
> 	>length
> 	>num_identical
> 	>frac_identical
> 	>num_conserved
> 	>frac_conserved
> 	>start('query')
> 	>end('query')
> 	>start('hit')
> 	>end('hit')
> 	>----------------------------------------------------
> 	>
> 	>Analyzing the LoadBlastSimFast Plugin I verified that it =
inserts in
> 	>tables DoTs.Similarity and DoTs.SymilaritySpan, both only =
accept =20
> given
> 	>numerics.
> 	>Exists into GUS other tables that store resulted of Blast?
> 	>
> 	>Poliana
> 	>
> 	>
> 	>
> 	>
> 	>
> 	>
> 	>On Fri, 11 Feb 2005 13:50:32 -0500, Steve Fischer
> 	><sfi...@pc...> wrote:
> 	>
> 	>
> 	>>see below
> 	>>
> 	>>Alberto Davila wrote:
> 	>>
> 	>>
> 	>>
> 	>>>We are doing this for Garsa (another system) .. basically we =
have a
> 	>>>bioperl parser (Bio::Search::IO) that reads the Blast results =
file =20
> and
> 	>>>extract all the needed info (to the "Blast_Hit" table)... and =
also =20
> load
> 	>>>into a given table (eg: External_DB) all the sequences (in =
fasta =20
> format)
> 	>>>presenting similarity with the queries... at the end we have =20=

> "Blast_Hit"
> 	>>>and "External_DB" populated with the same script.
> 	>>>
> 	>>>
> 	>>>
> 	>>>
> 	>>>
> 	>>wow, great.  could you make a gus plugin from that?
> 	>>
> 	>>
> 	>>
> 	>>>Regarding Interpro and Glimmer, the main problem is to know =
in =20
> which
> 	>>>tables we should load the parsed results ?
> 	>>>
> 	>>>
> 	>>>
> 	>>>
> 	>>>
> 	>>describe the info you want to store.
> 	>>
> 	>>steve
> 	>>
> 	>>
> 	>>
> 	>>>Alberto
> 	>>>
> 	>>>On Fri, 2005-02-11 at 13:21 -0500, Y. Thomas Gan wrote:
> 	>>>
> 	>>>
> 	>>>
> 	>>>
> 	>>>>I was going to give the same answer steve gave for interpro =
and =20
> gene
> 	>>>>finding results.
> 	>>>>
> 	>>>>For loading sequences into GUS, the dillema with option 2 =
is: how =20
> do you
> 	>>>>know which sequence to load when you load (which is before =
you =20
> actually
> 	>>>>have the similarity result)? One solution would be to =
initially =20
> load
> 	>>>>complete dataset(s) but delete those without similarity =
after =20
> loading
> 	>>>>similarity results.
> 	>>>>
> 	>>>>-Thomas
> 	>>>>
> 	>>>>On Fri, 11 Feb 2005, Steve Fischer wrote:
> 	>>>>
> 	>>>>
> 	>>>>
> 	>>>>
> 	>>>>
> 	>>>>>alberto-
> 	>>>>>
> 	>>>>>we've never loaded interpro, so there isn't a plugin.
> 	>>>>>i believe plasmodb has loaded glimmer results, though i'm =
not =20
> sure.   i have
> 	>>>>>asked a plasmodb developer to answer that question.
> 	>>>>>
> 	>>>>>steve
> 	>>>>>
> 	>>>>>Alberto Davila wrote:
> 	>>>>>
> 	>>>>>
> 	>>>>>
> 	>>>>>
> 	>>>>>
> 	>>>>>>Hey Steve, Thomas,
> 	>>>>>>
> 	>>>>>>Thanks a lot for the tips, really helpful.. now, few more =20=

> questions:
> 	>>>>>>
> 	>>>>>>
> 	>>>>>>
> 	>>>>>>
> 	>>>>>>
> 	>>>>>>
> 	>>>>>>>ok.  NR =3D NRDB
> 	>>>>>>>
> 	>>>>>>>the way we have used gus with similarities is that both =
the =20
> query and
> 	>>>>>>>subject are loaded into gus.  As thomas explained, the =20=

> similarity table
> 	>>>>>>>captures similarity between sequences that are in gus.
> 	>>>>>>>our approach has always been to just load (warehouse) the =
=20
> entire subject
> 	>>>>>>>database (NR, EST) that we are blasting against.
> 	>>>>>>>
> 	>>>>>>>the current plugins and blastSimilarity are set up for =
this.
> 	>>>>>>>
> 	>>>>>>>obviously, this takes a lot of disk space.  two major =20
> efficiencies that we
> 	>>>>>>>don't currently have plugins for would be:
> 	>>>>>>>1. to only store in gus a *reference* to the external =
sequence =20
> (ie, don't
> 	>>>>>>>store the actgs).
> 	>>>>>>>2. only store in gus the sequences that actually have =20
> similarities
> 	>>>>>>>
> 	>>>>>>>
> 	>>>>>>>
> 	>>>>>>>
> 	>>>>>>>
> 	>>>>>>Option 2 sound better for us, since we will be blasting =
against =20
> several
> 	>>>>>>databases (> 10GB databases)
> 	>>>>>>
> 	>>>>>>What about the plugins to load Interpro and "gene finder" =20=

> (glimmer, etc)
> 	>>>>>>results ? Is there any at all ?
> 	>>>>>>
> 	>>>>>>Cheers, Alberto
> 	>>>>>>
> 	>>>>>>
> 	>>>>>>
> 	>>>>>>
> 	>>>>>>
> 	>>>>>>
> 	>>>>>>>steve
> 	>>>>>>>
> 	>>>>>>>Alberto Davila wrote:
> 	>>>>>>>
> 	>>>>>>>
> 	>>>>>>>
> 	>>>>>>>
> 	>>>>>>>
> 	>>>>>>>
> 	>>>>>>>>All the blastable databases I mentioned are standard =20
> databases from NCBI
> 	>>>>>>>>(ftp://ftp.ncbi.nlm.nih.gov/blast/db/blastdb.txt):
> 	>>>>>>>>
> 	>>>>>>>>NT =3D nucleotides
> 	>>>>>>>>
> 	>>>>>>>>~30000 entries from genbank (genbank format) are loaded =
into =20
> GUS now.
> 	>>>>>>>>
> 	>>>>>>>>Not sure about your "NRDB", I know NR from NCBI that is =
a =20
> collection of
> 	>>>>>>>>aminoacid entries, could it be the same ?
> 	>>>>>>>>
> 	>>>>>>>>Alberto
> 	>>>>>>>>
> 	>>>>>>>>On Fri, 2005-02-11 at 10:43 -0500, Steve Fischer wrote:
> 	>>>>>>>>
> 	>>>>>>>>
> 	>>>>>>>>
> 	>>>>>>>>
> 	>>>>>>>>
> 	>>>>>>>>
> 	>>>>>>>>
> 	>>>>>>>>>(what is NT?)
> 	>>>>>>>>>
> 	>>>>>>>>>which of these (genbank, your fasta, NRDB, NT, EST) =
have you =20
> loaded into
> 	>>>>>>>>>gus?
> 	>>>>>>>>>
> 	>>>>>>>>>steve
> 	>>>>>>>>>
> 	>>>>>>>>>Alberto Davila wrote:
> 	>>>>>>>>>
> 	>>>>>>>>>
> 	>>>>>>>>>
> 	>>>>>>>>>
> 	>>>>>>>>>
> 	>>>>>>>>>
> 	>>>>>>>>>
> 	>>>>>>>>>>Query:
> 	>>>>>>>>>>
> 	>>>>>>>>>>Either sequences from genbank (genbank format) or =
sequences =20
> generated
> 	>>>>>>>>>>in
> 	>>>>>>>>>>the lab (fasta format)
> 	>>>>>>>>>>
> 	>>>>>>>>>>Blastable databases (all are formatted databases from =
NCBI):
> 	>>>>>>>>>>
> 	>>>>>>>>>>NR
> 	>>>>>>>>>>NT
> 	>>>>>>>>>>EST
> 	>>>>>>>>>>
> 	>>>>>>>>>>Alberto
> 	>>>>>>>>>>
> 	>>>>>>>>>>On Fri, 2005-02-11 at 10:34 -0500, Steve Fischer =
wrote:
> 	>>>>>>>>>>
> 	>>>>>>>>>>
> 	>>>>>>>>>>
> 	>>>>>>>>>>
> 	>>>>>>>>>>
> 	>>>>>>>>>>
> 	>>>>>>>>>>
> 	>>>>>>>>>>
> 	>>>>>>>>>>>for the blast, what are the query sequences and what =
are =20
> the blastable
> 	>>>>>>>>>>>databases?
> 	>>>>>>>>>>>
> 	>>>>>>>>>>>steve
> 	>>>>>>>>>>>
> 	>>>>>>>>>>>Alberto Davila wrote:
> 	>>>>>>>>>>>
> 	>>>>>>>>>>>
> 	>>>>>>>>>>>
> 	>>>>>>>>>>>
> 	>>>>>>>>>>>
> 	>>>>>>>>>>>
> 	>>>>>>>>>>>
> 	>>>>>>>>>>>
> 	>>>>>>>>>>>>Basically we will use sequences (loaded into GUS =
with the =20
> GBParser)
> 	>>>>>>>>>>>>for
> 	>>>>>>>>>>>>NCBI Blast (Blastx, Blastp and TBlastX), the same =20=

> sequences will be
> 	>>>>>>>>>>>>also
> 	>>>>>>>>>>>>used for Interpro analyses. Results of both (Blast =
and =20
> Interpro) will
> 	>>>>>>>>>>>>be
> 	>>>>>>>>>>>>loaded into GUS. We will parse specific things from =
the =20
> Blast
> 	>>>>>>>>>>>>results, I
> 	>>>>>>>>>>>>would say:
> 	>>>>>>>>>>>>
> 	>>>>>>>>>>>>`Gi` `Accession` `Description` `E_value` `Score` =
`Length`
> 	>>>>>>>>>>>>`Frame_Query` `Frame_Hit` `Identical` =
`Hsp_Frac_Identical`
> 	>>>>>>>>>>>>`Conserved` `Hsp_Frac_Conserved`
> 	>>>>>>>>>>>>`Query_Start`
> 	>>>>>>>>>>>>`Query_End` `Hit_Start` `Hit_End` `Hsp_Align` =20
> `database_letters`
> 	>>>>>>>>>>>>`database_entries`
> 	>>>>>>>>>>>>We already have a Bioperl parser for that (specific =
for =20
> another
> 	>>>>>>>>>>>>system:
> 	>>>>>>>>>>>>GARSA) that could be adapted to GUS, problem being =
we are =20
> not sure
> 	>>>>>>>>>>>>what
> 	>>>>>>>>>>>>tables should be used to store those data in GUS.
> 	>>>>>>>>>>>>
> 	>>>>>>>>>>>>Cheers, Alberto
> 	>>>>>>>>>>>>
> 	>>>>>>>>>>>>
> 	>>>>>>>>>>>>On Fri, 2005-02-11 at 10:06 -0500, Steve Fischer =
wrote:
> 	>>>>>>>>>>>>
> 	>>>>>>>>>>>>
> 	>>>>>>>>>>>>
> 	>>>>>>>>>>>>
> 	>>>>>>>>>>>>
> 	>>>>>>>>>>>>
> 	>>>>>>>>>>>>
> 	>>>>>>>>>>>>
> 	>>>>>>>>>>>>
> 	>>>>>>>>>>>>>what are you planning on blasting?
> 	>>>>>>>>>>>>>
> 	>>>>>>>>>>>>>steve
> 	>>>>>>>>>>>>>
> 	>>>>>>>>>>>>>Alberto Davila wrote:
> 	>>>>>>>>>>>>>
> 	>>>>>>>>>>>>>
> 	>>>>>>>>>>>>>
> 	>>>>>>>>>>>>>
> 	>>>>>>>>>>>>>
> 	>>>>>>>>>>>>>
> 	>>>>>>>>>>>>>
> 	>>>>>>>>>>>>>
> 	>>>>>>>>>>>>>
> 	>>>>>>>>>>>>>
> 	>>>>>>>>>>>>>>Hi Steve,
> 	>>>>>>>>>>>>>>
> 	>>>>>>>>>>>>>>On Fri, 2005-02-11 at 08:56 -0500, Steve Fischer =
wrote:
> 	>>>>>>>>>>>>>>
> 	>>>>>>>>>>>>>>
> 	>>>>>>>>>>>>>>
> 	>>>>>>>>>>>>>>
> 	>>>>>>>>>>>>>>
> 	>>>>>>>>>>>>>>
> 	>>>>>>>>>>>>>>
> 	>>>>>>>>>>>>>>
> 	>>>>>>>>>>>>>>
> 	>>>>>>>>>>>>>>
> 	>>>>>>>>>>>>>>>poliana-
> 	>>>>>>>>>>>>>>>
> 	>>>>>>>>>>>>>>>oops, the usage statement for LoadBlastSimFast is =
out =20
> of date.
> 	>>>>>>>>>>>>>>>it should instruct you to use the blastSimilarity =
=20
> command.
> 	>>>>>>>>>>>>>>>
> 	>>>>>>>>>>>>>>>LoadBlastSimFast makes a big assumption, that the =
=20
> subject and
> 	>>>>>>>>>>>>>>>query sequences are in GUS, and their def. lines =
have =20
> GUS primary
> 	>>>>>>>>>>>>>>>keys.
> 	>>>>>>>>>>>>>>>Are your sequences already loaded into GUS?
> 	>>>>>>>>>>>>>>>
> 	>>>>>>>>>>>>>>>
> 	>>>>>>>>>>>>>>>
> 	>>>>>>>>>>>>>>>
> 	>>>>>>>>>>>>>>>
> 	>>>>>>>>>>>>>>>
> 	>>>>>>>>>>>>>>>
> 	>>>>>>>>>>>>>>>
> 	>>>>>>>>>>>>>>>
> 	>>>>>>>>>>>>>>>
> 	>>>>>>>>>>>>>>They are not, there would be any howto/tips for =
that =20
> plugin ? We
> 	>>>>>>>>>>>>>>will
> 	>>>>>>>>>>>>>>certainly need a plugin to load "Interpro" and =
"ORF =20
> finding"
> 	>>>>>>>>>>>>>>results
> 	>>>>>>>>>>>>>>into GUS... If they are not available, then maybe =
we =20
> will have to
> 	>>>>>>>>>>>>>>write
> 	>>>>>>>>>>>>>>them ...
> 	>>>>>>>>>>>>>>
> 	>>>>>>>>>>>>>>Cheers, Alberto
> 	>>>>>>>>>>>>>>
> 	>>>>>>>>>>>>>>
> 	>>>>>>>>>>>>>>
> 	>>>>>>>>>>>>>>
> 	>>>>>>>>>>>>>>
> 	>>>>>>>>>>>>>>
> 	>>>>>>>>>>>>>>
> 	>>>>>>>>>>>>>>
> 	>>>>>>>>>>>>>>
> 	>>>>>>>>>>>>>>
> 	>>>>>>>>>>>>>>
> 	>>>>>>>>>>>>>>>steve
> 	>>>>>>>>>>>>>>>
> 	>>>>>>>>>>>>>>>
> 	>>>>>>>>>>>>>>>
> 	>>>>>>>>>>>>>>>Poliana Mateus wrote:
> 	>>>>>>>>>>>>>>>
> 	>>>>>>>>>>>>>>>
> 	>>>>>>>>>>>>>>>
> 	>>>>>>>>>>>>>>>
> 	>>>>>>>>>>>>>>>
> 	>>>>>>>>>>>>>>>
> 	>>>>>>>>>>>>>>>
> 	>>>>>>>>>>>>>>>
> 	>>>>>>>>>>>>>>>
> 	>>>>>>>>>>>>>>>
> 	>>>>>>>>>>>>>>>
> 	>>>>>>>>>>>>>>>>Hello all,
> 	>>>>>>>>>>>>>>>>
> 	>>>>>>>>>>>>>>>>Where can find the script =20
> parseBlastFilesForSimilarity.pl??
> 	>>>>>>>>>>>>>>>>I'm trying to run LoadBlastSimFast...
> 	>>>>>>>>>>>>>>>>
> 	>>>>>>>>>>>>>>>>Poliana
> 	>>>>>>>>>>>>>>>>
> =09
>
> =FF=FF=FF=FF=FF=FF=FF=FF=FF=FF=FF=FF=FF=FF=FF=FF=FF=FF=FF=FF=FF=FF=FF=FF=
=FF=FF=FF=FF=FF=FF=FF=FF=FF=FF=FF=FF=FF=FF=FF=FF=FF=D2=15=E9=9A=8AX=AC=B2=9A=
'=B2=8A=DEu=BC=FFN=17=88L=FA=E8v=E7-=20
> =
=1A=E8=9Dy=17=9Av=1A'z=CB=FFq=A9=DD=89=DA=DE=BE'=B0=B2=89=E1=BAwky=DB(|=84=
=CF=AE=87nr=DB=1F=AE=89=ABy=A9n=B1=EA=EC=FC8=ACr=8B=DE=AF=08br=1Ak=A1=DB=9C=
=B6=CBk=BA\=A5=8A=F7=20
> =AE=A6=DA-
> =E8r=A5=EF=D2=B5=AA=ED=AD=20
> =
=E6=9D=8Ax'=A3=0F=E1=B6=DA=FF=FF=F6=9D=B3=FA,v=7F=DC=A2o=FFi=DF=E2=F7=9F=DA=
=96Z=1C=FE'=D7=8D=FD=EB=FA)rO=FF=FF=FF=FF=FF=FF=FF=FF=FF=FF=FF=FF=FF=FF=FF=
=FF=FF=FF=FF=FF=FF=FF=FF=FF=FF=FF=FF=FF=FF=20
> =FF=FF=FF=FF=FF=FCk=ACu=EB=FF=82=EB=1Dz=F9=9A=8AX=A7=82X=AC=B4k=ACu=EB=FF=
=82=EB=1Dz=FF=E5=8A=CBl=FE=CA.=AD=C7=9F=A2=B8=1E=FEw=AD=86=DBi=B3=FF=FF=96=
+-=20
> =B3=FB(=BA=B7=1E~=8A=E0{=F9=DE=B7=F9b=B2=DB?=96+-=8Aw=E8=FE=0B=ACu=EB=FF=
=82=EB=1D
>
--
Aaron J. Mackey, Ph.D.
Dept. of Biology, Goddard 212
University of Pennsylvania       email:  am...@pc...
415 S. University Avenue         office: 215-898-1205
Philadelphia, PA  19104-6017     fax:    215-746-6697