RE: RES: [Gusdev-gusdev] parseBlastFilesForSimilarity.pl

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hey Ed,

Great, I will look forward to it... Poliana just started to look at the =
code since we are on a rush to meet some deadlines, anyway, she will =
contact you by Friday to check your progresses with the document ;-)

We are learning little by little about genomics databases (not too bad), =
then "hope" to motivate my colleagues (the real DB experts, not =
beginners like me) at the Federal University of Rio de Janeiro and IME =
to offer a course on "Genomic Databases" as part of the graduate =
programme for the second half of 2005. GUS and Chado schemas should =
(hopefully) be a topic.

Alberto

-----Original Message-----
From:	Ed Robinson [mailto:ero...@ug...]
Sent:	Mon 2/14/2005 11:46 AM
To:	davila; Steve Fischer
Cc:	Y. Thomas Gan; Poliana Mateus; gus...@li...
Subject:	Re: RES: [Gusdev-gusdev] parseBlastFilesForSimilarity.pl
Alberto,

Poliana may be interested in a GUS developers guide I am
trying to write this week for the course.  I just went through
the nightmare of learning how to correctly write GUS plugins
for a completely undocumented API and little help or pointers
to where that API can be found in the source code.  There is a
plugin description on the WIKI, but absolutely NO API for the
GUS Model.  I should have a document written for this with an
API for the Plugin Class and a general API written for the GUS
Model by Friday.  It will also include other points for
debugging GUS and some best practices I have collected in my
notes.

-Ed

---- Original message ----
>Date: Sun, 13 Feb 2005 18:01:22 -0300
>From: "davila" <da...@io...> =20
>Subject: RES: [Gusdev-gusdev] parseBlastFilesForSimilarity.pl =20
>To: "Steve Fischer" <sfi...@pc...>
>Cc: "Y. Thomas Gan" <yon...@pc...>, "Poliana
Mateus" <pol...@gm...>,
<gus...@li...>
>
>Steve,
>* see below
>
>Alberto Davila wrote:
>
>>We are doing this for Garsa (another system) .. basically we
have a
>>bioperl parser (Bio::Search::IO) that reads the Blast
results file and
>>extract all the needed info (to the "Blast_Hit" table)...
and also load
>>into a given table (eg: External_DB) all the sequences (in
fasta format)
>>presenting similarity with the queries... at the end we have
"Blast_Hit"
>>and "External_DB" populated with the same script.
>>
>>=20
>>
>wow, great.  could you make a gus plugin from that?
>
>
>Should not be a big problem, I will ask Poliana to do that...
she can ocassionally contact you asking for some details... at
the end we will put things being debugged/developed by us at :
www.biowebdb.org and also provide them to any interested
people. In an ideal world, nobody should suffer twice with the
same "bug" ;-)
>
>>Regarding Interpro and Glimmer, the main problem is to know
in which
>>tables we should load the parsed results ?
>>
>>=20
>>
>* describe the info you want to store.
>
>Basically this:
>
>Frame_Hit, Method , Method_Accession, Accession, Hit_Status,
Query_Start, Query_End, Description, E_value
>
>Again, I am asking Poliana to take care of that.
>
>* steve
>
>Cheers, Alberto
>
>
>Alberto M. R. D=E1vila, PhD
>Kinetoplastid Biology and Disease (Biomed Central)
>http://www.kinetoplastids.com
>http://www.darwin.fiocruz.br
>DBBM / Instituto Oswaldo Cruz / FIOCRUZ
>Av. Brasil 4365
>Rio de Janeiro, RJ, Brasil
>CEP 21045-900
>Email: da...@fi...
>          amr...@ya...
>Phone: 55-21-3865-8229 / 3865-8206
>Fax: 55-21-2590-3495
>-------------------------------------------------
>The BiowebDB consortium: http://www.biowebdb.org
>
>=20
>
>
>
>>Alberto
>>
>>On Fri, 2005-02-11 at 13:21 -0500, Y. Thomas Gan wrote:
>>=20
>>
>>>I was going to give the same answer steve gave for interpro
and gene
>>>finding results.
>>>
>>>For loading sequences into GUS, the dillema with option 2
is: how do you
>>>know which sequence to load when you load (which is before
you actually
>>>have the similarity result)? One solution would be to
initially load
>>>complete dataset(s) but delete those without similarity
after loading
>>>similarity results.
>>>
>>>-Thomas
>>>
>>>On Fri, 11 Feb 2005, Steve Fischer wrote:
>>>
>>>  =20
>>>
>>>>alberto-
>>>>
>>>>we've never loaded interpro, so there isn't a plugin.
>>>>i believe plasmodb has loaded glimmer results, though i'm
not sure.   i have
>>>>asked a plasmodb developer to answer that question.
>>>>
>>>>steve
>>>>
>>>>Alberto Davila wrote:
>>>>
>>>>    =20
>>>>
>>>>>Hey Steve, Thomas,
>>>>>
>>>>>Thanks a lot for the tips, really helpful.. now, few more
questions:
>>>>>
>>>>>
>>>>>      =20
>>>>>
>>>>>>ok.  NR =3D NRDB
>>>>>>
>>>>>>the way we have used gus with similarities is that both
the query and
>>>>>>subject are loaded into gus.  As thomas explained, the
similarity table
>>>>>>captures similarity between sequences that are in gus.
>>>>>>our approach has always been to just load (warehouse)
the entire subject
>>>>>>database (NR, EST) that we are blasting against.
>>>>>>
>>>>>>the current plugins and blastSimilarity are set up for this.
>>>>>>
>>>>>>obviously, this takes a lot of disk space.  two major
efficiencies that we
>>>>>>don't currently have plugins for would be:
>>>>>> 1. to only store in gus a *reference* to the external
sequence (ie, don't
>>>>>>store the actgs).
>>>>>> 2. only store in gus the sequences that actually have
similarities
>>>>>>
>>>>>>        =20
>>>>>>
>>>>>Option 2 sound better for us, since we will be blasting
against several
>>>>>databases (> 10GB databases)
>>>>>
>>>>>What about the plugins to load Interpro and "gene finder"
(glimmer, etc)
>>>>>results ? Is there any at all ?
>>>>>
>>>>>Cheers, Alberto
>>>>>
>>>>>
>>>>>      =20
>>>>>
>>>>>>steve
>>>>>>
>>>>>>Alberto Davila wrote:
>>>>>>
>>>>>>
>>>>>>        =20
>>>>>>
>>>>>>>All the blastable databases I mentioned are standard
databases from NCBI
>>>>>>>(ftp://ftp.ncbi.nlm.nih.gov/blast/db/blastdb.txt):
>>>>>>>
>>>>>>>NT =3D nucleotides
>>>>>>>
>>>>>>>~30000 entries from genbank (genbank format) are loaded
into GUS now.
>>>>>>>
>>>>>>>Not sure about your "NRDB", I know NR from NCBI that is
a collection of
>>>>>>>aminoacid entries, could it be the same ?
>>>>>>>
>>>>>>>Alberto
>>>>>>>
>>>>>>>On Fri, 2005-02-11 at 10:43 -0500, Steve Fischer wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>          =20
>>>>>>>
>>>>>>>>(what is NT?)
>>>>>>>>
>>>>>>>>which of these (genbank, your fasta, NRDB, NT, EST)
have you loaded into
>>>>>>>>gus?
>>>>>>>>
>>>>>>>>steve
>>>>>>>>
>>>>>>>>Alberto Davila wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>            =20
>>>>>>>>
>>>>>>>>>Query:
>>>>>>>>>
>>>>>>>>>Either sequences from genbank (genbank format) or
sequences generated
>>>>>>>>>in
>>>>>>>>>the lab (fasta format)
>>>>>>>>>
>>>>>>>>>Blastable databases (all are formatted databases from
NCBI):
>>>>>>>>>
>>>>>>>>>NR
>>>>>>>>>NT
>>>>>>>>>EST
>>>>>>>>>
>>>>>>>>>Alberto
>>>>>>>>>
>>>>>>>>>On Fri, 2005-02-11 at 10:34 -0500, Steve Fischer wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>              =20
>>>>>>>>>
>>>>>>>>>>for the blast, what are the query sequences and what
are the blastable
>>>>>>>>>>databases?
>>>>>>>>>>
>>>>>>>>>>steve
>>>>>>>>>>
>>>>>>>>>>Alberto Davila wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>                =20
>>>>>>>>>>
>>>>>>>>>>>Basically we will use sequences (loaded into GUS
with the GBParser)
>>>>>>>>>>>for
>>>>>>>>>>>NCBI Blast (Blastx, Blastp and TBlastX), the same
sequences will be
>>>>>>>>>>>also
>>>>>>>>>>>used for Interpro analyses. Results of both (Blast
and Interpro) will
>>>>>>>>>>>be
>>>>>>>>>>>loaded into GUS. We will parse specific things from
the Blast
>>>>>>>>>>>results, I
>>>>>>>>>>>would say:
>>>>>>>>>>>
>>>>>>>>>>>`Gi` `Accession` `Description` `E_value` `Score`
`Length`
>>>>>>>>>>>`Frame_Query` `Frame_Hit` `Identical`
`Hsp_Frac_Identical`
>>>>>>>>>>>`Conserved` `Hsp_Frac_Conserved`
>>>>>>>>>>>`Query_Start`
>>>>>>>>>>>`Query_End` `Hit_Start` `Hit_End` `Hsp_Align`
`database_letters`
>>>>>>>>>>>`database_entries`
>>>>>>>>>>>We already have a Bioperl parser for that (specific
for another
>>>>>>>>>>>system:
>>>>>>>>>>>GARSA) that could be adapted to GUS, problem being
we are not sure
>>>>>>>>>>>what
>>>>>>>>>>>tables should be used to store those data in GUS.
>>>>>>>>>>>
>>>>>>>>>>>Cheers, Alberto
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>On Fri, 2005-02-11 at 10:06 -0500, Steve Fischer wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>                  =20
>>>>>>>>>>>
>>>>>>>>>>>>what are you planning on blasting?
>>>>>>>>>>>>
>>>>>>>>>>>>steve
>>>>>>>>>>>>
>>>>>>>>>>>>Alberto Davila wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>                    =20
>>>>>>>>>>>>
>>>>>>>>>>>>>Hi Steve,
>>>>>>>>>>>>>
>>>>>>>>>>>>>On Fri, 2005-02-11 at 08:56 -0500, Steve Fischer
wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>                      =20
>>>>>>>>>>>>>
>>>>>>>>>>>>>>poliana-
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>oops, the usage statement for LoadBlastSimFast
is out of date.
>>>>>>>>>>>>>>it should instruct you to use the
blastSimilarity command.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>LoadBlastSimFast makes a big assumption, that
the subject and
>>>>>>>>>>>>>>query sequences are in GUS, and their def. lines
have GUS primary
>>>>>>>>>>>>>>keys.
>>>>>>>>>>>>>>Are your sequences already loaded into GUS?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>                        =20
>>>>>>>>>>>>>>
>>>>>>>>>>>>>They are not, there would be any howto/tips for
that plugin ? We
>>>>>>>>>>>>>will
>>>>>>>>>>>>>certainly need a plugin to load "Interpro" and
"ORF finding"
>>>>>>>>>>>>>results
>>>>>>>>>>>>>into GUS... If they are not available, then maybe
we will have to
>>>>>>>>>>>>>write
>>>>>>>>>>>>>them ...
>>>>>>>>>>>>>
>>>>>>>>>>>>>Cheers, Alberto
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>                      =20
>>>>>>>>>>>>>
>>>>>>>>>>>>>>steve
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>Poliana Mateus wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>                        =20
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>Hello all,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>Where can find the script
parseBlastFilesForSimilarity.pl??
>>>>>>>>>>>>>>>I'm trying to run LoadBlastSimFast...
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>Poliana
>