Re: [Gmod-gbrowse] troubles with new setup bioperl/gbrowse

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Tobias,

Here are some comments. I don't know the cause of loading
error you find. I can't reproduce it using bioiperl 1.5.1, but
on a system with other software versions that differ from
yours.

> 1.)
> I'm setting up a new server system (gbrowse, mysql) and decided to 
> re-install the gadfly database based on the GFF3 files available from 
> flybase. with my former system (gbrowse 1.6.3 on linux with apache2 and 
> bioperl-1.5.0 and mysql 4.1.13, database loaded from gadfly 4.2 GFF3 
> files) everything was fine.

The flybase genome annotation data no longer comes from a Gadfly
database but from a GMOD Chado database. That might be confusing
someone because there is a GBrowse script 'process_gadfly.pl' , which
should *not* be used for this GFF v3 data from chado database.

> now I'm running into troubles in that gbrowse does not display the 
> proper description for the annotations anymore. when I compare the old 
> to the new mysql database, one immediate difference is the size of the 
> fattribute table (old: 9 entries - new: more than a million).

You should have 8 or 9 entries in the GFF-Mysql fattributes table 
(I got cyto_range Dbxref dbxref_2nd Name Parent species gbunit Alias),
but some 5 million in the fattribute_to_feature table.

At a guess, the gff loader is either guessing wrong that is GFF v2, 
or otherwise not properly reading the attributes (column 9 of GFF 2/3).  

[Side-note: there are some features in fly GFF which have both Parent
and ID attributes that are useful.  The way the current Bioperl
GFFv3 reader works, it treats Parent as primary entity ID,
turning it into 'gname' field, which is fine, but then *drops* the ID,
which isn't always desirable.  It would be nice if the reader
would keep the ID field always and save as an attribute.]

...
> what is the problem with loading the gadfly files, is it the files as 
> such, the loading script?

I tested over the weekend loading into mysql-GFF the D. melanogaster
release 4.2 data set, after updating to bioperl 1.5.1. I didn't see
the problems you report; I used only bp_bulk_gff_load.pl, not the
other gff loaders.   I have a Sun-Solaris-x64 system presumably
similar to yours. I've replaced the system perl with a hand-compiled
one, to make all those needed native perl libs without a SunC
compiler.

Your tests seem to implicate the bioperl 1.5.1 installation. 
There are bugs in some of the gff v3 loaders prior to bioperl 1.5.1 
(bp_bulk_gff_load.pl at least): when a subfeature (exon) had multiple
parent features, these would not be recognized, which is something the
fly GFF does have. This isnt as big of a problem as what you are seeing
however.

Example servers with these data and comparisons to other
GBrowse adapators (Chado-Pg, MySQL, BerkeleyDB) are here:
 http://server2.eugenes.org/cgi-bin/gbrowse/dmel_r42_mysql/  (Sun-Solaris-x64)
 http://server3.eugenes.org/cgi-bin/gbrowse/dmel_r42_mysql/  (Apple-MacOSX-ppc)

Use the GBrowse data set 'dmel_r42_mysql' here to get the same one you
are working with. GBrowse 'dmel_r42_lucene' has the same data running
via lucene adaptor instead of mysql.

> 3.)
> as my lab plans to extend a lot on genomic profiling (ChIP on chip), 
> performance will sooner or later a big issue when adding all those 
> profiles (up to 400.000 features per profile) into the database.
> 
> what are the bottlenecks in the mysql-bioperl-gbrowse-apache pipeline 
> that would benefit most from tuning?

The simplest way to speed up GBrowse with large data sets is to run it
on the fastest computer you can. This year's model rather than one
two years old will make a big diff. There are things about database
clustering that might by used, but database access time generally
isn't the biggest cost.

If you can keep the count of features-to-display small, even with a
large data set, they should not take long to draw a map. The data
adaptors only fetch the drawable subset. How many of the 400,000
features do you want to show in one display? 

Data access accounts for maybe a third or less of the time to
draw a Gbrowse map.  The Mysql-GFF database is pretty well
optimized for what it does. I've gotten Lucene to work a bit
faster, but not by much (see
http://sourceforge.net/mailarchive/message.php?msg_id=12858575)

Processing a large number of perl objects is one of the time
costs.   GBrowse creates Bio::DB objects and then Bio::Feature
objects from these for drawing. If instead it went from database to
Bio::Feature objects, it would I think speed up the operation.
But such a change would be expensive in developer time.

-- Don Gilbert
-- d.gilbert--bioinformatics--indiana-u--bloomington-in-47405
-- gil...@in...--http://marmot.bio.indiana.edu/