From: Don G. <gil...@bi...> - 2006-01-04 00:24:35
|
Tobias, Here are some comments. I don't know the cause of loading error you find. I can't reproduce it using bioiperl 1.5.1, but on a system with other software versions that differ from yours. > 1.) > I'm setting up a new server system (gbrowse, mysql) and decided to > re-install the gadfly database based on the GFF3 files available from > flybase. with my former system (gbrowse 1.6.3 on linux with apache2 and > bioperl-1.5.0 and mysql 4.1.13, database loaded from gadfly 4.2 GFF3 > files) everything was fine. The flybase genome annotation data no longer comes from a Gadfly database but from a GMOD Chado database. That might be confusing someone because there is a GBrowse script 'process_gadfly.pl' , which should *not* be used for this GFF v3 data from chado database. > now I'm running into troubles in that gbrowse does not display the > proper description for the annotations anymore. when I compare the old > to the new mysql database, one immediate difference is the size of the > fattribute table (old: 9 entries - new: more than a million). You should have 8 or 9 entries in the GFF-Mysql fattributes table (I got cyto_range Dbxref dbxref_2nd Name Parent species gbunit Alias), but some 5 million in the fattribute_to_feature table. At a guess, the gff loader is either guessing wrong that is GFF v2, or otherwise not properly reading the attributes (column 9 of GFF 2/3). [Side-note: there are some features in fly GFF which have both Parent and ID attributes that are useful. The way the current Bioperl GFFv3 reader works, it treats Parent as primary entity ID, turning it into 'gname' field, which is fine, but then *drops* the ID, which isn't always desirable. It would be nice if the reader would keep the ID field always and save as an attribute.] ... > what is the problem with loading the gadfly files, is it the files as > such, the loading script? I tested over the weekend loading into mysql-GFF the D. melanogaster release 4.2 data set, after updating to bioperl 1.5.1. I didn't see the problems you report; I used only bp_bulk_gff_load.pl, not the other gff loaders. I have a Sun-Solaris-x64 system presumably similar to yours. I've replaced the system perl with a hand-compiled one, to make all those needed native perl libs without a SunC compiler. Your tests seem to implicate the bioperl 1.5.1 installation. There are bugs in some of the gff v3 loaders prior to bioperl 1.5.1 (bp_bulk_gff_load.pl at least): when a subfeature (exon) had multiple parent features, these would not be recognized, which is something the fly GFF does have. This isnt as big of a problem as what you are seeing however. Example servers with these data and comparisons to other GBrowse adapators (Chado-Pg, MySQL, BerkeleyDB) are here: http://server2.eugenes.org/cgi-bin/gbrowse/dmel_r42_mysql/ (Sun-Solaris-x64) http://server3.eugenes.org/cgi-bin/gbrowse/dmel_r42_mysql/ (Apple-MacOSX-ppc) Use the GBrowse data set 'dmel_r42_mysql' here to get the same one you are working with. GBrowse 'dmel_r42_lucene' has the same data running via lucene adaptor instead of mysql. > 3.) > as my lab plans to extend a lot on genomic profiling (ChIP on chip), > performance will sooner or later a big issue when adding all those > profiles (up to 400.000 features per profile) into the database. > > what are the bottlenecks in the mysql-bioperl-gbrowse-apache pipeline > that would benefit most from tuning? The simplest way to speed up GBrowse with large data sets is to run it on the fastest computer you can. This year's model rather than one two years old will make a big diff. There are things about database clustering that might by used, but database access time generally isn't the biggest cost. If you can keep the count of features-to-display small, even with a large data set, they should not take long to draw a map. The data adaptors only fetch the drawable subset. How many of the 400,000 features do you want to show in one display? Data access accounts for maybe a third or less of the time to draw a Gbrowse map. The Mysql-GFF database is pretty well optimized for what it does. I've gotten Lucene to work a bit faster, but not by much (see http://sourceforge.net/mailarchive/message.php?msg_id=12858575) Processing a large number of perl objects is one of the time costs. GBrowse creates Bio::DB objects and then Bio::Feature objects from these for drawing. If instead it went from database to Bio::Feature objects, it would I think speed up the operation. But such a change would be expensive in developer time. -- Don Gilbert -- d.gilbert--bioinformatics--indiana-u--bloomington-in-47405 -- gil...@in...--http://marmot.bio.indiana.edu/ |