From: Ian H. <ih...@be...> - 2007-02-01 04:22:01
|
One of the future applications of the AJAX genome browser that I've been excited about for some time is the possibility of opening two parallel views of two related genomes, with one of the views being the "master" genome, and the other view displaying a region of the second genome that was syntenic to the "master" segment currently being displayed. As you slide the master around, the secondary view would follow. Up until now, the only tool I knew of that did this was the Artemis Comparison Tool from Sanger. Well, here I am in Brisbane talking with Aaron Darling, who worked on MAUVE and the associated MAUVE Java viewer. http://gel.ahabs.wisc.edu/mauve/alignments.php I'll let Aaron (CC'd) describe what kind of data it's designed to display and how you navigate it. I mainly just wanted to point this out as another tool in the genome browser ecosystem... I also think it's quite similar to the sort of synteny viewer I (eventually) thought it would be cool to build out of parts of our browser. BTW here's the ACT webpage: http://www.sanger.ac.uk/Software/ACT/ |
From: Aaron D. <a.d...@im...> - 2007-02-01 08:57:15
|
Thanks for the intro Ian, Yes, Mauve displays synteny and genome rearrangement for two or more genomes along with sequence annotation and sequence identity plots. Perhaps the best way to describe it is by example, there's a few screenshots here: http://gel.ahabs.wisc.edu/mauve-aligner/mauve-user-guide/mauve-screenshots.h tml The viewer program is Java based, relying heavily on BioJava for parsing and displaying sequence annotations. However, the core data set being displayed is a multiple genome alignment computed by progressiveMauve. The alignment provides a complete basis for the comparative display, since it contains both a synteny map and a full multiple sequence alignment for each syntenic region. By the way, we don't call them Synteny Blocks, but rather "Locally Collinear Blocks" or just LCBs. The progressiveMauve command-line program dumps the alignment out to a text file in eXtended-Multi-FastA format, so it can be read and displayed in other genome browsers. Sockeye and M-GCAT are two others I know of that can read and display XMFA data. A gbrowse/AJAX type of browser would be great to display synteny information, and would likely be much nicer to use than Mauve's interactive display in many situations. Standalone programs like Mauve don't deal well with large genome comparisons. For example, a human/mouse/rat genome alignment weighs in at around 12Gbytes, and the data set is far too massive for users to download to their computers. Storing the alignment on a server and accessing it via a gbrowse/AJAX type client is really the only tenable solution. I really liked the demo at http://genome.biowiki.org/ Would it be possible for my friends at Univ. of Wisconsin to set up their own instance and load mammalian genome data into it? By the way, the mauve home page is http://gel.ahabs.wisc.edu/mauve And we've got a sourceforge project for it: http://sourceforge.net/projects/mauve The "alignments" web page that Ian linked to is pretty badly out-of-date and desperately needs to be fixed. I think the drosophila alignments were generated by a buggy version of the aligner back in 2004 (yikes!). -Aaron -----Original Message----- From: Ian Holmes [mailto:ih...@be...] Sent: Thursday, 1 February 2007 2:14 PM To: Aaron Darling; gmo...@li... Subject: Synteny One of the future applications of the AJAX genome browser that I've been excited about for some time is the possibility of opening two parallel views of two related genomes, with one of the views being the "master" genome, and the other view displaying a region of the second genome that was syntenic to the "master" segment currently being displayed. As you slide the master around, the secondary view would follow. Up until now, the only tool I knew of that did this was the Artemis Comparison Tool from Sanger. Well, here I am in Brisbane talking with Aaron Darling, who worked on MAUVE and the associated MAUVE Java viewer. http://gel.ahabs.wisc.edu/mauve/alignments.php I'll let Aaron (CC'd) describe what kind of data it's designed to display and how you navigate it. I mainly just wanted to point this out as another tool in the genome browser ecosystem... I also think it's quite similar to the sort of synteny viewer I (eventually) thought it would be cool to build out of parts of our browser. BTW here's the ACT webpage: http://www.sanger.ac.uk/Software/ACT/ |
From: Mitch S. <li...@ar...> - 2007-02-01 18:21:54
|
On Thu, 2007-02-01 at 18:56 +1000, Aaron Darling wrote: > I really liked the demo at http://genome.biowiki.org/ > Would it be possible for my friends at Univ. of Wisconsin to set up their > own instance and load mammalian genome data into it? The code is in the gmod CVS on sourceforge in gmod/Generic-Genome-Browser/ajax. Beyond that, Andrew wrote up how to install the pre-rendering stuff here: http://biowiki.org/twiki/bin/view/GBrowse/InstallTileRendering The main obstacle at this point would be that rendering a human or mouse genome would take a long time. I've been working on speeding it up, but I haven't checked that stuff into CVS yet, as we've still not made a decision about whether or not we need the graphics primitive database, which I think may be the wrong approach. The graphics primitive database stores all of the drawing commands (like line, rectangle, text) so they can be replayed later in chunks. I think (but haven't yet shown) that we can do it all by storing graphics primitives in memory, which is more than an order of magnitude faster. I've appended the full discussion below. Maybe Aaron can help us parallelize the rendering (MPI-tilerendering?) :-) I (and Andrew?) will be exploring the options and taking some measurements in the near future. Rendering some larger chromosomes is pretty high on my to-do list, and then we'll have a better idea of how to move forward. I do think being able to handle large chromosomes is important functionality to demo, right up there with search and community annotation (Ian, Andrew: agree or disagree?). Mitch Reasons for the primitive DB: 1. break up genome by pixel coordinates into smaller chunks for rendering--takes less RAM per chunk/can be parallelized. On the other hand, we may be able to break up the genome accurately without precomputing and storing all of the primitives. The main concern here is that it's sometimes difficult to predict the pixel span of a feature from its genomic span (especially with text labels). One option is to do the pixel-level layout of the entire track in each rendering job but only store/render the primitives that overlap the part of the track that's currently being rendered. Having each chunk recompute the full layout is redundant work, but I'm pretty sure that recomputing the layout is faster than fetching all the primitives from the database. BTW: this is one of the reasons to explore client-side labels. 2. if a small part of a track changes we may be able to save work by re-rendering only a small area. As far as I know we're not yet storing which feature each primitive is for, so it's difficult to invalidate the right primitives. Plus, adding a new feature to a track can potentially cause a large portion of the track to change (if it has to be laid out ("bumped") again). And at this point I'm not sure how much we need to support individual feature creation/editing. 3. doing only the layout/database fill work up front and then rendering tiles from the database on demand reduces the amount of time people have to wait between uploading their annotations and seeing them in the browser. On the other hand, we may be able to do a full in-memory pre-rendering in less time than it takes to fill the database. Am I missing anything here? |
From: Ian H. <ih...@be...> - 2007-02-02 00:22:13
|
Mitch Skinner wrote: > The main obstacle at this point would be that rendering a human or mouse > genome would take a long time. I've been working on speeding it up, but > I haven't checked that stuff into CVS yet, as we've still not made a > decision about whether or not we need the graphics primitive database, > which I think may be the wrong approach. The graphics primitive > database stores all of the drawing commands (like line, rectangle, text) > so they can be replayed later in chunks. I think (but haven't yet > shown) that we can do it all by storing graphics primitives in memory, > which is more than an order of magnitude faster. I've appended the full > discussion below. > > Maybe Aaron can help us parallelize the rendering > (MPI-tilerendering?) :-) > > I (and Andrew?) will be exploring the options and taking some > measurements in the near future. Rendering some larger chromosomes is > pretty high on my to-do list, and then we'll have a better idea of how > to move forward. I do think being able to handle large chromosomes is > important functionality to demo, right up there with search and > community annotation (Ian, Andrew: agree or disagree?). I agree. Big chromosomes are the top priority. The quick-and-dirty patch for this was the "render-on-demand" approach where you populate the primitives database but only pull out the primitives and create a tile image file when that tile (or one of its near neighbors) is requested by a client somewhere. I had some vague idea that there was already an "in_memory" sort of switch somewhere in TiledImage.pm that would bypass the primitives database but maybe not. Actually, early on, I did try to store the primitives in an R-tree but decided an SQL database was just faster. We did try various other things too, like creating a big SVG (which crashed) and a big PNG (which ran out of memory). > Reasons for the primitive DB: > 1. break up genome by pixel coordinates into smaller chunks for > rendering--takes less RAM per chunk/can be parallelized. > On the other hand, we may be able to break up the genome accurately > without precomputing and storing all of the primitives. The main > concern here is that it's sometimes difficult to predict the pixel span > of a feature from its genomic span (especially with text labels). One > option is to do the pixel-level layout of the entire track in each > rendering job but only store/render the primitives that overlap the part > of the track that's currently being rendered. Having each chunk > recompute the full layout is redundant work, but I'm pretty sure that > recomputing the layout is faster than fetching all the primitives from > the database. I think that the fastest/slimmest solution of all would be if we could do the layout once, store the minimal information required to reproduce the layout (either in memory if we're generating all the tiles in one go, or in a database if we're doing render-on-demand), and then use this information to position and render each glyph. The only reason we didn't do this at first was that it seemed to require a slightly more intimate familiarity with the workings of Bio::Graphics::Panel than we had at first, and writing a GD interception layer seemed reasonably straightforward (and sufficiently "generic" that it might conceivably continue to be useful for a while). We could definitely benefit from having a chat with Lincoln about how GBrowse layout works. Another thing to bear in mind is that the method used to render the tracks doesn't *have* to rest entirely on existing GBrowse code. For example, we could quite easily write a light & fast C/C++ track renderer for uploaded features, similar to UCSC's. Having this extra AJAX layer does allow us some flexibility & modularity in the way we choose to do rendering on the server. > BTW: this is one of the reasons to explore client-side labels. > > 2. if a small part of a track changes we may be able to save work by > re-rendering only a small area. > As far as I know we're not yet storing which feature each primitive is > for, so it's difficult to invalidate the right primitives. Plus, adding > a new feature to a track can potentially cause a large portion of the > track to change (if it has to be laid out ("bumped") again). And at > this point I'm not sure how much we need to support individual feature > creation/editing. I personally am starting to think we should not worry about feature creation/editing until we have to, because it raises too many speed bumps like this... > 3. doing only the layout/database fill work up front and then rendering > tiles from the database on demand reduces the amount of time people have > to wait between uploading their annotations and seeing them in the > browser. > On the other hand, we may be able to do a full in-memory pre-rendering > in less time than it takes to fill the database. > > Am I missing anything here? No -- I think this is a pretty succinct & accurate statement of the situation, especially #3. We should probably not do away with database approaches entirely. At some point, someone will inevitably try to render some track with 3 billion features per base or something crazy like that... Ian > > > ------------------------------------------------------------------------- > Using Tomcat but need to do more? Need to support web services, security? > Get stuff done quickly with pre-integrated technology to make your job easier. > Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 > _______________________________________________ > Gmod-ajax mailing list > Gmo...@li... > https://lists.sourceforge.net/lists/listinfo/gmod-ajax |
From: Mitch S. <mit...@be...> - 2007-02-02 04:09:24
|
Ian Holmes wrote: > We should probably not do away with database approaches entirely. At > some point, someone will inevitably try to render some track with 3 > billion features per base or something crazy like that... I had a conversation with Andrew today where we pretty much agreed on the same thing. I'm currently working on making memory/database a run-time option. So far I've touched TiledImage.pm and I think I'm also going to need to touch generate-tiles.pl. So this will probably affect the on-demand rendering code a little as well. Mitch |
From: Mitch S. <mit...@be...> - 2007-02-03 03:04:45
|
Mitch Skinner wrote: > I had a conversation with Andrew today where we pretty much agreed on > the same thing. I'm currently working on making memory/database a > run-time option. So far I've touched TiledImage.pm and I think I'm also > going to need to touch generate-tiles.pl. So this will probably affect > the on-demand rendering code a little as well. > I committed a change that implements this. There's a new generate-tiles.pl option: -primdb <dsn> optional parameter to use a database to store graphical primitives. Slower, but uses less memory and allows for rendering on demand. example: -primdb DBI:mysql:gdtile With the in-memory version, this is how the profile looks now (on the S. cerevisiae chr. I named genes track): Total Elapsed Time = 11.91910 Seconds User+System Time = 11.90910 Seconds Exclusive Times %Time ExclSec CumulS #Calls sec/call Csec/c Name 116. 13.82 15.221 349363 0.0000 0.0000 TiledImage::AUTOLOAD 45.8 5.464 5.464 2303 0.0024 0.0024 GD::Image::png 12.6 1.504 1.504 230322 0.0000 0.0000 TiledImagePanel::map_pt 7.85 0.935 4.260 73 0.0128 0.0584 TiledImage::renderTile 7.79 0.928 0.928 460682 0.0000 0.0000 GD::Image::line 7.73 0.920 10.749 1 0.9198 10.748 BatchTiledImage::renderTileRange 7.33 0.873 0.873 230702 0.0000 0.0000 TiledImage::__ANON__ 5.11 0.608 0.608 460764 0.0000 0.0000 TiledImage::min 4.11 0.489 0.489 491 0.0010 0.0010 Bio::Graphics::Glyph::_collision_k eys 4.01 0.478 0.478 460764 0.0000 0.0000 TiledImage::max 3.39 0.404 0.404 2377 0.0002 0.0002 GD::Image::_new This is with my modified version of GD. Clearly, DProf is confused about some things (I'm pretty sure that TiledImage::AUTOLOAD isn't taking 116% of the total runtime), but I'm hoping that it's right about the general ranking. I've got some ideas and code for reducing the amount of time taken by TiledImage::AUTOLOAD and GD::Image::png, but those might wait until after Andrew has committed some of the things he's got in mind. In particular, his gridline idea should help a lot with memory usage. Mitch |
From: Mitch S. <mit...@be...> - 2007-02-08 23:31:46
|
I wrote: > Clearly, DProf is confused about some things (I'm pretty sure that > TiledImage::AUTOLOAD isn't taking 116% of the total runtime), but I'm > hoping that it's right about the general ranking. I've committed a change that uses closures to generate subs that do most of the work that AUTOLOAD was doing. This makes S. cerevisiae chrom 1 (all tracks + zoom levels) run about 38% faster. Here's the current profile (yeast_chr1 named gene zoom 1): Total Elapsed Time = -1.21238 Seconds User+System Time = 0 Seconds Exclusive Times %Time ExclSec CumulS #Calls sec/call Csec/c Name 0.00 5.304 5.304 2303 0.0023 0.0023 GD::Image::png 0.00 1.380 10.355 1 1.3798 10.354 BatchTiledImage::renderTileRange 0.00 1.234 1.234 230361 0.0000 0.0000 GD::Image::line 0.00 1.194 1.194 230322 0.0000 0.0000 TiledImagePanel::map_pt 0.00 0.583 3.346 73 0.0080 0.0458 TiledImage::renderTile 0.00 0.558 0.558 460764 0.0000 0.0000 TiledImage::max 0.00 0.489 0.489 491 0.0010 0.0010 Bio::Graphics::Glyph::_collision_k eys 0.00 0.413 0.413 115472 0.0000 0.0000 MemoryPrimStorage::__ANON__ 0.00 0.398 0.567 253 0.0016 0.0022 Bio::Graphics::Glyph::collides 0.00 0.374 0.374 2377 0.0002 0.0002 GD::Image::_new 0.00 0.308 0.308 460764 0.0000 0.0000 TiledImage::min DProf is clearly still pretty confused. I think DProf tries to subtract out its own overhead, so maybe it's overestimating that and subtracting too much. But I'm still going to go on the assumption that the general ranking is about right. I know we're generating a pretty fair number of blank tiles at the higher zoom levels, so next I'm going to look into generating a blank tile ahead of time and just hardlinking to that whenever we're about to generate another blank one. Hopefully that will cut down on the time spent in GD::Image::png. I'm still not sure what to do about the gridlines; I'm going to leave that on the back burner for a bit. Mitch |