From: Martin A. H. <ma...@ma...> - 2009-10-19 12:50:16
|
I was trying to upload 2.3M features from a GFF file and Perl crashed out: perl(1734) malloc: *** mmap(size=16777216) failed (error code=12) *** error: can't allocate region *** set a breakpoint in malloc_error_break to debug Out of memory! So Jbrowse doesn't handle tracks with a bit of data? Martin |
From: Mitch S. <mit...@be...> - 2009-10-19 23:15:51
|
Martin A. Hansen wrote: > I was trying to upload 2.3M features from a GFF file and Perl crashed out: > > perl(1734) malloc: *** mmap(size=16777216) failed (error code=12) > *** error: can't allocate region > *** set a breakpoint in malloc_error_break to debug > Out of memory! > > So Jbrowse doesn't handle tracks with a bit of data? JBrowse uses BioPerl to parse GFF files. Because of the possibility of parent/child relationships between features, the BioPerl GFF loader (using the Bio::DB::SeqFeature::Store adapter) has to load the entire file into memory, and it stores those features in memory as relatively heavyweight objects. The possibilities I see for dealing with this are: * Use a database. Then the database takes care of the parent/child relationships, and it breaks the data down by refseq (you could also break your GFF down by refseq if you haven't already, which should help) * Use a different file format. I've tested with BED, which is streamable (there are no parent-child relationships between rows in BED, which means that we don't have to load the entire file into memory). It should also be possible to use SAM/BAM with JBrowse using Bio::SamTools, but I haven't tested it yet. Assuming that entire 2.3M is in one track, then even once you get all that data into JBrowse's JSON format, the client-side code in the "master" branch will barf on that much data. I'm working on that in the "lazyfeatures" branch; there are still some rough edges there but you could try it out and see how well it works for you. What kind of data is this? Is this short-read sequencing data, or SNP data, or something else? Mitch |
From: Caroline <Car...@kc...> - 2009-10-20 12:44:15
Attachments:
flatfile-to-json.diff
|
Hello, On Tue, 2009-10-20 at 00:15 +0100, Mitch Skinner wrote: > It > should also be possible to use SAM/BAM with JBrowse using > Bio::SamTools, > but I haven't tested it yet. I had a go at this and it is. Patch attached (first time I've done this with git, give me a shout if it's not the right format). > Assuming that entire 2.3M is in one track, then even once you get all > that data into JBrowse's JSON format, the client-side code in the > "master" branch will barf on that much data. I'm working on that in the > "lazyfeatures" branch; there are still some rough edges there but you > could try it out and see how well it works for you. Awesome! What's the plan for this? I'm trying to knock up a sequence server (perl, Catalyst) that will hand over our ChIPseq data from BAM files bit by bit and eventually do the same for remote BAM files, DAS servers and so on. You mentioned this on the mailing list ages ago, but this is the first chance I've had to get around to it. Can you point me in the right direction for getting it to play nicely with JBrowse? What will the lazyfeatures track expect from the server? How does it deal with zooming - can the server just decide to only return a hist summary at some point? What about caching? Does the browser grab data in defined chunks? What else should I be worrying about? Cheers, Cass |
From: Mitch S. <mit...@be...> - 2009-11-20 05:53:18
|
Sorry it took so long to get back to you. I've replied inline below - Caroline wrote: >> It should also be possible to use SAM/BAM with JBrowse using >> Bio::SamTools, but I haven't tested it yet. >> > > I had a go at this and it is. Patch attached (first time I've done this > with git, give me a shout if it's not the right format). > Awesome. The patch looks great; I've applied a slightly different version that makes Bio::DB::Sam optional: http://github.com/jbrowse/jbrowse/commit/82f29deedb57051bff81f9dc311bfd80b554e8e9 It's currently just on the "lazyfeatures" branch because I don't think it's that useful without the other stuff on that branch. Longer term, I'd like to try doing pulls from people's git repos, which will (e.g.) preserve authorship information in the git metadata. But patches are also fine and I'm happy to get them. This one worked fine for me. It still uses much more memory than it should; that can be addressed but I think it'll take more work re-arranging the interface between JBrowse's JsonGenerator and its clients. > Awesome! What's the plan for this? I'm trying to knock up a sequence > server (perl, Catalyst) that will hand over our ChIPseq data from BAM > files bit by bit and eventually do the same for remote BAM files, DAS > servers and so on. You mentioned this on the mailing list ages ago, but > this is the first chance I've had to get around to it. > > Can you point me in the right direction for getting it to play nicely > with JBrowse? What will the lazyfeatures track expect from the server? > How does it deal with zooming - can the server just decide to only > return a hist summary at some point? What about caching? Does the > browser grab data in defined chunks? What else should I be worrying > about? I wrote this to try and answer these questions: http://biowiki.org/view/JBrowse/LazyFeatureLoading The short version is: yes, there's a hist summary; currently the hist counts are generated at a zoom level that's hard-coded. That's a terrible hack, and doing something smarter is definitely on the list. The client does grab data in defined chunks, and caches those. After I had implemented lazy loading in JBrowse, I found out about the lazy/partial loading work that Heng Li has done for BAM and that Jim Kent has done for his BigBed/BigWig format. There was a big thread on samtools-devel about it: http://sourceforge.net/mailarchive/forum.php?thread_name=6dce9a0b0911150626o701e07baq2c97c4135e5ffda9%40mail.gmail.com&forum_name=samtools-devel There are a few messages from me in there that try to compare the JBrowse approach to the BAM and BigBed approaches. In the end, each of us came up with something different; Heng Li is using binning, Jim Kent is using r-trees, and I'm using NCLists. I don't think we can directly adopt either of the other two solutions for JBrowse, because they're doing a lot of bit-twiddling that I think would be hard to do in a web browser (I'm happy to have someone prove me wrong though, and I'd be happy to talk about it in more detail if people are interested). So my next thought was to wonder if I (or someone) could write a proxy that could act as a BAM/BigBed client and then serve JSON to JBrowse. I think it could be done but it's not 100% clear in my head how to do it. I'd be happy to talk about what I've been thinking so far if you're interested in tackling this. Earlier this year, I said that I didn't want to make the JBrowse JSON format a public thing because I wanted to be able to change it at will. Thinking about it some more, there are some aspects of the format that I think are pretty solid, and some other parts that are pretty likely to change. It might be possible to split out the likely-to-change bits from the unlikely-to-change bits; earlier I was worried about splitting things up too much and ending up with too many server round-trips, but maybe not. I'll write up a description of what's in there now and then we could talk about where to go from there. Thanks for the patch, Mitch |
From: Martin A. H. <ma...@ma...> - 2009-10-20 06:30:08
|
Hello Mitch, I have studied the GFF3 format, and it is quite clear to me that it should be possible to parse GFF3 entries in a step-wise manner using very little memory - including parent/childs (there is an optional ### record seperator, that if in place should make this much easier). However, it is also quite clear to me why GFF3 don't have a big fan club - it's way too complex. BED files are a decent tabular alternative - however, BED are something out of the UCSC genome browser, and is dependent on chrom, chromStart and chromEnd which does not make it a good generic format for contigs and scaffolds. There will always be conflicts between the original BED format and the different BED-like formats sprouting (e.g. between Jim Kents BED tools and BEDTools). Also, I really dislike the chromEnd not being the exact position as in an array. Finally, the itemRgb field is never used. My resolve is to stay clear of BED files unless you work with the UCSC genome browser. SAM/BAM appears to be a developing alternative - especially because it seems like consensus is developing to use this format. However, I find unnecessarily difficult to parse (the bit field), and I dislike that it is without field restrictions. Oh, I got ranting about formats :P - I should test Jbrowse with a database - which I haven't tried yet. Does Bio::DB support MySQL? The data I have got is Solexa reads mapped to a Bacterial genome using Bowtie. I hacked the Bowtie output into GFF3 like format. Regards, Martin On Tue, Oct 20, 2009 at 1:15 AM, Mitch Skinner <mit...@be...>wrote: > Martin A. Hansen wrote: > >> I was trying to upload 2.3M features from a GFF file and Perl crashed out: >> >> perl(1734) malloc: *** mmap(size=16777216) failed (error code=12) >> *** error: can't allocate region >> *** set a breakpoint in malloc_error_break to debug >> Out of memory! >> >> So Jbrowse doesn't handle tracks with a bit of data? >> > > JBrowse uses BioPerl to parse GFF files. Because of the possibility of > parent/child relationships between features, the BioPerl GFF loader (using > the Bio::DB::SeqFeature::Store adapter) has to load the entire file into > memory, and it stores those features in memory as relatively heavyweight > objects. > > The possibilities I see for dealing with this are: > * Use a database. Then the database takes care of the parent/child > relationships, and it breaks the data down by refseq (you could also break > your GFF down by refseq if you haven't already, which should help) > * Use a different file format. I've tested with BED, which is streamable > (there are no parent-child relationships between rows in BED, which means > that we don't have to load the entire file into memory). It should also be > possible to use SAM/BAM with JBrowse using Bio::SamTools, but I haven't > tested it yet. > > Assuming that entire 2.3M is in one track, then even once you get all that > data into JBrowse's JSON format, the client-side code in the "master" branch > will barf on that much data. I'm working on that in the "lazyfeatures" > branch; there are still some rough edges there but you could try it out and > see how well it works for you. > > What kind of data is this? Is this short-read sequencing data, or SNP > data, or something else? > > Mitch > |
From: Mitch S. <mit...@be...> - 2009-10-21 03:36:44
|
Martin A. Hansen wrote: > I have studied the GFF3 format, and it is quite clear to me that it > should be possible to parse GFF3 entries in a step-wise manner using > very little memory - including parent/childs Not in the general case, no. Imagine a file with 2.3 million features, all of which are children of one parent feature. Or that those 2.3 million features are children of a bunch of parent features, with all the parents at the end (or at some unpredictable place within the file). If you impose the right kind of (topological) ordering constraint on the GFF, you could improve your average-case memory usage, at the cost of making your GFF parser much less general. Or you could build the ordering into your parser, but then you'd have to do a fairly memory-hungry topological sort at the beginning. > Does Bio::DB support MySQL? The bioperl documentation talks all about this. Mitch |