From: Jim K. <ke...@so...> - 2009-10-30 07:33:40
|
I agree with you on the precomputed coverage files Lincoln. The 'zoomed out' summary data we store now in BigWig and BigBed is both too big and too small. We store the max, min, sample count, sum, and sum-squared in each bin. From this we have max and min directly, and we can calculate mean and standard deviation quickly. However it is a lot of floating point numbers to store, so that the zoomed data is larger than the primary data until you zoom out by a factor of 10 or so. In fact for most purposes you just want the mean or the max. On the other hand for other purposes you'd like to store the first quartile, median, and third quartile as well. David H. asked for this and I just threw up my hands since the zoomed data was already too large. Having separate summary views for different uses seems like a win. Whether having things live in separate files is better than all in one file is a more debatable point, but since SAM/BAM already has taken the separate file approach, as I said before, it certainly makes more sense to continue in this vein. There's enough nice things about bigBed and bigWig, and they are far enough in the production pipeline that we will almost certainly keep them rather than try to wedge everything good about them into SAM/ BAM. The bigWig is ultimately a more specialized format than SAM/ BAM, just being a general way to have a floating point value associated with a base in the genome. The bigBed is in some ways more general, capable of storing gene models, and an arbitrary number of extension fields that are described with both a word and a sentence in built-in metadata. The R-tree index of the big family is, I think, higher performance over a wider range of scales than the binning scheme used in BAM. Still, its an excellent thing to see cross-fertilization between the formats. Perhaps it will eventually lead to a convergence as extreme as the 2.0 version of each format being identical. Short of that it's likely to still improve both format's next version. We're certainly going to have a good look at the BGZF library, and I'm leaning more and more towards breaking the "big" monoliths into components inside of a directory. It certainly is a pain to _implement_ the monolith, with many more offsets to keep track of, and to make it truly extensible would require an internal directory that would effectively be reinventing a file system. An advantage both formats SAM/BAM and bigBed/bigWig have over "web services" is that they don't require institutional system administrator involvement typically for a small to medium-scale group to share their "large data" over the web. System admins are much more relaxed about letting users put static files on their web servers than they are about letting users install CGI scripts. That said though the amount of data a well structured web services CGI script needs to send out a socket is likely to be half that that that needs to get pulled to get a region from a BAM or BigBed/BigWig, since the index need not be sent, and only the fields the client is actually interested in need be sent. As important as the reduced size is the reduced number of round trips needed by not having to first request index blocks and then request data blocks. I'm having some thoughts about good genomics web services too, but not really sure I want to do it in the DAS framework. Part of this frankly is just that the feedback I've given on DAS over the years has had very little impact. Part of it is the fragmenting nature of the DAS spec. Part of it is the XML inflation of everything. Perhaps we could start fresh, and maybe start with JSON (or a, shudder to say, ASN.1) framework rather than XML. Perhaps that's a topic for another message list, but I do think that a good web services server does need these days to be able to serve BAM! I'm hoping it could server bigWig/bigBed, relational database tables (but not necessarily joins) and tab-separated files in many flavors as well. All the best, Jim > I do monitor this list and will add my 2c. > > The existing remote BAM file access works well, but the client needs > to cache the index file locally to get good interactive performance. > Precomputed coverage files should probably be stored as separate > index files using a predictable naming scheme and not be > incorporated into the BAM file itself. In fact, I imagine that the > coverage files are a specific instance of a BAM "view", which might > include other types of precomputed data such as genome-wide quality > scores, variation counts, etc. How would it work if there were a > "{basename}.view" subdirectory at the same level as the BAM file. > Within this subdirectory would be a series of precomputed statistics > files, along with some sort of meta file that describes what each > view is. This would allow for a view extension mechanism. > > Lincoln > > On Sun, Oct 25, 2009 at 2:33 AM, Richard Durbin <rd...@sa...> > wrote: > Hello samtools developers, > > I was talking to Jim Kent at the ASHG meeting. The UCSC browser > people > were about to add regional remote BAM file access (or in the process > of > doing this). > I said that this already existed, at both C and Perl levels (see > below). Could someone reply to this copying the Santa Cruz people who > are on this, directing them > to how to get at these features and any documentation for them. > > They also would like binned density information for zoomed-out views, > ideally stored in some other section of the BAM file. This might well > be useful for other > people, but would probably require significant changes (version n+1 of > BAM format). I suggested an auxiliary file, which Jim thought less > attractive, but might be > an intermediate. Maybe it would be good for Jim or Angie or someone > at > UCSC to share their experience of how to encode this information well. > Maybe Lincoln > also has experience/ideas - I don't know if he is actively watching > the > samtools list. > > Jim and Angie, maybe someone from UCSC should subscribe to the > samtools-devel sourceforge mailing list? > > Richard > > Jim Kent wrote: > > Hi - Richard Durbin just told me that SAM/BAM now can live on remote > > sites and be transported over HTTP like BigBed. I'm not sure about > > HTTPS/FTP, but at any rate it looks like we *don't* need to hack > this > > into the SAM/BAM code ourselves, it's already been done at Hinxton. > > He was agreeable to making it include some zoomed out data too, > though > > this is not so critical to their primary users (variant folks) as it > > is to people doing CHIP-seq and such, so I'm not sure if it will be > > high enough priority for them to get to. > > > > -- > The Wellcome Trust Sanger Institute is operated by Genome Research > Limited, a charity registered in England with number 1021457 and a > company registered in England with number 2742969, whose registered > office is 215 Euston Road, London, NW1 2BE. > > ------------------------------------------------------------------------------ > Come build with us! The BlackBerry(R) Developer Conference in SF, CA > is the only developer event you need to attend this year. Jumpstart > your > developing skills, take BlackBerry mobile applications to market and > stay > ahead of the curve. Join us from November 9 - 12, 2009. Register now! > http://p.sf.net/sfu/devconference > _______________________________________________ > Samtools-devel mailing list > Sam...@li... > https://lists.sourceforge.net/lists/listinfo/samtools-devel > > > > -- > Lincoln D. Stein > Director, Informatics and Biocomputing Platform > Ontario Institute for Cancer Research > 101 College St., Suite 800 > Toronto, ON, Canada M5G0A3 > 416 673-8514 > Assistant: Renata Musa <Ren...@oi...> |