Re: [Samtools-devel] SAM/BAM goes remote/distributed

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

I agree with you on the precomputed coverage files Lincoln.

The 'zoomed out' summary data we store now in BigWig and BigBed is  
both too big and too small.  We store the max, min, sample count, sum,  
and sum-squared in each bin.  From this we have max and min directly,  
and we can calculate mean and standard deviation quickly.  However it  
is a lot of floating point numbers to store, so that the zoomed data  
is larger than the primary data until you zoom out by a factor of 10  
or so.  In fact for most purposes you just want the mean or the max.    
On the other hand for other purposes you'd like to store the first  
quartile, median, and third quartile as well.   David H. asked for  
this and I just threw up my hands since the zoomed data was already  
too large.   Having separate summary views for different uses seems  
like a win.

Whether having things live in separate files is better than all in one  
file is a more debatable point, but since SAM/BAM already has taken  
the separate file approach, as I said before, it certainly makes more  
sense to continue in this vein.

There's enough nice things about bigBed and bigWig, and they are far  
enough in the production pipeline that we will almost certainly keep  
them rather than try to wedge everything good about them into SAM/ 
BAM.   The bigWig is ultimately a more specialized format than SAM/ 
BAM, just being a general way to have a floating point value  
associated with a base in the genome.   The bigBed is in some ways  
more general, capable of storing gene models, and an arbitrary number  
of extension fields that are described with both a word and a sentence  
in built-in metadata.  The R-tree index of the big family is, I think,  
higher performance over a wider range of scales than the binning  
scheme used in BAM.

Still, its an excellent thing to see cross-fertilization between the  
formats.   Perhaps it will eventually lead to a convergence as extreme  
as the 2.0 version of each format being identical.  Short of that it's  
likely to still improve both format's next version.  We're certainly  
going to have a good look at the BGZF library, and I'm leaning more  
and more towards breaking the "big" monoliths into components inside  
of a directory.  It certainly is a pain to _implement_ the monolith,  
with many more offsets to keep track of, and to make it truly  
extensible would require an internal directory that would effectively  
be reinventing a file system.

An advantage both formats SAM/BAM and bigBed/bigWig have over "web  
services" is that they don't require institutional system  
administrator involvement typically for a small to medium-scale group  
to share their "large data" over the web.  System admins are much more  
relaxed about letting users put static files on their web servers than  
they are about letting users install CGI scripts.   That said though  
the amount of data a well structured web services CGI script needs to  
send out a socket is likely to be half that that that needs to get  
pulled to get a region from a BAM or BigBed/BigWig, since the index  
need not be sent, and only the fields the client is actually  
interested in need be sent.  As important as the reduced size is the  
reduced number of round trips needed by not having to first request  
index blocks and then request data blocks.

I'm having some thoughts about good genomics web services too, but not  
really sure I want to do it in the DAS framework.  Part of this  
frankly is just that the feedback I've given on DAS over the years has  
had very little impact.  Part of it is the fragmenting nature of the  
DAS spec.  Part of it is the XML inflation of everything.   Perhaps we  
could start fresh, and maybe start with JSON (or a, shudder to say,  
ASN.1) framework rather than XML.   Perhaps that's a topic for another  
message list,  but I do think that a good web services server does  
need these days to be able to serve BAM!  I'm hoping it could server  
bigWig/bigBed, relational database tables (but not necessarily joins)  
and tab-separated files in many flavors as well.

All the best,
     Jim

> I do monitor this list and will add my 2c.
>
> The existing remote BAM file access works well, but the client needs  
> to cache the index file locally to get good interactive performance.  
> Precomputed coverage files should probably be stored as separate  
> index files using a predictable naming scheme and not be  
> incorporated into the BAM file itself. In fact, I imagine that the  
> coverage files are a specific instance of a BAM "view", which might  
> include other types of precomputed data such as genome-wide quality  
> scores, variation counts, etc. How would it work if there were a  
> "{basename}.view" subdirectory at the same level as the BAM file.  
> Within this subdirectory would be a series of precomputed statistics  
> files, along with some sort of meta file that describes what each  
> view is. This would allow for a view extension mechanism.
>
> Lincoln
>
> On Sun, Oct 25, 2009 at 2:33 AM, Richard Durbin <rd...@sa...>  
> wrote:
> Hello samtools developers,
>
> I was talking to Jim Kent at the ASHG meeting.  The UCSC browser  
> people
> were about to add regional remote BAM file access (or in the process  
> of
> doing this).
> I said that this already existed, at both C and Perl levels (see
> below).  Could someone reply to this copying the Santa Cruz people who
> are on this, directing them
> to how to get at these features and any documentation for them.
>
> They also would like binned density information for zoomed-out views,
> ideally stored in some other section of the BAM file.  This might well
> be useful for other
> people, but would probably require significant changes (version n+1 of
> BAM format).   I suggested an auxiliary file, which Jim thought less
> attractive, but might be
> an intermediate.  Maybe it would be good for Jim or Angie or someone  
> at
> UCSC to share their experience of how to encode this information well.
> Maybe Lincoln
> also has experience/ideas - I don't know if he is actively watching  
> the
> samtools list.
>
> Jim and Angie, maybe someone from UCSC should subscribe to the
> samtools-devel sourceforge mailing list?
>
> Richard
>
> Jim Kent wrote:
> > Hi - Richard Durbin just told me that SAM/BAM now can live on remote
> > sites and be transported over HTTP like BigBed.  I'm not sure about
> > HTTPS/FTP, but at any rate it looks like we *don't* need to hack  
> this
> > into the SAM/BAM code ourselves, it's already been done at Hinxton.
> > He was agreeable to making it include some zoomed out data too,  
> though
> > this is not so critical to their primary users (variant folks) as it
> > is to people doing CHIP-seq and such, so I'm not sure if it will be
> > high enough priority for them to get to.
>
>
>
> --
>  The Wellcome Trust Sanger Institute is operated by Genome Research
>  Limited, a charity registered in England with number 1021457 and a
>  company registered in England with number 2742969, whose registered
>  office is 215 Euston Road, London, NW1 2BE.
>
> ------------------------------------------------------------------------------
> Come build with us! The BlackBerry(R) Developer Conference in SF, CA
> is the only developer event you need to attend this year. Jumpstart  
> your
> developing skills, take BlackBerry mobile applications to market and  
> stay
> ahead of the curve. Join us from November 9 - 12, 2009. Register now!
> http://p.sf.net/sfu/devconference
> _______________________________________________
> Samtools-devel mailing list
> Sam...@li...
> https://lists.sourceforge.net/lists/listinfo/samtools-devel
>
>
>
> -- 
> Lincoln D. Stein
> Director, Informatics and Biocomputing Platform
> Ontario Institute for Cancer Research
> 101 College St., Suite 800
> Toronto, ON, Canada M5G0A3
> 416 673-8514
> Assistant: Renata Musa <Ren...@oi...>