|
From: Willem v. S. <wi...@sc...> - 2012-12-21 17:12:14
|
hi Andrew, Just a thought -- and maybe it isn't possible -- but can't you make your own format as a standard PNG with some additional private chunks? Would give you the advantage of reusing software libraries and you'll have no "reinvent the wheel" issues. Your data chucks larger than 2GB isn't a problem, you could handle it the same way as the multiple IDAT chunks in a PNG file. Willem On 2012-Dec-21, at 7:45 AM, Andrew Dalke wrote: > Hi all, > > I'm developing a container format for a subfield of cheminformatics (which > is itself a subfield of computational chemistry). I want to model the format > after the PNG format, or more generically, use a chunk model. I would like > some feedback from the PNG developers as to the usefulness of some of its > features. > > My format so far has about 6 chunk types. The core, corresponding to IDAT, > contains N repeats of "fingerprint" fields of size 'B' bits (typically B=1024, > and typically is a multiple of 64 bits). This can be several GB in length, so > I have a 64 bit length field. This is mean to be mmap-able, with space for > internal alignment so the fingerprint is suitably aligned. > > That's all normal stuff, and I have no questions about it. What I'm curious > about are: > > 1) How useful has the CRC been in practice for PNG? > > The CRC check value is one of the more unusual aspects of the PNG format > over other FourCC-like formats. I've prototyped it in my own, code, and > it's not so hard to implement in the writer. > > However, verifying the CRC for each block read takes time. A cat > /dev/null > of a 3GB file takes 45 seconds on my machine, and CRC adds a few seconds > to that time. One of the reasons for the mmap-able design of my format > is to allow fast loads, like for simple command-line queries. The fields > are internally organized (via another block) in such a way that most > searches take less than a second to operate. CRC verification will be the > dominate term if I enable validation by default. > > I can make the validation be optional, or done via a different tool, > but I can bet that no one will ever use it. As Glenn <glennrp> commented - > as a bit of dark humor, I believe - on 2009-10-08: > > There is even a recommendation > for a page or two of blank lines so an editor has a place to put > new stuff without changing the length. I suppose they don't worry > much about ruining the CRC. Who checks those anyhow? > > and Greg Roelofs commented on 2011-06-16: > > You don't have to verify [CRCs]; and again, the size impact is > irrelevant for the cases that matter. > > Then, who does check them? > > That got me to wondering why the CRC was there in the first place? The > two answers I've come up with are a) network quality is sometimes shaky, > especially in the 1990s, what with all the people using dial-up, so more > internal checks are better, and b) PNG promises lossless compression, and > strict validation helps ensure that no bits are lost. > > The mailing list archives only go back to 1995. The choice of CRC was > long decided by that point and I found no mention of the reasoning in > the 1996 or 1996 archives. > > Was it one of these two reasons, or for other reasons? How often do CRC > failures occur in the wild? That is, from real corruption, and not because > someone forgot to put the CRC in, or didn't implement that code correctly? > > In researching this, I found that something like 16 million to 10 billion > network packets are bad, and not detected by the hardware error checker, > and according to Bram Cohen, BitTorrent detects transmission error rates > at about once per 10 TB of data transfer. Given PNG's popularity, it's > certainly to have happened. I'm trying to figure out if this happens often > enough that I need to be worried about it. > > My current opinion is that I don't need a check value. > > 1b) If it has been useful, then should I stay with the CRC-32, or use > CRC-64? With PNG, the chunks are <2GB but I'll have chunks that are ~5GB. > > Obviously the strength of 3 CRC-32 checks, for the three sub-2GB blocks > needed to add up to 5B, is greater than the single CRC-32 for the entire > 5GB block. > > 2) Has the big-endian nature of PNG been either a benefit or a nuisance? > I'm thinking of using little-endian because some huge amount of the > expected use is on Intel-compatible little-endian hardware. I suspect > the choice here for me doesn't make any difference, except in how I define > the CRC. > > 3) Various PNG blocks use the NUL character as a separator. Why was that > done, instead of using the NUL as a terminator? Was it only to save one > byte of space? Perhaps there's some security issue I don't understand, like > the ability to sneak in extra data after the final terminator? > > I am thinking of using a NUL terminator so that my mmap'ed C data > structures can point directly into the mmap'ed file, rather than allocating > its own space for a string. > > 4) Do the bit 5 meanings of the chunk tag give the right flexibility? > > Based on what I've read in the archives, it does, but again, I wonder if > I've missed something. > > Cheers, > > Andrew > da...@da... > > > > ------------------------------------------------------------------------------ > LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial > Remotely access PCs and mobile devices and provide instant support > Improve your efficiency, and focus on delivering more value-add services > Discover what IT Professionals Know. Rescue delivers > http://p.sf.net/sfu/logmein_12329d2d > _______________________________________________ > png-mng-misc mailing list > png...@li... > https://lists.sourceforge.net/lists/listinfo/png-mng-misc > -- Willem van Schaik wi...@sc... |