From: Greg R. <ne...@po...> - 2012-12-21 19:30:54
|
Andrew Dalke wrote: > 1) How useful has the CRC been in practice for PNG? There are other checks at the packet or file level these days; I don't think I've encountered a PNG CRC error (other than for files I'm messing with) in more than a decade. Then again, the most common exposure to virtually all PNGs is via the browser, and it would tend to show a broken image rather than mention "CRC check failed" or whatever, so there's not really any good way of knowing how frequent it is. I will say that I've encountered a number of truncated JPEGs over the past decade, so transmission problems apparently still exist somewhere. > and Greg Roelofs commented on 2011-06-16: > You don't have to verify [CRCs]; and again, the size impact is > irrelevant for the cases that matter. > Then, who does check them? libpng does, and therefore virtually every PNG-supporting application does. But if performance is your primary concern, you can (and probably should) write a decoder that skips them. It gets a bit trickier on the encoder side (for PNG), since you can't really share PNGs with blank CRCs outside your own application. But if you're doing the whole thing internally (which is effectively your case, with non-PNGs), then that's an option, too. > That got me to wondering why the CRC was there in the first place? The > two answers I've come up with are a) network quality is sometimes shaky, > especially in the 1990s, what with all the people using dial-up, so more > internal checks are better, and b) PNG promises lossless compression, and > strict validation helps ensure that no bits are lost. That mostly covers it. Other reasons included the fact that floppies (and even hard drives occasionally, and these days definitely flash, and various writable CD/DVD formats) go bad; and perhaps also that binaries distributed on Usenet came in many parts and inside a fragile ASCII format, so they could get reassembled in the wrong order or with extra or missing bits in the middle. > In researching this, I found that something like 16 million to 10 billion > network packets are bad, and not detected by the hardware error checker, > and according to Bram Cohen, BitTorrent detects transmission error rates > at about once per 10 TB of data transfer. Given PNG's popularity, it's > certainly to have happened. I'm trying to figure out if this happens often > enough that I need to be worried about it. You could probably use those statistics to make a good estimate, but the other half of the equation is how critical it is when it does go bad. No one cares _too_ much if part of a web page doesn't display right, but for scientific purposes, perhaps the cost is higher. > 1b) If it has been useful, then should I stay with the CRC-32, or use > CRC-64? With PNG, the chunks are <2GB but I'll have chunks that are ~5GB. The chunk size isn't strongly relevant; you're not worried about malicious corruption (i.e., modifications with the same check value) but about accidental ones, and the chance of a 5 GB chunk having identical CRC-32s for both uncorrupted and corrupted cases is still quite remote. You could also use an Adler-32 check, as zlib does for the uncompressed data; it's faster than CRC-32 (3x? not sure) and provides a moderate level of assurance, though it's not as good as a CRC (which in turn isn't as good as an MD5, etc.). > 2) Has the big-endian nature of PNG been either a benefit or a nuisance? > I'm thinking of using little-endian because some huge amount of the > expected use is on Intel-compatible little-endian hardware. I suspect > the choice here for me doesn't make any difference, except in how I define > the CRC. Hard to say. It's nicer to look at in a hex editor, and it probably avoided some stupid programming shortcuts that would have led to fatal alignment faults on non-Intel architectures. Also, given the main image stream is compressed essentially at the bit level, there's no real cost for the bulk of PNG processing. For your case, I'd guess none of those reasons really matter. > 3) Various PNG blocks use the NUL character as a separator. Why was that > done, instead of using the NUL as a terminator? Was it only to save one > byte of space? Perhaps there's some security issue I don't understand, like > the ability to sneak in extra data after the final terminator? You either need a particular separator character--and NUL is rarely used in textual contexts--or you need additional count values. (The overall length is determined by the chunk size.) NUL-as-separator is more robust, I'd argue; count values can be misinterpreted as text. Or are you simply asking why there wasn't additionally a NUL terminator? Yes, that was for space reasons; we tried to avoid unnecessary redundancy. > I am thinking of using a NUL terminator so that my mmap'ed C data > structures can point directly into the mmap'ed file, rather than allocating > its own space for a string. Sounds fine to me. > 4) Do the bit 5 meanings of the chunk tag give the right flexibility? > Based on what I've read in the archives, it does, but again, I wonder if > I've missed something. I think so, but there haven't been all that many additional chunks defined (even privately, AFAIK), so it's hard to be completely certain. Greg |