|
From: Keith M. <kei...@us...> - 2010-02-15 21:04:03
|
I'd like to share awareness of an anomaly, of which I myself became aware only this past weekend; my attention was drawn to it by this bug report: http://preview.tinyurl.com/yz6lb57 Of course, the bug in this case is that the poster failed to specify O_BINARY, (or _O_BINARY for the Microsoft purists among you), when he opened his binary data file for I/O. (That Microsoft require this is a portability issue, and it may be a nuisance, but it is not a bug). Now, I have been aware for a long time, of the conversion of LF bytes to CRLF byte-pairs, behind the scenes, when using MSVCRT's write() function on a data stream opened without O_BINARY in effect, and the poster's output experience is exactly what I would expect. However, I was surprised that the complementary read() request for the four bytes representing an int may be satisfied, with an apparent return count of only three[*] bytes, when any byte-pair in the real input stream have the sequential values 13 and 10, (in which case the 13 is discarded, and the following byte, with a value 10, is read in its stead, with only three effective bytes being read). [*] Here, I would have expected *five* bytes to be consumed from the real input stream, discarding the initial byte of the CRLF sequence, and leaving the same four bytes as written by the complementary write() request to be returned to the application. However, the attached transcript illustrates that the reported behaviour, with only *four* bytes consumed, and three reported to the application, is observed in reality; this asymmetric behaviour of read() and write() seems quite anomalous, IMO. In the transcript, the anomaly is illustrated by five case studies, each of which writes the four integer values 2573, 10, 13 and 2573, and subsequently reads them back, into a buffer which has been initialised to 0xffffffff, with respectively: CASE 0: O_BINARY specified for *neither* writing, nor reading. CASE 1: O_BINARY specified for reading, but not for writing. CASE 2: O_BINARY specified for *both* writing and reading. CASE 3: O_BINARY specified for writing, but not for reading. CASE 4: Same as CASE 0, but with only the three int values, 10, 13 and 2573, initially written to the output stream[**]. The moral, of course, is to always specify O_BINARY when opening a data stream for reading or writing binary data; this is the only way in which predictably correct behaviour can be guaranteed, as is illustrated in CASE 2. [**] CASE 4 exhibits really strange behaviour, which IMO can only be classified as a Microsoft bug; 14 bytes are written, and on reading them back, with four bytes requested each time: 1) the sequence \r \n \0 \0 is read as 3 bytes, discarding \r and becoming an int value of 10 -- okay; it is odd because it has actually modified all four bytes in the input buffer, loading three from the CRLF-->LF converted data stream, and zeroing the fourth, but I might be persuaded to accept this behaviour. 2) the sequence \0 \r \0 \0 is read as 4 bytes, becoming an int value of 3328 -- again okay; it's what I would expect. 3) the sequence \0 \r \r \n is read as 3 bytes, and becomes the int value 168430848 -- this has to be a bug, because the \r preceding the \n should be discarded leaving \0 \r \n \?, and if \? is zeroed, (as it appears to be in the \r \n \0 \0 case), then the value should become 658688, or if only three bytes of the input buffer are updated, leaving the byte in the position of \? unchanged, then it should be -16118528; in fact, what appears to happen is that the four byte input sequence is read into the input buffer, as \0 \r \r \n, then the CRLF in the most significant two bytes is adjusted, by replacing the \r by \n, to convert CRLF-->LF, but it then fails to zero out the \n already in the most significant byte, and so the input buffer is left with the content \0 \r \n \n, which *does* represent the value 168430848 -- odd behaviour, and clearly wrong, (but then we expect such oddities from Microsoft, and they rarely disappoint us). 4) finally the remaining two zero bytes are read, replacing the least significant pair in the input buffer, but leaving the most significant pair unchanged, for a value of 0xffff0000, which is correctly represented as a decimal value of -65536; once again, this is what I would expect. -- Regards, Keith. |