Menu

#1396 Problem reading binary files with read()

Known_Feature
closed-invalid
nobody
gcc (462)
2014-08-15
2010-02-14
Bjorn S.
No

Hello,
I have a little problem with mingw and binary IO on Windows 7. It looks like a bug, hope it is the case, apologizes if I distrub you uneccessarily. I have tried searching mailing lists and tracker, without finding anything, but I am a beginner with using those. Not sure about which category to use.

The following little program writes 7 integer variables and one double.

#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>

int main( int argc, char** argv )
{
int fd, ns, ds, olap, polap, nc, ng, sl, nr, one=1;
double x=9.89;

fd = open( "testfile.bin", O_CREAT | O_TRUNC | O_WRONLY , 0660 );
sl = 8;
nr = write(fd,&sl,sizeof(int));
printf("sl write %i bytes \n",nr);
nr = write(fd,&x,sl);
printf("x write %i bytes \n",nr);
nr = write( fd, &one, sizeof(int) );
printf("one write %i bytes \n",nr);
ng = 3;
ds = 8;
nc = 5;
olap =10;
polap = 12;
nr = write( fd, &ng, sizeof(int) );
printf("ng write %i bytes \n",nr);
nr = write( fd, &ds, sizeof(int) );
printf("ds write %i bytes \n",nr);
nr = write( fd, &nc, sizeof(int) );
printf("nc write %i bytes \n",nr);
nr = write( fd, &olap, sizeof(int) );
printf("olap write %i bytes \n",nr);
nr = write( fd, &polap, sizeof(int) );
printf("polap write %i bytes \n",nr);
close(fd);
}

when running it the following output is produced:
-----------------------------------------------------------------------

sl write 4 bytes
x write 8 bytes
one write 4 bytes
ng write 4 bytes
ds write 4 bytes
nc write 4 bytes
olap write 4 bytes
polap write 4 bytes

which is ok, but it is odd that the file size is reported as 37 bytes instead
of the 36 that were written:

Directory of C:\Users\bjorn\Solvers\adpdis3d\tools

02/13/2010 04:51 PM <DIR> .
02/13/2010 04:51 PM <DIR> ..
02/13/2010 04:51 PM 1,153 #error-report.mai#
02/13/2010 04:51 PM 910 #wrichk.c#
02/13/2010 03:35 PM 21,601 a.exe
02/13/2010 03:35 PM 1,245 chkfile.c
02/12/2010 11:51 PM 860 chkfile.c~
02/13/2010 04:50 PM 37 testfile.bin
02/13/2010 04:50 PM 20,916 wr.exe
02/13/2010 03:30 PM 944 wrichk.c
02/13/2010 08:09 AM 259 wrichk.c~
9 File(s) 47,925 bytes
2 Dir(s) 439,667,093,504 bytes free

-----------------------------------------------------------------------
Next I wrote another little program to read the file:

#include <stdlib.h>
#include <stdio.h>
#include <fcntl.h>
#include <unistd.h>

int main( int argc, char** argv )
{
int fd, slask, nr, nsols, ng, ds, nc, iolap, polap;
int readfile =1,i;
char buf[300];

if( argc > 1 )
fd = open( argv[1], O_RDONLY );
else
fd = open("../run-tmpmix/flowout.bin",O_RDONLY );
if( fd == -1 )
{
printf("Error opening %s\n",argv[1]);
exit(2);
}
if( readfile == 1 )
{
nr = read(fd,&slask,sizeof(int));
nr = lseek(fd,slask,SEEK_CUR);
nr = read(fd,&nsols,sizeof(int));
nr = read( fd, &ng, sizeof(int) );
nr = read( fd, &ds, sizeof(int) );
nr = read( fd, &nc, sizeof(int) );
nr = read( fd, &iolap, sizeof(int) );
printf("olap read %i bytes \n",nr);
nr = read( fd, &polap, sizeof(int) );
printf("polap read %i bytes \n",nr);
printf("slask = %i \n",slask);
printf(" nsols = %i ng = %i ds = %i nc = %i iolap = %i polap = %i \n",
nsols, ng, ds, nc, iolap, polap );
}
else
{
nr = read(fd,buf,300);
for( i=1 ; i <=nr ; i++ )
printf("byte %i = %x \n",i,buf[i-1]);
}
}
which produces the output:

....

olap read 3 bytes
polap read 4 bytes
slask = 8
nsols = 1 ng = 3 ds = 8 nc = 5 iolap = 10 polap = 3072

Here is what is annoying me. The second last variable is read with only 3 bytes and the variable after gets an incorrect value. Shouldn't integers always have 4 bytes ? I have a rather big program that relies on the open/read/write binary IO, so I am not interested in workarounds that require a lot of chages in my code. Is this a bug ? Or is it anything I am doing wrong ?

Note, reading the entire file as a sequence of bytes
char buf[300];

nr = read(fd,buf,300);
produces correctly nr=36 and the bytes look right.

For version information, see below

Bjorn

C:\Users\bjorn\Solvers\adpdis3d\tools>gcc -v
Using built-in specs.
Target: mingw32
Configured with: ../gcc-4.4.0/configure --enable-languages=c,ada,c++,fortran,jav
a,objc,obj-c++ --disable-sjlj-exceptions --enable-shared --enable-libgcj --enabl
e-libgomp --with-dwarf2 --disable-win32-registry --enable-libstdcxx-debug --enab
le-version-specific-runtime-libs --prefix=/mingw --with-gmp=/mingw/src/gmp/root
--with-mpfr=/mingw/src/mpfr/root --build=mingw32
Thread model: win32
gcc version 4.4.0 (GCC)

C:\Users\bjorn\Solvers\adpdis3d\tools>ld -v
GNU ld (GNU Binutils) 2.20

C:\Users\bjorn\Solvers\adpdis3d\tools>ver

Microsoft Windows [Version 6.1.7600]

Discussion

  • Keith Marshall

    Keith Marshall - 2010-02-14
    • milestone: --> Known_Feature
    • status: open --> closed-invalid
     
  • Keith Marshall

    Keith Marshall - 2010-02-14

    Yes, it is a bug, but in your code, not in MinGW.

    > fd = open( "testfile.bin", O_CREAT | O_TRUNC | O_WRONLY , 0660 );

    This doesn't open the stream for binary output; unless you explicitly specify O_BINARY among the mode attributes, MSVCRT's I/O subsystem, (which is used by MinGW), will treat the data you write as text, regardless of its actual content. This means that, for every `\n' byte, (decimal value 10) you write, the I/O service routine will transparently insert a preceding `\r' byte,(decimal value 13). Thus, when you write the four byte representation of `olap', decimal byte sequence 10, 0, 0, 0), MSVCRT will actually write five bytes, 13, 10, 0, 0, 0. However, it will return a count of only four, excluding the extra byte it added behind the scenes; this explains the extra byte in the file, when compared to the actual byte count you wrote.

    Conversely, when you read the data back, when MSVCRT's read service encounters a `\r' byte in text mode input, it looks ahead one byte, and if the next byte is `\n' then the `\r' is discarded, uncounted, and `\n' is returned for a count of only one byte. This may explain the short count you observed on input, although I am surprised -- I haven't run your example code myself -- that when you requested four bytes you apparently got only three; however, any anomaly is in Microsoft's code, not MinGW's.

    BTW, using lseek() on text streams in MSVCRT dependent code, (as anything compiled with MinGW normally is), is almost guaranteed to cause heartache, (because of the handling of `\r' bytes).

     
  • Keith Marshall

    Keith Marshall - 2010-02-16

    While this bug is technically in your code, the short read you observed piqued my interest sufficiently to open a mailing list discussion at

    http://thread.gmane.org/gmane.comp.gnu.mingw.user/32120

    Do note that, besides your omission of O_BINARY, the ensuing discussion highlights a further bug in *your* code: read() is not required to return the full number of bytes requested, (even on POSIX systems), and when it returns a short count, it is *your* responsibility to take appropriate follow up action, which in your example, you don't.

    While the behaviour of MSVCRT's read() function may seem strange, and its inconsistent behaviour might be described as buggy, this is one of those cases where the bug affects only code which ventures into the realms of undefined behaviour; thus it may be accurately described as a feature, so it merits no effort to fix it. In other words, their bug is only exposed when a more serious bug exists in *your* code; thus their bug is technically benign, and no advantage accrues from fixing it.

     
  • Bjorn S.

    Bjorn S. - 2010-02-16

    Thank you for the comments, it was a very helpful. Apologizes for mistaken bug report. I am porting code from Linux, where the O_BINARY flag is not used. This incompatibility linux-windows is the main issue (although not a bug). The three byte read is not a bug if it was actually three bytes that were read, but one wonders what would have happened if the integer was large enough to require four bytes to represent.

     
  • Keith Marshall

    Keith Marshall - 2010-02-17

    > I am porting code from Linux, where the
    > O_BINARY flag is not used.

    Not only is it not used; it is not even *defined* in standard headers. This is why I said, (in the mailing-list thread), that Microsoft requiring it is a nuisance. I usually handle it by *always* including it in the mode argument to my open() calls, having first added something like:

    #ifndef O_BINARY
    # ifdef _O_BINARY
    # define O_BINARY _O_BINARY
    # else
    # define O_BINARY 0
    # endif
    #endif

    either in any file which needs it, immediately after the last #include, or in a project "platform.h" header, included by the last #include in the source file; this makes it a valid identifier having no effect, for the Linux/Unix/POSIX case.

    > The three byte read is not a bug if it was actually
    > three bytes that were read, but one wonders what
    > would have happened if the integer was large
    > enough to require four bytes to represent.

    You seem to be missing the point here. On a 32-bit platform, an int *always* needs four bytes! However, it is a bug in your code to blindly assume that

    read( fd, &value, sizeof( int ) );

    will automatically return four bytes. You *must* read four bytes, and you *must* check the return value from read(), to ensure that you got them all. You need something like:

    int value, count;
    .
    .
    count = 0;
    while( ! eof( fd ) && (count < sizeof( int )) )
    count += read( fd, ((char *)(&value) + count), sizeof( int ) - count );

    if( count < sizeof( int ) )
    /* Input error: handle it! */
    .
    .

    This isn't specific to MS-Windows; you may also encounter a similar failure condition on Linux/Unix/POSIX, so you *should* use such defensive coding *everywhere*.