In that version of MS-DOS, all disk files were stored as a collection of fixed length blocks, (of 128 bytes each IIRC). Since there was only a 1 in 128 probability that any file would exactly fill out the last block allocated, that block was padded with random garbage, after the last logical byte in the file, and the true logical end of file was marked by placing a 0x1A byte after it, (or maybe all the padding bytes were set to 0x1A; it was a long time ago, and my recollection is hazy).
All tools designed to read text files were coded to recognize the first 0x1A byte read on input as being beyond the logical end of file, and would immediately return EOF, reading no more data; that convention is still honoured today, in all versions of Windoze, for reading from files opened in text mode. Being a text mode tool, sed' will see the first 0x1A byte encountered as a hard end of file marker, and will not read either it, or anything beyond it. (I'm not aware of any command line option to causesed' to open its input stream in binary mode, and can see nothing appropriate in the info' manual; not surprising really, for there is no text/binary distinction on *nix, whencesed originates).
FWIW, this is why Windoze command line tools use Ctrl-Z', which generates a 0x1A byte, as the EOF signal on standard input, where *nix usesCtrl-D', (0x04).
If all you are interested in doing is removing special characters, tr' is a better choice thansed' in any case; even to do a one for one transliteration, tr' is still the better choice. If you need some extra capability, whichsed' can provide but `tr' can't, then you will have to do something like:
cat infile | tr -d "\032" | sed ...
to filter out the 0x1A bytes (032 in octal), before passing the residual input stream to sed'; (of course, this means yoursed' script will never see the 0x1A bytes).
BTW, the MSYS implementation of sed' for Windoze, (see www.mingw.org), doesn't have this limitation; it has been patched to treat the input stream as *binary* data, even though it is normally text. Unfortunately, in the current MSYS-1.0.10 release, thesed' provided is a rather dated version; it doesn't understand the '\xnn' notation, (nor even the '\0nn' octal notation), for special characters, but then, POSIX `sed' doesn't require this, so it is not a portable construct anyway.
HTH,
Keith.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Just wanted to add my experience with this character and a solution as to how I got around it.
I've been bitten by this End of File 0x1A character problem with Sed when working on some scripts parsing WINS name registration records that seem to contain all kinds of garbage and illegal characters, such as a \r \f \t, and one record containing \x1A.
Now the strange this is that on my Windows XP SP2 workstation where I write my scripts this problem doesn't appear and Sed runs through the entire input file ~105,000 records and outputs all the formated records, but when I remotely execute my script on a Windows 2003 Server SP0 Sed stops processing the input file right when it encounters the line with the End Of File 0x1A character at about ~69,000 records.
I've tested this behavior by creating a small test file containing some text and the special 0x1A character and parsed only that small file and the behavior repeats the same.
I'm using GnuWin32 Sed.exe 4.1.5 with LibIConv2.dll 1.9.2.1747 and LibIntl3.dll 0.14.4.1952 on both systems and these 3 files are copied to the remote server for execution.
That -B for binary mode option is not available in version 4.1.5 of GnuWin32 Sed.exe but -T (textmode) so I am guessing that -B is now the standard for opening files and that is good.
When checking the DLL dependencies MSVCRT.DLL comes up with a lot of references to the basic functions for opening and dealing with files. My guess is that this is the file that contains a bug in the SP0 on server that has been fixed in the SP2 on the XP workstation.
Microsoft Windows 2003 Server SP0 - 7.0.3790.0 shp 327,168 03-25-2003 msvcrt.dll
Microsoft Windows XP Professional SP2 - 7.0.2600.2180 shp 343,040 07-20-2006 msvcrt.dll
Unfortunatelly, since that DLL is already loaded in memory since it is used by so many applications I cannot force Sed.exe to re-load it by placing it into the same folder as the executable.
I basically resolved this problem by prefixing my Sed commands with doing a Tr.exe replacement for 0x1A (octal 032) before sending the file to Sed.
type "file.txt" | tr.exe "\032" " " | sed.exe ...
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I have a sed script to replace special characters with a space:
s/\x00/ /g
, but when i run into the hex 1A character the sed program ends...
s/\x1A/ /g
I don't seem to have problems with any other characters, even those outside the ASCII range.
This is an artifact from the days of MS-DOS v1.x.
In that version of MS-DOS, all disk files were stored as a collection of fixed length blocks, (of 128 bytes each IIRC). Since there was only a 1 in 128 probability that any file would exactly fill out the last block allocated, that block was padded with random garbage, after the last logical byte in the file, and the true logical end of file was marked by placing a 0x1A byte after it, (or maybe all the padding bytes were set to 0x1A; it was a long time ago, and my recollection is hazy).
All tools designed to read text files were coded to recognize the first 0x1A byte read on input as being beyond the logical end of file, and would immediately return EOF, reading no more data; that convention is still honoured today, in all versions of Windoze, for reading from files opened in text mode. Being a text mode tool,
sed' will see the first 0x1A byte encountered as a hard end of file marker, and will not read either it, or anything beyond it. (I'm not aware of any command line option to causesed' to open its input stream in binary mode, and can see nothing appropriate in theinfo' manual; not surprising really, for there is no text/binary distinction on *nix, whencesed originates).FWIW, this is why Windoze command line tools use
Ctrl-Z', which generates a 0x1A byte, as the EOF signal on standard input, where *nix usesCtrl-D', (0x04).If all you are interested in doing is removing special characters,
tr' is a better choice thansed' in any case; even to do a one for one transliteration,tr' is still the better choice. If you need some extra capability, whichsed' can provide but `tr' can't, then you will have to do something like:cat infile | tr -d "\032" | sed ...
to filter out the 0x1A bytes (032 in octal), before passing the residual input stream to
sed'; (of course, this means yoursed' script will never see the 0x1A bytes).BTW, the MSYS implementation of
sed' for Windoze, (see www.mingw.org), doesn't have this limitation; it has been patched to treat the input stream as *binary* data, even though it is normally text. Unfortunately, in the current MSYS-1.0.10 release, thesed' provided is a rather dated version; it doesn't understand the '\xnn' notation, (nor even the '\0nn' octal notation), for special characters, but then, POSIX `sed' doesn't require this, so it is not a portable construct anyway.HTH,
Keith.
Just wanted to add my experience with this character and a solution as to how I got around it.
I've been bitten by this End of File 0x1A character problem with Sed when working on some scripts parsing WINS name registration records that seem to contain all kinds of garbage and illegal characters, such as a \r \f \t, and one record containing \x1A.
Now the strange this is that on my Windows XP SP2 workstation where I write my scripts this problem doesn't appear and Sed runs through the entire input file ~105,000 records and outputs all the formated records, but when I remotely execute my script on a Windows 2003 Server SP0 Sed stops processing the input file right when it encounters the line with the End Of File 0x1A character at about ~69,000 records.
I've tested this behavior by creating a small test file containing some text and the special 0x1A character and parsed only that small file and the behavior repeats the same.
I'm using GnuWin32 Sed.exe 4.1.5 with LibIConv2.dll 1.9.2.1747 and LibIntl3.dll 0.14.4.1952 on both systems and these 3 files are copied to the remote server for execution.
That -B for binary mode option is not available in version 4.1.5 of GnuWin32 Sed.exe but -T (textmode) so I am guessing that -B is now the standard for opening files and that is good.
When checking the DLL dependencies MSVCRT.DLL comes up with a lot of references to the basic functions for opening and dealing with files. My guess is that this is the file that contains a bug in the SP0 on server that has been fixed in the SP2 on the XP workstation.
Microsoft Windows 2003 Server SP0 - 7.0.3790.0 shp 327,168 03-25-2003 msvcrt.dll
Microsoft Windows XP Professional SP2 - 7.0.2600.2180 shp 343,040 07-20-2006 msvcrt.dll
Unfortunatelly, since that DLL is already loaded in memory since it is used by so many applications I cannot force Sed.exe to re-load it by placing it into the same folder as the executable.
I basically resolved this problem by prefixing my Sed commands with doing a Tr.exe replacement for 0x1A (octal 032) before sending the file to Sed.
type "file.txt" | tr.exe "\032" " " | sed.exe ...
The latest release of sed (4.1.4) has the option -B for reading and writing in binary mode; see sed --help. This is a special GnuWin32 option.