From: Rony G. F. <Ron...@wu...> - 2010-12-18 10:52:13
|
On 18.12.2010 08:12, Mike Cowlishaw wrote: >>>> I have a Rexx program that merges several small files onto >>>> >> one large >> >>>> one. As it turned out a few of the small files were >>>> >> prefixed with a >> >>>> UTF8 BOM, |0xEFBBBF|. Should the BOM have been recognized and >>>> discarded? >>>> >>> How could Rexx (or any other processor) decide that some particular >>> prefix/content/suffix of a file is worthless and should be >>> >> discarded? >> >>> ("darn it, this file ends in 'ILY'; delete that!"). >>> >> It would handle it as any other text processor. Open the >> file, read the first three or four bytes. If no BOM is >> present reposition to the beginning, else position to the >> first char after the BOM. >> >> I realize that Rexx can not handle wide characters and use of >> the UTF8 BOM is discouraged, and at least on *ix systems can >> lead to problems with some apps. >> But the use of UTF8 is not forbidden. So when processing text >> files, it seems to me that a BOM should be checked for, even >> if it is ignored. Or a error issued for an unsupported >> encoding. For UTF8 I would ignore it and process the file as ASCII. >> > I wasn't clear -- sorry. I meant: what if the program *wants* to be able to > read the BOM and then process the file appropriately? Such a program would be > broken if the standard file read discarded rge BOM. > > Perhaps what you need is a wrapper function/class that will do exactly that > ('readUTF8' ... etc.). This kind of situation - processing non-ASCII-text files becomes more and more common on every operating systems. The upcoming "HTML5" standard even suggests to use UTF-8 encodings, which probably will lead to the number of BOM'med text files to explode on any platform. It is interesting to note that "linein" was used, which means that the author expected and indicated processing a text file. Now, a LF or CR-LF sequence on "linein" (and "lineout" for that matter) are never part of the read data (and appended automatically with lineout/say when writing textual data). So text files have been processed in Rexx always in a different way than binary files (for which charin/charout would be available), making it easy for the programmer to process them. I can see that Rexx programmers might therefore expect automatic BOM-handling on non-ASCII text files, if the Rexx programmer uses "linein" or "lineout/say". ---rony |