Re: [toolbox] Extended Attributes

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On 5 Jul 2005, at 23:33, Maxwell, Adam R wrote:

> I think XeTeX accepts either UTF-16 or UTF-8.

I doubt it, I think it only eats UTF-8. Besides, UTF-16 is very =20
recognisable when read as a sequence of bytes (a lot of zero's are =20
interspersed in the sequence).

> Also, trying to read Latin 1 files in as UTF-8 will cause an error =20
> in Cocoa, and you get nothing back

Well, read it as a sequence of unsigned bytes, discard all values > =20
127, and start searching for the marker that indicates the encoding. =20
If you encounter a lot of zero's, you probably have UTF-16. Once =20
you've found the real encoding, close the file, and re-open with the =20
now known encoding.

> (although this is arguably better than the corruption you get when =20
> reading MacOSRoman as UTF-8 or something).

Again, just open the file after you've figured out the real encoding.

For reference, I looked up how XML solves the problem. After all, =20
they store the encoding information in the file as well. For =20
reference I used XML in a Nutshell, third edition, by Harold and =20
Means, published by O'Reilly. It says:

... Many web servers omit the charset parameter from the MIME media =20
type. In this case, if the MIME media type is  text/xml, then the =20
document is assumed to be in the US-ASCII encoding. If the MIME type =20
is application/xml, then the parser attempts to guess the character =20
set by reading the first few bytes of the document.

(explanation of why they consider MIME types in the first place =20
omitted. However, guesswork seems to be part of the standard. Ouch.).

Every XML document should have an encoding declaration as part of its =20=

XML declaration. The encoding declaration tells the parser in which =20
character set the document is written. It's used only when other =20
metadata from outside the file is not available. For example, this =20
XML declaration says that the document uses the Latin-1 character =20
set, with the official name ISO-8859-1:

<?xml version=3D"1.0" encoding=3D"ISO-8859-1"?>

(end quote)

I think we could do worse than adapt a similar strategy. However, =20
since TeX itself has no direct way of specifying processing =20
instructions, this backup metadata should be encoded in a comment. =20
The xml declaration starts right at the start of the document. Since =20
all 8-bit encodings share the lower 7 bits with US-ASCII, and UTF-8 =20
does so as well, this method actually works for figuring out the real =20=

encoding.

[massive snip]

> XeTeX might be an exception to this, as previously noted, so =20
> storing encoding in the file might be  practi=13ca=13ble for a shorter =
=20
> time than we realize.

I don't think you'll have trouble recognising UTF-16 from the file =20
itself. And for the other 8-bit encodings it is possible to extract =20
the encoding by starting from US-ASCII.

Maarten=