Re: [toolbox] Extended Attributes
Status: Planning
Brought to you by:
jlaurens
From: Maarten S. <maa...@xs...> - 2005-07-06 20:42:13
|
On 5 Jul 2005, at 23:33, Maxwell, Adam R wrote: > I think XeTeX accepts either UTF-16 or UTF-8. I doubt it, I think it only eats UTF-8. Besides, UTF-16 is very =20 recognisable when read as a sequence of bytes (a lot of zero's are =20 interspersed in the sequence). > Also, trying to read Latin 1 files in as UTF-8 will cause an error =20 > in Cocoa, and you get nothing back Well, read it as a sequence of unsigned bytes, discard all values > =20 127, and start searching for the marker that indicates the encoding. =20 If you encounter a lot of zero's, you probably have UTF-16. Once =20 you've found the real encoding, close the file, and re-open with the =20 now known encoding. > (although this is arguably better than the corruption you get when =20 > reading MacOSRoman as UTF-8 or something). Again, just open the file after you've figured out the real encoding. For reference, I looked up how XML solves the problem. After all, =20 they store the encoding information in the file as well. For =20 reference I used XML in a Nutshell, third edition, by Harold and =20 Means, published by O'Reilly. It says: ... Many web servers omit the charset parameter from the MIME media =20 type. In this case, if the MIME media type is text/xml, then the =20 document is assumed to be in the US-ASCII encoding. If the MIME type =20 is application/xml, then the parser attempts to guess the character =20 set by reading the first few bytes of the document. (explanation of why they consider MIME types in the first place =20 omitted. However, guesswork seems to be part of the standard. Ouch.). Every XML document should have an encoding declaration as part of its =20= XML declaration. The encoding declaration tells the parser in which =20 character set the document is written. It's used only when other =20 metadata from outside the file is not available. For example, this =20 XML declaration says that the document uses the Latin-1 character =20 set, with the official name ISO-8859-1: <?xml version=3D"1.0" encoding=3D"ISO-8859-1"?> (end quote) I think we could do worse than adapt a similar strategy. However, =20 since TeX itself has no direct way of specifying processing =20 instructions, this backup metadata should be encoded in a comment. =20 The xml declaration starts right at the start of the document. Since =20 all 8-bit encodings share the lower 7 bits with US-ASCII, and UTF-8 =20 does so as well, this method actually works for figuring out the real =20= encoding. [massive snip] > XeTeX might be an exception to this, as previously noted, so =20 > storing encoding in the file might be practi=13ca=13ble for a shorter = =20 > time than we realize. I don't think you'll have trouble recognising UTF-16 from the file =20 itself. And for the other 8-bit encodings it is possible to extract =20 the encoding by starting from US-ASCII. Maarten= |