From: Colin P. A. <co...@co...> - 2007-07-26 15:21:39
|
>>>>> "Emmanuel" == Emmanuel Stapf [ES] <Emmanuel> writes: Emmanuel> Looks strange to me that the BOM is part of the Emmanuel> string. In my opinion it should not. The Unicode Consortium decide these matters, not us. Emmanuel> well, you want to have a smart `make_from_utf16' that Emmanuel> uses the BOM to identify wether it is little endian or Emmanuel> big endian? No, we already have that. Emmanuel> You are better off having `make_from_utf16 Emmanuel> (a_content: STRING; a_bom: STRING)' (assuming your Emmanuel> make_from_utf16 takes a string). No we are not better off. There are three situations: 1) The byte-string that we receive from wherever is in Unicode Encoding Scheme UTF-16. In this case it may start with a BOM, or it may not. If the latter, then big-endian is assumed, as per the Unicode specification. If the former, then the BOM is used to determine the endianness, and then discarded. This is creation procedure make_from_utf16. 2) The byte-string that we receive from wherever is in Unicode Encoding Scheme UTF-16BE. In this case no BOM is permitted. If the first two bytes indicate a zero-width non-breaking space, then that will be the first character in the string - it is not treated as a BOM and is not discarded. This is creation procedure make_from_utf16be. 3) The byte-string that we receive from wherever is in Unicode Encoding Scheme UTF-16LE. In this case no BOM is permitted. If the first two bytes indicate a zero-width non-breaking space, then that will be the first character in the string - it is not treated as a BOM and is not discarded. This is creation procedure make_from_utf16le. -- Colin Adams Preston Lancashire |