|
From: Gregorio L. <g.l...@gm...> - 2024-12-17 23:44:21
|
Ok, I think the fundamental issue is that I didn't properly understand how the BOM actually works. I was assuming it behaved similar to magic numbers/4CC or such (and thus I was under the impression that they were agreed-on but arbitrary sequences) If I now understand it correctly, it's always a single sequence that is not a visible character in either endianness, but the result of trying to interpret it one way or another can fail in a specific and expected way such that catching that provides a hint for which endianness use, yes? Is 0xEFBBBF the direct translation of 0xFEFF? I just got the answer to my question from Wikipedia, it seems: https://en.wikipedia.org/wiki/Byte_order_mark#UTF-8 says, "...or that it was converted to UTF-8 from a stream that contained an optional BOM. The standard also does not recommend removing a BOM when it is there..." Thanks again for helping me clear this up! Gregorio Litenstein Goldzweig Médico Cirujano • Fono: +56 9 96343643 • E-Mail: g.l...@gm... On 17 Dec 2024 at 19:49 -0300, Steven R. Loomis <srl...@un...>, wrote: > Thanks,Markus, > > I actually wonder if the issue is the other way… Gregorio, perhaps you are choosing a converter type that does not recognize the BOM but you have input data that has a BOM, ICU won’t automatically detect and strip it. > > Steven > > > El El mar, dic 17, 2024 a la(s) 4:31 p.m., Markus Scherer <mar...@gm...> escribió: > > > On Tue, Dec 17, 2024 at 7:32 AM Gregorio Litenstein <g.l...@gm...> wrote: > > > > > While debugging some stuff related to text conversion, I have noticed that converting from UTF16 to UTF8 (via an intermediary `UnicodeString` and `toUTF8String`) results in a UTF8 string that starts with \xEF\xBB\xBF. Why is this BOM being appended to my string, and why does it only seem to happen when converting from UTF16? > > > > > > > > UnicodeString::toUTF8String() does not prepend the BOM. > > > > > > > > I just tried this: > > > > UnicodeString s16(u"abcçカ🚴"); > > > > std::string s8; > > > > s16.toUTF8String(s8); > > > > printf("s8.length=%d [%2x %2x %2x %2x ...] \"%s\"\n", > > > > (int)s8.length(), > > > > (uint8_t)s8[0], (uint8_t)s8[1], (uint8_t)s8[2], (uint8_t)s8[3], > > > > s8.c_str()); > > > > > > > > As expected, this outputs 12 bytes, starting with 0x61 for 'a': > > > > s8.length=12 [61 62 63 c3 ...] "abcçカ🚴" > > > > > > > > If you get a BOM (which in UTF-8 is really merely a "signature byte sequence") in the output from UnicodeString::toUTF8String(), then the UnicodeString starts with U+FEFF. > > > > > > > > Best regards, > > > > markus > > > -- > > > You received this message because you are subscribed to the Google Groups "icu-support" group. > > > To unsubscribe from this group and stop receiving emails from it, send an email to icu...@un.... > > > To view this discussion visit https://groups.google.com/a/unicode.org/d/msgid/icu-support/CAN49p6pEGnnjWcZE1NHGCucAjG5zYh%3DAVtC2m9UWMzBGTtf9Jw%40mail.gmail.com. -- You received this message because you are subscribed to the Google Groups "icu-support" group. To unsubscribe from this group and stop receiving emails from it, send an email to icu...@un.... To view this discussion visit https://groups.google.com/a/unicode.org/d/msgid/icu-support/98918820-7336-4b55-8383-863f131cb761%40Spark. |