Re: [icu-support] UnicodeString::toUTF8String, why BOM?

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On Tue, Dec 17, 2024 at 7:32 AM Gregorio Litenstein <g.l...@gm...>
wrote:

> While debugging some stuff related to text conversion, I have noticed that
> converting from UTF16 to UTF8 (via an intermediary `UnicodeString` and
> `toUTF8String`) results in a UTF8 string that starts with \xEF\xBB\xBF. Why
> is this BOM being appended to my string, and why does it only seem to
> happen when converting from UTF16?

UnicodeString::toUTF8String() does not prepend the BOM.

I just tried this:
    UnicodeString s16(u"abcçカ🚴");
    std::string s8;
    s16.toUTF8String(s8);
    printf("s8.length=%d [%2x %2x %2x %2x ...] \"%s\"\n",
           (int)s8.length(),
           (uint8_t)s8[0], (uint8_t)s8[1], (uint8_t)s8[2], (uint8_t)s8[3],
           s8.c_str());

As expected, this outputs 12 bytes, starting with 0x61 for 'a':
s8.length=12 [61 62 63 c3 ...] "abcçカ🚴"

If you get a BOM (which in UTF-8 is really merely a "signature byte
sequence") in the output from UnicodeString::toUTF8String(), then the
UnicodeString starts with U+FEFF.

Best regards,
markus

-- 
You received this message because you are subscribed to the Google Groups "icu-support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to icu...@un....
To view this discussion visit https://groups.google.com/a/unicode.org/d/msgid/icu-support/CAN49p6pEGnnjWcZE1NHGCucAjG5zYh%3DAVtC2m9UWMzBGTtf9Jw%40mail.gmail.com.

Re: [icu-support] UnicodeString::toUTF8String, why BOM?

Open Source C/C++/Java libraries from Unicode

Re: [icu-support] UnicodeString::toUTF8String, why BOM?