Re: [icu-support] UnicodeString::toUTF8String, why BOM?

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Minor code feedback --

On Thu, Dec 19, 2024 at 5:35 PM Gregorio Litenstein <g.l...@gm...>
wrote:

> icu::UnicodeString Converter::convertToUTF8(std::string_view sv) {
> std::scoped_lock l(m_lock);
> icu::UnicodeString ret(sv.data(), static_cast<int>(sv.length()),
> m_converter.get(), m_error);
> if (m_error.isFailure()) throw std::runtime_error("Couldn't convert
> string: " + std::string(sv) + " to UTF-8. Error: " +
> std::to_string(m_error.get()) + ": " + m_error.errorName());
> return ret;
> }
>

Misnomer: This converts to UTF-16, not to UTF-8.

std::string UnicodeUtil::convertToUTF8 (std::string_view str, std::string
> _filename, CaseMapping toCase, bool assumeUTF8) {
> icu::UnicodeString ustring;
> std::string charset;
> if (assumeUTF8) charset = "UTF-8";
> else charset = UnicodeUtil::getCharset(str);
> if (charset != "UTF-8") {
> if (!_filename.empty()) {
> SpdLogger::info(LogSystem::I18N, "Filename={} does not seem to be UTF-8.
> Detected encoding={}", _filename, charset);
> }
> ustring = UnicodeUtil::getConverter(charset).convertToUTF8(str);
> }
> else { ustring = icu::UnicodeString::fromUTF8(str.data()); }
>

This line does not work if str contains NUL bytes. Just remove .data() -->
else { ustring = icu::UnicodeString::fromUTF8(str); }

switch(toCase) {
> case CaseMapping::UPPER:
> ustring.toUpper();
> break;
>
...

Note that case mappings are language-sensitive. Calling these functions
without a Locale parameter / locale ID string will use the machine's
default locale. You will get different results if your machine is set to
Turkish, Dutch, Greek, ...

It seems that the BOM only gets appended for a UTF16 source (I tried
> converting from ANSI as well as Shift-JIS).
>

As discussed, UnicodeString::toUTF8String() does not add a BOM. It will
convert one if there is one.

Considering for UTF8 the BOM is not encouraged, I would expect ICU to just
> remove the UTF16 BOM and not add a new one.
>

On input, the "UTF-16" converter will detect and remove the BOM. The
"UTF-16LE" and "UTF-16BE" converters will not remove the BOM. All according
to the standard.

Best regards,
markus

-- 
You received this message because you are subscribed to the Google Groups "icu-support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to icu...@un....
To view this discussion visit https://groups.google.com/a/unicode.org/d/msgid/icu-support/CAN49p6p-PbnwnY9z%2BdKaw9tWaFBpRNHrJioiep_NUT2VH5jbcA%40mail.gmail.com.

Re: [icu-support] UnicodeString::toUTF8String, why BOM?

Open Source C/C++/Java libraries from Unicode

Re: [icu-support] UnicodeString::toUTF8String, why BOM?