|
From: Gregorio L. <g.l...@gm...> - 2024-12-27 16:54:58
|
Gregorio Litenstein Goldzweig
Médico Cirujano
• Fono: +56 9 96343643
• E-Mail: g.l...@gm...
On 19 Dec 2024 at 23:04 -0300, Markus Scherer <mar...@gm...>, wrote:
> Minor code feedback --
>
> On Thu, Dec 19, 2024 at 5:35 PM Gregorio Litenstein <g.l...@gm...> wrote:
> > icu::UnicodeString Converter::convertToUTF8(std::string_view sv) {
> > std::scoped_lock l(m_lock); icu::UnicodeString ret(sv.data(), static_cast<int>(sv.length()), m_converter.get(), m_error); if (m_error.isFailure()) throw std::runtime_error("Couldn't convert string: " + std::string(sv) + " to UTF-8. Error: " + std::to_string(m_error.get()) + ": " + m_error.errorName()); return ret;}
>
> Misnomer: This converts to UTF-16, not to UTF-8.
You're right of course. I guess I didn't think about it when I wrote that, but I'm loath to change it now after several years. I'll probably add a comment though.
>
> > std::string UnicodeUtil::convertToUTF8 (std::string_view str, std::string _filename, CaseMapping toCase, bool assumeUTF8) {
> > icu::UnicodeString ustring; std::string charset; if (assumeUTF8) charset = "UTF-8"; else charset = UnicodeUtil::getCharset(str); if (charset != "UTF-8") { if (!_filename.empty()) { SpdLogger::info(LogSystem::I18N, "Filename={} does not seem to be UTF-8. Detected encoding={}", _filename, charset); } ustring = UnicodeUtil::getConverter(charset).convertToUTF8(str); } else { ustring = icu::UnicodeString::fromUTF8(str.data()); }
>
> This line does not work if str contains NUL bytes. Just remove .data() -->
> else { ustring = icu::UnicodeString::fromUTF8(str); }
>
That was a (possibly misguided) but deliberate decision at a time when some of the platforms we were working with had versions of ICU before 65. Thanks for pointing it out though.
> > switch(toCase) { case CaseMapping::UPPER: ustring.toUpper(); break;
> ...
>
> Note that case mappings are language-sensitive. Calling these functions without a Locale parameter / locale ID string will use the machine's default locale. You will get different results if your machine is set to Turkish, Dutch, Greek, ...
>
This is very unlikely to actually come up in usage or our app but thanks for pointing it out, should be easy to correct.
> > It seems that the BOM only gets appended for a UTF16 source (I tried converting from ANSI as well as Shift-JIS).
>
> As discussed, UnicodeString::toUTF8String() does not add a BOM. It will convert one if there is one.
>
> > Considering for UTF8 the BOM is not encouraged, I would expect ICU to just remove the UTF16 BOM and not add a new one.
>
> On input, the "UTF-16" converter will detect and remove the BOM. The "UTF-16LE" and "UTF-16BE" converters will not remove the BOM. All according to the standard.
>
> Best regards,
> markus
--
You received this message because you are subscribed to the Google Groups "icu-support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to icu...@un....
To view this discussion visit https://groups.google.com/a/unicode.org/d/msgid/icu-support/726794f0-ffac-4f28-aa88-2fdd10b09b2d%40Spark.
|