Re: [Gtkpod-devel] Faulty UTF-8 to UTF-16 conversion

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

El mi=C3=A9, 15-08-2007 a las 22:17 +0200, Christophe Fergeau escribi=C3=B3=
:
> 2007/8/15, Javier Kohen <jk...@us...>:
> > There is a bug in DB artwork writer, where it assumes that for a given
> > string S, with S8 being its UTF-8 representation and S16 its UTF-16LE
> > representation, 2 * len(S8) =3D len(S16). Unfortunately this very nice
> > property only holds for those characters that can be represented with
> > one byte in UTF-8, that is, the ASCII range.
>=20
> It holds for all characters in the BMP which contains characters for
> almost all modern languages, but yeah, this property is not true for
> all UTF-16 characters so the code shouldn't have made that assumption.

Indeed? Then either Python got it wrong or I speak an ancient language
(i.e. Spanish):
>>> s =3D u'mam=C3=A1'; (len(s.encode('utf-8')), len(s.encode('utf-16le')))
(5, 8)
>>> s =3D u'=C3=A1'; (len(s.encode('utf-8')), len(s.encode('utf-16le')))
(2, 2)

Really, UTF-16 is not just interleaving nul-bytes in an UTF-8 string,
it's actually a different encoding (based on the same ideas, but
different nonetheless).

> > This patch fixes the code to do the proper thing, i.e., use the output
> > size returned by the conversion routine. It also avoids copying all the
> > data twice: memcpy first, then sptr =3D swap(sptr)?! I reconstructed my
> > whole artwork database with no problem after applying this patch. I
> > found this thanks to Valgrind.
>=20
> I think the G_BIG_ENDIAN case is missing the len =3D strlen (string);
> you removed (in libgpod-mhod-utf16.diff)

You are absolutely right, sorry about that omission! I wonder why GCC
won't issue a warning, even if I specify a high optimization setting.

Any comments about the padding never being 0, but sometimes being 4? Is
that expected? I see that's a common idiom in this library.

Cheers,
--=20
Javier Kohen <jk...@us...>
ICQ: blashyrkh #2361802
Jabber: jk...@ja...