From: Javier K. <jk...@us...> - 2007-08-15 20:33:27
|
El mi=C3=A9, 15-08-2007 a las 22:17 +0200, Christophe Fergeau escribi=C3=B3= : > 2007/8/15, Javier Kohen <jk...@us...>: > > There is a bug in DB artwork writer, where it assumes that for a given > > string S, with S8 being its UTF-8 representation and S16 its UTF-16LE > > representation, 2 * len(S8) =3D len(S16). Unfortunately this very nice > > property only holds for those characters that can be represented with > > one byte in UTF-8, that is, the ASCII range. >=20 > It holds for all characters in the BMP which contains characters for > almost all modern languages, but yeah, this property is not true for > all UTF-16 characters so the code shouldn't have made that assumption. Indeed? Then either Python got it wrong or I speak an ancient language (i.e. Spanish): >>> s =3D u'mam=C3=A1'; (len(s.encode('utf-8')), len(s.encode('utf-16le'))) (5, 8) >>> s =3D u'=C3=A1'; (len(s.encode('utf-8')), len(s.encode('utf-16le'))) (2, 2) Really, UTF-16 is not just interleaving nul-bytes in an UTF-8 string, it's actually a different encoding (based on the same ideas, but different nonetheless). > > This patch fixes the code to do the proper thing, i.e., use the output > > size returned by the conversion routine. It also avoids copying all the > > data twice: memcpy first, then sptr =3D swap(sptr)?! I reconstructed my > > whole artwork database with no problem after applying this patch. I > > found this thanks to Valgrind. >=20 > I think the G_BIG_ENDIAN case is missing the len =3D strlen (string); > you removed (in libgpod-mhod-utf16.diff) You are absolutely right, sorry about that omission! I wonder why GCC won't issue a warning, even if I specify a high optimization setting. Any comments about the padding never being 0, but sometimes being 4? Is that expected? I see that's a common idiom in this library. Cheers, --=20 Javier Kohen <jk...@us...> ICQ: blashyrkh #2361802 Jabber: jk...@ja... |