On 11/13/2012 02:10 AM, Javier Kohen wrote:

On Tue, Nov 13, 2012 at 2:44 AM, Joel Smith <jsf-lists.gtkpod@jk1.net> wrote:
On 11/11/2012 03:09 AM, Daniele Forsi wrote:
> I don't know that code but in this commit
> https://gitorious.org/~joelsmith/gtkpod/joelsmiths-sandbox/commit/6157a8c10646ff25c98651755c6dbbafa63170ba
> you are using g_strdup() while existing code uses charset_to_utf8()
> exisitng code:
> 182        // MP4 Title
> 183           value = find_atom_value(TITLE);
> 184           if (value) {
> 185               track->title = charset_to_utf8(value);
> 186               free(value);
> 187           }
> new code:
> 383        // MP4 Sort Title
> 384           value = find_atom_value(SORT_TITLE);
> 385           if (value) {
> 386               track->sort_title = g_strdup(value);
> 387               free(value);
> 388           }
> if that's ok can you please explain why with a comment in the code?

I think the existing code is wrong.  MP4 tags are already stored as
UTF-8 (or at least they're supposed to be).  And AP doesn't do any
further conversions before returning the strings to us, so we shouldn't
be trying to convert an already UTF-8 string from local charset to
UTF-8.  If run with a non-UTF-8 charset locale, that would corrupt the
data.  For example, with ISO-8859-1, all but the lower 128 char values
would get corrupted.  Funny that I didn't even notice the other ones
doing it wrong. Thanks for reviewing the patch so we could catch the
other bug. I'll commit a fix. I'm no expert on i18n, charsets, etc, so
maybe we'll end up reverting my change right away, but that's how I see it.

I'm no expert in i18n and Unicode either, but I have some experience. I can review the change if you'd like.

The more eyes, the the better.  Here's the patch:

My reasoning is that find_atom_value() calls APar_ExtractDataAtom() which for text types, simply seeks to an offset in the MP4 file and reads the atom value data into the return buffer, without doing any kind of charset conversion.  Since MP4 atom text values are supposed to be stored as UTF-8, we should just copy these UTF-8 strings directly, rather than calling charset_to_utf8() on them.