Menu

Incorrect display of UTF-8 encoded ID3 tags

Help
2012-01-17
2016-04-21
  • agkistradon

    agkistradon - 2012-01-17

    I'm sorry if this is covered elsewhere; I did not find another thread about it after what I felt was pretty thorough searching.

    I'm running a ReadyNAS NV+ with minidlna version 1.0.22.  I have been storing MP3 files on this device.  I have created the files using UTF-8 encoding for the ID3 tags.  The tags look correct when I view the file properties in Gnome's Nautilus file browser (version 3.2.1).  They look fine in the music application on my Motorola Atrix running Gingerbread.  They look fine within Gnome's Rhythmbox music player.

    When served using minidlna, they do not display correctly.  For example, I have the album "Mer de Noms" by "A Perfect Circle".  Track 11 on that album is called "Breña" (note the UTF-8 character).  When viewed through the DLNA browser on the Atrix, on Rhythmbox's DLNA connection, or browsing using UPnP Inspector 0.2.2, the song ins listed as "Breña".  Examining the MP3 file itself yields the correct two bytes for the special character.

    Is anyone else seeing this?  If it's something I can fix, I am happy to follow directions.  If it's a known problem with minidlna, I can look into fixing it, if necessary.  Just looking for a second set of eyes here.

    Thank you for your time.

     
  • Anonymous

    Anonymous - 2012-01-28

    A lot of programs (VLC for one) seem to assume that ID3 tags are either ASCII or UTF16LE. ISTR that this may even be some sort of Windows-driven standard.

    It's a bit of a pain, but I convert all the tags from UTF8 to UTF16LE when I make MP3s from FLACs, and every player I've used has displayed all the non-ASCII characters correctly.

     
  • agkistradon

    agkistradon - 2012-01-29

    Thanks for the reply.

    I'm less worried about the player displaying the text incorrectly than I am about it being served incorrectly.  As it turns out, further research is pointing the problem back at the id3tag library (or possibly usage of said).  Allow me to explain…

    Checking out the very beginning of this particular MP3 file in a hex editor yields the following:

    49 44 33 03 00 00 00 00 08 49 54 49 54 32 00 00 00 07 00 00 00 42 72 65 C3 B1 61

    And the translation:

    ID3……ITIT2…….Bre..a

    So, pretty clearly, the tag is utf-8 encoded (B = 42, r = 72, e = 65, ñ = C3 B1, a = 61).  If you have a utf-8 capable terminal, you can do: echo -e "\xC3\xB1"  if you want to see that this is accurate.

    Anyway, so that pretty much rules out the MP3 file in my mind (it wasn't much of a question to begin with, but worth ruling out decisively.  Next thing to check is id3tag, which minidlna uses to extract ID3 tags.  At first, this was giving me fits, and I had to learn quite a lot more about how utf-8 is represented and how it's different from, say, wide characters (it is, quite).  The long and the short of it is that I had kind of a breakthrough this morning and wrote a bit of test program to see what was going on.

    Within minidlna (tagutils/tagutils-mp3.c), there's a block that determines how the raw data exposed by id3tag will be converted into something usable.

    if(lang_index >= 0) 
        utf8_text = _get_utf8_text(native_text); // through iconv 
    else 
        utf8_text = (unsigned char*)id3_ucs4_utf8duplicate(native_text);
    

    lang_index is always ending up -2 in my environment, so that means the code will always roll over to id3_ucs4_utf8duplicate.  I performed a bunch of machinations on this data until I stepped back and realized I should just see what the 'utf-8' characters this thing was spitting out were.  Using some code like the following…

    void dump_utf8_text( const id3_utf8_t *pid3_utf8 ) { 
        unsigned int i = 0; 
        while ( pid3_utf8[i] ) { 
            fprintf( stdout, "%i: %x\n", i, pid3_utf8[i] ); 
            ++i; 
        } 
    }
    

    I got output like…

    0: 42
    1: 72
    2: 65
    3: ffffffc3
    4: ffffff83
    5: ffffffc2
    6: ffffffb1
    7: 61
    

    Clearly, that is wrong.  That is suggesting that there are two multibyte characters instead of one in the original string, which is incorrect.  In fact, the two characters being represented are the utf-8 equivalents of the ascii characters for the two characters that make up the original multi-byte character.  Confused, yet?  I was, and it looks like understandably so.  It *looks* like the id3tag library is double-encoding the utf-8 string.

    I haven't had a chance to dig into other ways of dealing with the native id3tag data (ucs4), but I'll do that when time provides (unless someone else gets to it first).  For now, baby is screaming.

     
  • agkistradon

    agkistradon - 2012-01-29

    Without going into too much explanation, the behavior of id3tag's id3_ucs4_utf8duplicate function does not appear to be doing the right thing.  Switching to using id3_ucs4_latin1duplicate achieves what is necessary to preserve the utf-8 encoded string.  Alternately, just copying the ucs4 string into a char* would work as well.

    // The ucs4 characters
    0: 42
    1: 72
    2: 65
    3: c3
    4: b1
    5: 61

    // Converted to 'latin1' (which really isn't converting to latin1, it's converting to a char* string, which isn' inherently latin1 or anything else)
    0: 42
    1: 72
    2: 65
    3: ffffffc3
    4: ffffffb1
    5: 61

    The short lesson is this: char* is not utf-8 or latin1.  It's just an array of bytes.  If someone is interested in allowing a string to be utf-8 in C, they should leave it as a char* and not do any interpretation of it.  As soon as something tries to decode the utf-8 string, then there's a risk of needing to reencode it.  Easier to just pass it around as char* and spit the raw chars at your clients, letting them know that it's utf-8 encoded.  This is why a lot of the C library functions just natively work with utf-8.  They don't know anything about encoding or display, only about char*s.

    Anyway, my work here as done.  It tests out correctly on the DLNA client on my phone, in Rhythmbox, and in the UPnP browser.  As soon as I figure out how to cross-compile for the ReadyNAS, this is a solved problem on my end.

     
  • raulfg3

    raulfg3 - 2016-04-19

    same problem here, some news?

     
  • Shrimpkin

    Shrimpkin - 2016-04-21

    Are you trying to use minidlna version 1.0.22? Current version is 1.1.5

     

Log in to post a comment.