problem with UTF-8 encoding in ID3-tags

Help
airflow
2010-05-26
2013-05-30
  • airflow
    airflow
    2010-05-26

    Hi all!

    Since I've created my own import.js (besides being happy) I also noticed that I had some problems with some special characters in tags.

    I use UTF-8 whereever possible, also in the ID3-tags. The config.xml of mediatomb offers to specify this encoding, so I did that. In most cases everything works, but for some files some characters are just missing.

    First I suspected a problem with taglib (because when I open the file with VLC the tags are shown correctly), but when I tried investigating the issue deeper, I found that if I change the virtual-layout-type from "js" back to "builtin", mediatomb extracts the correct strings again - for the same file! So I assume it can't be an issue of taglib any more, as "builtin" also makes use of tag-information.

    Either it's a bug in mediatomb, or a bug in my import.js script. Is the latter possible? I just assign variables. Do I have to take special care of UTF-8 strings in Javascript?

    My import.js is here: http://fp.ath.cx/dls/import.js
    A sample MP3 which has the problem is here: http://fp.ath.cx/dls/Sigur%20R%c3%b3s%20%5bHlemmur%5d%2006%20-%20Hlemmur%202.mp3

    Kind Regards,
    airflow

     
  • Jin
    Jin
    2010-05-27

    Well, it's possible that we have a bug somewhere, normally you do not have to do anything special, just make sure that filesystem and metadata charsets are set correctly.

    I'll try to reproduce it with your test files, thanks for uploading (got them).

    Kind regards,
    Jin

     
  • airflow
    airflow
    2010-06-16

    Hi, did you have a chance yet to take a look at the issue? If there's anything I can do (test sthg.), let me know.

    Kind Regards,
    airflow

     
  • Jin
    Jin
    2010-06-19

    No sorry, did not have time to do anything on MT lately.

     

  • Anonymous
    2010-08-18

    Same problem here

    In "js" mode, if i change anything about the object title, it will be truncated at the first special character.

    i tried  to use some function to re-encode the string like :
    obj.title = encode_utf8('Ep ' + paddedEpisode + leftover);

    but with no more more succes

    function encode_utf8( s )
    {
      return unescape( encodeURIComponent( s ) );
    }

    According to the log, everything should be in utf-8 as my locale
    INFO: Setting filesystem import charset to UTF-8
    INFO: Setting metadata import charset to UTF-8
    INFO: Setting playlist charset to UTF-8
    INFO: Configuration check succeeded.

    for example the file : Pokémon 1x01 Le Depart.avi
    I want to create a format Serie/season/ episode
    the file give in log :
    ERROR: iconv: /Séries/PokXmon/S01 could not be converted
    2010-08-18 08:10:07   ERROR: iconv: Chaîne multi-octets ou étendue de caractère

    So in my browser i just see Pok/S01/Pokémon 1x01 Le Depart.avi
    the final file name is keep untouched (with accent) if  i dont use the obj.title = 'Ep ' + paddedEpisode + leftover;

    So in resume :
    I can create chains with hardcoded string with accent like "Séries"
    i can't create any object from existing file with accent.

     

  • Anonymous
    2010-08-18

    Solved…
    I used a function to fix the case of the title and that was it which mess up every thing :

    function toProperCase(s)
    {
      return s.toLowerCase().replace(/^(.)|\s(.)/g,
              function($1) { return $1.toUpperCase(); });
    }

    i just remove it and that's fine !!

     
  • airflow
    airflow
    2010-11-17

    2010-08-18 17:11:07 CEST
    Solved… I used a function to fix the case of the title and that was it which mess up every thing :

    I don't quite get it. Why is it necessary to fix the case of the title? How can it mess up anything? You write, you "just remove it" - what are you removing, the special character?

    Jin, does this help you in locating the real problem? Why is it not possible to have accents/umlauts when using "js"-layoutmode?

    I will incorporate any workaround in my import-script when necessary. I just don't get the problem/solution yet.

    Thanks,
    airflow

     
  • Jin
    Jin
    2010-11-21

    I also do not see how changing the case should change anything. You could try using the *2i functions, maybe that will help? http://mediatomb.cc/pages/scripting#id2809739

     
  • airflow
    airflow
    2010-12-02

    OK, I invested some time over the last days for this problem.

    I could narrow the problem down to the following facts:

    I don't think it has anything to do with "js" vs. "builtin". It's rather a problem of using the "native"-tags (already gathered by
    mediatomb automatically) vs. the additional tags you have to request by setting the aux-data-parameters.

    Example:
    In config.xml, I set to gather the following tags:

            <auxdata>
              <add-data tag="TPE1"/>
              <add-data tag="TPE2"/>
            </auxdata>
    

    One of these is the track-artist, the other the album-artist. I used some MP3s for testing, which have those tags filled with the value "Björk". I used different versions with different tag-formats (id3 2.4 UTF-16,  id3 2.4 UTF-8, id3 2.3 - UTF-16).

    I added a debugging-section in the import.js which looks like this:

    print ('DEBUGOUTPUT (M_ARTIST):' + obj.meta[M_ARTIST]);
    print ('DEBUGOUTPUT (TPE1):' + obj.aux['TPE1']);
    print ('DEBUGOUTPUT (TPE2):' + obj.aux['TPE2']);
    

    The output looks like this:

    2010-12-02 15:44:17      JS: DEBUGOUTPUT (M_ARTIST):Björk
    2010-12-02 15:44:17      JS: DEBUGOUTPUT (TPE1):Björ
    2010-12-02 15:44:17      JS: DEBUGOUTPUT (TPE2):Björ
    

    So there seems to be some problem when gathering the metadata via the "auxilary data"-feature of mediatomb. I tried virtually all combinations of encodings of the tags and encoding-settings in mediatomb (my goal is to use UTF-8 everywhere). The output is always the same, as if the setting was ignored. I doublechecked the correct encoding in the tags with hex-editors. The tags are set correctly and consistantly.

    Any input is welcome about how to troubleshoot this problem further.
    Kind Regards,
    airflow

     
  • airflow
    airflow
    2010-12-05

    I continued playing around with mediatomb and the way the different encoding-options work.

    1) As I wrote in the previous post, auxiliary data is truncated if it contains special characters (although the encoding and the corresponding setting in config.xml is right).

    I want to add something to that. I found that the truncation depends on how many special characters are contained in the string.

    E.g. "Björk" -> "Björ" (1 character truncated)
    but "José Gonzáles" -> "José Gonzál" (2 characters truncated)

    UTF-8 works in a way that it uses an additional byte for a "special" character like an umlaut. For me it looks like there is some bug in the code which mixes up the number of "human-readable"-characters for a string (which would be 5 for "Björk") and the number of bytes actually needed to store it in UTF-8 (which would be 6 in that example).

    For the record: I used the newest SVN-version in combination with taglib.

    Kind Regards,
    airflow

     
  • John Glasson
    John Glasson
    2011-12-05

    I'm new to MediaTomb (having just ditched Twonky Server because it can't handle multi-value tags - I can achieve this with MediaTomb scripting).

    But I use the TCOM tag which requires me to use auxdata, and I find that composers with non-ascii characters - eg Dvorák - get truncated.

    Has anyone come up with a fix or work-around?  I see that the last post was a year ago.

    regards,

    John Glasson

     
  • John Glasson
    John Glasson
    2011-12-08

    I've now got around the character encoding problem by rewriting the problem tags using HTML entity values. So for example, Dvorák becomes Dvor&aacute;k

    John Glasson

     
  • John Glasson
    John Glasson
    2011-12-09

    A further refinement: I've added a function to my import.js which replaces the HTML entity values (see previous posting) by the 2 byte Unicode values.  For this to work correctly, I now add the HTMl entity values in the form &#nnn; this pattern is identified by the function, and the correct Unicode is substituted.  Code as follows:

    function HTMLcodes(str) {

    var s = /&#{3}\;/
    var results =  str.match(s);

    while (results!=null) {
    var substr = results.substr(2,3);
    var newchr = String.fromCharCode(195)+String.fromCharCode(substr - 195 + 131);
    str = str.replace(results,newchr);
    var results =  str.match(s);
    }
    return(str);
    }

     
  • gamicoulas
    gamicoulas
    2011-12-31

    It might worth checking the following bug report:

    Artifact ID: 3127955
    Summary: extraction of UTF-8 encoded aux-data leads to truncation