MediaTomb / Discussion / Help: problem with UTF-8 encoding in ID3-tags

airflow - 2010-05-26

Hi all!

Since I've created my own import.js (besides being happy) I also noticed that I had some problems with some special characters in tags.

I use UTF-8 whereever possible, also in the ID3-tags. The config.xml of mediatomb offers to specify this encoding, so I did that. In most cases everything works, but for some files some characters are just missing.

First I suspected a problem with taglib (because when I open the file with VLC the tags are shown correctly), but when I tried investigating the issue deeper, I found that if I change the virtual-layout-type from "js" back to "builtin", mediatomb extracts the correct strings again - for the same file! So I assume it can't be an issue of taglib any more, as "builtin" also makes use of tag-information.

Either it's a bug in mediatomb, or a bug in my import.js script. Is the latter possible? I just assign variables. Do I have to take special care of UTF-8 strings in Javascript?

My import.js is here: http://fp.ath.cx/dls/import.js
A sample MP3 which has the problem is here: http://fp.ath.cx/dls/Sigur%20R%c3%b3s%20%5bHlemmur%5d%2006%20-%20Hlemmur%202.mp3

Kind Regards,
airflow

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jin - 2010-05-27

Well, it's possible that we have a bug somewhere, normally you do not have to do anything special, just make sure that filesystem and metadata charsets are set correctly.

I'll try to reproduce it with your test files, thanks for uploading (got them).

Kind regards,
Jin

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

airflow - 2010-06-16

Hi, did you have a chance yet to take a look at the issue? If there's anything I can do (test sthg.), let me know.

Kind Regards,
airflow

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jin - 2010-06-19

No sorry, did not have time to do anything on MT lately.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anonymous - 2010-08-18

Same problem here

In "js" mode, if i change anything about the object title, it will be truncated at the first special character.

i tried to use some function to re-encode the string like :
obj.title = encode_utf8('Ep ' + paddedEpisode + leftover);

but with no more more succes

function encode_utf8( s )
{
return unescape( encodeURIComponent( s ) );
}

According to the log, everything should be in utf-8 as my locale
INFO: Setting filesystem import charset to UTF-8
INFO: Setting metadata import charset to UTF-8
INFO: Setting playlist charset to UTF-8
INFO: Configuration check succeeded.

for example the file : Pokémon 1x01 Le Depart.avi
I want to create a format Serie/season/ episode
the file give in log :
ERROR: iconv: /Séries/PokXmon/S01 could not be converted
2010-08-18 08:10:07 ERROR: iconv: Chaîne multi-octets ou étendue de caractère

So in my browser i just see Pok/S01/Pokémon 1x01 Le Depart.avi
the final file name is keep untouched (with accent) if i dont use the obj.title = 'Ep ' + paddedEpisode + leftover;

So in resume :
I can create chains with hardcoded string with accent like "Séries"
i can't create any object from existing file with accent.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anonymous - 2010-08-18

Solved…
I used a function to fix the case of the title and that was it which mess up every thing :

function toProperCase(s)
{
return s.toLowerCase().replace(/^(.)|\s(.)/g,
function($1) { return $1.toUpperCase(); });
}

i just remove it and that's fine !!

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

airflow - 2010-11-17

2010-08-18 17:11:07 CEST
Solved… I used a function to fix the case of the title and that was it which mess up every thing :

I don't quite get it. Why is it necessary to fix the case of the title? How can it mess up anything? You write, you "just remove it" - what are you removing, the special character?

Jin, does this help you in locating the real problem? Why is it not possible to have accents/umlauts when using "js"-layoutmode?

I will incorporate any workaround in my import-script when necessary. I just don't get the problem/solution yet.

Thanks,
airflow

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jin - 2010-11-21

I also do not see how changing the case should change anything. You could try using the *2i functions, maybe that will help? http://mediatomb.cc/pages/scripting#id2809739

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

airflow - 2010-12-02

OK, I invested some time over the last days for this problem.

I could narrow the problem down to the following facts:

I don't think it has anything to do with "js" vs. "builtin". It's rather a problem of using the "native"-tags (already gathered by
mediatomb automatically) vs. the additional tags you have to request by setting the aux-data-parameters.

Example:
In config.xml, I set to gather the following tags:

<auxdata> <add-data tag="TPE1"/> <add-data tag="TPE2"/> </auxdata>

One of these is the track-artist, the other the album-artist. I used some MP3s for testing, which have those tags filled with the value "Björk". I used different versions with different tag-formats (id3 2.4 UTF-16, id3 2.4 UTF-8, id3 2.3 - UTF-16).

I added a debugging-section in the import.js which looks like this:

print ('DEBUGOUTPUT (M_ARTIST):' + obj.meta[M_ARTIST]); print ('DEBUGOUTPUT (TPE1):' + obj.aux['TPE1']); print ('DEBUGOUTPUT (TPE2):' + obj.aux['TPE2']);

The output looks like this:

2010-12-02 15:44:17 JS: DEBUGOUTPUT (M_ARTIST):Björk 2010-12-02 15:44:17 JS: DEBUGOUTPUT (TPE1):Björ 2010-12-02 15:44:17 JS: DEBUGOUTPUT (TPE2):Björ

So there seems to be some problem when gathering the metadata via the "auxilary data"-feature of mediatomb. I tried virtually all combinations of encodings of the tags and encoding-settings in mediatomb (my goal is to use UTF-8 everywhere). The output is always the same, as if the setting was ignored. I doublechecked the correct encoding in the tags with hex-editors. The tags are set correctly and consistantly.

Any input is welcome about how to troubleshoot this problem further.
Kind Regards,
airflow
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

airflow - 2010-12-05

I continued playing around with mediatomb and the way the different encoding-options work.

1) As I wrote in the previous post, auxiliary data is truncated if it contains special characters (although the encoding and the corresponding setting in config.xml is right).

I want to add something to that. I found that the truncation depends on how many special characters are contained in the string.

E.g. "Björk" -> "Björ" (1 character truncated)
but "José Gonzáles" -> "José Gonzál" (2 characters truncated)

UTF-8 works in a way that it uses an additional byte for a "special" character like an umlaut. For me it looks like there is some bug in the code which mixes up the number of "human-readable"-characters for a string (which would be 5 for "Björk") and the number of bytes actually needed to store it in UTF-8 (which would be 6 in that example).

For the record: I used the newest SVN-version in combination with taglib.

Kind Regards,
airflow

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

John Glasson - 2011-12-05

I'm new to MediaTomb (having just ditched Twonky Server because it can't handle multi-value tags - I can achieve this with MediaTomb scripting).

But I use the TCOM tag which requires me to use auxdata, and I find that composers with non-ascii characters - eg Dvorák - get truncated.

Has anyone come up with a fix or work-around? I see that the last post was a year ago.

regards,

John Glasson

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

John Glasson - 2011-12-08

I've now got around the character encoding problem by rewriting the problem tags using HTML entity values. So for example, Dvorák becomes Dvorák

John Glasson

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

John Glasson - 2011-12-09

A further refinement: I've added a function to my import.js which replaces the HTML entity values (see previous posting) by the 2 byte Unicode values. For this to work correctly, I now add the HTMl entity values in the form &#nnn; this pattern is identified by the function, and the correct Unicode is substituted. Code as follows:

function HTMLcodes(str) {

var s = /&#{3}\;/
var results = str.match(s);

while (results!=null) {
var substr = results.substr(2,3);
var newchr = String.fromCharCode(195)+String.fromCharCode(substr - 195 + 131);
str = str.replace(results,newchr);
var results = str.match(s);
}
return(str);
}

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

gamicoulas - 2011-12-31

It might worth checking the following bug report:

Artifact ID: 3127955
Summary: extraction of UTF-8 encoded aux-data leads to truncation

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- David - 2016-04-21
  
  Just want to add that this does seem to fix the issue, I had the same problem with Björk, but the fix at https://sourceforge.net/p/mediatomb/bugs/78/ seems to have fixed it, i.e. obj.aux['TPE2']), obj.aux['TIT2']), etc are now returning the full value.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

problem with UTF-8 encoding in ID3-tags

Forums

Help

problem with UTF-8 encoding in ID3-tags document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

problem with UTF-8 encoding in ID3-tags