reading unicode id3v2 tags in windows
Brought to you by:
t1mpy
Please help, i can't read unicode tags from mp3 files
that have id3v2 unicode tags. When i request the field-
>get() function id3lib truncates the unicode values to
UTF-8.
Look:
ID3_Tag myTag(file);
ID3_Frame* myFrame=myTag.Find(ID3FID_TITLE);
ID3_Field* myField=myFrame->GetField(ID3FN_TEXT);
wchar_t wc[256];
myField->Get(wc,255);
for (UINT x=0; x<wcslen(wc); x++)
{
char car1= (char)wc[x];
char car2= (char)(wc[x]/256);
wc[x]=(car1*256+car2);
}
I am using Windows and VC6.
The wc string is composed of characters that are all
less than 256. None of the characters are unicode.
Even when i try the getRawUnicodeText() function, the
string is not unicode.
How can I read the unicode tag?
Logged In: YES
user_id=564388
Could you please give me your email or can you log in so i can
contact you about this if needed?
Logged In: YES
user_id=604743
I am also having problems reading unicode tags. If you update
a tag using the properties window from WinXP
WindowsExplorer it will store unicode strings in the V2 tag. It
also updates the V1 tag with regular strings. ID3LIB (I am
using ID3COM) returns a string of ?????? instead of the
unicode strings.
Logged In: YES
user_id=564388
Do you have the same problem with the cvs version of id3lib-
devel ?
Logged In: YES
user_id=681516
Unicode support does not work, both reading and writing and both on
Windows and on Linux. I would expect that the Unicode interface is specified
like that:
- Using the unicode_t methods, setting an unicode_t array element to e.g. 'A'
shall result in the unicode representation of 'A' (i.e. 0x0041) and setting it to a
character in another page (e.g. 0x03b1) shall also be possible. Of course
reading back the characters shall result in the same code.
- Using the char * methods and then changing the encoding to Unicode shall
store the string as a correct Unicode string.
- Unicode text must be stored in the correct format to be read by MP3 players
and taggers supporting it. This is in binary representation e.g. ff fe 54 00 65
00 73 00 74 00 for "Test", i.e. little endian 16 bit encoding is used.
However the following bugs are present:
- To create correct Unicode characters using the unicode_t methods, the
unicode_t values have to be swapped (LSB and MSB), e.g. to set 'A', one
has to assign 0x4100 to a unicode_t array element. In the other direction,
you have to swap the bytes after reading unicode_t values.
- unicode_t values with LSB >= 0x80 (e.g. 0x03b1, which has to be swapped
to 0xb103) cannot be written, as the sign-extension will overwrite the MSB
with 0xff (e.g. you read back 0xb1ff if you have set 0xb103).
- Using the char * methods and then changing the encoding to Unicode
stores the string in incorrect big endian order but with a little endian BOM
and then a second big endian BOM (on Linux, using iconv). On Windows,
using oldconv(), the text may be stored differently, but still incompatible with
the unicode_t interface.
I have posted a patch, which was tested on Linux, to the mailing list. The
iconv - oldconv issue is not handled by that patch. That problem should be
solved too, deciding whether it is useful to use iconv because at the moment
it is not possible to create other encodings than isolatin and UTF16LE
because of checks in the method interface. Thus the additional flexibility of
iconv is not used. I repeat my posting and the patch below.
----
While trying to support Unicode with my ID3 tagger (kid3.sourceforge.net,
maybe
you could add it to the list of applications using id3lib), I came into the
same problems as discussed on this list. After examining the problem, I
realized that there is no way to do a workaround in an application, as some
bugs just make it impossible to get certain unicodes through.
So I had a look at the sources and I think the problem is in io_helpers.cpp:
// The string data can be generated from iconv (SetEncoding() -> convert())
// or from an unicode_t array (Set(const unicode_t* data)). iconv generates
// "fffe 5400 6500 7300 7400" for "Test" in UTF16, so this format has to
// be used (Note: oldconvert() does not seem to be compatible as it picks out
// only odd bytes, this should be fixed if oldconvert() is used).
size_t io::writeUnicodeText(ID3_Writer& writer, String data, bool bom)
{
ID3_Writer::pos_type beg = writer.getCur();
size_t size = (data.size() / 2) * 2;
if (size == 0)
{
return 0;
}
if (bom)
{
// The BOM should not be written if there is already a BOM
// in the data (as done by iconv)
// on little endian architecutes, the BOM is little endian,
// but the data big endian (see below)
// Write the BOM: 0xFEFF
unicode_t BOM = 0xFEFF;
writer.writeChars((const unsigned char*) &BOM, 2);
// this loop should be outside of if (bom)
for (size_t i = 0; i < size; i += 2)
{
// if data[i+1] >= 0x80, it is sign extended and kills data[i]
// e.g. 03b1 => data={03 ffffffb1} -> ch=ffb1
// this assumes big endian data, but UTF16 normally is little endian!
// ID3_FieldImpl::Set(const unicode_t* data) constructs the string
// with the raw unicode_t array => we get it back in the same way
// i.e. here little endian, the code below swaps the bytes, i.e.
// changes the endianness to big endian => different endianness for
// BOM and data!
unicode_t ch = (data[i] << 8) | data[i+1];
writer.writeChars((const unsigned char*) &ch, 2);
}
}
return writer.getCur() - beg;
}
...
// The unicode data is accessed in GetRawUnicodeText() by casting
// the string data to a unicode_t array. Here the bytes are swapped
// in the little endian case (bom == -1), so on little endian
// architectures, the unicode_t array will have swapped bytes!
// This code should be independent of the endianness.
String io::readUnicodeString(ID3_Reader& reader)
{
String unicode;
ID3_Reader::char_type ch1, ch2;
if (!readTwoChars(reader, ch1, ch2) || isNull(ch1, ch2))
{
return unicode;
}
int bom = isBOM(ch1, ch2);
if (!bom)
{
unicode += static_cast<char>(ch1);
unicode += static_cast<char>(ch2);
}
while (!reader.atEnd())
{
if (!readTwoChars(reader, ch1, ch2) || isNull(ch1, ch2))
{
break;
}
if (bom == -1)
{
unicode += static_cast<char>(ch2);
unicode += static_cast<char>(ch1);
}
else
{
unicode += static_cast<char>(ch1);
unicode += static_cast<char>(ch2);
}
}
return unicode;
}
// The same as above
String io::readUnicodeText(ID3_Reader& reader, size_t len)
Below you find a patch which fixes these three functions.
I tested the patch on Linux (Intel), the Unicode tags work with players and
taggers which support Unicode.
diff -ru id3lib-3.8.3.orig/src/io_helpers.cpp id3lib-3.8.3/src/io_helpers.cpp
--- id3lib-3.8.3.orig/src/io_helpers.cpp 2003-10-08 12:27:09.000000000
+0200
+++ id3lib-3.8.3/src/io_helpers.cpp 2003-10-08 12:27:27.000000000
+0200
@@ -124,8 +124,15 @@
String io::readUnicodeString(ID3_Reader& reader)
{
+ // The unicode data is accessed in GetRawUnicodeText() by casting
+ // the string data to an unicode_t array, so the bytes read are
+ // first used to calculate a unicode_t (respecting the endianness)
+ // and then put into the string so that the accessing the unicode_t
+ // array will give the correct values.
String unicode;
ID3_Reader::char_type ch1, ch2;
+ unicode_t uc;
+ const char *ucbytes = (const char *)&uc;
if (!readTwoChars(reader, ch1, ch2) || isNull(ch1, ch2))
{
return unicode;
@@ -133,8 +140,10 @@
int bom = isBOM(ch1, ch2);
if (!bom)
{
- unicode += static_cast<char>(ch1);
- unicode += static_cast<char>(ch2);
+ // big endian
+ uc = ((ch1 << 8) & 0xff00) | (ch2 & 0xff);
+ unicode += ucbytes[0];
+ unicode += ucbytes[1];
}
while (!reader.atEnd())
{
@@ -144,14 +153,16 @@
}
if (bom == -1)
{
- unicode += static_cast<char>(ch2);
- unicode += static_cast<char>(ch1);
+ // little endian
+ uc = ((ch2 << 8) & 0xff00) | (ch1 & 0x00ff);
}
else
{
- unicode += static_cast<char>(ch1);
- unicode += static_cast<char>(ch2);
+ // big endian
+ uc = ((ch1 << 8) & 0xff00) | (ch2 & 0x00ff);
}
+ unicode += ucbytes[0];
+ unicode += ucbytes[1];
}
return unicode;
}
@@ -160,6 +171,8 @@
{
String unicode;
ID3_Reader::char_type ch1, ch2;
+ unicode_t uc;
+ const char *ucbytes = (const char *)&uc;
if (!readTwoChars(reader, ch1, ch2))
{
return unicode;
@@ -168,8 +181,10 @@
int bom = isBOM(ch1, ch2);
if (!bom)
{
- unicode += ch1;
- unicode += ch2;
+ // big endian
+ uc = ((ch1 << 8) & 0xff00) | (ch2 & 0x00ff);
+ unicode += ucbytes[0];
+ unicode += ucbytes[1];
unicode += readText(reader, len);
}
else if (bom == 1)
@@ -184,8 +199,10 @@
{
break;
}
- unicode += ch2;
- unicode += ch1;
+ // little endian
+ uc = ((ch2 << 8) & 0xff00) | (ch1 & 0x00ff);
+ unicode += ucbytes[0];
+ unicode += ucbytes[1];
}
}
return unicode;
@@ -358,16 +375,23 @@
{
return 0;
}
- if (bom)
+ // The string data can be generated from iconv (SetEncoding() -> convert())
+ // or from an unicode_t array (Set(const unicode_t* data)). iconv generates
+ // "fffe 5400 6500 7300 7400" for "Test" in UTF16, so this format has to
+ // be used (Note: oldconvert() does not seem to be compatible as it picks
out
+ // only odd bytes).
+ if (bom && !isBOM(data[0], data[1]))
{
+ // Only write the BOM if there is not already one in the data.
+ // iconv already writes a BOM.
// Write the BOM: 0xFEFF
unicode_t BOM = 0xFEFF;
writer.writeChars((const unsigned char*) &BOM, 2);
- for (size_t i = 0; i < size; i += 2)
- {
- unicode_t ch = (data[i] << 8) | data[i+1];
- writer.writeChars((const unsigned char*) &ch, 2);
- }
+ }
+ for (size_t i = 0; i < size; i++)
+ {
+ char ch = data[i];
+ writer.writeChars(&ch, 1);
}
return writer.getCur() - beg;
}
Regards,
Urs Fleisch
----