Menu

#90 reading unicode id3v2 tags in windows

open
None
5
2002-07-27
2002-07-27
Anonymous
No

Please help, i can't read unicode tags from mp3 files
that have id3v2 unicode tags. When i request the field-
>get() function id3lib truncates the unicode values to
UTF-8.

Look:
ID3_Tag myTag(file);
ID3_Frame* myFrame=myTag.Find(ID3FID_TITLE);
ID3_Field* myField=myFrame->GetField(ID3FN_TEXT);

wchar_t wc[256];
myField->Get(wc,255);

for (UINT x=0; x<wcslen(wc); x++)
{
char car1= (char)wc[x];
char car2= (char)(wc[x]/256);
wc[x]=(car1*256+car2);
}

I am using Windows and VC6.

The wc string is composed of characters that are all
less than 256. None of the characters are unicode.

Even when i try the getRawUnicodeText() function, the
string is not unicode.

How can I read the unicode tag?

Discussion

  • T.H.F. Klok

    T.H.F. Klok - 2002-07-27

    Logged In: YES
    user_id=564388

    Could you please give me your email or can you log in so i can
    contact you about this if needed?

     
  • T.H.F. Klok

    T.H.F. Klok - 2002-07-27
    • assigned_to: nobody --> t1mpy
     
  • Russ Tiller

    Russ Tiller - 2002-09-04

    Logged In: YES
    user_id=604743

    I am also having problems reading unicode tags. If you update
    a tag using the properties window from WinXP
    WindowsExplorer it will store unicode strings in the V2 tag. It
    also updates the V1 tag with regular strings. ID3LIB (I am
    using ID3COM) returns a string of ?????? instead of the
    unicode strings.

     
  • T.H.F. Klok

    T.H.F. Klok - 2002-09-04

    Logged In: YES
    user_id=564388

    Do you have the same problem with the cvs version of id3lib-
    devel ?

     
  • Urs Fleisch

    Urs Fleisch - 2003-10-20

    Logged In: YES
    user_id=681516

    Unicode support does not work, both reading and writing and both on
    Windows and on Linux. I would expect that the Unicode interface is specified
    like that:
    - Using the unicode_t methods, setting an unicode_t array element to e.g. 'A'
    shall result in the unicode representation of 'A' (i.e. 0x0041) and setting it to a
    character in another page (e.g. 0x03b1) shall also be possible. Of course
    reading back the characters shall result in the same code.
    - Using the char * methods and then changing the encoding to Unicode shall
    store the string as a correct Unicode string.
    - Unicode text must be stored in the correct format to be read by MP3 players
    and taggers supporting it. This is in binary representation e.g. ff fe 54 00 65
    00 73 00 74 00 for "Test", i.e. little endian 16 bit encoding is used.

    However the following bugs are present:
    - To create correct Unicode characters using the unicode_t methods, the
    unicode_t values have to be swapped (LSB and MSB), e.g. to set 'A', one
    has to assign 0x4100 to a unicode_t array element. In the other direction,
    you have to swap the bytes after reading unicode_t values.
    - unicode_t values with LSB >= 0x80 (e.g. 0x03b1, which has to be swapped
    to 0xb103) cannot be written, as the sign-extension will overwrite the MSB
    with 0xff (e.g. you read back 0xb1ff if you have set 0xb103).
    - Using the char * methods and then changing the encoding to Unicode
    stores the string in incorrect big endian order but with a little endian BOM
    and then a second big endian BOM (on Linux, using iconv). On Windows,
    using oldconv(), the text may be stored differently, but still incompatible with
    the unicode_t interface.

    I have posted a patch, which was tested on Linux, to the mailing list. The
    iconv - oldconv issue is not handled by that patch. That problem should be
    solved too, deciding whether it is useful to use iconv because at the moment
    it is not possible to create other encodings than isolatin and UTF16LE
    because of checks in the method interface. Thus the additional flexibility of
    iconv is not used. I repeat my posting and the patch below.

    ----
    While trying to support Unicode with my ID3 tagger (kid3.sourceforge.net,
    maybe
    you could add it to the list of applications using id3lib), I came into the
    same problems as discussed on this list. After examining the problem, I
    realized that there is no way to do a workaround in an application, as some
    bugs just make it impossible to get certain unicodes through.

    So I had a look at the sources and I think the problem is in io_helpers.cpp:

    // The string data can be generated from iconv (SetEncoding() -> convert())
    // or from an unicode_t array (Set(const unicode_t* data)). iconv generates
    // "fffe 5400 6500 7300 7400" for "Test" in UTF16, so this format has to
    // be used (Note: oldconvert() does not seem to be compatible as it picks out
    // only odd bytes, this should be fixed if oldconvert() is used).
    size_t io::writeUnicodeText(ID3_Writer& writer, String data, bool bom)
    {
    ID3_Writer::pos_type beg = writer.getCur();
    size_t size = (data.size() / 2) * 2;
    if (size == 0)
    {
    return 0;
    }
    if (bom)
    {
    // The BOM should not be written if there is already a BOM
    // in the data (as done by iconv)
    // on little endian architecutes, the BOM is little endian,
    // but the data big endian (see below)
    // Write the BOM: 0xFEFF
    unicode_t BOM = 0xFEFF;
    writer.writeChars((const unsigned char*) &BOM, 2);
    // this loop should be outside of if (bom)
    for (size_t i = 0; i < size; i += 2)
    {
    // if data[i+1] >= 0x80, it is sign extended and kills data[i]
    // e.g. 03b1 => data={03 ffffffb1} -> ch=ffb1
    // this assumes big endian data, but UTF16 normally is little endian!
    // ID3_FieldImpl::Set(const unicode_t* data) constructs the string
    // with the raw unicode_t array => we get it back in the same way
    // i.e. here little endian, the code below swaps the bytes, i.e.
    // changes the endianness to big endian => different endianness for
    // BOM and data!
    unicode_t ch = (data[i] << 8) | data[i+1];
    writer.writeChars((const unsigned char*) &ch, 2);
    }
    }
    return writer.getCur() - beg;
    }

    ...

    // The unicode data is accessed in GetRawUnicodeText() by casting
    // the string data to a unicode_t array. Here the bytes are swapped
    // in the little endian case (bom == -1), so on little endian
    // architectures, the unicode_t array will have swapped bytes!
    // This code should be independent of the endianness.
    String io::readUnicodeString(ID3_Reader& reader)
    {
    String unicode;
    ID3_Reader::char_type ch1, ch2;
    if (!readTwoChars(reader, ch1, ch2) || isNull(ch1, ch2))
    {
    return unicode;
    }
    int bom = isBOM(ch1, ch2);
    if (!bom)
    {
    unicode += static_cast<char>(ch1);
    unicode += static_cast<char>(ch2);
    }
    while (!reader.atEnd())
    {
    if (!readTwoChars(reader, ch1, ch2) || isNull(ch1, ch2))
    {
    break;
    }
    if (bom == -1)
    {
    unicode += static_cast<char>(ch2);
    unicode += static_cast<char>(ch1);
    }
    else
    {
    unicode += static_cast<char>(ch1);
    unicode += static_cast<char>(ch2);
    }
    }
    return unicode;
    }

    // The same as above
    String io::readUnicodeText(ID3_Reader& reader, size_t len)

    Below you find a patch which fixes these three functions.
    I tested the patch on Linux (Intel), the Unicode tags work with players and
    taggers which support Unicode.

    diff -ru id3lib-3.8.3.orig/src/io_helpers.cpp id3lib-3.8.3/src/io_helpers.cpp
    --- id3lib-3.8.3.orig/src/io_helpers.cpp 2003-10-08 12:27:09.000000000
    +0200
    +++ id3lib-3.8.3/src/io_helpers.cpp 2003-10-08 12:27:27.000000000
    +0200
    @@ -124,8 +124,15 @@

    String io::readUnicodeString(ID3_Reader& reader)
    {
    + // The unicode data is accessed in GetRawUnicodeText() by casting
    + // the string data to an unicode_t array, so the bytes read are
    + // first used to calculate a unicode_t (respecting the endianness)
    + // and then put into the string so that the accessing the unicode_t
    + // array will give the correct values.
    String unicode;
    ID3_Reader::char_type ch1, ch2;
    + unicode_t uc;
    + const char *ucbytes = (const char *)&uc;
    if (!readTwoChars(reader, ch1, ch2) || isNull(ch1, ch2))
    {
    return unicode;
    @@ -133,8 +140,10 @@
    int bom = isBOM(ch1, ch2);
    if (!bom)
    {
    - unicode += static_cast<char>(ch1);
    - unicode += static_cast<char>(ch2);
    + // big endian
    + uc = ((ch1 << 8) & 0xff00) | (ch2 & 0xff);
    + unicode += ucbytes[0];
    + unicode += ucbytes[1];
    }
    while (!reader.atEnd())
    {
    @@ -144,14 +153,16 @@
    }
    if (bom == -1)
    {
    - unicode += static_cast<char>(ch2);
    - unicode += static_cast<char>(ch1);
    + // little endian
    + uc = ((ch2 << 8) & 0xff00) | (ch1 & 0x00ff);
    }
    else
    {
    - unicode += static_cast<char>(ch1);
    - unicode += static_cast<char>(ch2);
    + // big endian
    + uc = ((ch1 << 8) & 0xff00) | (ch2 & 0x00ff);
    }
    + unicode += ucbytes[0];
    + unicode += ucbytes[1];
    }
    return unicode;
    }
    @@ -160,6 +171,8 @@
    {
    String unicode;
    ID3_Reader::char_type ch1, ch2;
    + unicode_t uc;
    + const char *ucbytes = (const char *)&uc;
    if (!readTwoChars(reader, ch1, ch2))
    {
    return unicode;
    @@ -168,8 +181,10 @@
    int bom = isBOM(ch1, ch2);
    if (!bom)
    {
    - unicode += ch1;
    - unicode += ch2;
    + // big endian
    + uc = ((ch1 << 8) & 0xff00) | (ch2 & 0x00ff);
    + unicode += ucbytes[0];
    + unicode += ucbytes[1];
    unicode += readText(reader, len);
    }
    else if (bom == 1)
    @@ -184,8 +199,10 @@
    {
    break;
    }
    - unicode += ch2;
    - unicode += ch1;
    + // little endian
    + uc = ((ch2 << 8) & 0xff00) | (ch1 & 0x00ff);
    + unicode += ucbytes[0];
    + unicode += ucbytes[1];
    }
    }
    return unicode;
    @@ -358,16 +375,23 @@
    {
    return 0;
    }
    - if (bom)
    + // The string data can be generated from iconv (SetEncoding() -> convert())
    + // or from an unicode_t array (Set(const unicode_t* data)). iconv generates
    + // "fffe 5400 6500 7300 7400" for "Test" in UTF16, so this format has to
    + // be used (Note: oldconvert() does not seem to be compatible as it picks
    out
    + // only odd bytes).
    + if (bom && !isBOM(data[0], data[1]))
    {
    + // Only write the BOM if there is not already one in the data.
    + // iconv already writes a BOM.
    // Write the BOM: 0xFEFF
    unicode_t BOM = 0xFEFF;
    writer.writeChars((const unsigned char*) &BOM, 2);
    - for (size_t i = 0; i < size; i += 2)
    - {
    - unicode_t ch = (data[i] << 8) | data[i+1];
    - writer.writeChars((const unsigned char*) &ch, 2);
    - }
    + }
    + for (size_t i = 0; i < size; i++)
    + {
    + char ch = data[i];
    + writer.writeChars(&ch, 1);
    }
    return writer.getCur() - beg;
    }

    Regards,
    Urs Fleisch

    ----

     

Log in to post a comment.