#2 XML Encoding

closed-accepted
None
5
2008-03-22
2007-10-23
No

As we are Germans there can be ü,ä,ö in COLLADA documents. The current revision of the libxml plug-in doesn't set the encoding attribute in the XML header. I changed that, but then the file was corrupted because the strings are interpreted wrong by the function xmlTextWriterWriteString. To fix that I added the conversion isolat1ToUTF8 if the preprocessor variable WIN32 is defined.

I don't know if this is the right way to handle this, but for Windows it works.

Discussion

  • Anonymous - 2007-10-23

    A patch for daeLIBXMLPlugin.cpp

     
  • Anonymous - 2007-10-24
    • assigned_to: nobody --> steve314
     
  • Steven Thomas

    Steven Thomas - 2007-10-24

    Logged In: YES
    user_id=1783832
    Originator: NO

    Hi Kai, thanks for the patch! I don't have time to test it quite yet, but I do hae a few questions for you:

    (a) We recently stripped iconv from the version of libxml that we ship with the DOM, due to license concerns. Will the isolat1ToUTF8 function still work without iconv?
    (b) Do we need to call an analogous function when loading a document?
    (c) The DOM also runs on Mac and Linux, so the WIN32 check shouldn't be necessary. I'll test it on Linux/Mac to make sure it works alright.

    I should be able to test your patch sometime in the next week or so. Thanks for the contribution!

    Steve

     
  • Steven Thomas

    Steven Thomas - 2007-10-25

    Logged In: YES
    user_id=1783832
    Originator: NO

    Kai, can you post a sample Collada file using the ü,ä,ö characters so I can test?

     
  • Steven Thomas

    Steven Thomas - 2007-10-25

    Logged In: YES
    user_id=1783832
    Originator: NO

    Nevermind about the test file.

    I did a little testing, and this is what I've noticed.

    (1) If I use the DOM in its current form and round-trip a file, characters like ü,ä,ö get round-tripped correctly.
    (2) If I add your code that uses isolat1ToUTF8, those characters no longer get round-tripped correctly. They come out as "ü ä ö".

    I'm using the DOM on linux with the built-in libxml, which includes iconv support. Which version of libxml are you using? If you're using the version that comes packages with the DOM code in svn, that one doesn't include iconv support, so that could explain the difference.

    I'm pretty ignorant on issues with Unicode and UTF-8, so I'm not fully aware of what's going on. We need to get these issues worked out though.

     
  • Anonymous - 2007-10-25

    Logged In: YES
    user_id=1825079
    Originator: YES

    I think the function isolat1ToUTF8 should only be used for systems which encodes strings as ISO-8859-1. Another possible way could be to set the encoding to ISO-8859-1 if the host system is windows. Most linux distribution use UTF-8 to encode strings (but it is configurable). That's way you have a correct output on linux without the conversion function.

    We use libxml2 with iconv support.

    >We recently stripped iconv from the version of libxml that we ship
    >with the DOM, due to license concerns. Will the isolat1ToUTF8 function
    >still work without iconv?

    I don't know. The libxml API does not exactly specify this. See http://xmlsoft.org/html/libxml-encoding.html

    > Do we need to call an analogous function when loading a document?

    Oops, I didn't check that. But I guess we need analogous mechanism while loading.

    > The DOM also runs on Mac and Linux, so the WIN32 check shouldn't be
    > necessary. I'll test it on Linux/Mac to make sure it works alright.

    The WIN32 check is to only call isolat1ToUTF8 if the code is compiled on windows. For linux it depends on the configured encoding how to convert the string before passing it to libxml.

     
  • Anonymous - 2007-10-25

    Logged In: YES
    user_id=1825079
    Originator: YES

    Okay I digged into the problem and had a look at the writer example
    http://xmlsoft.org/examples/testWriter.c:

    -- code --
    /* Write a comment as child of EXAMPLE.
    * Please observe, that the input to the xmlTextWriter functions
    * HAS to be in UTF-8, even if the output XML is encoded
    * in iso-8859-1 */
    tmp = ConvertInput("This is a comment with special chars: <äöü>",
    MY_ENCODING);
    rc = xmlTextWriterWriteComment(writer, tmp);
    -- code --

    The comment in the above code block specifies our problem. libxml needs
    always UTF-8 strings. Means that every string must be converted from
    and to UTF-8.

    Okay, so far for Windows this works for writing because of my additions.
    But if I'm loooking at the function
    xmlChar * ConvertInput(const char *in, const char *encoding)
    it may be a little easy.

    I had a look at the daeLIBXMLPlugin.cpp where to adjust the reading of
    input data. The function
    void packageCurrentAttributes(xmlTextReaderPtr reader,
    std::vector<std::pair<daeString, daeString> >& attributes)
    reads attributes and the values can include special characters. For
    windows I should convert the value string to iso but if I use UTF8Toisolat1
    I have to allocate an char array with appropriate size and I can't delete
    it after conversion, because the output vector stores just the pointer.

    Any idea how to solve this issue? Maybe we should open an bug for this.

     
  • Anonymous - 2007-10-25

    Logged In: YES
    user_id=1825079
    Originator: YES

    I attached a new version of the patch which converts attribute values and text nodes while reading too.
    File Added: xml_encoding2.patch

     
  • Anonymous - 2007-10-25

    A second patch for daeLIBXMLPlugin.cpp

     
  • Steven Thomas

    Steven Thomas - 2007-10-26

    Logged In: YES
    user_id=1783832
    Originator: NO

    Ok, so we only need the conversion on Windows it sounds like.

    I also made some changes to the code, and we're stomping on each other's changes with these patches. To make things easier I've created a new branch called "charEncoding" and given you write access. You can checkout the branch with this command: "svn co https://collada-dom.svn.sourceforge.net/svnroot/collada-dom/branches/charEncoding charEncoding"

    It's updated to include your latest code. I'll commit my changes tomorrow. Feel free to commit whatever you please. Once we think the code is good enough and we've tested it on all the platforms we can merge it into trunk.

    Here's what still has to be answered:
    (1) What happens when iconv isn't present?
    (2) What are the performance implications? Is the encoding conversion something we can keep turned on by default, or should we only turn it on if iconv is present?
    (3) Do Linux and Mac behave properly by default?

    Steve

     
  • Anonymous - 2007-10-29

    Logged In: YES
    user_id=1825079
    Originator: YES

    > To make things easier I've created a new branch called "charEncoding" and
    > given you write access.

    Good idea but do you merge all fixes in the trunk into the branch? That would be very important so please don't forget this. I can't promise how much time I can spend to work on this issue the next few weeks.

    > What are the performance implications? Is the encoding conversion
    > something we can keep turned on by default, or should we only turn it on if
    > iconv is present?

    Of course there will be performance implications. The current impl needs to allocate, convert and deallocate character arrays. This slows down the im- and export of dae files. IMHO the conversion should be on by default because this is an international project and there are a lot of special characters (not only the German ones). But if an user of the DOM decides it never will have special characters in its documents then there should be a preprocessor definition that enables and disables the default conversion.

    Did you run some tests already?

     
  • Steven Thomas

    Steven Thomas - 2007-10-29

    Logged In: YES
    user_id=1783832
    Originator: NO

    "Good idea but do you merge all fixes in the trunk into the branch?"

    No, the branch is only going to exist for a short while. I'm going to make a few tweaks to the code, test non-ASCII character handling on Mac/Linux to make sure it already works, and measure the performance hit on Windows. Then I'm going to merge the code back to trunk. This branch will only be around for a week or two.

    "Of course there will be performance implications. The current impl needs to allocate, convert and deallocate character arrays."

    Right. I just want to measure the hit so we know what we're dealing with. I agree 100% that we should handle non-ASCII characters by default. If the performance hit is small enough, I'm not even going to include an explicit option to disable it. I just need some hard numbers to look at.

     
  • Steven Thomas

    Steven Thomas - 2008-02-14

    Logged In: YES
    user_id=1783832
    Originator: NO

    I've been working on some other things and just came back to this problem today. Sorry for the delay. Here are a few observations:

    (1) The isolat1ToUTF8 function isn't the right way to approach this in the general case. Windows has many character encodings other than Windows-1252 (see http://msdn2.microsoft.com/en-us/library/ms776446\(VS.85).aspx), so just passing all data through isolat1ToUTF8 if we detect that we're on Windows isn't right. To handle this fully you need to detect what the system's current character encoding is (not sure how to do that, but I'm sure there's a way) and use iconv to convert to utf-8.

    (2) Are we absolutely certain that the DOM's character encoding should default to the system's current encoding? Obviously libxml doesn't do that; libxml is utf-8 based, and it's the user's job to do any necessary conversions. Would this be an unreasonable approach for the DOM? Does it place too high a burden on the DOM client? Can libxml only get away with this because it's mostly used on utf-8 based systems (like Linux) rather than Windows? How do other multi-platform libraries deal with character encoding issues?

    This problem seems pretty complicated.

    Steve

     
  • Anonymous - 2008-02-14

    Logged In: YES
    user_id=1825079
    Originator: YES

    >Are we absolutely certain that the DOM's character encoding should
    >default to the system's current encoding?

    I don't think that the default encoding has to be the system's current encoding, but if the DOM deals with UTF8 only, IMHO the API should handle different encodings accordingly.

    >Would this be an unreasonable approach for the DOM? Does it
    >place too high a burden on the DOM client?

    As a user of the DOM-API I don't want to think about this issue if I set an char*. If the user should handle this, the API must be much clearer to force the user to pass UTF8 strings only. This means that the public API should never use char* or wchar*. We need something like a UTFString class and voilà, we have the place where we can place the conversions.

     
  • Steven Thomas

    Steven Thomas - 2008-02-14

    Logged In: YES
    user_id=1783832
    Originator: NO

    > If the user should handle this, the API must be much clearer to
    force the user to pass UTF8 strings only. This means that the public API
    should never use char* or wchar*. We need something like a UTFString class
    and voilà, we have the place where we can place the conversions.

    Hmm, that'd probably be a lot of work. If this weren't implemented with full backward-compatiblity, it'd inconvenience a lot of people currently using the DOM who are fine with the current behavior.

    Another idea: we provide a parameter in the DOM to control the character encoding that the DOM uses. By default, it would be set to utf8, and in that case no conversion would take place before passing the data through to libxml. However you could set the encoding string to anything supported by libxml's xmlFindCharEncodingHandler function, which I assume will use iconv (if it's present) to determine supported encodings. So in your case you would just call something like DAE::setCharEncoding("ISO-8859-1") and then everything should work as you expect.

    I think that would be fairly clean and simple, and perfectly backward-compatible with old code. Let me know what you think.

    Steve

     
  • Anonymous - 2008-02-15

    Logged In: YES
    user_id=1825079
    Originator: YES

    >So in your case you would just call something like
    >DAE::setCharEncoding("ISO-8859-1") and then
    >everything should work as you expect.

    This could be a solution, but the documentation of the API must point to this issue to make the user aware of encoding problems.

    >it'd inconvenience a lot of people currently
    >using the DOM who are fine with the current behavior.

    BTW do you have any idea who is using the DOM?

     
  • Nobody/Anonymous

    Logged In: NO

    > BTW do you have any idea who is using the DOM?

    Not really, other than a few people at various companies that I know personally.

     
  • Steven Thomas

    Steven Thomas - 2008-03-20

    Logged In: YES
    user_id=1783832
    Originator: NO

    I merged the charEncoding branch to trunk in revision 483. Now if you want the character conversion to take place, call either DAE::setGlobalCharEncoding(DAE::Latin1) (global setting, applies to all DAE objects) or dae.setCharEncoding(DAE::Latin1) (local setting, applies only to that DAE).

    This isn't the ideal way to do it of course. Ideally we'd automatically detect the system encoding and set it to do the appropriate conversions automatically, or maybe use the more Windows-y UTF16 for proper Unicode support on Windows. I have to get DOM 2.0 released though and this will have to do for now.

    Also, the Latin1 character set still isn't fully supported, even with this patch. Element and attribute names don't get converted. Although none of the standard Collada elements or attributes have names with non-ASCII characters, <extra> data might. This probably wouldn't be too hard to fix though.

    Steve

     
  • Steven Thomas

    Steven Thomas - 2008-03-22
    • status: open --> closed-accepted
     

Log in to post a comment.