Menu

#4 modtoxml output UTF8

open
ModPacker (3)
5
2005-05-15
2005-05-15
cinderbdt
No

In the readme, you mention editing the XML by hand and
rolling it back into a binary GFF and back into the MOD.

I'm looking at some other workflow possibilities, where
the XML could be used to insert into a database and
then ask questions like "how many X with property X do
we have in our module?".

I'm running into some roadblocks with the available
tools for dealing with XML, because they all expect the
XML to be in UTF8 encoding.

Right now, I'm messing around with GNU Recode to see if
I can turn the output of modtoxml into UTF8.

I understand that the game probably doesn't handle UTF8
very well, so for purposes of putting the files back
into the module, it would be another step to convert it
back?

Anyway, a command line switch or a default behavior to
use UTF8 and to put a real XML header with an encoding
attribute would be a real help.

Thanks for reading.

Discussion

  • cinderbdt

    cinderbdt - 2005-05-15
    • assigned_to: nobody --> pspeed
     
  • Paul Speed

    Paul Speed - 2005-05-15

    Logged In: YES
    user_id=652870

    Interesting because Java likes to do everything in UTF8 as
    it is. It should be a simple enough change to make sure
    that it is and I think it's legal for XML to be in UTF8
    (though I'll double check to be sure). If so then I may not
    even make it an option.

    Unfortunately, none of my files use any characters that
    would show up as anything but regular characters even in
    UTF8... do you have some foreign language files or something
    that do this that I could test with... ie: an example of a
    file that shows the problem.

    Thanks,
    -Paul

     
  • cinderbdt

    cinderbdt - 2005-05-15

    Logged In: YES
    user_id=704004

    I don't know of any special characters in the module. All
    the XML files come out just fine if I use ModToXML to
    generate them. On either Windows or GNU/Linux, they come
    out as ANSI, and they look OK. It was while I was looking
    for how to process this XML for decision support purposes
    that I found the information about "well-formed" XML being
    encoded as UTF-8.

    Eyes-glaze-over-backround-info:
    I can query the individual XML files, for example with
    python Amara, and it works. But when it comes to inserting
    into a database, the only way I've found so far is to create
    a DTD, then use the DTD to write SQL create table
    statements. I found relaxer to create a DTD, but it expects
    the XML it reads to be encoded in UTF8. GNU Recode worked.
    I'm still trying to understand this part, so I'm not sure
    I'm on the right track with the autogenerated DTD. I ended
    up with a database structure I found had lost all the
    context I wanted, because it was based on the wrong level of
    representation -- the XML structure, not the BioWare GFF
    format structure. It may be that I should give up on
    inserting this into a database and go back to poking with
    python and shellcode. I'm still googling. I realize that
    is a bit far afield from NWN.

     
  • Paul Speed

    Paul Speed - 2005-05-15

    Logged In: YES
    user_id=652870

    Just to help point you further in the right direction, UTF8
    looks exactly like regular ASCII for any characters that are
    not outside of the normal limited ASCII set. All english
    text can be represented as ASCII characters and therefore
    will look identical as UTF8.

    At least that's been my experience to date. In another life
    encoded some asian text that looked very different in UTF8. :)

     
  • cinderbdt

    cinderbdt - 2005-05-16

    Logged In: YES
    user_id=704004

    It looks the same to a human, but the ms-ansi / ASCII is one
    byte to represent a character, while UTF8 is two bytes.
    That's why it doesn't work to just stick an encoding header
    on the top of the output of modtoxml in notepad, without
    also saving it as Unicode UTF8 so that it becomes twice as
    large on disk.

     
  • Paul Speed

    Paul Speed - 2005-05-17

    Logged In: YES
    user_id=652870

    UTF-8 only uses two bytes to represent characters that are
    not standard 7 bit ASCII. See the RFC here:
    http://ietf.org/rfc/rfc2279.txt

    Quote:
    "Character values from 0000 0000 to 0000 007F (US-ASCII
    repertoire) correspond to octets 00 to 7F (7 bit US-ASCII
    values). A direct consequence is that a plain ASCII string
    is also a valid UTF-8 string."

    So something is a bit odd here. I'll do some tests at some
    point to explicitly write out UTF8 to see if the files
    actually change, but I suspect not.

    -Paul

     

Log in to post a comment.

MongoDB Logo MongoDB