modtoxml output UTF8

Brought to you by: pspeed

#4 modtoxml output UTF8

Status: open

Owner: Paul Speed

Labels: ModPacker (3)

Priority: 5

Updated: 2005-05-15

Created: 2005-05-15

Creator: cinderbdt

Private: No

In the readme, you mention editing the XML by hand and
rolling it back into a binary GFF and back into the MOD.

I'm looking at some other workflow possibilities, where
the XML could be used to insert into a database and
then ask questions like "how many X with property X do
we have in our module?".

I'm running into some roadblocks with the available
tools for dealing with XML, because they all expect the
XML to be in UTF8 encoding.

Right now, I'm messing around with GNU Recode to see if
I can turn the output of modtoxml into UTF8.

I understand that the game probably doesn't handle UTF8
very well, so for purposes of putting the files back
into the module, it would be another step to convert it
back?

Anyway, a command line switch or a default behavior to
use UTF8 and to put a real XML header with an encoding
attribute would be a real help.

Thanks for reading.

Discussion

cinderbdt - 2005-05-15

assigned_to: nobody --> pspeed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Paul Speed - 2005-05-15

Logged In: YES
user_id=652870

Interesting because Java likes to do everything in UTF8 as
it is. It should be a simple enough change to make sure
that it is and I think it's legal for XML to be in UTF8
(though I'll double check to be sure). If so then I may not
even make it an option.

Unfortunately, none of my files use any characters that
would show up as anything but regular characters even in
UTF8... do you have some foreign language files or something
that do this that I could test with... ie: an example of a
file that shows the problem.

Thanks,
-Paul

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

cinderbdt - 2005-05-15

Logged In: YES
user_id=704004

I don't know of any special characters in the module. All
the XML files come out just fine if I use ModToXML to
generate them. On either Windows or GNU/Linux, they come
out as ANSI, and they look OK. It was while I was looking
for how to process this XML for decision support purposes
that I found the information about "well-formed" XML being
encoded as UTF-8.

Eyes-glaze-over-backround-info:
I can query the individual XML files, for example with
python Amara, and it works. But when it comes to inserting
into a database, the only way I've found so far is to create
a DTD, then use the DTD to write SQL create table
statements. I found relaxer to create a DTD, but it expects
the XML it reads to be encoded in UTF8. GNU Recode worked.
I'm still trying to understand this part, so I'm not sure
I'm on the right track with the autogenerated DTD. I ended
up with a database structure I found had lost all the
context I wanted, because it was based on the wrong level of
representation -- the XML structure, not the BioWare GFF
format structure. It may be that I should give up on
inserting this into a database and go back to poking with
python and shellcode. I'm still googling. I realize that
is a bit far afield from NWN.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Paul Speed - 2005-05-15

Logged In: YES
user_id=652870

Just to help point you further in the right direction, UTF8
looks exactly like regular ASCII for any characters that are
not outside of the normal limited ASCII set. All english
text can be represented as ASCII characters and therefore
will look identical as UTF8.

At least that's been my experience to date. In another life
encoded some asian text that looked very different in UTF8. :)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

cinderbdt - 2005-05-16

Logged In: YES
user_id=704004

It looks the same to a human, but the ms-ansi / ASCII is one
byte to represent a character, while UTF8 is two bytes.
That's why it doesn't work to just stick an encoding header
on the top of the output of modtoxml in notepad, without
also saving it as Unicode UTF8 so that it becomes twice as
large on disk.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Paul Speed - 2005-05-17

Logged In: YES
user_id=652870

UTF-8 only uses two bytes to represent characters that are
not standard 7 bit ASCII. See the RFC here:
http://ietf.org/rfc/rfc2279.txt

Quote:
"Character values from 0000 0000 to 0000 007F (US-ASCII
repertoire) correspond to octets 00 to 7F (7 bit US-ASCII
values). A direct consequence is that a plain ASCII string
is also a valid UTF-8 string."

So something is a bit odd here. I'll do some tests at some
point to explicitly write out UTF8 to see if the files
actually change, but I suspect not.

-Paul

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Log in to post a comment.