I have some feedback that the switch to UTF-8 "breaking" working XML parsing is going to be an issue. I'm undecided on how to proceed, but I wanted to publicly look at the issue.
One approach (moving a discussion from an email thread):
The changes to switch between UTF-8 and codepage are pretty minimal. I think autodetection would be important rather than having the user have to tell the parser to use UTF-8 or not. The biggest issue is that many programmer don't know what UTF-8 is.
I'd suggest:
- if the microsoft UTF-8 lead bytes are set, use UTF-8
- if the "encoding" is specified UTF-8, use UTF-8
- if the "encoding" is specified, and is NOT UTF-8, use codepage (not correct, but the best it can do)
- if the encoding is not specified, use UTF-8
The last point is debateable. But at least old content could be "fixed" by adding the encoding tag.
Thoughts?
lee
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
On the last point (encoding not specified), I *think* the generally accepted thing is to assume UTF-8. The w3c spec (see below) seems to imply this, though its not 100% clear on it.
But the impression I've got of the TinyXml "mission" it makes sense to keep things "tiny". In my project, I know I'll only need plain ascii chars, and I don't want to compile in any UTF8-handling logic.
Ideally, there would be a either a compile-time #define to specify the default encoding (and perhaps even remove UTF-8 handling from a build), or a run-time way to specify it (eg parser->setDefaultEncoding( UTF8 ))
This is what scares me on the UTF-8 thing - you've captured what I expect to be a general concern. ASCII is a pure subset of UTF-8, so you don't have to do anything. Seemless. Nada. No change.
But I think you capture the concern of many developers who feel they are going to have to do something. Good docs and calling it "supporting UTF-8" will help, but it's still going to create churn.
lee
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
<rant>
On the W3C spec: when XML first came out I read about it and thought it was pretty cool. And then my boss at the time slapped this 1000 page "how to use XML" book on my desk, and I thought "shouldn't that be a pamplet or something?". I mean there just isn't that much to it. But somewhere, somehow, it all got really complicated. And the incredible detail, options, combinations, and just stuff in the W3C spec about something that is fundamentally *very simple* is part of the problem.
</rant>
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
-
2004-06-03
UTF-8 switch may be a good way!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I think it really makes sense to spend some efford in the encoding issue even if it makes tinyxml a little bit less tiny. There are just too many users that cannot stick to pure us-ascii.
The former way tinyxml handled encoding was both, tiny and sufficient for us (German) but it would be completely ok if we would have to specify the encoding since it's the standard way to do it (even if it would break some of our existing unit test cases).
So I fully support your suggestions. Regarding the last of it (encoding not specified) I don't have a clear opinion.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Okay, I went with the more correct rules. Hopefully the correct choice! The latest version in CVS will:
- If the "UTF-8 byte order marks" or "UTF-8 lead bytes" (0xef 0xbb 0xbf) begin the file or data stream, TinyXml will read it as UTF-8.
- If the declaration tag is read, and it has an encoding="UTF-8", then TinyXml will read it as UTF-8.
- If the declaration tag is read, and it has no encoding specified, then TinyXml will read it as UTF-8.
- If the declaration tag is read, and it has an encoding="something else", then TinyXml will read it as Legacy Mode. In legacy mode, TinyXml will work as it did before. It's not clear what that mode does exactly, but old content should keep working.
Sound good?
lee
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I just read your response to my earlier reply and I realised I didn't word it very clearly.
I know ASCII is a subset of UTF-8. It is also a subset of ISO8859-1. This is handy, but it is not the point. The point is that I know in my project I will only ever deal with ASCII. If I can avoid it, I do not want any extra code that deals with UTF-8.
I accept this is an outside and low priority desire (not even a "requirement"), but bear with me as there is a point in here.
You said that with UTF-8 "you don't have to do anything. Seemless. Nada. No change [over ASCII]". This is not entirely true.
If I have an XML document where all chars are <0x7f then you are correct.
But as soon as there is any char from outside the ASCII range, the ASCII-only string processing code will have to be changed.
For example, the German guys will be needing the u+umlaut character. This (from memory) is 0277 in unicode which maps to UTF-8 as 2 chars:
11000010 10111111 = 0xC2 0xBF
Assuming the TinyXml parser produces a 1-byte-per-char string (traditional C string) then we have a question:
Does the parser convert this one char (2 bytes) into some other charset (like the windows char, ony byte = 0xFC), or leave it as 2 bytes (thereby making the string effectively a UTF-8 string), or something else?
If the parser leaves the two bytes there then the simplest logic change to the ASCII-only logic will be to skip all non-ASCII chars, eg:
if ( *pChar & 0x80 ) pChar++; // skip non-ASCII
This charset stuff gets tricky because there are so many different ways of handling different scenarios.
The ultimate long term solution (and I emphasise *long term*, not "right now") to this would be:
- TinyXml ships with various classes for handling documents in different charsets, eg DecoderUTF8, DecoderASCIIOnly, DecoderUC2 etc. Each has #ifdef blocks around it.
- TinyXml ships with #defines set so that only UTF-8 is used.
- A developer can change the #define settings so their TinyXml build supports only the exact charset (or charsets) that they want.
- If a document is parsed and the xmlEncoding declaration is "ISO8859-1" and that isn't compiled into TinyXml the library would fail with "unsupported charset error".
Don't get me wrong, I know this is a long term thing. It would be accomplished by pulling out the code that iterates through chars and encodes them into a separate classes, eg DecoderUTF8 etc.
For now, supporting UTF-8 is far and away the best solution.
But for a long term direction, the above keeps the library tight, it keeps lots of users happy, it may even improve the modularity of the code because the charset parser is separated from the parser main (a good design IMO).
What do you think?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I have some feedback that the switch to UTF-8 "breaking" working XML parsing is going to be an issue. I'm undecided on how to proceed, but I wanted to publicly look at the issue.
One approach (moving a discussion from an email thread):
The changes to switch between UTF-8 and codepage are pretty minimal. I think autodetection would be important rather than having the user have to tell the parser to use UTF-8 or not. The biggest issue is that many programmer don't know what UTF-8 is.
I'd suggest:
- if the microsoft UTF-8 lead bytes are set, use UTF-8
- if the "encoding" is specified UTF-8, use UTF-8
- if the "encoding" is specified, and is NOT UTF-8, use codepage (not correct, but the best it can do)
- if the encoding is not specified, use UTF-8
The last point is debateable. But at least old content could be "fixed" by adding the encoding tag.
Thoughts?
lee
On the last point (encoding not specified), I *think* the generally accepted thing is to assume UTF-8. The w3c spec (see below) seems to imply this, though its not 100% clear on it.
But the impression I've got of the TinyXml "mission" it makes sense to keep things "tiny". In my project, I know I'll only need plain ascii chars, and I don't want to compile in any UTF8-handling logic.
Ideally, there would be a either a compile-time #define to specify the default encoding (and perhaps even remove UTF-8 handling from a build), or a run-time way to specify it (eg parser->setDefaultEncoding( UTF8 ))
Some links:
http://www.w3.org/TR/REC-xml/#NT-EncodingDecl
http://www.w3.org/TR/REC-xml/#sec-guessing
Ellers
Ellers --
This is what scares me on the UTF-8 thing - you've captured what I expect to be a general concern. ASCII is a pure subset of UTF-8, so you don't have to do anything. Seemless. Nada. No change.
But I think you capture the concern of many developers who feel they are going to have to do something. Good docs and calling it "supporting UTF-8" will help, but it's still going to create churn.
lee
<rant>
On the W3C spec: when XML first came out I read about it and thought it was pretty cool. And then my boss at the time slapped this 1000 page "how to use XML" book on my desk, and I thought "shouldn't that be a pamplet or something?". I mean there just isn't that much to it. But somewhere, somehow, it all got really complicated. And the incredible detail, options, combinations, and just stuff in the W3C spec about something that is fundamentally *very simple* is part of the problem.
</rant>
UTF-8 switch may be a good way!
Lee,
I think it really makes sense to spend some efford in the encoding issue even if it makes tinyxml a little bit less tiny. There are just too many users that cannot stick to pure us-ascii.
The former way tinyxml handled encoding was both, tiny and sufficient for us (German) but it would be completely ok if we would have to specify the encoding since it's the standard way to do it (even if it would break some of our existing unit test cases).
So I fully support your suggestions. Regarding the last of it (encoding not specified) I don't have a clear opinion.
Okay, I went with the more correct rules. Hopefully the correct choice! The latest version in CVS will:
- If the "UTF-8 byte order marks" or "UTF-8 lead bytes" (0xef 0xbb 0xbf) begin the file or data stream, TinyXml will read it as UTF-8.
- If the declaration tag is read, and it has an encoding="UTF-8", then TinyXml will read it as UTF-8.
- If the declaration tag is read, and it has no encoding specified, then TinyXml will read it as UTF-8.
- If the declaration tag is read, and it has an encoding="something else", then TinyXml will read it as Legacy Mode. In legacy mode, TinyXml will work as it did before. It's not clear what that mode does exactly, but old content should keep working.
Sound good?
lee
Lee,
Your post from 06-05 07:40 sound fine to me.
I just read your response to my earlier reply and I realised I didn't word it very clearly.
I know ASCII is a subset of UTF-8. It is also a subset of ISO8859-1. This is handy, but it is not the point. The point is that I know in my project I will only ever deal with ASCII. If I can avoid it, I do not want any extra code that deals with UTF-8.
I accept this is an outside and low priority desire (not even a "requirement"), but bear with me as there is a point in here.
You said that with UTF-8 "you don't have to do anything. Seemless. Nada. No change [over ASCII]". This is not entirely true.
If I have an XML document where all chars are <0x7f then you are correct.
But as soon as there is any char from outside the ASCII range, the ASCII-only string processing code will have to be changed.
For example, the German guys will be needing the u+umlaut character. This (from memory) is 0277 in unicode which maps to UTF-8 as 2 chars:
11000010 10111111 = 0xC2 0xBF
Assuming the TinyXml parser produces a 1-byte-per-char string (traditional C string) then we have a question:
Does the parser convert this one char (2 bytes) into some other charset (like the windows char, ony byte = 0xFC), or leave it as 2 bytes (thereby making the string effectively a UTF-8 string), or something else?
If the parser leaves the two bytes there then the simplest logic change to the ASCII-only logic will be to skip all non-ASCII chars, eg:
if ( *pChar & 0x80 ) pChar++; // skip non-ASCII
This charset stuff gets tricky because there are so many different ways of handling different scenarios.
The ultimate long term solution (and I emphasise *long term*, not "right now") to this would be:
- TinyXml ships with various classes for handling documents in different charsets, eg DecoderUTF8, DecoderASCIIOnly, DecoderUC2 etc. Each has #ifdef blocks around it.
- TinyXml ships with #defines set so that only UTF-8 is used.
- A developer can change the #define settings so their TinyXml build supports only the exact charset (or charsets) that they want.
- If a document is parsed and the xmlEncoding declaration is "ISO8859-1" and that isn't compiled into TinyXml the library would fail with "unsupported charset error".
Don't get me wrong, I know this is a long term thing. It would be accomplished by pulling out the code that iterates through chars and encodes them into a separate classes, eg DecoderUTF8 etc.
For now, supporting UTF-8 is far and away the best solution.
But for a long term direction, the above keeps the library tight, it keeps lots of users happy, it may even improve the modularity of the code because the charset parser is separated from the parser main (a good design IMO).
What do you think?