I have encountered three bugs when parsing documents with an internal DTD. The attached patch contains fixes for all three. The bugs are:
When an entity was declared multiple times, i.e.
<!ENTITY sameName "test1">
<!ENTITY sameName "test2">
the second and following declarations did not cause an entity declaration handler call but multiple calls to the default handler with the character data inside of the declaration. The first two hunks of the patch fix this by removing a few lines.
On an element declaration with a choice of three or more elements, i.e.
<!ELEMENT elm1 (elm2|elm3|elm4)>
each pipe character (starting at the second one) caused one default handler call before the element declaration handler call, containing only the pipe character. So the above example caused a default handler call with "|" as its data followed by the expected element declaration handler call. Choices with more elements caused more default handler calls. handleDefault was not properly set to false in this case. Hunk 3 fixes this.
If the xml file had CR or CRLF as line delimiters, then Expat failed to convert them into LF if they occurred before the root xml tag. That is what hunk 4 fixes.
Number 1) is not a bug. From the spec:
<quote>
If the same entity is declared more than once, the first declaration encountered is binding; at user option, an XML processor may issue a warning if entities are declared multiple times.
</quote>
Have to investigate number 2)
Number 3) is not clear to me. Do you mean this happens for anything before the root tag, even the DTD?
1) True, but the spec does not say not to pass duplicate element declations at all, and pass nonsense to the application instead. Just to clarify: In the above example, I receive an elementdecl handler call for the first declaration and two default handler calls for the second declaration, one with sameName and the other with "test2" as data, which is pretty useless. Maybe a configuration option might be suitable to let the application decide whether it only wants to receive the first or all entity declarations.
3) I mean any line breaks before the root element, i.e. if you use Expat to read and then recreate the following XML doc:
you will end up with this:
Meanwhile I found two missing free()s in xmlparse.c through Valgrind. An updated patch is attached.
I re-read the Expat docs, this part about the default handler is relevant:
This means that in all 3 reported cases Expat works as designed:
Note: the default handler is an Expat specific extension not necessary under the spec, and not subject to the spec.
Karl
OK, I finally got around to running some code myself.