#88 Patch for three bugs in DTDs

Feature Request
open
nobody
None
5
2014-02-07
2014-01-30
John Doe
No

I have encountered three bugs when parsing documents with an internal DTD. The attached patch contains fixes for all three. The bugs are:

  1. When an entity was declared multiple times, i.e.
    <!ENTITY sameName "test1">
    <!ENTITY sameName "test2">
    the second and following declarations did not cause an entity declaration handler call but multiple calls to the default handler with the character data inside of the declaration. The first two hunks of the patch fix this by removing a few lines.

  2. On an element declaration with a choice of three or more elements, i.e.
    <!ELEMENT elm1 (elm2|elm3|elm4)>
    each pipe character (starting at the second one) caused one default handler call before the element declaration handler call, containing only the pipe character. So the above example caused a default handler call with "|" as its data followed by the expected element declaration handler call. Choices with more elements caused more default handler calls. handleDefault was not properly set to false in this case. Hunk 3 fixes this.

  3. If the xml file had CR or CRLF as line delimiters, then Expat failed to convert them into LF if they occurred before the root xml tag. That is what hunk 4 fixes.

1 Attachments

Discussion

  • Karl Waclawek
    Karl Waclawek
    2014-01-31

    Number 1) is not a bug. From the spec:

    If the same entity is declared more than once, the first declaration encountered is binding; at user option, an XML processor may issue a warning if entities are declared multiple times.

    Have to investigate number 2)

    Number 3) is not clear to me. Do you mean this happens for anything before the root tag, even the DTD?

     
  • John Doe
    John Doe
    2014-02-04

    1) True, but the spec does not say not to pass duplicate element declations at all, and pass nonsense to the application instead. Just to clarify: In the above example, I receive an elementdecl handler call for the first declaration and two default handler calls for the second declaration, one with sameName and the other with "test2" as data, which is pretty useless. Maybe a configuration option might be suitable to let the application decide whether it only wants to receive the first or all entity declarations.

    3) I mean any line breaks before the root element, i.e. if you use Expat to read and then recreate the following XML doc:

        <?xml version="1.0"?>\r\n
        <!--some comment-->\r\n
        <!DOCTYPE root [\r\n
          ...\r\n
        ]>\r\n
        <root>\r\n
        some content\r\n
        </root>
    

    you will end up with this:

        <?xml version="1.0"?>\r\n
        <!--some comment-->\r\n
        <!DOCTYPE root [\r\n
          ...\r\n
        ]>\r\n
        <root>\n
        some content\n
        </root>
    

    Meanwhile I found two missing free()s in xmlparse.c through Valgrind. An updated patch is attached.

     
    Attachments
  • Karl Waclawek
    Karl Waclawek
    2014-02-04

    I re-read the Expat docs, this part about the default handler is relevant:

    Sets a handler for any characters in the document which wouldn't otherwise be handled. This includes both data for which no handlers can be set (like some kinds of DTD declarations) and data which could be reported but which currently has no handler set. The characters are passed exactly as they were present in the XML document except that they will be encoded in UTF-8 or UTF-16. Line boundaries are not normalized. Note that a byte order mark character is not passed to the default handler. There are no guarantees about how characters are divided between calls to the default handler: for example, a comment might be split between multiple calls. Setting the handler with this call has the side effect of turning off expansion of references to internally defined general entities. Instead these references are passed to the default handler.

    This means that in all 3 reported cases Expat works as designed:

    1. The default handler reports those declarations that should otherwise be ignored. Reporting them as actual parts of the DTD would be an error.
    2. The pipe characters would not be reported otherwise, so the default handler reports them.
    3. The line breaks in the DTD part would not be reported otherwise as they are not document content, so the default handler reports them.

    Note: the default handler is an Expat specific extension not necessary under the spec, and not subject to the spec.

    Karl

     
  • John Doe
    John Doe
    2014-02-05

    1. It does not even do that as described earlier.
    2. Of course they are reported otherwise: As part of the call to the ElementDeclHandler.
    3. Mmmh, why would someone want to receive a mix of different line delimiters, default handler or not?
     
  • Karl Waclawek
    Karl Waclawek
    2014-02-05

    OK, I finally got around to running some code myself.

    1. Clarification: the spec says that only the first of multiple identical entity declarations is binding, not element declarations. Duplicate element declarations are not a well-formed-ness violation. We have to agree to disagree here. However, there is one known issue with the default handler, explained in the comments to patch #55: https://sourceforge.net/p/expat/patches/55/
    2. Yes, you are correct, they should not be reported.
    3. The default handler is intended to report what is in the document without any XML processing.
    4. The two missing frees are intentionally missing - to allow the application to keep the data structures around without copying them. There is a documented API called XML_FreeContentModel that the application must use.
     
  • John Doe
    John Doe
    2014-02-07

    1. That is exactly the bug I'm seeing. It causes a loss of information (in my use case) since you have no chance to figure out that the two strings presented by the default handler belong to a duplicate entity declaration. Reporting them through the entity decl handler would solve this issue, but this might be a problem for existing applications that don't expect duplicates being reported. That is why it might be suitable to add a new configuration option where you can set whether duplicates are reported or not. (I meant to write "entity" not "element" in my second post, sorry.)
    2. Right, nonetheless it is somewhat weird that some line breaks are not normalized, even though it is documented.
    3. Alright, my bad. It has been a while since I read that part of the docs.