Tracker: Bugs

5 can not distinguish the external DTD subset. - ID: 909349
Last Update: Settings changed ( maartenc )

Problem:

dom4j can not distinguish the external DTD subset when
using the aelfred parser.

Background:

The external DTD subset consists of those element,
attribute, general and parameter entities that are
declared in a remote DTD resource rather than inline
within the <!DOCTYPE foo [...]>.

The aelfred XmlParser class does not make the
distinction between the internal and external DTD
subsets available via the SAXDriver, so the application
can not observe these DTD subsets distinctly.

Since dom4j can not distinguish the external DTD subset
when using the aelfred parser, ALL processed
declarations show up as part of the internal DTD
subset. Therefore, if you specify
includeInternalDTDDeclarations and
includeExternalDTDDeclarations, the internal DTD subset
will contain BOTH the internal and external declarations
and the external DTD subset will be null.

The problem may (or may not) be that SAX itself does
not provide for this distinction. That is a seperate
question and SHOULD be directed to the SAX folks.

Workaround:

There are a couple of ways that you can try to
workaround this problem. All are kludges. The basic
pattern requires parsing twice, e.g.:

Parse the document twice, once asking only for the
internal DTD subset and the other time asking for the
external DTD subset. The _external_ DTD subset will be
reported on the second parse as the _internal_ DTD
subset.

Recommendation:

Either (a) drop the includeExternalDTDSubset flag or (b)
modify the Aelfred XmlParser in a SAX compliant manner
to report the SAX2 LexicalHandler events in two
sections: (1) the internal DTD subset; and (2) the
external DTD subset.

In the org.dom4j.io.aelfred package, see:

XmlParser#parseDoctypedecl(), which first parses the
internal DTD subset and then the external DTD subset,
but does NOT report these two DTD subsets distinctly.

SAXDriver#deliverDTDEvents(), which delivers the
appropriate SAX events for the parsed DTD information.
However, since aelfred does not differentiate the
internal and external DTD subsets, this method is unable
to deliver any SAX events that might allow a SAX event
consumer to identify those declarations that are properly
part of the internal DTD subset vs those that are
properly part of the external DTD subset.

In the org.dom4j.io package, see:

SAXContentHandler#startEntity( String name )

and

SAXContentHandler#endEntity( String name )

These methods attempt to create a distinction between
the internal and external DTD subsets. I can't judge
offhand if this logic is correct and the aelfred parser is
broken, or if this logic is incorrect per the SAX2
LexicalHandler contract.


On a related note, I have written a set up tests that
demonstrate this problem and which I will submit as a
patch to the appropriate tracker. This patch will also
bundle some minor bug fixes for the serialization of DTD
declarations, e.g., parameterized entities were not
correctly serialized.


Bryan Thompson ( thompsonbry ) - 2004-03-03 21:46

5

Closed

Fixed

Maarten Coene

None

None

Public


Comment ( 1 )

Date: 2004-03-19 20:31
Sender: maartencProject AdminAccepting Donations

Logged In: YES
user_id=178745

I've added a new version of the aelfred parser (you can find
it in the org.dom4j.aelfred2 package) which supports
internal/external DTD subsets.

Thanks!


Comments have been closed for this artifact.

Attached File

No Files Currently Attached

Changes ( 4 )

Field Old Value Date By
status_id Open 2004-03-19 20:31 maartenc
resolution_id None 2004-03-19 20:31 maartenc
close_date - 2004-03-19 20:31 maartenc
assigned_to nobody 2004-03-19 13:38 maartenc