Tag Balancer fails to handle the following HTML:
The problem is the following:
- when the <TR> tag is processed, it detects that there is no <TBODY> and tries to create it.
- The creation is successful, unfortunately, the caller doesn't detect this and ignores the <TR> tag.
- When <TD> is processed, it cannot find the parent <TR> (because it was ignored), so it again forces its creation
- The creation is successful, but the caller again doesn't detect the successful node creation and ignores the <TD> tag.
- When <TABLE> is processed, it doesn't need parent, so it directly inserts it in the document creating the following structure:
which is invalid and browsers ignore the inner <table> tag.
The reason for not detecting the successful creation is because of the QName handling. When the parent node for <TBODY> is forced to be created, the QName passed in is without any URI and prefix. The TBODY tag is inserted in the element stack and the rest of the filters are called. One of them is the Namespace binder, which goes and populate the URI and the prefix fields of the original instance of the QName, which is passed to the forcedStartElement method. However, the QName inserted in the element stack is copied, so the instance there is without the URI and prefix. When forceStartElement() checks whether an element was created, it compares the two instances of the QName, which are different because of the URI and the prefix.
The workaround is to disable the http://xml.org/sax/features/namespaces feature.
If I find time, I'll update test case with a patch.