Thread: [Htmlparser-developer] future directions

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

The htmlparser project is quite successful, with many, many users (over 
15,000 downloads) and now 17 developers.
It's time to consider where it goes from here. Here are some thoughts.

<Restructuring>

To initiate discussion, I propose a restructuring. One model of language 
processing identifies three levels so I'll briefly define these in terms 
htmlparser people can relate to:

lexical level
- low level, identifies character encoding, whitespace, tokens, lines
- currently implemented in Parser, NodeReader and Tag

syntactic level
- mid level, identifies tags, nesting, attributes, text
- this is the raison d'être of the htmlparser, currently implemented in 
scanners, especially TagScanner, AttributeParser, 
CompositeTagScannerHelper, StringParser and TagParser

semantic level
- high level, identifies meaning
- currently not implemented, but this is where all the 'action' is, 
regarding page layout, script, JSP, tables and other actual content

So far, htmlparser has not been in the semantic business, but I would 
think that the best parsing can be done when higher level knowledge is 
brought to bear. This borders on AI and applies when missing end tags 
and malformed tags or attributes are encountered.  It seems logical to 
create three packages to contain the three levels and extract the 
corresponding functionality out of the many places it currently is and 
put it into the appropriate centralized location.

Here are some broad specifics to use as a basis for design documents:
move all things character and stream related to a new lexer package
  - would return a contiguous stream of 'Token' objects that contain 
only absolute character offsets
  - would answer questions about line number and be able to extract 
portions of the stream
  - would implement 'cursors' to save and restore machine state
  - operates at the character level
repackage the parserHelper package to be what it really is, the syntax 
package
   - would return a contiguous stream of Tag objects
   - would implement 'bookmarks' to save and restore machine state
   - operates at the string level
create a sematics package that contains the high level knowledge about HTML
   - this level would return a contiguous stream of (possibly nested) nodes
   - operates at the document type definition level

</Restructuring>

<Error Handling>

Dirty HTML and the nature of parsing means error handling is integral to 
operation, so a review of error handling is in order with an eye to a 
comprehensive policy, the tenets of which might include:

1) The level (class) that discovers the error performs it's best effort 
to fix it. This is already mostly in place. It just needs systematic 
thought and robust utility methods to handle pathological cases, i.e. 
end of file, no expected token, quotes in odd places, etc. Exceptions 
should be handled locally as much as possible. Anything not generated by 
parser code, i.e. not a ParserException, is handled gracefully by the 
code closest to the throw. This applies for IOExceptions and other 
explicit exceptions, but also the runtime exceptions such as 
ArrayBoundsExceptions,  OutOfMemoryExceptions, 
IllegalArgumentExceptions, etc.

2) Errors that can't be fixed assume that the wrong higher level 
(semantic or syntactic) assumptions have been made and backs up to a 
point where an alternate interpretation is possible. This means being 
able to restore the state of the machine to a prior 'known good state'.

3) Errors should be couched in absolute character location terms so 
intelligent choices can be made about syntactic trees. If a supervisor 
routine is used, it can explore tree depths and stream consumption. The 
rule might be "choose the interpretation that extracts the most nodes" 
or "choose the interpretation that advances furthest into the file 
before discovering an error".

</Error Handling>

<Testing>

The existing ad hoc test cases and examples of dirty HTML are good for 
development, but perhaps a more thorough suite of tests needs to be 
created for real world use.

Based on the strict HTML document type definition at 
http://www.w3.org/TR/html4/strict.dtd, a test case generator could be 
constructed that would read the dtd, create a tree of possible HTML 
syntax, and then based on that correct HTML and a few dozen permutation 
possibilities (i.e. insert string content, dropping a node, injecting a 
bogus node, injecting a comment, altering a node by a character, 
truncating input, starting midway through a construct etc.) throw tens 
of thousands of tests at the parser and record the test HTML and any 
unexpected failure mode, i.e. "<head><body>Hello world</bod" - failed to 
replace </bod with </body>.

</Testing>

Thread: [Htmlparser-developer] future directions

htmlparser-developer