[Htmlparser-developer] lexer integration
Brought to you by:
derrickoswald
|
From: Derrick O. <Der...@Ro...> - 2003-10-25 16:05:23
|
Made all test suites self executable by moving the mainline into
ParserTestCase.
Handle some pathological remark nodes (Netscape handles way more, like
everything starting with <! so it seems).
Handle some broken end tags. TAG_ENDERS and END_TAG_ENDERS should be
revisited for all scanners.
Passes 512 of 522 tests.
TODO
=====
Helpers
-------
I desparately want to get rid of the last remaining 'helper' class, the
CompositeTagScannerHelper. It's close, it just needs some more untangling.
Node Factory
------------
The factory concept needs to be extended. The Parser's createTagNode
should look up the name of the node (from the attribute list provided),
and create specific types of tags (FormTag, TableTag etc.) by cloning
empty tags from a Hashtable of possible tag types (possibly called
mBlastocyst in reference to undifferentiated stem cells).
This would provide a concrete implementation of createTag in
CompositeTagScanner, removing a lot of near duplicate code from the
scanners, and allow end users to plug in their own tags via a call like
setTagFor ("BODY", new myBodyTag())
on the Parser. The end user wouldn't have to create or replace a scanner
to get their own tags out. Getting rid of the data package cleared up a
lot of questions regarding the interaction scanners have with tags. In
general, the scanner now creates the tag in a very straight forward
bean-like manner:
ret = new Div ();
ret.setPage (page);
ret.setStartPosition (start);
ret.setEndPosition (end);
ret.setAttributesEx (attributes);
ret.setStartTag (startTag);
ret.setEndTag (endTag);
ret.setChildren (children);
This is nearly always the same in every scanner, only the tag name is
different. The oddball cases have been highlighted with a
// special step here...
comment in the code. These special steps mostly revolve around
meta-information available in scanners only (i.e. base href), or
handling of nesting with a stack construct. It shouldn't be too much
trouble to make these all go away.
Scanners
--------
The script scanner has been replaced. It can be considered as a first
pass at what needs to be done to replace the generic
CompositeTagScanner. The use of the underlying lexer makes these
specialty scanners much easier.
Unit Tests
----------
The remaining failing unit tests show up the changed functionality.
Each needs to be examined, a decision on the 'correct' behaviour made,
and the code or test altered accordingly.
Documentation
-------------
As of now, it's more likely that the javadocs are lying to you than
providing any helpful advice. This needs to be reworked completely.
As you can see there's lots of work to do, so anyone with a death wish
can jump in. I'll be working my way from top to bottom of the JUnit
errors list and commiting and notifying the developer list after each of
them. So go ahead and do a take from CVS and jump in the middle with
anything that appeals. Keep the list posted and update your CVS tree
often (or subscribe to the htmlparsre-cvs mailing list for interrupt
driven notification rather than polled notification).
|