[Htmlparser-developer] lexer integration
Brought to you by:
derrickoswald
|
From: Derrick O. <Der...@Ro...> - 2003-11-08 22:41:14
|
To replace the string filtering based on constants in the scanner
classes I've implemented generic node filtering, based on a NodeFilter
interface. Some example filters have been added to the new filter
package to give everyone an idea of how it can be used. This may be
pushed down to the lexer level if only a restricted subset of filters is
allowed.
Tag specific scanners are now only used to set up the tags in the
prototype list and, except for ScriptTag, the tags now all use one of
two common scanners, either a TagScanner or a CompositeTagScanner that
are statically allocated by the tag base classes.
I got rid of the node lookahead in the parser. This was used to
determine the character set to use for reading the stream before handing
out any erroneous nodes, but with some sleight of hand at the
stream/source level we can still hide most of that from the user by
performing the character set change in the doSemanticAction() method of
the META tag. This means the META tag should always be registered
(without it being registered, character sets may be handled erroneously
if the HTTP header is incorrect, just as with the Lexer). This change
makes the IteratorImpl class much simpler. The old IteratorImpl is moved
to PeekingIteratorImpl but deprecated, as is the PeekingIterator interface.
Some side effects:
The mainline of the parser now looks different. Instead of -i, -l etc.
switches, the user specifies the node name directly, i.e.:
java -jar htmlparser.jar org.htmlparser.Parser IMG
and it really works now.
In the past, the parser avoided handling tags like "<a
name=target>yadda</a>" because it didn't have an HREF attribute.
However, this is valid HTML for a destination anchor from some other
location, i.e. <a href="#target">see yadda</a>. This special logic in
the LinkScanner is no longer used and will be destroyed when the
LinkScanner goes away. This means there is no longer any need for the
evaluate() method to be checked before scanning tags (at least there's
no reason for it at this time), so it can probably be removed. But,
caveat emptor, the parser can now return LinkTags where
linktag.getLink() should (and eventually will) return null.
p.s. Is any of this stuff I'm spewing useful? There's very little
feedback from anybody.
TODO
=====
Remove Scanners
---------------
Finish off obviating the scanners. Think of a good way to group tags so
adding one tag to the list of tags to be returned by the parser would
add it's buddies, i.e. the Form scanner now adds Input, TextArea,
Selection and Option scanners behind the scenes for you. Then replace
the add, remove, get, etc. scanner methods on the parser with the
comparable tag based ones. Alter all the test cases to use the new
methods, and move all the unique scanner test cases into tag test cases
then delete most of the scannersTests package.
Filters
-------
Implement the new filtering mechanism for NodeList.searchFor ().
Documentation
-------------
As of now, it's more likely that the javadocs are lying to you than
providing any helpful advice. This needs to be reworked completely.
Augment Lexer State Machines
----------------------------------------
There are some changes needed in the lexer state machines to handle JSP
constructs and also whitespace either side of attribute equals signs.
Currently the latter is handled by a kludgy fixAttributes() method
applied after a tag is parsed, but it would be better handled in the
state machine initially. The former isn't handled at all, and would
involve all nodes possibly having children (a remark or string node can
have embedded JSP, i.e. <!-- this remark, created on <%@ date() %>,
needs to be handled -->. So some design work needs to be done to analyze
the state transitions and gating characters.
toHtml(verbatim/fixed)
-----------------------------
One of the design goals for the new Lexer subsystem was to be able to
regurgitate the original HTML via the toHtml() method, so the original
page is unmodified except for any explicit user edits, i.e. link URL
edits. But the parser fixes broken HTML without asking, so you can't get
back an unadulterated page from toHtml(). A lot of test cases assume
fixed HTML. Either a parameter on toHtml() or another method would be
needed to provide the choice of the original HTML or the fixed HTML.
There's some initial work on eliminating the added virtual end tags
commented out in TagNode, but it will also require a way to remember
broken tags, like ...<title>The Title</title</head><body>...
GUI Parser Tool
---------------------
Some GUI based parser application showing the HTML parse tree in one
panel and the HTML text in another, with the tree node selected being
highlighted in the text, or the text cursor setting the tree node
selected, would be really good. A filter builder tool to graphically
construct a program to extract a snippet from an HTML page would blow
people away.
Applications
-----------
Rework all the applications for a better 'out of the box' experience for
new and novice users. Fix all the scripts in /bin (for unix and windows)
and add any others that don't exist already.
As you can see there's lots of work to do, so anyone with a death wish
can jump in. So go ahead and do a take from CVS and jump in the middle
with anything that appeals. Keep the list posted and update your CVS
tree often (or subscribe to the htmlparsre-cvs mailing list for
interrupt driven notification rather than polled notification).
|