Thread: [Htmlparser-developer] future directions
Brought to you by:
derrickoswald
|
From: Derrick O. <Der...@ro...> - 2003-04-18 15:49:26
|
The htmlparser project is quite successful, with many, many users (over 15,000 downloads) and now 17 developers. It's time to consider where it goes from here. Here are some thoughts. <Restructuring> To initiate discussion, I propose a restructuring. One model of language processing identifies three levels so I'll briefly define these in terms htmlparser people can relate to: lexical level - low level, identifies character encoding, whitespace, tokens, lines - currently implemented in Parser, NodeReader and Tag syntactic level - mid level, identifies tags, nesting, attributes, text - this is the raison d'être of the htmlparser, currently implemented in scanners, especially TagScanner, AttributeParser, CompositeTagScannerHelper, StringParser and TagParser semantic level - high level, identifies meaning - currently not implemented, but this is where all the 'action' is, regarding page layout, script, JSP, tables and other actual content So far, htmlparser has not been in the semantic business, but I would think that the best parsing can be done when higher level knowledge is brought to bear. This borders on AI and applies when missing end tags and malformed tags or attributes are encountered. It seems logical to create three packages to contain the three levels and extract the corresponding functionality out of the many places it currently is and put it into the appropriate centralized location. Here are some broad specifics to use as a basis for design documents: move all things character and stream related to a new lexer package - would return a contiguous stream of 'Token' objects that contain only absolute character offsets - would answer questions about line number and be able to extract portions of the stream - would implement 'cursors' to save and restore machine state - operates at the character level repackage the parserHelper package to be what it really is, the syntax package - would return a contiguous stream of Tag objects - would implement 'bookmarks' to save and restore machine state - operates at the string level create a sematics package that contains the high level knowledge about HTML - this level would return a contiguous stream of (possibly nested) nodes - operates at the document type definition level </Restructuring> <Error Handling> Dirty HTML and the nature of parsing means error handling is integral to operation, so a review of error handling is in order with an eye to a comprehensive policy, the tenets of which might include: 1) The level (class) that discovers the error performs it's best effort to fix it. This is already mostly in place. It just needs systematic thought and robust utility methods to handle pathological cases, i.e. end of file, no expected token, quotes in odd places, etc. Exceptions should be handled locally as much as possible. Anything not generated by parser code, i.e. not a ParserException, is handled gracefully by the code closest to the throw. This applies for IOExceptions and other explicit exceptions, but also the runtime exceptions such as ArrayBoundsExceptions, OutOfMemoryExceptions, IllegalArgumentExceptions, etc. 2) Errors that can't be fixed assume that the wrong higher level (semantic or syntactic) assumptions have been made and backs up to a point where an alternate interpretation is possible. This means being able to restore the state of the machine to a prior 'known good state'. 3) Errors should be couched in absolute character location terms so intelligent choices can be made about syntactic trees. If a supervisor routine is used, it can explore tree depths and stream consumption. The rule might be "choose the interpretation that extracts the most nodes" or "choose the interpretation that advances furthest into the file before discovering an error". </Error Handling> <Testing> The existing ad hoc test cases and examples of dirty HTML are good for development, but perhaps a more thorough suite of tests needs to be created for real world use. Based on the strict HTML document type definition at http://www.w3.org/TR/html4/strict.dtd, a test case generator could be constructed that would read the dtd, create a tree of possible HTML syntax, and then based on that correct HTML and a few dozen permutation possibilities (i.e. insert string content, dropping a node, injecting a bogus node, injecting a comment, altering a node by a character, truncating input, starting midway through a construct etc.) throw tens of thousands of tests at the parser and record the test HTML and any unexpected failure mode, i.e. "<head><body>Hello world</bod" - failed to replace </bod with </body>. </Testing> |
|
From: Somik R. <so...@ya...> - 2003-04-20 03:20:46
|
Hi Derrick,
I am glad you brought this up.
> The htmlparser project is quite successful, with many, many users (over
> 15,000 downloads) and now 17 developers.
We actually ended up in pos 18 when I last checked. All of the credit goes
to all the developers (and a significant part to you) and the user community
who have given us feedback..
I agree with most of your ideas. I shall therefore only write about
where I don't agree.
<Testing>
>
> The existing ad hoc test cases and examples of dirty HTML are good for
> development, but perhaps a more thorough suite of tests needs to be
> created for real world use.
>
> Based on the strict HTML document type definition at
> http://www.w3.org/TR/html4/strict.dtd, a test case generator could be
> constructed that would read the dtd, create a tree of possible HTML
> syntax, and then based on that correct HTML and a few dozen permutation
> possibilities (i.e. insert string content, dropping a node, injecting a
> bogus node, injecting a comment, altering a node by a character,
> truncating input, starting midway through a construct etc.) throw tens
> of thousands of tests at the parser and record the test HTML and any
> unexpected failure mode, i.e. "<head><body>Hello world</bod" - failed to
> replace </bod with </body>.
>
</Testing>
The testcases in the parser are anything but ad-hoc.I have specifically
avoided going down the path of the DTD. Lots of parsers exist which follow
the DTD to the letter and fail miserably on real-world html. It is not
difficult to construct a "correct" htmlparser using a grammar and javacc.
What differentiates us is that we have a massive testcase database from
real-world html which took almost 3 years to accumulate. It is this alone
that gives us the confidence to totally change the design at any given time
and still know that we're doing it right. This is what allowed me to
redesign the CompositeTagScanner, and not break anything new. We are at a
position where we can go ahead and redesign whatever we wish.
With this release, I have tried to incorporate the Fit framework
(http://fit.c2.com) which makes writing tests a breeze. Now, to write new
tests, you will only use a html editor like Netscape Composer or M$ Word and
you can run them! I have made some tests for the AttributeParser (and I
already found a bug).
I think we can think of making release 1.3 very soon - and probably decide
to seal it from now. My recommended to-do list for 1.3 is :
[1] Redesign StringParser
[2] Redesign AttributeParser
[3] Redesign NodeReader
These three look particularly scary right now.
On to project management issues, I am planning to step down from the role of
the project lead. Your contribution has been substantial and has added so
much value to the parser. You also have a solid vision to take this project
to v1.4 - I really like your ideas about semantic capabilities and have been
thinking on the same lines. I hope you will volunteer for this
responsibility. There is not much extra work involved - the build scripts
have been there for a while - all you have to do is make releases whenever
you feel (one week is usually good). I will of course be there to help in
whatever way I can.
Regards,
Somik
|
|
From: Derrick O. <Der...@ro...> - 2003-04-20 19:24:16
|
Somik, I guess I'm flattered that you think highly of my contributions, but I stand on the shoulders of giants. You're probably feeling burnt out after two years and want to get your life back. I suggest this. If no one else comes forward, grant me the Project Manager role and forget about it for the summer (if you can). I'll try to be the "Somik" by doing the builds, answering email, performing triage and fixing bugs, contributing to discussions, and so on, but it's unlikely that the quality will be as good. I'm not naive enough to think I'm the one with the vision to move it forward. If it's a rest you want, I can only promise a short respite. We'll both consider it again in September, and if you still feel burnt out, you can drop your Project Manager role altogether, in favour of me or somebody else. Or, if I haven't completely misread you, you can pick it up again after your sabbatical. p.s. I didn't mean to imply that the existing testcases be discarded. No, they are excellent. But do they have good code coverage? Do they represent a significant portion of valid HTML constructs? Can they be said to probe the invalid HTML space in a meaningful way? The only way to do this is via automation. Perhaps the fit framework can be used as the front end for it, but nobody is going to hand-craft thousands of tests. Derrick Somik Raha wrote: ><snip> > > >The testcases in the parser are anything but ad-hoc.I have specifically >avoided going down the path of the DTD. Lots of parsers exist which follow >the DTD to the letter and fail miserably on real-world html. It is not >difficult to construct a "correct" htmlparser using a grammar and javacc. >What differentiates us is that we have a massive testcase database from >real-world html which took almost 3 years to accumulate. It is this alone >that gives us the confidence to totally change the design at any given time >and still know that we're doing it right. This is what allowed me to >redesign the CompositeTagScanner, and not break anything new. We are at a >position where we can go ahead and redesign whatever we wish. > >With this release, I have tried to incorporate the Fit framework >(http://fit.c2.com) which makes writing tests a breeze. Now, to write new >tests, you will only use a html editor like Netscape Composer or M$ Word and >you can run them! I have made some tests for the AttributeParser (and I >already found a bug). > >I think we can think of making release 1.3 very soon - and probably decide >to seal it from now. My recommended to-do list for 1.3 is : >[1] Redesign StringParser >[2] Redesign AttributeParser >[3] Redesign NodeReader > >These three look particularly scary right now. > >On to project management issues, I am planning to step down from the role of >the project lead. Your contribution has been substantial and has added so >much value to the parser. You also have a solid vision to take this project >to v1.4 - I really like your ideas about semantic capabilities and have been >thinking on the same lines. I hope you will volunteer for this >responsibility. There is not much extra work involved - the build scripts >have been there for a while - all you have to do is make releases whenever >you feel (one week is usually good). I will of course be there to help in >whatever way I can. > >Regards, >Somik > > > > |
|
From: Somik R. <so...@ya...> - 2003-04-24 15:08:12
|
Hi Derrick,
Sorry for not replying earlier, I have been out of town, and the last
few days have been really busy.
> I didn't mean to imply that the existing testcases be discarded. No,
> they are excellent.
> But do they have good code coverage? Do they represent a significant
> portion of valid HTML constructs? Can they be said to probe the invalid
> HTML space in a meaningful way? The only way to do this is via
> automation. Perhaps the fit framework can be used as the front end for
> it, but nobody is going to hand-craft thousands of tests.
I agree with you. The FIT framework can reduce the coding burden if it ends
up as a frontend for most of our acceptance tests.
I need a break, yes, and I do think that you can lead the project far beyond
my ideas. I am not attached to the work already done, and in fact, don't
want to leave you with some of the poor design that I did a long time back.
So, I will focus on cleaning up the core areas, which will be useful in the
long run.
It will be great if you try making the releases from this week- so we can
make a smooth transition. Before running the script, I usually run the
WikiGrabber (first delete the docs directory) which brings in the latest
docs from our Wiki.
Regards,
Somik
|
|
From: Derrick O. <Der...@ro...> - 2003-04-25 22:44:30
|
Somik, For the release this weekend, which Johan Naudts is keen to get, you'll need to change my Project Role to Project Manager in the Members section on the Admin page, so I can access the File Release System. I'll try the WikiGrabber tonight. Derrick Somik Raha wrote: >Hi Derrick, > Sorry for not replying earlier, I have been out of town, and the last >few days have been really busy. > > > >>I didn't mean to imply that the existing testcases be discarded. No, >>they are excellent. >>But do they have good code coverage? Do they represent a significant >>portion of valid HTML constructs? Can they be said to probe the invalid >>HTML space in a meaningful way? The only way to do this is via >>automation. Perhaps the fit framework can be used as the front end for >>it, but nobody is going to hand-craft thousands of tests. >> >> > >I agree with you. The FIT framework can reduce the coding burden if it ends >up as a frontend for most of our acceptance tests. > >I need a break, yes, and I do think that you can lead the project far beyond >my ideas. I am not attached to the work already done, and in fact, don't >want to leave you with some of the poor design that I did a long time back. >So, I will focus on cleaning up the core areas, which will be useful in the >long run. > >It will be great if you try making the releases from this week- so we can >make a smooth transition. Before running the script, I usually run the >WikiGrabber (first delete the docs directory) which brings in the latest >docs from our Wiki. > >Regards, >Somik > > > > > |
|
From: Somik R. <so...@ya...> - 2003-04-26 05:25:53
|
Done. Sorry I didnt do it earlier. Regards, Somik ----- Original Message ----- From: "Derrick Oswald" <Der...@ro...> To: <htm...@li...> Sent: Friday, April 25, 2003 3:56 PM Subject: Re: [Htmlparser-developer] future directions > Somik, > > For the release this weekend, which Johan Naudts is keen to get, you'll > need to change my Project Role to Project Manager in the Members section > on the Admin page, so I can access the File Release System. > > I'll try the WikiGrabber tonight. > > Derrick > > Somik Raha wrote: > > >Hi Derrick, > > Sorry for not replying earlier, I have been out of town, and the last > >few days have been really busy. > > > > > > > >>I didn't mean to imply that the existing testcases be discarded. No, > >>they are excellent. > >>But do they have good code coverage? Do they represent a significant > >>portion of valid HTML constructs? Can they be said to probe the invalid > >>HTML space in a meaningful way? The only way to do this is via > >>automation. Perhaps the fit framework can be used as the front end for > >>it, but nobody is going to hand-craft thousands of tests. > >> > >> > > > >I agree with you. The FIT framework can reduce the coding burden if it ends > >up as a frontend for most of our acceptance tests. > > > >I need a break, yes, and I do think that you can lead the project far beyond > >my ideas. I am not attached to the work already done, and in fact, don't > >want to leave you with some of the poor design that I did a long time back. > >So, I will focus on cleaning up the core areas, which will be useful in the > >long run. > > > >It will be great if you try making the releases from this week- so we can > >make a smooth transition. Before running the script, I usually run the > >WikiGrabber (first delete the docs directory) which brings in the latest > >docs from our Wiki. > > > >Regards, > >Somik > > > > > > > > > > > > > > > ------------------------------------------------------- > This sf.net email is sponsored by:ThinkGeek > Welcome to geek heaven. > http://thinkgeek.com/sf > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer |