Thread: [Htmlparser-developer] lexer integration
Brought to you by:
derrickoswald
From: Derrick O. <Der...@Ro...> - 2003-09-29 17:38:09
|
Fixed up the serializability. TODO ===== TagData ------- This has been reworked to allow it to limp along under the new system, but it should really be removed. I think the reason for it (reduce the number of arguments to tag constructors) no longer applies, and a lot of the code could be easier to read if the Tag was more bean-like and had a zero args constructor with appropriate accessors. Helpers ------- I desparately want to get rid of these 'helper' classes. They are just obfuscating the code. Node Factory ------------ The factory concept needs to be extended with a TagFactory (extending NodeFactory) that has the signatures for creating all the possible types of tags there are, and then this needs to be used by all the scanners to create their specific tags. Scanners -------- The scanners may not be working, hard to tell without the unit tests running. I'm not sure that CompositeTagScanner is completely all right yet, It probably needs to be reworked based on the lexer. Unit Tests ---------- As mentioned, many of the unit tests expect toHtml() to produce capitalized and rearranged output. And parseAndAssertNodeCount() is expected not to include so many whitespace nodes. These need to be addressed. Documentation ------------- As of now, it's more likely that the javadocs are lying to you than providing any helpful advice. This needs to be reworked completely. As you can see there's lots of work to do, so anyone with a death wish can jump in. I'll be working my way from top to bottom of the TODO list and commiting and notifying the developer list after each of them. So go ahead and do a take from CVS and jump in the middle with anything that appeals. Keep the list posted and update your CVS tree often (or subscribe to the htmlparsre-cvs mailing list for interrupt driven notification rather than polled notification). |
From: Derrick O. <Der...@Ro...> - 2003-09-29 19:55:06
|
OK, it's started... I've integrated the low level lexer code into the main parser code. Many things aren't working anymore Of the 448 unit tests 213 of them fail and 14 show exception faults. But the upside is 211 of the tests pass. So I'm dropping my current snapshot, opening it up to those who may wish to assist. See the TODO section. Big changes =========== A lot of files have been removed -------------------------------- htmlparser/NodeReader.java this is the primary class that's being replaced by Lexer, the method nextNode() replaces readElement() htmlparser/RemarkNodeParser.java remark nodes are now parsed in the Lexer main loop htmlparser/parserHelper/AttributeParser.java attributes are now parsed by the lexer before the tag is created, manipulated as a Vector of Attribute objects htmlparser/parserHelper/StringParser.java string nodes are now parsed by the lexer htmlparser/parserHelper/TagParser.java tags are now parsed by the lexer htmlparser/tags/EndTag.java this class was replaced by a call to the new isEndTag() method on the Tag class I labeled the repository with tag "PriorToLexerIntegration" just in case you want to retreive a file that's no longer there. Class Derivations ----------------- The StringNode, RemarkNode and tags.Tag class now derive from their lexeme counterparts in lexer.nodes instead of the other way around. NodeFactory ----------- The beginnings of a node factory interface are included. This was added so the lexer could return 'visitable' nodes to the parser. The parser acts as it's own node factory, as does the Lexer. NodeCount --------- The node count for parsing goes up in most cases because every whitespace (i.e. newline) now counts as a StringNode. This has whacked out a lot of the tests that were expecting fewer nodes or a certain type of node at a particular index. Attributes ---------- Attributes now maintain their order and case. The count of attributes also went up because whitespace is maintained within tags too. The storage in a Vector means the element 0 Attribute is actually the name of the tag, rather than having the $TAGNAME entry in a HashTable. TODO ===== visitEndTag() ----------------- The visitEndNode() method on the visitor interface should be put back. I shouldn't have removed it when EndTag was removed. Instead the accept() in Tag should dispatch to visitTag() or visitEndTag() based on isEndTag(). Serializable -------------- The Parser needs to be made serializable again. This involves a transient field down on the Source, I think, rather than having the whole Lexer transient in the Parser. TagData ------- This has been reworked to allow it to limp along under the new system, but it should really be removed. I think the reason for it (reduce the number of arguments to tag constructors) no longer applies, and a lot of the code could be easier to read if the Tag was more bean-like and had a zero args constructor with appropriate accessors. Helpers ------- I desparately want to get rid of these 'helper' classes. They are just obfuscating the code. Node Factory ------------ The factory concept needs to be extended with a TagFactory (extending NodeFactory) that has the signatures for creating all the possible types of tags there are, and then this needs to be used by all the scanners to create their specific tags. Scanners -------- The scanners may not be working, hard to tell without the unit tests running. I'm not sure that CompositeTagScanner is completely all right yet, It probably needs to be reworked based on the lexer. Unit Tests ---------- As mentioned, many of the unit tests expect toHtml() to produce capitalized and rearranged output. And parseAndAssertNodeCount() is expected not to include so many whitespace nodes. These need to be addressed. Documentation ------------- As of now, it's more likely that the javadocs are lying to you than providing any helpful advice. This needs to be reworked completely. As you can see there's lots of work to do, so anyone with a death wish can jump in. I'll be working my way from top to bottom of the TODO list and commiting and notifying the developer list after each of them. So go ahead and do a take from CVS and jump in the middle with anything that appeals. Keep the list posted and update your CVS tree often (or subscribe to the htmlparsre-cvs mailing list for interrupt driven notification rather than polled notification). Derrick |
From: Derrick O. <Der...@Ro...> - 2003-10-05 14:00:45
|
Made progress on nearly all the TODO items. The tasks aren't as separable as I thought. There are still 133 failing tests. I'll make a stab at the easy ones next. TODO ===== TagData ------- This has been reworked to allow it to limp along under the new system, but it should really be removed. I think the reason for it (reduce the number of arguments to tag constructors) no longer applies, and a lot of the code could be easier to read if Tags were more bean-like and had zero args constructors with appropriate accessors. Helpers ------- I desparately want to get rid of the two remaining 'helper' classes. They are just obfuscating the code. Node Factory ------------ The factory concept needs to be extended. The Parser's createTagNode should look up the name of the node (from the attribute list provided), and create specific types of tags (FormTag, TableTag etc.) by cloning empty tags from a Hashtable of possible tag types (possibly called mBlastocyst in reference to undifferentiated stem cells). This would provide a concrete implementation of createTag in CompositeTagScanner, removing a lot of near duplicate code from the scanners, and allow end users to plug in their own tags via a call like setTagFor ("BODY", new myBodyTag()) on the Parser. Details on interaction with the scanners have to be worked out, but it seems the end user wouldn't have to replace the scanner to get their own tags out. Scanners -------- The script scanner has been replaced. It can be considered as a first pass at what needs to be done to replace the generic CompositeTagScanner. The use of the underlying lexer makes these specialty scanners much easier. Unit Tests ---------- Many of the unit tests expect toHtml() to produce capitalized and rearranged output. And parseAndAssertNodeCount() is expected not to include so many whitespace nodes. These need to be addressed. Documentation ------------- As of now, it's more likely that the javadocs are lying to you than providing any helpful advice. This needs to be reworked completely. As you can see there's lots of work to do, so anyone with a death wish can jump in. I'll be working my way from top to bottom of the JUnit errors list and commiting and notifying the developer list after each of them. So go ahead and do a take from CVS and jump in the middle with anything that appeals. Keep the list posted and update your CVS tree often (or subscribe to the htmlparsre-cvs mailing list for interrupt driven notification rather than polled notification). |
From: Derrick O. <Der...@Ro...> - 2003-10-06 02:11:44
|
I've fixed the easily fixed tests now, the remaining 40 or so indicate changed functionality that needs to be examined, a decision on the 'correct' behaviour made, and the code or test altered accordingly. TODO ===== TagData ------- This has been reworked to allow it to limp along under the new system, but it should really be removed. I think the reason for it (reduce the number of arguments to tag constructors) no longer applies, and a lot of the code could be easier to read if Tags were more bean-like and had zero args constructors with appropriate accessors. Helpers ------- I desparately want to get rid of the two remaining 'helper' classes. They are just obfuscating the code. The CompositeTagScannerHelper is close to being folded back into the CompositeTagScanner. It just needs some more untangling. AbstractNode ------------ Drop org.htmlparser.lexer.nodes.AbstractNode, fold functionality into org.htmlparser.AbstractNode. Node Factory ------------ The factory concept needs to be extended. The Parser's createTagNode should look up the name of the node (from the attribute list provided), and create specific types of tags (FormTag, TableTag etc.) by cloning empty tags from a Hashtable of possible tag types (possibly called mBlastocyst in reference to undifferentiated stem cells). This would provide a concrete implementation of createTag in CompositeTagScanner, removing a lot of near duplicate code from the scanners, and allow end users to plug in their own tags via a call like setTagFor ("BODY", new myBodyTag()) on the Parser. Details on interaction with the scanners have to be worked out, but it seems the end user wouldn't have to replace the scanner to get their own tags out. Scanners -------- The script scanner has been replaced. It can be considered as a first pass at what needs to be done to replace the generic CompositeTagScanner. The use of the underlying lexer makes these specialty scanners much easier. Unit Tests ---------- The remaining failing unit tests show up the changed functionality. Examples: testIncompleteTitle - <title>blah</title </head> used to be 2 nodes testInvertedCommas - <tag attribute = whatever> used to be acceptable testEmptyComment - <!--> was considered a valid remark node Each needs to be examined, a decision on the 'correct' behaviour made, and the code or test altered accordingly. Documentation ------------- As of now, it's more likely that the javadocs are lying to you than providing any helpful advice. This needs to be reworked completely. As you can see there's lots of work to do, so anyone with a death wish can jump in. I'll be working my way from top to bottom of the JUnit errors list and commiting and notifying the developer list after each of them. So go ahead and do a take from CVS and jump in the middle with anything that appeals. Keep the list posted and update your CVS tree often (or subscribe to the htmlparsre-cvs mailing list for interrupt driven notification rather than polled notification). |
From: Derrick O. <Der...@Ro...> - 2003-10-13 22:09:27
|
It now passes 499 tests out of 521. The remaining 22 failures indicate changed functionality that needs to be examined, a decision on the 'correct' behaviour made, and the code or test altered accordingly. I've eliminated the ParserHelper static class, only one 'helper' left to go. I reinstated the tests in the test.temporaryFailures package, they're really lexer package tests, so I put them in there. The good news is, they all pass now. In other words, the reason for them being relegated to the temporaryFailures package no longer exists. The bad news is, I've given up on the JSP test cases for the nonce. These tests pointed out that the old attribute parser was handling some pretty awful attributes, so I added a fixAttributes() on Lexer to handle these bad tags. Cheifly the changes were to provide for whitespace either side of an equals sign (between the attribute name and the value) in Attribute and then recognize where it was needed in parseTag by calling fixAttributes. I also fell back to providing unquoted values in the special getAttributes() hashtable. This is to provide backwards compatibility. Reverting it messed up a lot of tests that I had 'fixed' already. TODO ===== TagData ------- This has been reworked to allow it to limp along under the new system, but it should really be removed. I think the reason for it (reduce the number of arguments to tag constructors) no longer applies, and a lot of the code could be easier to read if Tags were more bean-like and had zero args constructors with appropriate accessors. Helpers ------- I desparately want to get rid of the last remaining 'helper' class, the CompositeTagScannerHelper. It's close, it just needs some more untangling. AbstractNode ------------ Drop org.htmlparser.lexer.nodes.AbstractNode, fold functionality into org.htmlparser.AbstractNode. Node Factory ------------ The factory concept needs to be extended. The Parser's createTagNode should look up the name of the node (from the attribute list provided), and create specific types of tags (FormTag, TableTag etc.) by cloning empty tags from a Hashtable of possible tag types (possibly called mBlastocyst in reference to undifferentiated stem cells). This would provide a concrete implementation of createTag in CompositeTagScanner, removing a lot of near duplicate code from the scanners, and allow end users to plug in their own tags via a call like setTagFor ("BODY", new myBodyTag()) on the Parser. Details on interaction with the scanners have to be worked out, but it seems the end user wouldn't have to replace the scanner to get their own tags out. Scanners -------- The script scanner has been replaced. It can be considered as a first pass at what needs to be done to replace the generic CompositeTagScanner. The use of the underlying lexer makes these specialty scanners much easier. Unit Tests ---------- The remaining failing unit tests show up the changed functionality. Examples: testIncompleteTitle - <title>blah</title </head> used to be 2 nodes testEmptyComment - <!--> was considered a valid remark node Each needs to be examined, a decision on the 'correct' behaviour made, and the code or test altered accordingly. Documentation ------------- As of now, it's more likely that the javadocs are lying to you than providing any helpful advice. This needs to be reworked completely. As you can see there's lots of work to do, so anyone with a death wish can jump in. I'll be working my way from top to bottom of the JUnit errors list and commiting and notifying the developer list after each of them. So go ahead and do a take from CVS and jump in the middle with anything that appeals. Keep the list posted and update your CVS tree often (or subscribe to the htmlparsre-cvs mailing list for interrupt driven notification rather than polled notification). |
From: Derrick O. <Der...@Ro...> - 2003-10-20 01:56:28
|
Removed the data package from the parser level tags. Out went TagData, CompositeTagData, LinkData and FormData. This means the createTag call is now bloated with arguments, but this too shall pass. Moved a lot of the functionality from the scanners to the tags. Whereas before, the scanner would extract all sorts of stuff and pass it to special tag constructors and the tag would just hold it, the tag now performs these tasks when asked. I also removed a lot of member variables so the tags get and set attribute values directly, which means it comes out in the toHtml() call without any special work. Removed lexer level AbstractNode, so there is a Page property on the org.htmlparser.AbstractNode now. Separated tag creation from recursion in NodeFactory interface, so people who want to create their own tags won't need to worry about the scanning recursion. It passes 508 of 522 unit tests. TODO ===== Helpers ------- I desparately want to get rid of the last remaining 'helper' class, the CompositeTagScannerHelper. It's close, it just needs some more untangling. Node Factory ------------ The factory concept needs to be extended. The Parser's createTagNode should look up the name of the node (from the attribute list provided), and create specific types of tags (FormTag, TableTag etc.) by cloning empty tags from a Hashtable of possible tag types (possibly called mBlastocyst in reference to undifferentiated stem cells). This would provide a concrete implementation of createTag in CompositeTagScanner, removing a lot of near duplicate code from the scanners, and allow end users to plug in their own tags via a call like setTagFor ("BODY", new myBodyTag()) on the Parser. The end user wouldn't have to create or replace a scanner to get their own tags out. Getting rid of the data package cleared up a lot of questions regarding the interaction scanners have with tags. In general, the scanner now creates the tag in a very straight forward bean-like manner: ret = new Div (); ret.setPage (page); ret.setStartPosition (start); ret.setEndPosition (end); ret.setAttributesEx (attributes); ret.setStartTag (startTag); ret.setEndTag (endTag); ret.setChildren (children); This is nearly always the same in every scanner, only the tag name is different. The oddball cases have been highlighted with a // special step here... comment in the code. These special steps mostly revolve around meta-information available in scanners only (i.e. base href), or handling of nesting with a stack construct. It shouldn't be too much trouble to make these all go away. Scanners -------- The script scanner has been replaced. It can be considered as a first pass at what needs to be done to replace the generic CompositeTagScanner. The use of the underlying lexer makes these specialty scanners much easier. Unit Tests ---------- The remaining failing unit tests show up the changed functionality. Examples: testIncompleteTitle - <title>blah</title </head> used to be 2 nodes testEmptyComment - <!--> was considered a valid remark node Each needs to be examined, a decision on the 'correct' behaviour made, and the code or test altered accordingly. Documentation ------------- As of now, it's more likely that the javadocs are lying to you than providing any helpful advice. This needs to be reworked completely. As you can see there's lots of work to do, so anyone with a death wish can jump in. I'll be working my way from top to bottom of the JUnit errors list and commiting and notifying the developer list after each of them. So go ahead and do a take from CVS and jump in the middle with anything that appeals. Keep the list posted and update your CVS tree often (or subscribe to the htmlparsre-cvs mailing list for interrupt driven notification rather than polled notification). |
From: Joshua K. <jo...@in...> - 2003-10-22 01:26:55
|
Derrick, It is me or are there duplicates of the StringNode, RemarkNode, etc between the org.htmlparser package and the org.htmlparser.lexer.nodes package? I also noticed that the NodeFactory's creation methods take the lexer as an argument, yet *all* of those methods and the methods they call rely on lexer.getPage(). Have you considered simply passing in a page instance rather than a lexer instance? That will work well for some further refactoring I have in mind. --jk |
From: Derrick O. <Der...@Ro...> - 2003-10-22 03:31:04
|
Joshua, I think the duplication is because the lexer.nodes package nodes don't use the NodeVisitor pattern and the htmlparser package nodes do. The lexer is shipped as a separate jar so it needs nodes that don't drag in the composite node stuff, whcih happens if the NodeVisitor signature is included. This may be factored out if we get rid of visitLinkTag, visitorImageTag and visitorTitleTag from that interface. These may best be handled by direct examination of the node name in the various visitor classes. The composite tag recursion happens on the scanTagNode method which does need a lexer, so the create calls can take just a Page, like you say. Derrick Joshua Kerievsky wrote: > Derrick, > > It is me or are there duplicates of the StringNode, RemarkNode, etc > between the org.htmlparser package and the org.htmlparser.lexer.nodes > package? > I also noticed that the NodeFactory's creation methods take the lexer > as an argument, yet *all* of those methods and the methods they call > rely on lexer.getPage(). Have you considered simply passing in a > page instance rather than a lexer instance? That will work well for > some further refactoring I have in mind. > --jk > > > |
From: Joshua K. <jo...@in...> - 2003-10-22 22:50:26
|
Derrick Oswald wrote: > I think the duplication is because the lexer.nodes package nodes don't > use the NodeVisitor pattern and the htmlparser package nodes do. The > lexer is shipped as a separate jar so it needs nodes that don't drag > in the composite node stuff, whcih happens if the NodeVisitor > signature is included. This may be factored out if we get rid of > visitLinkTag, visitorImageTag and visitorTitleTag from that > interface. These may best be handled by direct examination of the > node name in the various visitor classes. Yeah, as I said in another thread, the NodeVisitor ought not to be dependent on scanners (or, in the future, what prototypable tags are present in some collection). That is, it shouldn't have methods on it that visit types which may not be available. So I'm in favor of a simple, narrow NodeVisitor interface - just letting one visit the basic types. > The composite tag recursion happens on the scanTagNode method which > does need a lexer, so the create calls can take just a Page, like you > say. Sounds good. regards jk |
From: Derrick O. <Der...@Ro...> - 2003-10-25 16:05:23
|
Made all test suites self executable by moving the mainline into ParserTestCase. Handle some pathological remark nodes (Netscape handles way more, like everything starting with <! so it seems). Handle some broken end tags. TAG_ENDERS and END_TAG_ENDERS should be revisited for all scanners. Passes 512 of 522 tests. TODO ===== Helpers ------- I desparately want to get rid of the last remaining 'helper' class, the CompositeTagScannerHelper. It's close, it just needs some more untangling. Node Factory ------------ The factory concept needs to be extended. The Parser's createTagNode should look up the name of the node (from the attribute list provided), and create specific types of tags (FormTag, TableTag etc.) by cloning empty tags from a Hashtable of possible tag types (possibly called mBlastocyst in reference to undifferentiated stem cells). This would provide a concrete implementation of createTag in CompositeTagScanner, removing a lot of near duplicate code from the scanners, and allow end users to plug in their own tags via a call like setTagFor ("BODY", new myBodyTag()) on the Parser. The end user wouldn't have to create or replace a scanner to get their own tags out. Getting rid of the data package cleared up a lot of questions regarding the interaction scanners have with tags. In general, the scanner now creates the tag in a very straight forward bean-like manner: ret = new Div (); ret.setPage (page); ret.setStartPosition (start); ret.setEndPosition (end); ret.setAttributesEx (attributes); ret.setStartTag (startTag); ret.setEndTag (endTag); ret.setChildren (children); This is nearly always the same in every scanner, only the tag name is different. The oddball cases have been highlighted with a // special step here... comment in the code. These special steps mostly revolve around meta-information available in scanners only (i.e. base href), or handling of nesting with a stack construct. It shouldn't be too much trouble to make these all go away. Scanners -------- The script scanner has been replaced. It can be considered as a first pass at what needs to be done to replace the generic CompositeTagScanner. The use of the underlying lexer makes these specialty scanners much easier. Unit Tests ---------- The remaining failing unit tests show up the changed functionality. Each needs to be examined, a decision on the 'correct' behaviour made, and the code or test altered accordingly. Documentation ------------- As of now, it's more likely that the javadocs are lying to you than providing any helpful advice. This needs to be reworked completely. As you can see there's lots of work to do, so anyone with a death wish can jump in. I'll be working my way from top to bottom of the JUnit errors list and commiting and notifying the developer list after each of them. So go ahead and do a take from CVS and jump in the middle with anything that appeals. Keep the list posted and update your CVS tree often (or subscribe to the htmlparsre-cvs mailing list for interrupt driven notification rather than polled notification). |
From: Derrick O. <Der...@Ro...> - 2003-10-26 04:29:16
|
Fixed or avoided the remaining failing unit tests. It's a green bar now, 522 of 522 passing. I shut up all the excess verbiage from the tests, so they're silent too. TODO ===== Helpers ------- I desparately want to get rid of the last remaining 'helper' class, the CompositeTagScannerHelper. It's close, it just needs some more untangling. Node Factory ------------ The factory concept needs to be extended. The Parser's createTagNode should look up the name of the node (from the attribute list provided), and create specific types of tags (FormTag, TableTag etc.) by cloning empty tags from a Hashtable of possible tag types (possibly called mBlastocyst in reference to undifferentiated stem cells). This would provide a concrete implementation of createTag in CompositeTagScanner, removing a lot of near duplicate code from the scanners, and allow end users to plug in their own tags via a call like setTagFor ("BODY", new myBodyTag()) on the Parser. The end user wouldn't have to create or replace a scanner to get their own tags out. Getting rid of the data package cleared up a lot of questions regarding the interaction scanners have with tags. In general, the scanner now creates the tag in a very straight forward bean-like manner: ret = new Div (); ret.setPage (page); ret.setStartPosition (start); ret.setEndPosition (end); ret.setAttributesEx (attributes); ret.setStartTag (startTag); ret.setEndTag (endTag); ret.setChildren (children); This is nearly always the same in every scanner, only the tag name is different. The oddball cases have been highlighted with a // special step here... comment in the code. These special steps mostly revolve around meta-information available in scanners only (i.e. base href), or handling of nesting with a stack construct. It shouldn't be too much trouble to make these all go away. Scanners -------- The script scanner has been replaced. It can be considered as a first pass at what needs to be done to replace the generic CompositeTagScanner. The use of the underlying lexer makes these specialty scanners much easier. Documentation ------------- As of now, it's more likely that the javadocs are lying to you than providing any helpful advice. This needs to be reworked completely. Augment Lexer State Machines ---------------------------------------- There are some changes needed in the lexer state machines to handle JSP constructs and also whitespace either side of attribute equals signs. Currently the latter is handled by a kludgy fixAttributes() method applied after a tag is parsed, but it would be better handled in the state machine initially. The former isn't handled at all, and would involve all nodes possibly having children (a remark or string node can have embedded JSP, i.e. <!-- this remark, created on <%@ date() %>, needs to be handled -->. So some design work needs to be done to analyze the state transitions and gating characters. Case Sensitive TestCase ------------------------------- Currently all string comparisons via the ParserTestCase.assertStringsEqual() are case insensitive. This should be turned off by setting ParserTestCase.mCaseInsensitiveComparisons to false, and the tests fixed to accommodate. toHtml(verbatim/fixed) ----------------------------- One of the design goals for the new Lexer subsystem was to be able to regurgitate the original HTML via the toHtml() method, so the original page is unmodified except for any explicit user edits, i.e. link URL edits. But the parser fixes broken HTML without asking, so you can't get back an unadulterated page from toHtml(). A lot of test cases assume fixed HTML. Either a parameter on toHtml() or another method would be needed to provide the choice of the original HTML or the fixed HTML. There's some initial work on eliminating the added virtual end tags commented out in TagNode, but it will also require a way to remember broken tags, like ...<title>The Title</title</head><body>... GUI Parser Tool --------------------- Some GUI based parser application showing the HTML parse tree in one panel and the HTML text in another, with the tree node selected being highlighted in the text, or the text cursor setting the tree node selected, would be really good. As you can see there's lots of work to do, so anyone with a death wish can jump in. So go ahead and do a take from CVS and jump in the middle with anything that appeals. Keep the list posted and update your CVS tree often (or subscribe to the htmlparsre-cvs mailing list for interrupt driven notification rather than polled notification). |
From: Derrick O. <Der...@Ro...> - 2003-10-26 16:08:56
|
Got rid of CompositeTagScannerHelper. Yeaahh! TODO ===== Node Factory ------------ The factory concept needs to be extended. The Parser's createTagNode should look up the name of the node (from the attribute list provided), and create specific types of tags (FormTag, TableTag etc.) by cloning empty tags from a Hashtable of possible tag types (possibly called mBlastocyst in reference to undifferentiated stem cells). This would provide a concrete implementation of createTag in CompositeTagScanner, removing a lot of near duplicate code from the scanners, and allow end users to plug in their own tags via a call like setTagFor ("BODY", new myBodyTag()) on the Parser. The end user wouldn't have to create or replace a scanner to get their own tags out. Getting rid of the data package cleared up a lot of questions regarding the interaction scanners have with tags. In general, the scanner now creates the tag in a very straight forward bean-like manner: ret = new Div (); ret.setPage (page); ret.setStartPosition (start); ret.setEndPosition (end); ret.setAttributesEx (attributes); ret.setStartTag (startTag); ret.setEndTag (endTag); ret.setChildren (children); This is nearly always the same in every scanner, only the tag name is different. The oddball cases have been highlighted with a // special step here... comment in the code. These special steps mostly revolve around meta-information available in scanners only (i.e. base href), or handling of nesting with a stack construct. It shouldn't be too much trouble to make these all go away. Documentation ------------- As of now, it's more likely that the javadocs are lying to you than providing any helpful advice. This needs to be reworked completely. Augment Lexer State Machines ---------------------------------------- There are some changes needed in the lexer state machines to handle JSP constructs and also whitespace either side of attribute equals signs. Currently the latter is handled by a kludgy fixAttributes() method applied after a tag is parsed, but it would be better handled in the state machine initially. The former isn't handled at all, and would involve all nodes possibly having children (a remark or string node can have embedded JSP, i.e. <!-- this remark, created on <%@ date() %>, needs to be handled -->. So some design work needs to be done to analyze the state transitions and gating characters. Case Sensitive TestCase ------------------------------- Currently all string comparisons via the ParserTestCase.assertStringsEqual() are case insensitive. This should be turned off by setting ParserTestCase.mCaseInsensitiveComparisons to false, and the tests fixed to accommodate. toHtml(verbatim/fixed) ----------------------------- One of the design goals for the new Lexer subsystem was to be able to regurgitate the original HTML via the toHtml() method, so the original page is unmodified except for any explicit user edits, i.e. link URL edits. But the parser fixes broken HTML without asking, so you can't get back an unadulterated page from toHtml(). A lot of test cases assume fixed HTML. Either a parameter on toHtml() or another method would be needed to provide the choice of the original HTML or the fixed HTML. There's some initial work on eliminating the added virtual end tags commented out in TagNode, but it will also require a way to remember broken tags, like ...<title>The Title</title</head><body>... GUI Parser Tool --------------------- Some GUI based parser application showing the HTML parse tree in one panel and the HTML text in another, with the tree node selected being highlighted in the text, or the text cursor setting the tree node selected, would be really good. As you can see there's lots of work to do, so anyone with a death wish can jump in. So go ahead and do a take from CVS and jump in the middle with anything that appeals. Keep the list posted and update your CVS tree often (or subscribe to the htmlparsre-cvs mailing list for interrupt driven notification rather than polled notification). |
From: Derrick O. <Der...@Ro...> - 2003-11-06 04:06:41
|
OK, almost ready to get rid of most of the scanner package that shadows the tag package. There remains the 'filter' concept to handle, and then all but TagScanner, CompositeTagScanner and ScriptScanner are obsolete. The tags now own their 'ids', 'enders' and 'end tag enders' lists, and the isTagToBeEndedFor() logic now uses information from the tags, not the scanners. Nodes are created by cloning from a list of prototypes in the Parser (NodeFactory), so the scanners no longer create the tags (but they still create the prototypical ones). Now, the startTag() *is* the CompositeTag, and the CompositeTagScanner just adds children to an already differentiated tag. The scanners have no special actions on behalf of tags anymore. Things like the LinkProcessor and form ACTION determination have been moved out of the scanners and into either the Page object or the appropriate tags. Other changes: Made visitor 'node visiting order' the same order as on the page. Fixed StringBean, which was still looking for end tags with names starting with a slash, i.e. "/SCRIPT". Added some debugging support to the lexer, so you can easily base a breakpoint on a line number in a HTML page. Fixed all the tests failing if case sensitivity was turned on. Now ParserTestCase does case sensitive comparisons. Convert native characters in tests to unicode. Mostly this was the division sign (\u00f7) used in tests of character entity reference translation. Remove deprecated method calls: elementBegin() is now getStartPosition() and elementEnd() is now getEndPosition() Also fixed the NodeFactory signatures to have a Page rather than a Lexer. TODO ===== Filters ------- Replace the String to String comparison of the 'filter' concept with a TagFilter interface: boolean accept (Tag tag); and allow users to perform something like: NodeList list = parser.extractAllNodesThatAre ( new NodeFilter () { public boolean accept (Tag tag) { return (tag.getClass() == LinkTag.class); } }; And similarly for: tag.collectInto (NodeList collectionList, NodeFilter filter); nodelist.searchFor (NodeFilter filter); parser.parse (NodeFilter filter) etc. Remove Scanners --------------- Finish off obviating the scanners. Think of a good way to group tags so adding one tag to the list of tags to be returned by the parser would add it's buddies, i.e. the Form scanner now adds Input, TextArea, Selection and Option scanners behind the scenes for you. Then replace the add, remove, get, etc. scanner methods on the parser with the comparable tag based ones. Alter all the test cases to use the new methods, and move all the unique scanner test cases into tag test cases then delete most of the scannersTests package. Documentation ------------- As of now, it's more likely that the javadocs are lying to you than providing any helpful advice. This needs to be reworked completely. Augment Lexer State Machines ---------------------------------------- There are some changes needed in the lexer state machines to handle JSP constructs and also whitespace either side of attribute equals signs. Currently the latter is handled by a kludgy fixAttributes() method applied after a tag is parsed, but it would be better handled in the state machine initially. The former isn't handled at all, and would involve all nodes possibly having children (a remark or string node can have embedded JSP, i.e. <!-- this remark, created on <%@ date() %>, needs to be handled -->. So some design work needs to be done to analyze the state transitions and gating characters. toHtml(verbatim/fixed) ----------------------------- One of the design goals for the new Lexer subsystem was to be able to regurgitate the original HTML via the toHtml() method, so the original page is unmodified except for any explicit user edits, i.e. link URL edits. But the parser fixes broken HTML without asking, so you can't get back an unadulterated page from toHtml(). A lot of test cases assume fixed HTML. Either a parameter on toHtml() or another method would be needed to provide the choice of the original HTML or the fixed HTML. There's some initial work on eliminating the added virtual end tags commented out in TagNode, but it will also require a way to remember broken tags, like ...<title>The Title</title</head><body>... GUI Parser Tool --------------------- Some GUI based parser application showing the HTML parse tree in one panel and the HTML text in another, with the tree node selected being highlighted in the text, or the text cursor setting the tree node selected, would be really good. Applications ----------- Rework all the applications for a better 'out of the box' experience for new and novice users. Fix all the scripts in /bin (for unix and windows) and add any others that don't exist already. As you can see there's lots of work to do, so anyone with a death wish can jump in. So go ahead and do a take from CVS and jump in the middle with anything that appeals. Keep the list posted and update your CVS tree often (or subscribe to the htmlparsre-cvs mailing list for interrupt driven notification rather than polled notification). |
From: Derrick O. <Der...@Ro...> - 2003-11-08 22:41:14
|
To replace the string filtering based on constants in the scanner classes I've implemented generic node filtering, based on a NodeFilter interface. Some example filters have been added to the new filter package to give everyone an idea of how it can be used. This may be pushed down to the lexer level if only a restricted subset of filters is allowed. Tag specific scanners are now only used to set up the tags in the prototype list and, except for ScriptTag, the tags now all use one of two common scanners, either a TagScanner or a CompositeTagScanner that are statically allocated by the tag base classes. I got rid of the node lookahead in the parser. This was used to determine the character set to use for reading the stream before handing out any erroneous nodes, but with some sleight of hand at the stream/source level we can still hide most of that from the user by performing the character set change in the doSemanticAction() method of the META tag. This means the META tag should always be registered (without it being registered, character sets may be handled erroneously if the HTTP header is incorrect, just as with the Lexer). This change makes the IteratorImpl class much simpler. The old IteratorImpl is moved to PeekingIteratorImpl but deprecated, as is the PeekingIterator interface. Some side effects: The mainline of the parser now looks different. Instead of -i, -l etc. switches, the user specifies the node name directly, i.e.: java -jar htmlparser.jar org.htmlparser.Parser IMG and it really works now. In the past, the parser avoided handling tags like "<a name=target>yadda</a>" because it didn't have an HREF attribute. However, this is valid HTML for a destination anchor from some other location, i.e. <a href="#target">see yadda</a>. This special logic in the LinkScanner is no longer used and will be destroyed when the LinkScanner goes away. This means there is no longer any need for the evaluate() method to be checked before scanning tags (at least there's no reason for it at this time), so it can probably be removed. But, caveat emptor, the parser can now return LinkTags where linktag.getLink() should (and eventually will) return null. p.s. Is any of this stuff I'm spewing useful? There's very little feedback from anybody. TODO ===== Remove Scanners --------------- Finish off obviating the scanners. Think of a good way to group tags so adding one tag to the list of tags to be returned by the parser would add it's buddies, i.e. the Form scanner now adds Input, TextArea, Selection and Option scanners behind the scenes for you. Then replace the add, remove, get, etc. scanner methods on the parser with the comparable tag based ones. Alter all the test cases to use the new methods, and move all the unique scanner test cases into tag test cases then delete most of the scannersTests package. Filters ------- Implement the new filtering mechanism for NodeList.searchFor (). Documentation ------------- As of now, it's more likely that the javadocs are lying to you than providing any helpful advice. This needs to be reworked completely. Augment Lexer State Machines ---------------------------------------- There are some changes needed in the lexer state machines to handle JSP constructs and also whitespace either side of attribute equals signs. Currently the latter is handled by a kludgy fixAttributes() method applied after a tag is parsed, but it would be better handled in the state machine initially. The former isn't handled at all, and would involve all nodes possibly having children (a remark or string node can have embedded JSP, i.e. <!-- this remark, created on <%@ date() %>, needs to be handled -->. So some design work needs to be done to analyze the state transitions and gating characters. toHtml(verbatim/fixed) ----------------------------- One of the design goals for the new Lexer subsystem was to be able to regurgitate the original HTML via the toHtml() method, so the original page is unmodified except for any explicit user edits, i.e. link URL edits. But the parser fixes broken HTML without asking, so you can't get back an unadulterated page from toHtml(). A lot of test cases assume fixed HTML. Either a parameter on toHtml() or another method would be needed to provide the choice of the original HTML or the fixed HTML. There's some initial work on eliminating the added virtual end tags commented out in TagNode, but it will also require a way to remember broken tags, like ...<title>The Title</title</head><body>... GUI Parser Tool --------------------- Some GUI based parser application showing the HTML parse tree in one panel and the HTML text in another, with the tree node selected being highlighted in the text, or the text cursor setting the tree node selected, would be really good. A filter builder tool to graphically construct a program to extract a snippet from an HTML page would blow people away. Applications ----------- Rework all the applications for a better 'out of the box' experience for new and novice users. Fix all the scripts in /bin (for unix and windows) and add any others that don't exist already. As you can see there's lots of work to do, so anyone with a death wish can jump in. So go ahead and do a take from CVS and jump in the middle with anything that appeals. Keep the list posted and update your CVS tree often (or subscribe to the htmlparsre-cvs mailing list for interrupt driven notification rather than polled notification). |