Thread: [Htmlparser-developer] ParserFeedback mechanism
Brought to you by:
derrickoswald
From: <dha...@or...> - 2003-05-02 08:18:45
Attachments:
BDY.RTF
|
Hi guys, I remember we had a discussion about the feedback mechanism earlier. I just wanted to restart it by suggesting use of the Logging Wrapper from Jakarta. I have noticed that if anyone wants to use the ParserFeedback to log then they will need to mostly extend the DefaultParserFeedback class and override the methods appropriately. If we can map the ParserFeedback class to the Logging Wrapper applications can easily use the Feedback mechanism to log to Log4j and JDK 1.4 without having to do a thing. most users according tome woudl be using one of these systems. I believe the argument then was coupling with a third-party library. But I believe the flexibility it offers outstrips the coupling drawback. Furthermore imagine an application which is using some other logging tool. They have coded their entire logging framework using the Logging Wrapper and have used an adapter to log to their logging tool. If they use the parser and want to log its output as well, they will have to write one more adapter. Instead if the parser provides a mechanism for using the Logging Wrapper, they would not need to do anything. We ahve actually had requests wherein different clients have asked for different logging tools to be used!!! Hence the request. We could simply extend from DefaultParserFeedback for LogWrapperFeedback and make it implement the commons logging interface. Do let me know your thoughts/opinions/suggestions on the same. Regards, Dhaval |
From: Derrick O. <Der...@ro...> - 2003-05-02 10:59:57
|
I looked at this earlier too, regarding the org.apache.commons.logging package: http://jakarta.apache.org/commons/logging.html which provides a thin logging-system agnostic interface. There didn't appear to be anything in the licence (http://jakarta.apache.org/commons/license.html) that precludes repackaging the code into the htmlparser tree. It isn't very large. We would have to say "This product includes software developed by the Apache Software Foundation (http://www.apache.org/)." somewhere in the documentation. There is however, the task of re-working every file in the source tree to use the logging wrapper mechanism, which is non-trivial (190 files, 83 of which are tests). I would suggest this be undertaken when version 1.3 is finished. I think we can arbitrarily set a cut-off point for 1.3 next week, unless a major show-stopper is discovered. Of course for backwards compatibility, we should just deprecate the ParserFeedback (et al) and provide an implementation for it in terms of the new logging code. Derrick dha...@or... wrote: >Hi guys, > >I remember we had a discussion about the feedback mechanism earlier. I >just wanted to restart it by suggesting use of the Logging Wrapper from >Jakarta. > >I have noticed that if anyone wants to use the ParserFeedback to log >then they will need to mostly extend the DefaultParserFeedback class and >override the methods appropriately. If we can map the ParserFeedback >class to the Logging Wrapper applications can easily use the Feedback >mechanism to log to Log4j and JDK 1.4 without having to do a thing. most >users according tome woudl be using one of these systems. I believe the >argument then was coupling with a third-party library. But I believe the >flexibility it offers outstrips the coupling drawback. > >Furthermore imagine an application which is using some other logging >tool. They have coded their entire logging framework using the Logging >Wrapper and have used an adapter to log to their logging tool. If they >use the parser and want to log its output as well, they will have to >write one more adapter. Instead if the parser provides a mechanism for >using the Logging Wrapper, they would not need to do anything. > >We ahve actually had requests wherein different clients have asked for >different logging tools to be used!!! Hence the request. > >We could simply extend from DefaultParserFeedback for LogWrapperFeedback >and make it implement the commons logging interface. > >Do let me know your thoughts/opinions/suggestions on the same. > >Regards, >Dhaval > > > > |
From: Somik R. <so...@ya...> - 2003-05-02 14:46:01
|
Hi Derrick, Dhaval, Everyone, I've been thinking about his for a while too, and I agree with you - = we shouldnt decide before 1.3 is out. Some issues that have been on my mind -=20 [1] I find the parser's differentiating factor is its size - time and = time again the feedback I've received is that folks love its being below = 100K. Size almost directly maps on to simplicity. And that impacts the = other important area - performance. [2] I hate to pay for what I don't need - when folks get tons of stuff = that they don't need, they are paying for the needs of a few. At the same time, I think it is a challenge to be able to accomodate new = requests and still keep the parser light. I see a natural layer forming: ,----------------------------------------. | Sample Applications, GUI | | ,'''''''''''''''''''''''''''''''`. | | | Logging Mechanism | | | | ,''''''''''''''''''''''''''| | | | | | Beans | | | | | | +--------------------b | | | | | | | Scanners | | | | | | | | ,---------------Y | | | | | | | | | Core Parser | | | | | | | | | `.............../ | | | | | | | L____________________| | | | | | | | | | | | '`'''''''''''''''''''''''''' | | | | default, log4j, jdk1.4 | | | `................................/ | |________________________________________| If we can perform this seperation in the design and the packaging, it = might allow people to choose what they need. We don't have to follow the = "one size fits all" policy. What are your thoughts? I am not sure how we'd achieve this seperation = or whether it really makes sense - so please jump in with your two = cents.. Regards, Somik ----- Original Message -----=20 From: "Derrick Oswald" <Der...@ro...> To: <htm...@li...> Sent: Friday, May 02, 2003 4:13 AM Subject: Re: [Htmlparser-developer] ParserFeedback mechanism > I looked at this earlier too, regarding the org.apache.commons.logging = > package: > http://jakarta.apache.org/commons/logging.html > which provides a thin logging-system agnostic interface. >=20 > There didn't appear to be anything in the licence=20 > (http://jakarta.apache.org/commons/license.html) that precludes=20 > repackaging the code into the htmlparser tree. It isn't very large. We = > would have to say >=20 > "This product includes software developed by the Apache Software = Foundation (http://www.apache.org/)." >=20 > somewhere in the documentation. >=20 > There is however, the task of re-working every file in the source tree = > to use the logging wrapper mechanism, which is non-trivial (190 files, = > 83 of which are tests). > I would suggest this be undertaken when version 1.3 is finished. I = think=20 > we can arbitrarily set a cut-off point for 1.3 next week, unless a = major=20 > show-stopper is discovered. > Of course for backwards compatibility, we should just deprecate the=20 > ParserFeedback (et al) and provide an implementation for it in terms = of=20 > the new logging code. >=20 > Derrick >=20 > dha...@or... wrote: >=20 > >Hi guys, > > > >I remember we had a discussion about the feedback mechanism earlier. = I > >just wanted to restart it by suggesting use of the Logging Wrapper = from > >Jakarta. > > > >I have noticed that if anyone wants to use the ParserFeedback to log > >then they will need to mostly extend the DefaultParserFeedback class = and > >override the methods appropriately. If we can map the ParserFeedback > >class to the Logging Wrapper applications can easily use the Feedback > >mechanism to log to Log4j and JDK 1.4 without having to do a thing. = most > >users according tome woudl be using one of these systems. I believe = the > >argument then was coupling with a third-party library. But I believe = the > >flexibility it offers outstrips the coupling drawback.=20 > > > >Furthermore imagine an application which is using some other logging > >tool. They have coded their entire logging framework using the = Logging > >Wrapper and have used an adapter to log to their logging tool. If = they > >use the parser and want to log its output as well, they will have to > >write one more adapter. Instead if the parser provides a mechanism = for > >using the Logging Wrapper, they would not need to do anything.=20 > > > >We ahve actually had requests wherein different clients have asked = for > >different logging tools to be used!!! Hence the request. > > > >We could simply extend from DefaultParserFeedback for = LogWrapperFeedback > >and make it implement the commons logging interface. > > > >Do let me know your thoughts/opinions/suggestions on the same. > > > >Regards, > >Dhaval > > > > > > =20 > > >=20 >=20 >=20 >=20 > ------------------------------------------------------- > This sf.net email is sponsored by:ThinkGeek > Welcome to geek heaven. > http://thinkgeek.com/sf > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer |
From: Derrick O. <Der...@ro...> - 2003-05-03 12:03:32
|
Since it's a library incorporated within other applications, size is always an issue. There are two aspects though, disk footprint (jar size) and memory usage. Usually, there is a speed/memory usage trade-off to be made, which is only sometimes reflected in the disk footprint size. With current desktop hardware, people usually trade off memory for speed. It's only with embedded or mobile applications you concentrate on disk size and memory consumption. Regarding your picture, the layers won't necessarily follow the current package structure. For example, logging is integral to the core parser to report problems, and the beans layer removes all HTML tags so it can't be used by upper layers. In order to decide the breakdown in layers, a poll of users regarding typical use-cases might be in order. Lets say there are two major groupings: 1) extraction of all or part of the information on a page to be consumed by another application. 2) rewriting URLs, content, specific tags, clean-up, reformatting or pretty printing HTML text This would suggest three configuration items (jars): parser_applications.jar - Sample applications, GUI tools, beans, tests parser_edit.jar - Rewriting tools, DOM type heirarchical editing, visitors, smart tags parser_core.jar - Read-only core parser, stream of undifferentiated tags If a programs parser usage involves extraction, it need only use the parser_core.jar and pass through the data in a stream-like fashion. But if rewriting is in order, they use both parser_core.jar and parser_edit.jar and the parser presents the full HTML document as a heirarchy of tag specific nodes. All else goes into parser_applications. We could probably get parser_core.jar below 25KB, or in that range. Derrick Somik Raha wrote: <snip> > [1] I find the parser's differentiating factor is its size - time and > time again the feedback I've received is that folks love its being > below 100K. Size almost directly maps on to simplicity. And that > impacts the other important area - performance. > > [2] I hate to pay for what I don't need - when folks get tons of stuff > that they don't need, they are paying for the needs of a few. > > At the same time, I think it is a challenge to be able to accomodate > new requests and still keep the parser light. I see a natural layer > forming: > > > ,----------------------------------------. > | Sample Applications, GUI | > | ,'''''''''''''''''''''''''''''''`. | > | | Logging Mechanism | | > | | ,''''''''''''''''''''''''''| | | > | | | Beans | | | > | | | +--------------------b | | | > | | | | Scanners | | | | > | | | | ,---------------Y | | | | > | | | | | Core Parser | | | | | > | | | | `.............../ | | | | > | | | L____________________| | | | > | | | | | | > | | '`'''''''''''''''''''''''''' | | > | | default, log4j, jdk1.4 | | > | `................................/ | > |________________________________________| > > If we can perform this seperation in the design and the packaging, it > might allow people to choose what they need. We don't have to follow > the "one size fits all" policy. > > What are your thoughts? I am not sure how we'd achieve this seperation > or whether it really makes sense - so please jump in with your two cents.. > > Regards, > Somik > <snip> |
From: Derrick O. <Der...@ro...> - 2003-05-03 12:12:22
|
Oops, I should have said that the parser_core.jar outputs a stream of undifferentiated *nodes* Derrick Oswald wrote: > > Since it's a library incorporated within other applications, size is > always an issue. > There are two aspects though, disk footprint (jar size) and memory usage. > Usually, there is a speed/memory usage trade-off to be made, which is > only sometimes reflected in the disk footprint size. > With current desktop hardware, people usually trade off memory for speed. > It's only with embedded or mobile applications you concentrate on disk > size and memory consumption. > > Regarding your picture, the layers won't necessarily follow the > current package structure. > For example, logging is integral to the core parser to report > problems, and the beans layer removes all HTML tags so it can't be > used by upper layers. In order to decide the breakdown in layers, a > poll of users regarding typical use-cases might be in order. > > Lets say there are two major groupings: > > 1) extraction of all or part of the information on a page to be > consumed by another application. > 2) rewriting URLs, content, specific tags, clean-up, reformatting or > pretty printing HTML text > > This would suggest three configuration items (jars): > > parser_applications.jar - Sample applications, GUI tools, beans, tests > parser_edit.jar - Rewriting tools, DOM type heirarchical editing, > visitors, smart tags > parser_core.jar - Read-only core parser, stream of undifferentiated tags > > If a programs parser usage involves extraction, it need only use the > parser_core.jar and pass through the data in a stream-like fashion. > But if rewriting is in order, they use both parser_core.jar and > parser_edit.jar and the parser presents the full HTML document as a > heirarchy of tag specific nodes. All else goes into parser_applications. > > We could probably get parser_core.jar below 25KB, or in that range. > > Derrick > > Somik Raha wrote: > <snip> > >> [1] I find the parser's differentiating factor is its size - time and >> time again the feedback I've received is that folks love its being >> below 100K. Size almost directly maps on to simplicity. And that >> impacts the other important area - performance. >> >> [2] I hate to pay for what I don't need - when folks get tons of >> stuff that they don't need, they are paying for the needs of a few. >> >> At the same time, I think it is a challenge to be able to accomodate >> new requests and still keep the parser light. I see a natural layer >> forming: >> >> >> ,----------------------------------------. >> | Sample Applications, GUI | >> | ,'''''''''''''''''''''''''''''''`. | >> | | Logging Mechanism | | >> | | ,''''''''''''''''''''''''''| | | >> | | | Beans | | | >> | | | +--------------------b | | | >> | | | | Scanners | | | | >> | | | | ,---------------Y | | | | >> | | | | | Core Parser | | | | | >> | | | | `.............../ | | | | >> | | | L____________________| | | | >> | | | | | | >> | | '`'''''''''''''''''''''''''' | | >> | | default, log4j, jdk1.4 | | >> | `................................/ | >> |________________________________________| >> >> If we can perform this seperation in the design and the packaging, it >> might allow people to choose what they need. We don't have to follow >> the "one size fits all" policy. >> >> What are your thoughts? I am not sure how we'd achieve this >> seperation or whether it really makes sense - so please jump in with >> your two cents.. >> >> Regards, >> Somik >> > > > <snip> > > > > > ------------------------------------------------------- > This sf.net email is sponsored by:ThinkGeek > Welcome to geek heaven. > http://thinkgeek.com/sf > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > |