Re: [Htmlparser-developer] configuration items
Brought to you by:
derrickoswald
From: Derrick O. <Der...@ro...> - 2003-05-03 12:12:22
|
Oops, I should have said that the parser_core.jar outputs a stream of undifferentiated *nodes* Derrick Oswald wrote: > > Since it's a library incorporated within other applications, size is > always an issue. > There are two aspects though, disk footprint (jar size) and memory usage. > Usually, there is a speed/memory usage trade-off to be made, which is > only sometimes reflected in the disk footprint size. > With current desktop hardware, people usually trade off memory for speed. > It's only with embedded or mobile applications you concentrate on disk > size and memory consumption. > > Regarding your picture, the layers won't necessarily follow the > current package structure. > For example, logging is integral to the core parser to report > problems, and the beans layer removes all HTML tags so it can't be > used by upper layers. In order to decide the breakdown in layers, a > poll of users regarding typical use-cases might be in order. > > Lets say there are two major groupings: > > 1) extraction of all or part of the information on a page to be > consumed by another application. > 2) rewriting URLs, content, specific tags, clean-up, reformatting or > pretty printing HTML text > > This would suggest three configuration items (jars): > > parser_applications.jar - Sample applications, GUI tools, beans, tests > parser_edit.jar - Rewriting tools, DOM type heirarchical editing, > visitors, smart tags > parser_core.jar - Read-only core parser, stream of undifferentiated tags > > If a programs parser usage involves extraction, it need only use the > parser_core.jar and pass through the data in a stream-like fashion. > But if rewriting is in order, they use both parser_core.jar and > parser_edit.jar and the parser presents the full HTML document as a > heirarchy of tag specific nodes. All else goes into parser_applications. > > We could probably get parser_core.jar below 25KB, or in that range. > > Derrick > > Somik Raha wrote: > <snip> > >> [1] I find the parser's differentiating factor is its size - time and >> time again the feedback I've received is that folks love its being >> below 100K. Size almost directly maps on to simplicity. And that >> impacts the other important area - performance. >> >> [2] I hate to pay for what I don't need - when folks get tons of >> stuff that they don't need, they are paying for the needs of a few. >> >> At the same time, I think it is a challenge to be able to accomodate >> new requests and still keep the parser light. I see a natural layer >> forming: >> >> >> ,----------------------------------------. >> | Sample Applications, GUI | >> | ,'''''''''''''''''''''''''''''''`. | >> | | Logging Mechanism | | >> | | ,''''''''''''''''''''''''''| | | >> | | | Beans | | | >> | | | +--------------------b | | | >> | | | | Scanners | | | | >> | | | | ,---------------Y | | | | >> | | | | | Core Parser | | | | | >> | | | | `.............../ | | | | >> | | | L____________________| | | | >> | | | | | | >> | | '`'''''''''''''''''''''''''' | | >> | | default, log4j, jdk1.4 | | >> | `................................/ | >> |________________________________________| >> >> If we can perform this seperation in the design and the packaging, it >> might allow people to choose what they need. We don't have to follow >> the "one size fits all" policy. >> >> What are your thoughts? I am not sure how we'd achieve this >> seperation or whether it really makes sense - so please jump in with >> your two cents.. >> >> Regards, >> Somik >> > > > <snip> > > > > > ------------------------------------------------------- > This sf.net email is sponsored by:ThinkGeek > Welcome to geek heaven. > http://thinkgeek.com/sf > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > |