Thread: [Htmlparser-developer] RE: [Htmlparser-user] version 1.5
Brought to you by:
derrickoswald
From: Marc N. <ma...@ke...> - 2004-02-17 18:16:17
|
I'm a big fan of server-side transforms. That is, scanning an HTML = document and transforming parts of it into custom markup and/or DHTML. = I do this using a servlet filter in Tomcat. I'm currently using an older version of the library (from 08/24/2003) -- = before the major code changes were made, mostly because I've been too = busy working on other things to port my code to the new APIs. I hope to = get to it eventually! :) However, if you're looking for feedback, then here's what I would find = useful in the library. It may or may not already do the following to = certain degrees. But if anything in this list can be made easy(ier) = than I'm all for it: - scan an HTML page for "custom" XML/HTML tags embedded within the HTML - maintain both the original HTML and the location of the XML "islands" = within it - provide mechanisms to parse different kinds of custom tags, including = the following: - very simple tags (like <br>) - value-only tags (like <a>value</a>) - composite tags (like <ul>) - tags that contain "anything", which the parser simply skips over (similar to <script>, but even dumber so that all it looks for is the = closing tag) - APIs that allow the definition of the custom tags (above) without = having to create a custom scanner and tag class for each one For illustrative purposes, here's an example of what some of my custom = tags look like: <html> <body> <h2>Here is the chart</h2> <Component name=3D"myChart" incorporates=3D"Chart"> <String name=3D"backgroundColor" value=3D"white"/> <String name=3D"foregroundColor" value=3D"black"/> <Number name=3D"width" value=3D"200"/> <Number name=3D"height" value=3D"400"/> <Reference name=3D"data" value=3D"dataModel"/> <Method name=3D"changeSize"> <Param name=3D"width"/> <Param name=3D"height"/> <Impl> // This is javascript code this.width.set(width); this.height.set(height); this.render(); </Impl> </Method> </Component> <hr> blah blah .... (more HTML) .... </body> </html> Hope this helps! Marc -----Original Message----- From: Derrick Oswald [mailto:Der...@Ro...] Sent: Tuesday, February 17, 2004 4:40 AM To: htm...@li...; htm...@li... Subject: [Htmlparser-user] version 1.5 Now that version 1.4 is nearly put to bed, it's time to look forward=20 into the future to visualize or 'blue sky' the features that could be=20 incorporated in the next version of the parser. There are a small number = of feature requests that have accumulated over the last few months that=20 can serve as a starting point:=20 http://sourceforge.net/tracker/?group_id=3D24399&atid=3D381402 But what is really required are some real use-cases that aren't=20 addressed by the curent parser, which will lead to real requirements,=20 which lead to real features that can be added to the parser for the next = version. What does everyone do with the htmlparser that could be built=20 into it? Or more to the point, what capabilities are lacking that cause=20 a developer to *not* use htmlparser and do it themselves some other way? = Does anybody have any ideas? Does anybody have some applications they=20 would like to add to the htmlparser codebase so that 'out-of-the-box' it = does what they want? In general, what directions should development=20 take, i.e. HTML correction or editing, XML, robots, server side=20 transforms etc.? Has anybody got some pet peeves they want cleared up?=20 Come on, give it up. Now's the time. Derrick ------------------------------------------------------- SF.Net is sponsored by: Speed Start Your Linux Apps Now. Build and deploy apps & Web services for Linux with a free DVD software kit from IBM. Click Now! http://ads.osdn.com/?ad_id=3D1356&alloc_id=3D3438&op=3Dclick _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Marc N. <ma...@ke...> - 2004-02-17 18:27:06
|
Just to clarify -- the library already does most of the things I list = below (i.e. I've already implemented them using a semi-current version = of HTMLParser). However, I'm listing them here so they may be = considered as one of the many use cases for the library. I also want to commend Derrick for all the work he's put into the = project! Marc -----Original Message----- From: Marc Novakowski=20 Sent: Tuesday, February 17, 2004 10:12 AM To: htm...@li...; htm...@li... Subject: RE: [Htmlparser-user] version 1.5 I'm a big fan of server-side transforms. That is, scanning an HTML = document and transforming parts of it into custom markup and/or DHTML. = I do this using a servlet filter in Tomcat. I'm currently using an older version of the library (from 08/24/2003) -- = before the major code changes were made, mostly because I've been too = busy working on other things to port my code to the new APIs. I hope to = get to it eventually! :) However, if you're looking for feedback, then here's what I would find = useful in the library. It may or may not already do the following to = certain degrees. But if anything in this list can be made easy(ier) = than I'm all for it: - scan an HTML page for "custom" XML/HTML tags embedded within the HTML - maintain both the original HTML and the location of the XML "islands" = within it - provide mechanisms to parse different kinds of custom tags, including = the following: - very simple tags (like <br>) - value-only tags (like <a>value</a>) - composite tags (like <ul>) - tags that contain "anything", which the parser simply skips over (similar to <script>, but even dumber so that all it looks for is the = closing tag) - APIs that allow the definition of the custom tags (above) without = having to create a custom scanner and tag class for each one For illustrative purposes, here's an example of what some of my custom = tags look like: <html> <body> <h2>Here is the chart</h2> <Component name=3D"myChart" incorporates=3D"Chart"> <String name=3D"backgroundColor" value=3D"white"/> <String name=3D"foregroundColor" value=3D"black"/> <Number name=3D"width" value=3D"200"/> <Number name=3D"height" value=3D"400"/> <Reference name=3D"data" value=3D"dataModel"/> <Method name=3D"changeSize"> <Param name=3D"width"/> <Param name=3D"height"/> <Impl> // This is javascript code this.width.set(width); this.height.set(height); this.render(); </Impl> </Method> </Component> <hr> blah blah .... (more HTML) .... </body> </html> Hope this helps! Marc -----Original Message----- From: Derrick Oswald [mailto:Der...@Ro...] Sent: Tuesday, February 17, 2004 4:40 AM To: htm...@li...; htm...@li... Subject: [Htmlparser-user] version 1.5 Now that version 1.4 is nearly put to bed, it's time to look forward=20 into the future to visualize or 'blue sky' the features that could be=20 incorporated in the next version of the parser. There are a small number = of feature requests that have accumulated over the last few months that=20 can serve as a starting point:=20 http://sourceforge.net/tracker/?group_id=3D24399&atid=3D381402 But what is really required are some real use-cases that aren't=20 addressed by the curent parser, which will lead to real requirements,=20 which lead to real features that can be added to the parser for the next = version. What does everyone do with the htmlparser that could be built=20 into it? Or more to the point, what capabilities are lacking that cause=20 a developer to *not* use htmlparser and do it themselves some other way? = Does anybody have any ideas? Does anybody have some applications they=20 would like to add to the htmlparser codebase so that 'out-of-the-box' it = does what they want? In general, what directions should development=20 take, i.e. HTML correction or editing, XML, robots, server side=20 transforms etc.? Has anybody got some pet peeves they want cleared up?=20 Come on, give it up. Now's the time. Derrick ------------------------------------------------------- SF.Net is sponsored by: Speed Start Your Linux Apps Now. Build and deploy apps & Web services for Linux with a free DVD software kit from IBM. Click Now! http://ads.osdn.com/?ad_id=3D1356&alloc_id=3D3438&op=3Dclick _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user ------------------------------------------------------- SF.Net is sponsored by: Speed Start Your Linux Apps Now. Build and deploy apps & Web services for Linux with a free DVD software kit from IBM. Click Now! http://ads.osdn.com/?ad_id=1356&alloc_id438&op=3Dick _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Alberto N. <alb...@ti...> - 2004-04-21 17:08:14
|
All the following suggestions are already done by me. I'd like to read your comments and maybe also improvement's ideas. Waiting for your advises, I continue test activity. I hope that all these improvements could make quick the process of changing the strings after the parser have processed the url stream. ---------------------------------------------------------------------------- --------- package org.htmlparser.util; Class NodeList I suggest to add two methods more: 1- public void keepLeaves () filter all nodes but leaves nodes. For example keepLeaves() applied to "<DIV>In The Middle<DIV>Hello World!</DIV></DIV>" gives as result the removal of top node (containing "In The Middle<DIV>Hello World!</DIV>") and the only element in the list will be the leaf node (containing "Hello World!"). 2- public void keepTopLevel () filter all nodes but nodes of the top level. For example keepTopLevel() applied to "<DIV>In The Middle<DIV>Hello World!</DIV></DIV>" gives as result the removal of leaf node (containing "Hello World!") and the only element in the list will be the top node (containing "In The Middle<DIV>Hello World!</DIV>"). ---------------------------------------------------------------------------- --------- package org.htmlparser.util; Class ParserUtils I suggest to add methods doing trim and split operations giving a string input variable as parameter. The trim and split operations have various methods that consider as trimming and splitting delimiters: spaces and tabs, digits, tags, simple characters. The use of this function could proficencly refine the text inside or outside tags. ---------------------------------------------------------------------------- --------- Another interesting improvement is to add the following method: public static Parser createParserParsingAnInputString (String input) throws ParserException, UnsupportedEncodingException This method will create a Parser Object from an input string. The input string is NOT the href of file or url in input but it is the stream itself. For example a significative input string could be: "<DIV>Hello World!</DIV>". This method could be added in Parser class or in both classes Parser and ParserUtils classes. Hope you like, Alberto Nacher User ID: 892989 Login Name (User Name): anul |
From: John M. <jo...@rt...> - 2004-02-17 18:25:38
|
custom tags with namespaces would also be a nice feature. Ala <rte:body></rte:body> we use those for marking the test that our Lucene search engine should index. At the moment I am using a simple substring method to parse out the text between these tags, but having htmlparser support them out of the box would made things a lot more efficient for more complex pages with multiple tags. John On Tue, 2004-02-17 at 18:11, Marc Novakowski wrote: > I'm a big fan of server-side transforms. That is, scanning an HTML document and transforming parts of it into custom markup and/or DHTML. I do this using a servlet filter in Tomcat. > > I'm currently using an older version of the library (from 08/24/2003) -- before the major code changes were made, mostly because I've been too busy working on other things to port my code to the new APIs. I hope to get to it eventually! :) > > However, if you're looking for feedback, then here's what I would find useful in the library. It may or may not already do the following to certain degrees. But if anything in this list can be made easy(ier) than I'm all for it: > > - scan an HTML page for "custom" XML/HTML tags embedded within the HTML > - maintain both the original HTML and the location of the XML "islands" within it > - provide mechanisms to parse different kinds of custom tags, including the following: > - very simple tags (like <br>) > - value-only tags (like <a>value</a>) > - composite tags (like <ul>) > - tags that contain "anything", which the parser simply skips over > (similar to <script>, but even dumber so that all it looks for is the closing tag) > > - APIs that allow the definition of the custom tags (above) without having to create a custom scanner and tag class for each one > > For illustrative purposes, here's an example of what some of my custom tags look like: > > <html> > <body> > <h2>Here is the chart</h2> > <Component name="myChart" incorporates="Chart"> > <String name="backgroundColor" value="white"/> > <String name="foregroundColor" value="black"/> > <Number name="width" value="200"/> > <Number name="height" value="400"/> > <Reference name="data" value="dataModel"/> > <Method name="changeSize"> > <Param name="width"/> > <Param name="height"/> > <Impl> > // This is javascript code > this.width.set(width); > this.height.set(height); > this.render(); > </Impl> > </Method> > </Component> > <hr> > blah blah .... (more HTML) .... > > </body> > </html> > > > > Hope this helps! > Marc > > -----Original Message----- > From: Derrick Oswald [mailto:Der...@Ro...] > Sent: Tuesday, February 17, 2004 4:40 AM > To: htm...@li...; > htm...@li... > Subject: [Htmlparser-user] version 1.5 > > > Now that version 1.4 is nearly put to bed, it's time to look forward > into the future to visualize or 'blue sky' the features that could be > incorporated in the next version of the parser. There are a small number > of feature requests that have accumulated over the last few months that > can serve as a starting point: > http://sourceforge.net/tracker/?group_id=24399&atid=381402 > > But what is really required are some real use-cases that aren't > addressed by the curent parser, which will lead to real requirements, > which lead to real features that can be added to the parser for the next > version. What does everyone do with the htmlparser that could be built > into it? Or more to the point, what capabilities are lacking that cause > a developer to *not* use htmlparser and do it themselves some other way? > Does anybody have any ideas? Does anybody have some applications they > would like to add to the htmlparser codebase so that 'out-of-the-box' it > does what they want? In general, what directions should development > take, i.e. HTML correction or editing, XML, robots, server side > transforms etc.? Has anybody got some pet peeves they want cleared up? > Come on, give it up. Now's the time. > > Derrick > > > > ------------------------------------------------------- > SF.Net is sponsored by: Speed Start Your Linux Apps Now. > Build and deploy apps & Web services for Linux with > a free DVD software kit from IBM. Click Now! > http://ads.osdn.com/?ad_id=1356&alloc_id=3438&op=click > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > ------------------------------------------------------- > SF.Net is sponsored by: Speed Start Your Linux Apps Now. > Build and deploy apps & Web services for Linux with > a free DVD software kit from IBM. Click Now! > http://ads.osdn.com/?ad_id56&alloc_id438&op=click > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user -- John Moylan ---------------------- ePublishing Radio Telefis Eireann, Montrose House, Donnybrook, Dublin 4, Eire t:+353 1 2083564 e:joh...@rt... ****************************************************************************** The information in this e-mail is confidential and may be legally privileged. It is intended solely for the addressee. Access to this e-mail by anyone else is unauthorised. If you are not the intended recipient, any disclosure, copying, distribution, or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. Please note that emails to, from and within RTÉ may be subject to the Freedom of Information Act 1997 and may be liable to disclosure. ****************************************************************************** |