htmlparser-developer Mailing List for HTML Parser (Page 6)
Brought to you by:
derrickoswald
You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(4) |
Nov
(1) |
Dec
(4) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(12) |
Feb
|
Mar
(7) |
Apr
(27) |
May
(14) |
Jun
(16) |
Jul
(27) |
Aug
(74) |
Sep
(1) |
Oct
(23) |
Nov
(12) |
Dec
(119) |
2003 |
Jan
(31) |
Feb
(23) |
Mar
(28) |
Apr
(59) |
May
(119) |
Jun
(10) |
Jul
(3) |
Aug
(17) |
Sep
(8) |
Oct
(38) |
Nov
(6) |
Dec
(1) |
2004 |
Jan
(4) |
Feb
(4) |
Mar
(1) |
Apr
(2) |
May
|
Jun
(7) |
Jul
(6) |
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2005 |
Jan
|
Feb
(1) |
Mar
|
Apr
(8) |
May
|
Jun
|
Jul
|
Aug
(2) |
Sep
(10) |
Oct
(4) |
Nov
(15) |
Dec
|
2006 |
Jan
|
Feb
(1) |
Mar
|
Apr
(4) |
May
(11) |
Jun
|
Jul
|
Aug
|
Sep
(2) |
Oct
|
Nov
|
Dec
|
2007 |
Jan
(3) |
Feb
(2) |
Mar
|
Apr
(2) |
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2008 |
Jan
|
Feb
(1) |
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
(5) |
Oct
(1) |
Nov
|
Dec
|
2009 |
Jan
|
Feb
(1) |
Mar
|
Apr
(2) |
May
|
Jun
(4) |
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
(2) |
2010 |
Jan
(1) |
Feb
|
Mar
|
Apr
(8) |
May
|
Jun
|
Jul
|
Aug
|
Sep
(6) |
Oct
|
Nov
(1) |
Dec
|
2011 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(3) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2012 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2014 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2015 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(1) |
2016 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(2) |
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(2) |
From: Marc N. <ma...@ke...> - 2004-02-17 18:27:06
|
Just to clarify -- the library already does most of the things I list = below (i.e. I've already implemented them using a semi-current version = of HTMLParser). However, I'm listing them here so they may be = considered as one of the many use cases for the library. I also want to commend Derrick for all the work he's put into the = project! Marc -----Original Message----- From: Marc Novakowski=20 Sent: Tuesday, February 17, 2004 10:12 AM To: htm...@li...; htm...@li... Subject: RE: [Htmlparser-user] version 1.5 I'm a big fan of server-side transforms. That is, scanning an HTML = document and transforming parts of it into custom markup and/or DHTML. = I do this using a servlet filter in Tomcat. I'm currently using an older version of the library (from 08/24/2003) -- = before the major code changes were made, mostly because I've been too = busy working on other things to port my code to the new APIs. I hope to = get to it eventually! :) However, if you're looking for feedback, then here's what I would find = useful in the library. It may or may not already do the following to = certain degrees. But if anything in this list can be made easy(ier) = than I'm all for it: - scan an HTML page for "custom" XML/HTML tags embedded within the HTML - maintain both the original HTML and the location of the XML "islands" = within it - provide mechanisms to parse different kinds of custom tags, including = the following: - very simple tags (like <br>) - value-only tags (like <a>value</a>) - composite tags (like <ul>) - tags that contain "anything", which the parser simply skips over (similar to <script>, but even dumber so that all it looks for is the = closing tag) - APIs that allow the definition of the custom tags (above) without = having to create a custom scanner and tag class for each one For illustrative purposes, here's an example of what some of my custom = tags look like: <html> <body> <h2>Here is the chart</h2> <Component name=3D"myChart" incorporates=3D"Chart"> <String name=3D"backgroundColor" value=3D"white"/> <String name=3D"foregroundColor" value=3D"black"/> <Number name=3D"width" value=3D"200"/> <Number name=3D"height" value=3D"400"/> <Reference name=3D"data" value=3D"dataModel"/> <Method name=3D"changeSize"> <Param name=3D"width"/> <Param name=3D"height"/> <Impl> // This is javascript code this.width.set(width); this.height.set(height); this.render(); </Impl> </Method> </Component> <hr> blah blah .... (more HTML) .... </body> </html> Hope this helps! Marc -----Original Message----- From: Derrick Oswald [mailto:Der...@Ro...] Sent: Tuesday, February 17, 2004 4:40 AM To: htm...@li...; htm...@li... Subject: [Htmlparser-user] version 1.5 Now that version 1.4 is nearly put to bed, it's time to look forward=20 into the future to visualize or 'blue sky' the features that could be=20 incorporated in the next version of the parser. There are a small number = of feature requests that have accumulated over the last few months that=20 can serve as a starting point:=20 http://sourceforge.net/tracker/?group_id=3D24399&atid=3D381402 But what is really required are some real use-cases that aren't=20 addressed by the curent parser, which will lead to real requirements,=20 which lead to real features that can be added to the parser for the next = version. What does everyone do with the htmlparser that could be built=20 into it? Or more to the point, what capabilities are lacking that cause=20 a developer to *not* use htmlparser and do it themselves some other way? = Does anybody have any ideas? Does anybody have some applications they=20 would like to add to the htmlparser codebase so that 'out-of-the-box' it = does what they want? In general, what directions should development=20 take, i.e. HTML correction or editing, XML, robots, server side=20 transforms etc.? Has anybody got some pet peeves they want cleared up?=20 Come on, give it up. Now's the time. Derrick ------------------------------------------------------- SF.Net is sponsored by: Speed Start Your Linux Apps Now. Build and deploy apps & Web services for Linux with a free DVD software kit from IBM. Click Now! http://ads.osdn.com/?ad_id=3D1356&alloc_id=3D3438&op=3Dclick _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user ------------------------------------------------------- SF.Net is sponsored by: Speed Start Your Linux Apps Now. Build and deploy apps & Web services for Linux with a free DVD software kit from IBM. Click Now! http://ads.osdn.com/?ad_id=1356&alloc_id438&op=3Dick _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: John M. <jo...@rt...> - 2004-02-17 18:25:38
|
custom tags with namespaces would also be a nice feature. Ala <rte:body></rte:body> we use those for marking the test that our Lucene search engine should index. At the moment I am using a simple substring method to parse out the text between these tags, but having htmlparser support them out of the box would made things a lot more efficient for more complex pages with multiple tags. John On Tue, 2004-02-17 at 18:11, Marc Novakowski wrote: > I'm a big fan of server-side transforms. That is, scanning an HTML document and transforming parts of it into custom markup and/or DHTML. I do this using a servlet filter in Tomcat. > > I'm currently using an older version of the library (from 08/24/2003) -- before the major code changes were made, mostly because I've been too busy working on other things to port my code to the new APIs. I hope to get to it eventually! :) > > However, if you're looking for feedback, then here's what I would find useful in the library. It may or may not already do the following to certain degrees. But if anything in this list can be made easy(ier) than I'm all for it: > > - scan an HTML page for "custom" XML/HTML tags embedded within the HTML > - maintain both the original HTML and the location of the XML "islands" within it > - provide mechanisms to parse different kinds of custom tags, including the following: > - very simple tags (like <br>) > - value-only tags (like <a>value</a>) > - composite tags (like <ul>) > - tags that contain "anything", which the parser simply skips over > (similar to <script>, but even dumber so that all it looks for is the closing tag) > > - APIs that allow the definition of the custom tags (above) without having to create a custom scanner and tag class for each one > > For illustrative purposes, here's an example of what some of my custom tags look like: > > <html> > <body> > <h2>Here is the chart</h2> > <Component name="myChart" incorporates="Chart"> > <String name="backgroundColor" value="white"/> > <String name="foregroundColor" value="black"/> > <Number name="width" value="200"/> > <Number name="height" value="400"/> > <Reference name="data" value="dataModel"/> > <Method name="changeSize"> > <Param name="width"/> > <Param name="height"/> > <Impl> > // This is javascript code > this.width.set(width); > this.height.set(height); > this.render(); > </Impl> > </Method> > </Component> > <hr> > blah blah .... (more HTML) .... > > </body> > </html> > > > > Hope this helps! > Marc > > -----Original Message----- > From: Derrick Oswald [mailto:Der...@Ro...] > Sent: Tuesday, February 17, 2004 4:40 AM > To: htm...@li...; > htm...@li... > Subject: [Htmlparser-user] version 1.5 > > > Now that version 1.4 is nearly put to bed, it's time to look forward > into the future to visualize or 'blue sky' the features that could be > incorporated in the next version of the parser. There are a small number > of feature requests that have accumulated over the last few months that > can serve as a starting point: > http://sourceforge.net/tracker/?group_id=24399&atid=381402 > > But what is really required are some real use-cases that aren't > addressed by the curent parser, which will lead to real requirements, > which lead to real features that can be added to the parser for the next > version. What does everyone do with the htmlparser that could be built > into it? Or more to the point, what capabilities are lacking that cause > a developer to *not* use htmlparser and do it themselves some other way? > Does anybody have any ideas? Does anybody have some applications they > would like to add to the htmlparser codebase so that 'out-of-the-box' it > does what they want? In general, what directions should development > take, i.e. HTML correction or editing, XML, robots, server side > transforms etc.? Has anybody got some pet peeves they want cleared up? > Come on, give it up. Now's the time. > > Derrick > > > > ------------------------------------------------------- > SF.Net is sponsored by: Speed Start Your Linux Apps Now. > Build and deploy apps & Web services for Linux with > a free DVD software kit from IBM. Click Now! > http://ads.osdn.com/?ad_id=1356&alloc_id=3438&op=click > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > ------------------------------------------------------- > SF.Net is sponsored by: Speed Start Your Linux Apps Now. > Build and deploy apps & Web services for Linux with > a free DVD software kit from IBM. Click Now! > http://ads.osdn.com/?ad_id56&alloc_id438&op=click > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user -- John Moylan ---------------------- ePublishing Radio Telefis Eireann, Montrose House, Donnybrook, Dublin 4, Eire t:+353 1 2083564 e:joh...@rt... ****************************************************************************** The information in this e-mail is confidential and may be legally privileged. It is intended solely for the addressee. Access to this e-mail by anyone else is unauthorised. If you are not the intended recipient, any disclosure, copying, distribution, or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. Please note that emails to, from and within RTÉ may be subject to the Freedom of Information Act 1997 and may be liable to disclosure. ****************************************************************************** |
From: Marc N. <ma...@ke...> - 2004-02-17 18:16:17
|
I'm a big fan of server-side transforms. That is, scanning an HTML = document and transforming parts of it into custom markup and/or DHTML. = I do this using a servlet filter in Tomcat. I'm currently using an older version of the library (from 08/24/2003) -- = before the major code changes were made, mostly because I've been too = busy working on other things to port my code to the new APIs. I hope to = get to it eventually! :) However, if you're looking for feedback, then here's what I would find = useful in the library. It may or may not already do the following to = certain degrees. But if anything in this list can be made easy(ier) = than I'm all for it: - scan an HTML page for "custom" XML/HTML tags embedded within the HTML - maintain both the original HTML and the location of the XML "islands" = within it - provide mechanisms to parse different kinds of custom tags, including = the following: - very simple tags (like <br>) - value-only tags (like <a>value</a>) - composite tags (like <ul>) - tags that contain "anything", which the parser simply skips over (similar to <script>, but even dumber so that all it looks for is the = closing tag) - APIs that allow the definition of the custom tags (above) without = having to create a custom scanner and tag class for each one For illustrative purposes, here's an example of what some of my custom = tags look like: <html> <body> <h2>Here is the chart</h2> <Component name=3D"myChart" incorporates=3D"Chart"> <String name=3D"backgroundColor" value=3D"white"/> <String name=3D"foregroundColor" value=3D"black"/> <Number name=3D"width" value=3D"200"/> <Number name=3D"height" value=3D"400"/> <Reference name=3D"data" value=3D"dataModel"/> <Method name=3D"changeSize"> <Param name=3D"width"/> <Param name=3D"height"/> <Impl> // This is javascript code this.width.set(width); this.height.set(height); this.render(); </Impl> </Method> </Component> <hr> blah blah .... (more HTML) .... </body> </html> Hope this helps! Marc -----Original Message----- From: Derrick Oswald [mailto:Der...@Ro...] Sent: Tuesday, February 17, 2004 4:40 AM To: htm...@li...; htm...@li... Subject: [Htmlparser-user] version 1.5 Now that version 1.4 is nearly put to bed, it's time to look forward=20 into the future to visualize or 'blue sky' the features that could be=20 incorporated in the next version of the parser. There are a small number = of feature requests that have accumulated over the last few months that=20 can serve as a starting point:=20 http://sourceforge.net/tracker/?group_id=3D24399&atid=3D381402 But what is really required are some real use-cases that aren't=20 addressed by the curent parser, which will lead to real requirements,=20 which lead to real features that can be added to the parser for the next = version. What does everyone do with the htmlparser that could be built=20 into it? Or more to the point, what capabilities are lacking that cause=20 a developer to *not* use htmlparser and do it themselves some other way? = Does anybody have any ideas? Does anybody have some applications they=20 would like to add to the htmlparser codebase so that 'out-of-the-box' it = does what they want? In general, what directions should development=20 take, i.e. HTML correction or editing, XML, robots, server side=20 transforms etc.? Has anybody got some pet peeves they want cleared up?=20 Come on, give it up. Now's the time. Derrick ------------------------------------------------------- SF.Net is sponsored by: Speed Start Your Linux Apps Now. Build and deploy apps & Web services for Linux with a free DVD software kit from IBM. Click Now! http://ads.osdn.com/?ad_id=3D1356&alloc_id=3D3438&op=3Dclick _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Derrick O. <Der...@Ro...> - 2004-02-17 12:44:14
|
Now that version 1.4 is nearly put to bed, it's time to look forward into the future to visualize or 'blue sky' the features that could be incorporated in the next version of the parser. There are a small number of feature requests that have accumulated over the last few months that can serve as a starting point: http://sourceforge.net/tracker/?group_id=24399&atid=381402 But what is really required are some real use-cases that aren't addressed by the curent parser, which will lead to real requirements, which lead to real features that can be added to the parser for the next version. What does everyone do with the htmlparser that could be built into it? Or more to the point, what capabilities are lacking that cause a developer to *not* use htmlparser and do it themselves some other way? Does anybody have any ideas? Does anybody have some applications they would like to add to the htmlparser codebase so that 'out-of-the-box' it does what they want? In general, what directions should development take, i.e. HTML correction or editing, XML, robots, server side transforms etc.? Has anybody got some pet peeves they want cleared up? Come on, give it up. Now's the time. Derrick |
From: Derrick O. <Der...@Ro...> - 2004-01-27 14:13:46
|
Ayhan, I think it's a good idea for the parser to always use an english Locale when converting to upper case for tag names and attribute names. Thanks for pointing that overload out. I've identified about a dozen places where the String.toUppercase() is performed that should use a locale. I will implement this when the CVS system comes back online. You can track this with bug #883664 toUpperCase on tag names and attributes depends on locale. I'm surprised though, that your experiment didn't also find the end tag, which uses the same mechanism. Can you provide a URL that would illustrate the problem. My suspicion is the title tags are (incorrectly) built with Turkish characters. Perhaps you need to overload the TitleTag to recognize various combinations of English and Turkish characters, i.e. Title, TITLE, T\u0130TLE and T\u0131tle. These would convert to uppercase differently, based on the locale used. So you might define: public class TurkishTitleTag extends TitleTag { /** * The set of names handled by this tag. */ private static final String[] mIds = new String[] {"TITLE", "T\u0130TLE", "T\u0131TLE"}; /** * The set of tag names that indicate the end of this tag. */ private static final String[] mEnders = new String[] {"TITLE", "T\u0130TLE", "T\u0131TLE", "BODY"}; /** * Return the set of names handled by this tag. * @return The names to be matched that create tags of this type. */ public String[] getIds () { return (mIds); } /** * Return the set of tag names that cause this tag to finish. * @return The names of following tags that stop further scanning. */ public String[] getEnders () { return (mEnders); } } Then you would substitute this tag for the normal TitleTag using something like: ProtoTypicalNodeFactory factory = new ProtoTypicalNodeFactory (); factory.registerTag (new TurkishTitleTag ()); parser.setNodeFactory (factory); That should find all combinations of title, even without the toUppercase(locale) change, unless, of course, there are other special Turkish characters that need to be handled. Derrick p.s. Can you also provide a URL for the style problem. Ayhan Peker wrote: > Derrick, > i have tried 1.4..it is the same..So i am unable to extract titles if > they are lowercase..I have visited the link again regarding style > problem ...it is the same too..i am still getting style content with > the text..:( > > > Ayhan > > > > */Derrick Oswald <Der...@Ro...>/* wrote: > > > Ayhan, > > First off, you should probably switch to the 1.4 version, since > much has > changed since 1.3. > You should be able to use any locale when running the parser, if not > I'll try to fix it. > > It's likely that the page with the Turkish title should have used : > > (or whatever lang name it is), or the title tag should follow the > META > tag that sets the character encoding. > > Derrick > > Ayhan Peker wrote: > > > Derrick, > > Thanks for the quick response > > > > I think i know where the problem is: > > > > I set my linux to turkish locale. In turkish there are two 'i's one > > with dot on it and one without a dot. so TITLE IN TURKISH DOES NOT > > EQUAL TITLE IN ENGLISH. TitleScanner was returning false when > > "title".toUppercase().equals("TITLE").. > > > > I set the default locale at the begining of the thread to > english..It > > solved the problem..However I need turkish locale for the text (for > > database) > > I tried to modify TitleScanner by adding Locale > > ( "title".toUppercase(Locale("en")).equals("TITLE").) It finds the > > title but fails to detect > > > > Is there a way i can get the title fields without setting the > default > > locale to english? > > > > Regarding style content coming with text: > > have a look at the url: http://www.metu.edu.tr/about/admins.php > > > > Thanks > |
From: Derrick O. <Der...@Ro...> - 2004-01-26 11:35:18
|
Although the statistics are really hard to gather from the sourceforge site, I think the news announcment on the front page of sourceforge about the 1.4-20040104 integration release has nearly doubled the usual rate of htmlparser downloads and shifted the 1.4 version downloads from 30% of total downloads (~200/700) to 60% of total downloads (~800/1300). Surprisingly 2.5% of users are still pulling version 1.1 and 1.2. |
From: Somik R. <so...@ya...> - 2004-01-07 15:49:13
|
Hi Derrick, Great going! Maybe at some point, you ought to talk to the Sourceforge people to check if this could be the project of the month, and give an interview. Cheers, Somik ----- Original Message ----- From: "Derrick Oswald" <Der...@Ro...> To: <htm...@li...>; <htm...@li...> Sent: Tuesday, January 06, 2004 5:30 PM Subject: [general] [Htmlparser-user] HTML Parser makes page 1 > The latest release of HTML Parser (1.4-20040104) has made the front page > of sourceforge ... http://sourceforge.net/ > ... that might bump up the download stats. > > Derrick > > > > ------------------------------------------------------- > This SF.net email is sponsored by: IBM Linux Tutorials. > Become an expert in LINUX or just sharpen your skills. Sign up for IBM's > Free Linux Tutorials. Learn everything from the bash shell to sys admin. > Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > |
From: Derrick O. <Der...@Ro...> - 2004-01-07 01:30:38
|
The latest release of HTML Parser (1.4-20040104) has made the front page of sourceforge ... http://sourceforge.net/ ... that might bump up the download stats. Derrick |
From: Derrick O. <Der...@Ro...> - 2003-12-08 00:17:01
|
The 'lexer integration' subject line is wearing a little thin, since it's been a while since lexer integration issues were complete, so from now on I'll try to label it appropriately. I've removed the scanners that didn't do anything anymore, leaving script and jsp scanners. Instead of registering a scanner to enable returning a specific tag you now add a tag to a new class called PrototypicalNodeFactory. These 'prototype' tags are cloned as needed to be returned from the parser. All known tags are 'registered' by default in a new Parser which is similar to having called the old 'registerDOMScanners()', so tags are fully nested. This is different behaviour so you will need to recurse into returned nodes to get at what you want, or if you want to return only some of the derived tags while keeping most as generic tags and a flatter structure, there are various constructors and manipulators on the factory. See the javadocs and examples in the tests package. Nearly all the old scanner tests are folded into the tag tests. I've changed the operation of toString() for CompositeTags. It now returns an indented listing of children so the mainline from the Parser looks better. TODO ===== 1.3.1 ------ It looks like there are enough bugs and requests to warrant another 1.3 point release with some patched files. I hate to work on a branch, but it may be the only way to get everyone off my back. Filters ------- Implement the new filtering mechanism for NodeList.searchFor (). Documentation ------------- As of now, it's more likely that the javadocs are lying to you than providing any helpful advice. This needs to be reworked completely. Augment Lexer State Machines ---------------------------------------- There are some changes needed in the lexer state machines to handle JSP constructs and also whitespace either side of attribute equals signs. Currently the latter is handled by a kludgy fixAttributes() method applied after a tag is parsed, but it would be better handled in the state machine initially. The former isn't handled at all, and would involve all nodes possibly having children (a remark or string node can have embedded JSP, i.e. <!-- this remark, created on <%@ date() %>, needs to be handled -->. So some design work needs to be done to analyze the state transitions and gating characters. toHtml(verbatim/fixed) ----------------------------- One of the design goals for the new Lexer subsystem was to be able to regurgitate the original HTML via the toHtml() method, so the original page is unmodified except for any explicit user edits, i.e. link URL edits. But the parser fixes broken HTML without asking, so you can't get back an unadulterated page from toHtml(). A lot of test cases assume fixed HTML. Either a parameter on toHtml() or another method would be needed to provide the choice of the original HTML or the fixed HTML. There's some initial work on eliminating the added virtual end tags commented out in TagNode, but it will also require a way to remember broken tags, like ...<title>The Title</title</head><body>... GUI Parser Tool --------------------- Some GUI based parser application showing the HTML parse tree in one panel and the HTML text in another, with the tree node selected being highlighted in the text, or the text cursor setting the tree node selected, would be really good. A filter builder tool to graphically construct a program to extract a snippet from an HTML page would blow people away. Applications ----------- Rework all the applications for a better 'out of the box' experience for new and novice users. Fix all the scripts in /bin (for unix and windows) and add any others that don't exist already. Clean Up ------------ The integration process needs to be revamped to take use the $Name: CVS substitution, so a checkin isn't required every integration. Block/Inline ---------------- The tag-enders and end-tag-enders lists are only a partial solution to the HTML specification for block and inline tags. By ensuring block tags don't overlap, a better parsing job could be done, i.e. <FORM> .... <TABLE> ... </FORM></TABLE> would be rearranged as <FORM> .... <TABLE> ... </TABLE></FORM> This needs some design work. |
From: Derrick O. <Der...@Ro...> - 2003-11-12 03:08:47
|
Two questions for users and developers... As part of the refactoring going on, the scanners are being obviated. This means that except for the TagScanner, CompositeTagScanner and ScriptScanner, the scanners package will be empty. Instead of registering scanners, programmers will register tags. These will be cloned as needed to be returned as parsed nodes. I'm in a position now to remove the registerScanners() method, and I'm wondering if the state of a new Parser shouldn't be preloaded with tags it recognizes. This is directly opposite to the current implementation where one needs to do a two phase setup: parser = new Parser (); parser.registerScanners (); I've looked at all the code I have available, and in every case (except for unit test cases) the new Parser call is immediately followed by registerScanners. Question: Should a new parser be already configured and ready to rock with tags registered? Of course there will be ways to get the original behaviour. After much discussion with Joshua, I've broken out the NodeFactory as a class, so currently this might look like (my unsubmitted codebase): parser.setNodeFactory (new PrototypicalNodeFactory (true)); where the boolean indicates the node factory should be constructed empty. Of course, you can add (or remove) whatever specific tags (even your own custom ones) you want to receive: PrototypicalNodeFactory factory = new PrototypicalNodeFactory (true); factory.registerTag (new LinkTag ()); factory.registerTag (new ImageTag ()); parser.setNodeFactory (factory); which could also be written: parser.setNodeFactory (new PrototypicalNodeFactory (new Tag[] {new LinkTag (), new ImageTag ()})); An empty node factory generates undifferentiated tag, string and remark nodes, just like the Lexer. Then - I need to know how far I should go. Question: Should the node factory, and hence the Parser, have *all* the possible tags it knows about registered by default? This would be the equivalent of the current registerDomScanners() method call, which adds <HTML>, <HEAD> and <BODY> recognition. This may be slightly more problematical, since I can find very few (none?) instances of it's use. Realistically, if your program isn't handling recursing into node children now, you are probably doing it wrong, and adding one more level to the node tree won't cause a problem. My preference is to load it up completely, as it makes for a cleaner design, but if somebody can provide a compelling reason not to, I'll listen. In the absence of responses, I will take the answers as an emphatical Yes and Yes. Derrick |
From: Derrick O. <Der...@Ro...> - 2003-11-08 22:58:08
|
Please welcome Dereck Carrea to the HTML parser project. He is a 4th year computer science student in LYIT in Ireland. Over the last 3 years he has programmed in C, C++, VB, ASP, and recently began coding in Java. His main software development interests are in creating web applications running both on the desktop and online, particulary web search applications. Welcome Dereck (even though you spell your name incorrectly). |
From: Derrick O. <Der...@Ro...> - 2003-11-08 22:41:14
|
To replace the string filtering based on constants in the scanner classes I've implemented generic node filtering, based on a NodeFilter interface. Some example filters have been added to the new filter package to give everyone an idea of how it can be used. This may be pushed down to the lexer level if only a restricted subset of filters is allowed. Tag specific scanners are now only used to set up the tags in the prototype list and, except for ScriptTag, the tags now all use one of two common scanners, either a TagScanner or a CompositeTagScanner that are statically allocated by the tag base classes. I got rid of the node lookahead in the parser. This was used to determine the character set to use for reading the stream before handing out any erroneous nodes, but with some sleight of hand at the stream/source level we can still hide most of that from the user by performing the character set change in the doSemanticAction() method of the META tag. This means the META tag should always be registered (without it being registered, character sets may be handled erroneously if the HTTP header is incorrect, just as with the Lexer). This change makes the IteratorImpl class much simpler. The old IteratorImpl is moved to PeekingIteratorImpl but deprecated, as is the PeekingIterator interface. Some side effects: The mainline of the parser now looks different. Instead of -i, -l etc. switches, the user specifies the node name directly, i.e.: java -jar htmlparser.jar org.htmlparser.Parser IMG and it really works now. In the past, the parser avoided handling tags like "<a name=target>yadda</a>" because it didn't have an HREF attribute. However, this is valid HTML for a destination anchor from some other location, i.e. <a href="#target">see yadda</a>. This special logic in the LinkScanner is no longer used and will be destroyed when the LinkScanner goes away. This means there is no longer any need for the evaluate() method to be checked before scanning tags (at least there's no reason for it at this time), so it can probably be removed. But, caveat emptor, the parser can now return LinkTags where linktag.getLink() should (and eventually will) return null. p.s. Is any of this stuff I'm spewing useful? There's very little feedback from anybody. TODO ===== Remove Scanners --------------- Finish off obviating the scanners. Think of a good way to group tags so adding one tag to the list of tags to be returned by the parser would add it's buddies, i.e. the Form scanner now adds Input, TextArea, Selection and Option scanners behind the scenes for you. Then replace the add, remove, get, etc. scanner methods on the parser with the comparable tag based ones. Alter all the test cases to use the new methods, and move all the unique scanner test cases into tag test cases then delete most of the scannersTests package. Filters ------- Implement the new filtering mechanism for NodeList.searchFor (). Documentation ------------- As of now, it's more likely that the javadocs are lying to you than providing any helpful advice. This needs to be reworked completely. Augment Lexer State Machines ---------------------------------------- There are some changes needed in the lexer state machines to handle JSP constructs and also whitespace either side of attribute equals signs. Currently the latter is handled by a kludgy fixAttributes() method applied after a tag is parsed, but it would be better handled in the state machine initially. The former isn't handled at all, and would involve all nodes possibly having children (a remark or string node can have embedded JSP, i.e. <!-- this remark, created on <%@ date() %>, needs to be handled -->. So some design work needs to be done to analyze the state transitions and gating characters. toHtml(verbatim/fixed) ----------------------------- One of the design goals for the new Lexer subsystem was to be able to regurgitate the original HTML via the toHtml() method, so the original page is unmodified except for any explicit user edits, i.e. link URL edits. But the parser fixes broken HTML without asking, so you can't get back an unadulterated page from toHtml(). A lot of test cases assume fixed HTML. Either a parameter on toHtml() or another method would be needed to provide the choice of the original HTML or the fixed HTML. There's some initial work on eliminating the added virtual end tags commented out in TagNode, but it will also require a way to remember broken tags, like ...<title>The Title</title</head><body>... GUI Parser Tool --------------------- Some GUI based parser application showing the HTML parse tree in one panel and the HTML text in another, with the tree node selected being highlighted in the text, or the text cursor setting the tree node selected, would be really good. A filter builder tool to graphically construct a program to extract a snippet from an HTML page would blow people away. Applications ----------- Rework all the applications for a better 'out of the box' experience for new and novice users. Fix all the scripts in /bin (for unix and windows) and add any others that don't exist already. As you can see there's lots of work to do, so anyone with a death wish can jump in. So go ahead and do a take from CVS and jump in the middle with anything that appeals. Keep the list posted and update your CVS tree often (or subscribe to the htmlparsre-cvs mailing list for interrupt driven notification rather than polled notification). |
From: Derrick O. <Der...@Ro...> - 2003-11-06 04:06:41
|
OK, almost ready to get rid of most of the scanner package that shadows the tag package. There remains the 'filter' concept to handle, and then all but TagScanner, CompositeTagScanner and ScriptScanner are obsolete. The tags now own their 'ids', 'enders' and 'end tag enders' lists, and the isTagToBeEndedFor() logic now uses information from the tags, not the scanners. Nodes are created by cloning from a list of prototypes in the Parser (NodeFactory), so the scanners no longer create the tags (but they still create the prototypical ones). Now, the startTag() *is* the CompositeTag, and the CompositeTagScanner just adds children to an already differentiated tag. The scanners have no special actions on behalf of tags anymore. Things like the LinkProcessor and form ACTION determination have been moved out of the scanners and into either the Page object or the appropriate tags. Other changes: Made visitor 'node visiting order' the same order as on the page. Fixed StringBean, which was still looking for end tags with names starting with a slash, i.e. "/SCRIPT". Added some debugging support to the lexer, so you can easily base a breakpoint on a line number in a HTML page. Fixed all the tests failing if case sensitivity was turned on. Now ParserTestCase does case sensitive comparisons. Convert native characters in tests to unicode. Mostly this was the division sign (\u00f7) used in tests of character entity reference translation. Remove deprecated method calls: elementBegin() is now getStartPosition() and elementEnd() is now getEndPosition() Also fixed the NodeFactory signatures to have a Page rather than a Lexer. TODO ===== Filters ------- Replace the String to String comparison of the 'filter' concept with a TagFilter interface: boolean accept (Tag tag); and allow users to perform something like: NodeList list = parser.extractAllNodesThatAre ( new NodeFilter () { public boolean accept (Tag tag) { return (tag.getClass() == LinkTag.class); } }; And similarly for: tag.collectInto (NodeList collectionList, NodeFilter filter); nodelist.searchFor (NodeFilter filter); parser.parse (NodeFilter filter) etc. Remove Scanners --------------- Finish off obviating the scanners. Think of a good way to group tags so adding one tag to the list of tags to be returned by the parser would add it's buddies, i.e. the Form scanner now adds Input, TextArea, Selection and Option scanners behind the scenes for you. Then replace the add, remove, get, etc. scanner methods on the parser with the comparable tag based ones. Alter all the test cases to use the new methods, and move all the unique scanner test cases into tag test cases then delete most of the scannersTests package. Documentation ------------- As of now, it's more likely that the javadocs are lying to you than providing any helpful advice. This needs to be reworked completely. Augment Lexer State Machines ---------------------------------------- There are some changes needed in the lexer state machines to handle JSP constructs and also whitespace either side of attribute equals signs. Currently the latter is handled by a kludgy fixAttributes() method applied after a tag is parsed, but it would be better handled in the state machine initially. The former isn't handled at all, and would involve all nodes possibly having children (a remark or string node can have embedded JSP, i.e. <!-- this remark, created on <%@ date() %>, needs to be handled -->. So some design work needs to be done to analyze the state transitions and gating characters. toHtml(verbatim/fixed) ----------------------------- One of the design goals for the new Lexer subsystem was to be able to regurgitate the original HTML via the toHtml() method, so the original page is unmodified except for any explicit user edits, i.e. link URL edits. But the parser fixes broken HTML without asking, so you can't get back an unadulterated page from toHtml(). A lot of test cases assume fixed HTML. Either a parameter on toHtml() or another method would be needed to provide the choice of the original HTML or the fixed HTML. There's some initial work on eliminating the added virtual end tags commented out in TagNode, but it will also require a way to remember broken tags, like ...<title>The Title</title</head><body>... GUI Parser Tool --------------------- Some GUI based parser application showing the HTML parse tree in one panel and the HTML text in another, with the tree node selected being highlighted in the text, or the text cursor setting the tree node selected, would be really good. Applications ----------- Rework all the applications for a better 'out of the box' experience for new and novice users. Fix all the scripts in /bin (for unix and windows) and add any others that don't exist already. As you can see there's lots of work to do, so anyone with a death wish can jump in. So go ahead and do a take from CVS and jump in the middle with anything that appeals. Keep the list posted and update your CVS tree often (or subscribe to the htmlparsre-cvs mailing list for interrupt driven notification rather than polled notification). |
From: Derrick O. <Der...@Ro...> - 2003-11-03 22:57:10
|
The parent field points to the enclosing composite tag -- composite tags are *not* returned by the lexer. The lexer produces a linear stream of simple lexemes, without composite structure. You would need to use a parser. That is, in the example <A href="yadda"><IMG href="baffa"></A>, the image tag has the link tag as the parent, only for nodes produced by the parser (this would be one node with one child). You could use the same logic as below but you would need to dig recursively into each node returned to do your checking. If it's always in a table, you need only register the table scanner, so there would be less digging to do, since all other non-table nodes would be just simple nodes (again with no children). Derrick du du wrote: > Hello everyone: > > i'd like to locate a specific string in a html page and then process > information around it, so the whole scenario as: > > <html> <head>...</head> > <body><table> > <tr><td><p class=tablehead><b>Closing Time</b> </p></td></tr> > <tr>.....</tr> > </table> > </body></html> > In fact, I can locate "Closing Time", as well as its lexerNode, > and thus, I could further locate its parentNode or children nodes. But > when I using > aNode.getParentNode() always throw null point error. Part of code like: > > ... > Node aNode = lexer.nextNode(); > Node bNode; > while(aNode != null){ > if (aNode.getText().indexOf("Closing Time")!=-1){ > bNode = aNode.getParent(); > System.out.println("current node="+_bNode_.getText()); > } > aNode = lexer.nextNode(); > } > ... > > I'll be very appreciate if somebody could give me help. > > henry |
From: du du <tel...@ya...> - 2003-11-03 19:47:36
|
Hello everyone: i'd like to locate a specific sptring in a html page and then process information around it, so the whole scenario as: <html> <head>...</head> <body><table> <tr><td><p class=tablehead><b>Closing Time</b> </p></td></tr> <tr>.....</tr> </table> </body></html> In fact, I can locate "Closing Time", as well as its lexerNode, and thus, I could further locate its parentNode or children nodes. But when I using aNode.getParentNode() always throw null point error. Part of code like: ... Node aNode = lexer.nextNode(); Node bNode; while(aNode != null){ if (aNode.getText().indexOf("Closing Time")!=-1){ bNode = aNode.getParent(); System.out.println("current node="+bNode.getText()); } aNode = lexer.nextNode(); } ... I'll be very appreciate if somebody could give me help. henry --------------------------------- Post your free ad now! Yahoo! Canada Personals |
From: Derrick O. <Der...@Ro...> - 2003-10-26 16:08:56
|
Got rid of CompositeTagScannerHelper. Yeaahh! TODO ===== Node Factory ------------ The factory concept needs to be extended. The Parser's createTagNode should look up the name of the node (from the attribute list provided), and create specific types of tags (FormTag, TableTag etc.) by cloning empty tags from a Hashtable of possible tag types (possibly called mBlastocyst in reference to undifferentiated stem cells). This would provide a concrete implementation of createTag in CompositeTagScanner, removing a lot of near duplicate code from the scanners, and allow end users to plug in their own tags via a call like setTagFor ("BODY", new myBodyTag()) on the Parser. The end user wouldn't have to create or replace a scanner to get their own tags out. Getting rid of the data package cleared up a lot of questions regarding the interaction scanners have with tags. In general, the scanner now creates the tag in a very straight forward bean-like manner: ret = new Div (); ret.setPage (page); ret.setStartPosition (start); ret.setEndPosition (end); ret.setAttributesEx (attributes); ret.setStartTag (startTag); ret.setEndTag (endTag); ret.setChildren (children); This is nearly always the same in every scanner, only the tag name is different. The oddball cases have been highlighted with a // special step here... comment in the code. These special steps mostly revolve around meta-information available in scanners only (i.e. base href), or handling of nesting with a stack construct. It shouldn't be too much trouble to make these all go away. Documentation ------------- As of now, it's more likely that the javadocs are lying to you than providing any helpful advice. This needs to be reworked completely. Augment Lexer State Machines ---------------------------------------- There are some changes needed in the lexer state machines to handle JSP constructs and also whitespace either side of attribute equals signs. Currently the latter is handled by a kludgy fixAttributes() method applied after a tag is parsed, but it would be better handled in the state machine initially. The former isn't handled at all, and would involve all nodes possibly having children (a remark or string node can have embedded JSP, i.e. <!-- this remark, created on <%@ date() %>, needs to be handled -->. So some design work needs to be done to analyze the state transitions and gating characters. Case Sensitive TestCase ------------------------------- Currently all string comparisons via the ParserTestCase.assertStringsEqual() are case insensitive. This should be turned off by setting ParserTestCase.mCaseInsensitiveComparisons to false, and the tests fixed to accommodate. toHtml(verbatim/fixed) ----------------------------- One of the design goals for the new Lexer subsystem was to be able to regurgitate the original HTML via the toHtml() method, so the original page is unmodified except for any explicit user edits, i.e. link URL edits. But the parser fixes broken HTML without asking, so you can't get back an unadulterated page from toHtml(). A lot of test cases assume fixed HTML. Either a parameter on toHtml() or another method would be needed to provide the choice of the original HTML or the fixed HTML. There's some initial work on eliminating the added virtual end tags commented out in TagNode, but it will also require a way to remember broken tags, like ...<title>The Title</title</head><body>... GUI Parser Tool --------------------- Some GUI based parser application showing the HTML parse tree in one panel and the HTML text in another, with the tree node selected being highlighted in the text, or the text cursor setting the tree node selected, would be really good. As you can see there's lots of work to do, so anyone with a death wish can jump in. So go ahead and do a take from CVS and jump in the middle with anything that appeals. Keep the list posted and update your CVS tree often (or subscribe to the htmlparsre-cvs mailing list for interrupt driven notification rather than polled notification). |
From: Derrick O. <Der...@Ro...> - 2003-10-26 04:29:16
|
Fixed or avoided the remaining failing unit tests. It's a green bar now, 522 of 522 passing. I shut up all the excess verbiage from the tests, so they're silent too. TODO ===== Helpers ------- I desparately want to get rid of the last remaining 'helper' class, the CompositeTagScannerHelper. It's close, it just needs some more untangling. Node Factory ------------ The factory concept needs to be extended. The Parser's createTagNode should look up the name of the node (from the attribute list provided), and create specific types of tags (FormTag, TableTag etc.) by cloning empty tags from a Hashtable of possible tag types (possibly called mBlastocyst in reference to undifferentiated stem cells). This would provide a concrete implementation of createTag in CompositeTagScanner, removing a lot of near duplicate code from the scanners, and allow end users to plug in their own tags via a call like setTagFor ("BODY", new myBodyTag()) on the Parser. The end user wouldn't have to create or replace a scanner to get their own tags out. Getting rid of the data package cleared up a lot of questions regarding the interaction scanners have with tags. In general, the scanner now creates the tag in a very straight forward bean-like manner: ret = new Div (); ret.setPage (page); ret.setStartPosition (start); ret.setEndPosition (end); ret.setAttributesEx (attributes); ret.setStartTag (startTag); ret.setEndTag (endTag); ret.setChildren (children); This is nearly always the same in every scanner, only the tag name is different. The oddball cases have been highlighted with a // special step here... comment in the code. These special steps mostly revolve around meta-information available in scanners only (i.e. base href), or handling of nesting with a stack construct. It shouldn't be too much trouble to make these all go away. Scanners -------- The script scanner has been replaced. It can be considered as a first pass at what needs to be done to replace the generic CompositeTagScanner. The use of the underlying lexer makes these specialty scanners much easier. Documentation ------------- As of now, it's more likely that the javadocs are lying to you than providing any helpful advice. This needs to be reworked completely. Augment Lexer State Machines ---------------------------------------- There are some changes needed in the lexer state machines to handle JSP constructs and also whitespace either side of attribute equals signs. Currently the latter is handled by a kludgy fixAttributes() method applied after a tag is parsed, but it would be better handled in the state machine initially. The former isn't handled at all, and would involve all nodes possibly having children (a remark or string node can have embedded JSP, i.e. <!-- this remark, created on <%@ date() %>, needs to be handled -->. So some design work needs to be done to analyze the state transitions and gating characters. Case Sensitive TestCase ------------------------------- Currently all string comparisons via the ParserTestCase.assertStringsEqual() are case insensitive. This should be turned off by setting ParserTestCase.mCaseInsensitiveComparisons to false, and the tests fixed to accommodate. toHtml(verbatim/fixed) ----------------------------- One of the design goals for the new Lexer subsystem was to be able to regurgitate the original HTML via the toHtml() method, so the original page is unmodified except for any explicit user edits, i.e. link URL edits. But the parser fixes broken HTML without asking, so you can't get back an unadulterated page from toHtml(). A lot of test cases assume fixed HTML. Either a parameter on toHtml() or another method would be needed to provide the choice of the original HTML or the fixed HTML. There's some initial work on eliminating the added virtual end tags commented out in TagNode, but it will also require a way to remember broken tags, like ...<title>The Title</title</head><body>... GUI Parser Tool --------------------- Some GUI based parser application showing the HTML parse tree in one panel and the HTML text in another, with the tree node selected being highlighted in the text, or the text cursor setting the tree node selected, would be really good. As you can see there's lots of work to do, so anyone with a death wish can jump in. So go ahead and do a take from CVS and jump in the middle with anything that appeals. Keep the list posted and update your CVS tree often (or subscribe to the htmlparsre-cvs mailing list for interrupt driven notification rather than polled notification). |
From: Derrick O. <Der...@Ro...> - 2003-10-25 16:05:23
|
Made all test suites self executable by moving the mainline into ParserTestCase. Handle some pathological remark nodes (Netscape handles way more, like everything starting with <! so it seems). Handle some broken end tags. TAG_ENDERS and END_TAG_ENDERS should be revisited for all scanners. Passes 512 of 522 tests. TODO ===== Helpers ------- I desparately want to get rid of the last remaining 'helper' class, the CompositeTagScannerHelper. It's close, it just needs some more untangling. Node Factory ------------ The factory concept needs to be extended. The Parser's createTagNode should look up the name of the node (from the attribute list provided), and create specific types of tags (FormTag, TableTag etc.) by cloning empty tags from a Hashtable of possible tag types (possibly called mBlastocyst in reference to undifferentiated stem cells). This would provide a concrete implementation of createTag in CompositeTagScanner, removing a lot of near duplicate code from the scanners, and allow end users to plug in their own tags via a call like setTagFor ("BODY", new myBodyTag()) on the Parser. The end user wouldn't have to create or replace a scanner to get their own tags out. Getting rid of the data package cleared up a lot of questions regarding the interaction scanners have with tags. In general, the scanner now creates the tag in a very straight forward bean-like manner: ret = new Div (); ret.setPage (page); ret.setStartPosition (start); ret.setEndPosition (end); ret.setAttributesEx (attributes); ret.setStartTag (startTag); ret.setEndTag (endTag); ret.setChildren (children); This is nearly always the same in every scanner, only the tag name is different. The oddball cases have been highlighted with a // special step here... comment in the code. These special steps mostly revolve around meta-information available in scanners only (i.e. base href), or handling of nesting with a stack construct. It shouldn't be too much trouble to make these all go away. Scanners -------- The script scanner has been replaced. It can be considered as a first pass at what needs to be done to replace the generic CompositeTagScanner. The use of the underlying lexer makes these specialty scanners much easier. Unit Tests ---------- The remaining failing unit tests show up the changed functionality. Each needs to be examined, a decision on the 'correct' behaviour made, and the code or test altered accordingly. Documentation ------------- As of now, it's more likely that the javadocs are lying to you than providing any helpful advice. This needs to be reworked completely. As you can see there's lots of work to do, so anyone with a death wish can jump in. I'll be working my way from top to bottom of the JUnit errors list and commiting and notifying the developer list after each of them. So go ahead and do a take from CVS and jump in the middle with anything that appeals. Keep the list posted and update your CVS tree often (or subscribe to the htmlparsre-cvs mailing list for interrupt driven notification rather than polled notification). |
From: Derrick O. <Der...@Ro...> - 2003-10-24 18:18:34
|
Henry, Those methods are marked protected. They are used internally by the nextNode() method. They can be made public if there is a compelling reason to. Derrick du du wrote: > I read the Javadoc of Lexer. there are *parseRemark* > <file:///E:/htmlparser_oct20/htmlparser1_4/docs/javadoc/org/htmlparser/lexer/Lexer.html#parseRemark%28%29>(), > *parseString* > <file:///E:/htmlparser_oct20/htmlparser1_4/docs/javadoc/org/htmlparser/lexer/Lexer.html#parseString%28%29>() > & *parseTag* > <file:///E:/htmlparser_oct20/htmlparser1_4/docs/javadoc/org/htmlparser/lexer/Lexer.html#parseTag%28%29>(), > but these 3 methods are not included in the Lexer jar package > (htmlparser1_4_20030921.zip > <http://prdownloads.sourceforge.net/htmlparser/htmlparser1_4_20030921.zip?download>). > Does anyone else have the same experiences? > > henry > > > |
From: Derrick O. <Der...@Ro...> - 2003-10-24 12:07:02
|
The Page you have hasn't had any characters read from it yet (it's character offset is zero), hence it hasn't asked the source for any characters, hence the source isn't ready yet, and has nothing available yet. The ready() method is to check if the next read() may block, so it can't check if there are characters in the stream or URL provided. The getText() method with that signature returns all the text read so far, in your case nothing. Try: Cursor c = new cursor (parser_page, 0); do { } while (0 != parser_page.getCharacter ()); System.out.println("parser_page.getText()="+parser_page.getText()); This is all explained in the javadocs. Derrick du du wrote: > Urgent help: > > When constructing Page with either > Page parser_page = new Page(String ); > OR > Page parser_page = new Page(URLConnection ); > OR > Page parser_page = new Page(InputStream,charset); > I can't get any output from: > System.out.println("parser_page.getText()="+parser_page.getText()); > the result always show: > parser_page.getText()= > But when I added: > Source source = parser_page.getSource(); > System.out.println("source.ready()="+source.ready()+" > source.available()="+source.available()); > Output show: > source.ready()=false source.available()=0 > > Does this means the Page object is not ready to use? Why? > thanks a lot > > henry > > |
From: du du <tel...@ya...> - 2003-10-24 04:27:56
|
I read the Javadoc of Lexer. there are parseRemark(), parseString() & parseTag(), but these 3 methods are not included in the Lexer jar package (htmlparser1_4_20030921.zip). Does anyone else have the same experiences? henry --------------------------------- Post your free ad now! Yahoo! Canada Personals |
From: du du <tel...@ya...> - 2003-10-23 06:56:33
|
Urgent help: When constructing Page with either Page parser_page = new Page(String ); OR Page parser_page = new Page(URLConnection ); OR Page parser_page = new Page(InputStream,charset); I can't get any output from: System.out.println("parser_page.getText()="+parser_page.getText()); the result always show: parser_page.getText()= But when I added: Source source = parser_page.getSource(); System.out.println("source.ready()="+source.ready()+" source.available()="+source.available()); Output show: source.ready()=false source.available()=0 Does this means the Page object is not ready to use? Why? thanks a lot henry --------------------------------- Post your free ad now! Yahoo! Canada Personals |
From: du du <tel...@ya...> - 2003-10-23 05:28:52
|
Hello, When constructing Page with either Page parser_page = new Page(String ); OR Page parser_page = new Page(URLConnection ); OR Page parser_page = new Page(InputStream,charset); I can't get any output in: System.out.println("parser_page.getText()="+parser_page.getText()); the result always show: parser_page.getText()= But when I added: Source source = parser_page.getSource(); System.out.println("source.ready()="+source.ready()+" source.available()="+source.available()); Output show: source.ready()=false source.available()=0 Could anybody give me some hints? thanks a lot henry --------------------------------- Post your free ad now! Yahoo! Canada Personals |
From: Joshua K. <jo...@in...> - 2003-10-22 22:50:26
|
Derrick Oswald wrote: > I think the duplication is because the lexer.nodes package nodes don't > use the NodeVisitor pattern and the htmlparser package nodes do. The > lexer is shipped as a separate jar so it needs nodes that don't drag > in the composite node stuff, whcih happens if the NodeVisitor > signature is included. This may be factored out if we get rid of > visitLinkTag, visitorImageTag and visitorTitleTag from that > interface. These may best be handled by direct examination of the > node name in the various visitor classes. Yeah, as I said in another thread, the NodeVisitor ought not to be dependent on scanners (or, in the future, what prototypable tags are present in some collection). That is, it shouldn't have methods on it that visit types which may not be available. So I'm in favor of a simple, narrow NodeVisitor interface - just letting one visit the basic types. > The composite tag recursion happens on the scanTagNode method which > does need a lexer, so the create calls can take just a Page, like you > say. Sounds good. regards jk |
From: Derrick O. <Der...@Ro...> - 2003-10-22 03:31:04
|
Joshua, I think the duplication is because the lexer.nodes package nodes don't use the NodeVisitor pattern and the htmlparser package nodes do. The lexer is shipped as a separate jar so it needs nodes that don't drag in the composite node stuff, whcih happens if the NodeVisitor signature is included. This may be factored out if we get rid of visitLinkTag, visitorImageTag and visitorTitleTag from that interface. These may best be handled by direct examination of the node name in the various visitor classes. The composite tag recursion happens on the scanTagNode method which does need a lexer, so the create calls can take just a Page, like you say. Derrick Joshua Kerievsky wrote: > Derrick, > > It is me or are there duplicates of the StringNode, RemarkNode, etc > between the org.htmlparser package and the org.htmlparser.lexer.nodes > package? > I also noticed that the NodeFactory's creation methods take the lexer > as an argument, yet *all* of those methods and the methods they call > rely on lexer.getPage(). Have you considered simply passing in a > page instance rather than a lexer instance? That will work well for > some further refactoring I have in mind. > --jk > > > |