htmlparser-user Mailing List for HTML Parser (Page 90)
Brought to you by:
derrickoswald
You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(1) |
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(7) |
Feb
|
Mar
(9) |
Apr
(50) |
May
(20) |
Jun
(47) |
Jul
(37) |
Aug
(32) |
Sep
(30) |
Oct
(11) |
Nov
(37) |
Dec
(47) |
2003 |
Jan
(31) |
Feb
(70) |
Mar
(67) |
Apr
(34) |
May
(66) |
Jun
(25) |
Jul
(48) |
Aug
(43) |
Sep
(58) |
Oct
(25) |
Nov
(10) |
Dec
(25) |
2004 |
Jan
(38) |
Feb
(17) |
Mar
(24) |
Apr
(25) |
May
(11) |
Jun
(6) |
Jul
(24) |
Aug
(42) |
Sep
(13) |
Oct
(17) |
Nov
(13) |
Dec
(44) |
2005 |
Jan
(10) |
Feb
(16) |
Mar
(16) |
Apr
(23) |
May
(6) |
Jun
(19) |
Jul
(39) |
Aug
(15) |
Sep
(40) |
Oct
(49) |
Nov
(29) |
Dec
(41) |
2006 |
Jan
(28) |
Feb
(24) |
Mar
(52) |
Apr
(41) |
May
(31) |
Jun
(34) |
Jul
(22) |
Aug
(12) |
Sep
(11) |
Oct
(11) |
Nov
(11) |
Dec
(4) |
2007 |
Jan
(39) |
Feb
(13) |
Mar
(16) |
Apr
(24) |
May
(13) |
Jun
(12) |
Jul
(21) |
Aug
(61) |
Sep
(31) |
Oct
(13) |
Nov
(32) |
Dec
(15) |
2008 |
Jan
(7) |
Feb
(8) |
Mar
(14) |
Apr
(12) |
May
(23) |
Jun
(20) |
Jul
(9) |
Aug
(6) |
Sep
(2) |
Oct
(7) |
Nov
(3) |
Dec
(2) |
2009 |
Jan
(5) |
Feb
(8) |
Mar
(10) |
Apr
(22) |
May
(85) |
Jun
(82) |
Jul
(45) |
Aug
(28) |
Sep
(26) |
Oct
(50) |
Nov
(8) |
Dec
(16) |
2010 |
Jan
(3) |
Feb
(11) |
Mar
(39) |
Apr
(56) |
May
(80) |
Jun
(64) |
Jul
(49) |
Aug
(48) |
Sep
(16) |
Oct
(3) |
Nov
(5) |
Dec
(5) |
2011 |
Jan
(13) |
Feb
|
Mar
(1) |
Apr
(7) |
May
(7) |
Jun
(7) |
Jul
(7) |
Aug
(8) |
Sep
|
Oct
(6) |
Nov
(2) |
Dec
|
2012 |
Jan
(5) |
Feb
|
Mar
(3) |
Apr
(3) |
May
(4) |
Jun
(8) |
Jul
(1) |
Aug
(5) |
Sep
(10) |
Oct
(3) |
Nov
(2) |
Dec
(4) |
2013 |
Jan
(4) |
Feb
(2) |
Mar
(7) |
Apr
(7) |
May
(6) |
Jun
(7) |
Jul
(3) |
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
2014 |
Jan
|
Feb
(2) |
Mar
(1) |
Apr
|
May
(3) |
Jun
(1) |
Jul
|
Aug
|
Sep
(1) |
Oct
(4) |
Nov
(2) |
Dec
(4) |
2015 |
Jan
(4) |
Feb
(2) |
Mar
(8) |
Apr
(7) |
May
(6) |
Jun
(7) |
Jul
(3) |
Aug
(1) |
Sep
(1) |
Oct
(4) |
Nov
(3) |
Dec
(4) |
2016 |
Jan
(4) |
Feb
(6) |
Mar
(9) |
Apr
(9) |
May
(6) |
Jun
(1) |
Jul
(1) |
Aug
|
Sep
|
Oct
(1) |
Nov
(1) |
Dec
(1) |
2017 |
Jan
|
Feb
(1) |
Mar
(3) |
Apr
(1) |
May
|
Jun
(1) |
Jul
(2) |
Aug
(3) |
Sep
(6) |
Oct
(3) |
Nov
(2) |
Dec
(5) |
2018 |
Jan
(3) |
Feb
(13) |
Mar
(28) |
Apr
(5) |
May
(4) |
Jun
(2) |
Jul
(2) |
Aug
(8) |
Sep
(2) |
Oct
(1) |
Nov
(5) |
Dec
(1) |
2019 |
Jan
(8) |
Feb
(1) |
Mar
|
Apr
(1) |
May
(4) |
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(2) |
2020 |
Jan
|
Feb
|
Mar
(1) |
Apr
(1) |
May
(1) |
Jun
(2) |
Jul
(1) |
Aug
(1) |
Sep
(1) |
Oct
|
Nov
(1) |
Dec
(1) |
2021 |
Jan
(3) |
Feb
(2) |
Mar
(1) |
Apr
(1) |
May
(2) |
Jun
(1) |
Jul
(2) |
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2022 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
(1) |
Jun
(1) |
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2023 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2024 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2025 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Navid H.L. <na...@ya...> - 2002-11-09 19:09:39
|
Hi I am very new here. How I can use htmlparser? How I should set up the library so the java import works? I can not even compile and run the sample programs. I have jdk on my computer. Can some one give me basic instructions to do these. I greatly appreciat your help. Nav __________________________________________________ Do you Yahoo!? U2 on LAUNCH - Exclusive greatest hits videos http://launch.yahoo.com/u2 |
From: Somik R. <so...@ya...> - 2002-11-09 18:45:14
|
Hi Folks, Candidate Release 2 is out. Changes are : [1] Updated javadoc [2] Added support for multiple calls to elements() [sequentially, not parallelly] The latter implies, you can complete one round of parsing, and make another call to HTMLParser.elements() to begin another, without needing to recreate the parser object. You can get it from http://htmlparser.sourceforge.net. Your feedback is awaited. Regards, Somik |
From: Somik R. <so...@ya...> - 2002-11-09 05:14:13
|
Hi Steve, > I had expected the 2nd call to htmlEnum to give me a new enumeration > which I could step through, but the method htmlEnum.hasMoreNodes() > returns FALSE. This is bcos, when you create the HTMLParser object, the io connections are established, and the elements that you run through also advances the internal cursor. A quick solution for you is to re-create the parser object. However, it would be good to support multiple (sequential) iterators. Do you have time to look into the source ? You will need to only work with HTMLParser.java. We would be glad to put this in for the production release. Regards, Somik |
From: Stephen H. <Ste...@tr...> - 2002-11-08 22:46:19
|
I hope this isn't a stupid question, but here it goes anyway..... I have some code where I have to parse through an entire document, then based on what I find, parse through it again. Here is some code: parser = new HTMLParser(url2parse, new DefaultHTMLParserFeedback()); parser.registerScanners(); htmlEnum=parser.elements(); while (htmlEnum.hasMoreNodes()) { //insert processing code here } htmlEnum=parser.elements(); while (htmlEnum.hasMoreNodes()) { //insert secondary processing here } I had expected the 2nd call to htmlEnum to give me a new enumeration which I could step through, but the method htmlEnum.hasMoreNodes() returns FALSE. What am I missing? Thanks, --stephen harrington |
From: Somik R. <so...@ya...> - 2002-10-31 12:26:09
|
Hi Folks, HTMLParser 20021031 (C1) is out. This is candidate release 1. If = there are no issues, then this will become a production release. =20 There are bug fixes in this release, and some improvements. Most = important improvement - allowing renderers to be plugged in so as to = allow customization of functionality of toHTML(). Check the javadoc of = com.kizna.html.HTMLNode. This has been a repeating request, to be able = to modify the output of toHTML, especially for designers of web crawlers = who want to change the link before saving it. Thanks to Kaarle Kaila for the bug fix in HTMLParameterParser. = Thanks to Domenico Lordi for improvements to HTMLLinkScanner and = HTMLLinkTag.=20 =20 Here is the change log : Integration Build 1.2 - 20021031 [1] Changed string creation to static strings in HTMLTagParser [2] HTMLLinkProcessor can handle urls beginning with file:// (bug fix - = 629601) [3] All scanners get the feedback object initialized from HTMLParser [4] Fixed bug 624045 (in HTMLParameterParser) - erroneous space key = removed [5] Added HTMLRenderer and external rendering support in HTMLNode. [6] Line no and details incorporated for feedback and exceptions [7] HTMLLinkProcessor: "javascript:" recognition [8] HTMLLinkScanner: added flags for javascript, ftp, http, https [9] HTMLLinkTag: constructor for new flags, methods isJavascriptLink, = setJavascriptLink, etc... Please visit http://htmlparser.sourceforge.net to download this = release. <<Next step>> As far as architecture is concerned, I think this is it. The feedback = mechanism has been more or less integrated, though we're not using the = info method at all. Claude -- your help in doing a review on this issue = would be highly appreciated. Dhaval -- Have all the issues that you raised been fixed ? Annette -- Can you give your feedback on the HTMLRenderer and if it is = useful for your project ? If anyone has any issues, please raise them now, or forever hold your = peace.. <<Need Help>> In order that this may be a truly professional product, it would be = highly appreciated, if the members of the user and developer list = contribute a small portion of their time in finalizing this production = release.=20 These are the areas where you can help : [1] Test the release and please report bugs WITH your names (pls sign in = at sourceforge before u file your bug reports) [2] Check the javadocs - quality control - if anything is missing, = please update, and check in. [3] Write articles - based on applications you have written, which we = can put up for others to read. Articles could cover design areas, = performance, scalability, etc.. [4] Be active on the htmlparser-user mailing list to help others in the = community [5] Send a testimonial which we can put up to show that open source = software really can achieve professional targets (send this to = so...@ya...) Of course, pls do any of the above only if you have benefitted from this = project in any way.=20 Thank you very much, and awaiting your feedback. Regards, Somik |
From: Somik R. <so...@ya...> - 2002-10-16 10:59:16
|
Hi Folks, Integration release 1.2-20021016 is out. You can get it from http://htmlparser.sourceforge.net Here's the change log : Integration Build 1.2 - 20021016 -------------------------------- [1] Fixed bug 621117 - JSP tags not recognized if within string node [2] Fixed bug 617228 - Links with > symbol in query strings were not being recognized. [3] build.xml completely automatic - no manual changes needed before running [4] build.xml included in release package, inside src.zip [5] Refactored HTMLTag - design modified, introduced HTMLTagParser helper class [6] Optimized scanning process - 20% faster now There have been some refactorings and optimizations in this release. Most notably, the scanners are not enumerated sequentially anymore. Instead, they are stored inside hashtables, and are identified by the first word that occurs in a tag (in uppercase). Now, we have a default implementation of evaluate() which returns true, and most of the scanners dont override this if their evaluation is simply based on matching the first word. However, if the matching logic is complex, then evaluate() should be overridden. An additional method has been introduced in HTMLTagScanner() which all scanners have to override - and that is - getID() - which will be used to register the scanner into the hashtable (called only once) inside addScanner(). In addition feedback is being incorporated - you will find feedback if you run the testcases. The performance improvement is substantial - on running com.kizna.htmlTests.PerformanceTest.java - I could see a reduction of 500 ms (with all scanners registered) from 2500 ms to 2000 ms (run on the MySQL installation guide page). For developers (or folks who want to join) - the build script has been included in the distribution (it is a whole lot more powerful now - autodetects code version, etc..). Making your package ready for distribution is exceedingly simple now - so do go ahead and explore. Regards, Somik |
From: Somik R. <so...@ya...> - 2002-10-16 02:36:50
|
Oh :) I was referring to the bug report on asp tags.. Almost done now. Working on powerful build scripts now which should be a part of the next release as well. Bytway, I was thinking on one last design optimization - replacing the scan-evaluate mechanism with a hashtable, so as to reduce the scanner search from O(n) to O(1). I was thinking of doing a basic match initially to spark off a call to the relevant scan for confirmation. What do you think ? Regards, Somik ----- Original Message ----- From: "Claude Duguay" <CD...@ar...> To: <htm...@li...> Sent: Tuesday, October 15, 2002 7:59 PM Subject: RE: [Htmlparser-user] Question > Actually, I did not report the bug, other than by email. I presume you are referring to the 'nobody' bug? Of course, if you fixed it I'm grateful, as always. Thanks. > > -----Original Message----- > From: Somik Raha [mailto:so...@ya...] > Sent: Tue 10/15/2002 2:53 AM > To: htm...@li... > Cc: > Subject: Re: [Htmlparser-user] Question > > > > Hi Claude, > So that bug report on the site was yours... > The bug has been reproduced and fixed (asp tags inside string nodes not > being detected). > Should be out in the next release. > > Regards, > Somik > ----- Original Message ----- > From: "Claude Duguay" <CD...@ar...> > To: <htm...@li...> > Sent: Monday, September 30, 2002 10:17 PM > Subject: [Htmlparser-user] Question > > > I've see the parser throw an exception on ASP pages. Strictly speaking, > it isn't a requirement that JSP or ASP be parsable, but I think the only > hangup is really a failure to recognize the "<% ... %>" pattern. Is this > functionality that should be there or would be easy to add? > > > ------------------------------------------------------- > This sf.net email is sponsored by:ThinkGeek > Welcome to geek heaven. > http://thinkgeek.com/sf > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > ------------------------------------------------------- > This sf.net email is sponsored by:ThinkGeek > Welcome to geek heaven. > http://thinkgeek.com/sf > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > |
From: Claude D. <CD...@ar...> - 2002-10-15 14:29:34
|
QWN0dWFsbHksIEkgZGlkIG5vdCByZXBvcnQgdGhlIGJ1Zywgb3RoZXIgdGhhbiBieSBlbWFpbC4g SSBwcmVzdW1lIHlvdSBhcmUgcmVmZXJyaW5nIHRvIHRoZSAnbm9ib2R5JyBidWc/IE9mIGNvdXJz ZSwgaWYgeW91IGZpeGVkIGl0IEknbSBncmF0ZWZ1bCwgYXMgYWx3YXlzLiBUaGFua3MuDQoNCgkt LS0tLU9yaWdpbmFsIE1lc3NhZ2UtLS0tLSANCglGcm9tOiBTb21payBSYWhhIFttYWlsdG86c29t aWtAeWFob28uY29tXSANCglTZW50OiBUdWUgMTAvMTUvMjAwMiAyOjUzIEFNIA0KCVRvOiBodG1s cGFyc2VyLXVzZXJAbGlzdHMuc291cmNlZm9yZ2UubmV0IA0KCUNjOiANCglTdWJqZWN0OiBSZTog W0h0bWxwYXJzZXItdXNlcl0gUXVlc3Rpb24NCgkNCgkNCg0KCUhpIENsYXVkZSwNCgkgICAgU28g dGhhdCBidWcgcmVwb3J0IG9uIHRoZSBzaXRlIHdhcyB5b3Vycy4uLg0KCSAgICBUaGUgYnVnIGhh cyBiZWVuIHJlcHJvZHVjZWQgYW5kIGZpeGVkIChhc3AgdGFncyBpbnNpZGUgc3RyaW5nIG5vZGVz IG5vdA0KCWJlaW5nIGRldGVjdGVkKS4NCgkgICAgU2hvdWxkIGJlIG91dCBpbiB0aGUgbmV4dCBy ZWxlYXNlLg0KCQ0KCVJlZ2FyZHMsDQoJU29taWsNCgktLS0tLSBPcmlnaW5hbCBNZXNzYWdlIC0t LS0tDQoJRnJvbTogIkNsYXVkZSBEdWd1YXkiIDxDRHVndWF5QGFyY2Vzc2EuY29tPg0KCVRvOiA8 aHRtbHBhcnNlci11c2VyQGxpc3RzLnNvdXJjZWZvcmdlLm5ldD4NCglTZW50OiBNb25kYXksIFNl cHRlbWJlciAzMCwgMjAwMiAxMDoxNyBQTQ0KCVN1YmplY3Q6IFtIdG1scGFyc2VyLXVzZXJdIFF1 ZXN0aW9uDQoJDQoJDQoJSSd2ZSBzZWUgdGhlIHBhcnNlciB0aHJvdyBhbiBleGNlcHRpb24gb24g QVNQIHBhZ2VzLiBTdHJpY3RseSBzcGVha2luZywNCglpdCBpc24ndCBhIHJlcXVpcmVtZW50IHRo YXQgSlNQIG9yIEFTUCBiZSBwYXJzYWJsZSwgYnV0IEkgdGhpbmsgdGhlIG9ubHkNCgloYW5ndXAg aXMgcmVhbGx5IGEgZmFpbHVyZSB0byByZWNvZ25pemUgdGhlICI8JSAuLi4gJT4iIHBhdHRlcm4u IElzIHRoaXMNCglmdW5jdGlvbmFsaXR5IHRoYXQgc2hvdWxkIGJlIHRoZXJlIG9yIHdvdWxkIGJl IGVhc3kgdG8gYWRkPw0KCQ0KCQ0KCS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0t LS0tLS0tLS0tLS0tLS0tLS0tLS0NCglUaGlzIHNmLm5ldCBlbWFpbCBpcyBzcG9uc29yZWQgYnk6 VGhpbmtHZWVrDQoJV2VsY29tZSB0byBnZWVrIGhlYXZlbi4NCglodHRwOi8vdGhpbmtnZWVrLmNv bS9zZg0KCV9fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fDQoJ SHRtbHBhcnNlci11c2VyIG1haWxpbmcgbGlzdA0KCUh0bWxwYXJzZXItdXNlckBsaXN0cy5zb3Vy Y2Vmb3JnZS5uZXQNCglodHRwczovL2xpc3RzLnNvdXJjZWZvcmdlLm5ldC9saXN0cy9saXN0aW5m by9odG1scGFyc2VyLXVzZXINCgkNCgkNCgkNCgktLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0t LS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tDQoJVGhpcyBzZi5uZXQgZW1haWwgaXMgc3BvbnNv cmVkIGJ5OlRoaW5rR2Vlaw0KCVdlbGNvbWUgdG8gZ2VlayBoZWF2ZW4uDQoJaHR0cDovL3RoaW5r Z2Vlay5jb20vc2YNCglfX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19f X19fXw0KCUh0bWxwYXJzZXItdXNlciBtYWlsaW5nIGxpc3QNCglIdG1scGFyc2VyLXVzZXJAbGlz dHMuc291cmNlZm9yZ2UubmV0DQoJaHR0cHM6Ly9saXN0cy5zb3VyY2Vmb3JnZS5uZXQvbGlzdHMv bGlzdGluZm8vaHRtbHBhcnNlci11c2VyDQoJDQoNCg== |
From: Somik R. <so...@ya...> - 2002-10-15 09:56:03
|
Hi Claude, So that bug report on the site was yours... The bug has been reproduced and fixed (asp tags inside string nodes not being detected). Should be out in the next release. Regards, Somik ----- Original Message ----- From: "Claude Duguay" <CD...@ar...> To: <htm...@li...> Sent: Monday, September 30, 2002 10:17 PM Subject: [Htmlparser-user] Question I've see the parser throw an exception on ASP pages. Strictly speaking, it isn't a requirement that JSP or ASP be parsable, but I think the only hangup is really a failure to recognize the "<% ... %>" pattern. Is this functionality that should be there or would be easy to add? ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Somik R. <so...@ya...> - 2002-10-15 08:09:32
|
Hi , This is in response to the bug report (by nobody) : "If there are some special characters(we found a problem=20 with <) within HTML comments then all lines upto that=20 line(on which the charcter is present) gets deleted=20 when you reprint the tag(using toHTML()). I have been=20 using Node.toHTML() and I am assuming that the tag will=20 get parsed as a HTMLRemarkNode and its toHTML() will=20 get called. Whatever the case the output is distinctly=20 different from the input. Even the starting HTML=20 comments i.e. <!-- get deleted. " ** End of Report I couldnt reproduce this. Make sure you are using the latest Integration = release. HTMLRemarkNodeTest.testTagWithinRemarkNode() checks for this = bug - and it is passing. Also pls dont submit reports as nobody - then = the response wouldnt reach you. Regards, Somik |
From: Somik R. <so...@ya...> - 2002-10-02 03:37:46
|
Hi Folks, I decided to make an early release as I am going to hit the road yet again from tomorrow.. Here's the change log for you.. Integration Build 1.2 - 20021002 -------------------------------- [1] Bug fixed in HTMLStyleScanner, allowing attributes to show in style tag [2] HTMLFormTag bug fixed - allows changing of internal form attributes, and is reflected in toHTML() [3] HTMLFormScanner included in lineup of standard scanners [4] Added default parser feedback object constructor in HTMLParser [5] Only single LinkProcessor object created now for link and image scanners - leading to better performance [6] Added Base Ref Scanner - relative links to this can be handled for both standard links and image locations As you can see- all of this is based on community feedback - so pls keep the feedback coming. Im sorry for the delay in releases- I was swamped - just resigned from my job, and am on vacation for a while. I should be able to continue work on the parser from the 12th - in the meanwhile, if anyone else wishes to work on adding stuff -pls feel free. Areas of work left for production release 1.2 : [1] Checking testcases on Linux - Can someone just run com.kizna.html.AllTests - and give us a report of all the tests that failed ? [2] Checking bug reports from Joe Ryburn on the user group "A linkScanner issue" and Stephen Harrington "Problem Parsing a link" [3] Adding functionality for the parser feedback object - we've got to give some feedback for the logs, etc.. These are simple changes - I hope we can have some help.. Regards, Somik |
From: Somik R. <so...@ya...> - 2002-10-01 07:56:09
|
Hi Stephen, Could you open a bug report on the site ? We'll look into this. Regards, Somik ----- Original Message ----- From: "Stephen Harrington" <Ste...@tr...> To: <htm...@li...> Sent: Monday, September 30, 2002 9:51 PM Subject: [Htmlparser-user] Not my link.... > > Dhaval posted: > > Hi Stephen, > > Using '>' in your link is incorrect since it is a special character in > HTML. What I am saying is that your code is incorrect. Your query string > > must be scaped before sending tot he server. You must use the hex > representation of > in your link > i.e. %3E. > > This should not only make your HTML correct but will solve your parsing > problem as well. > > Regards, > > Dhaval Udani > Senior Analyst > M-Line, QPEG > OrbiTech Solutions Ltd. > +91-22-8290019 Extn. 1457 > > -------------------------------------------------------------------------- ------------------- > > Unfortunately I am not the author of the page I a parsing. BTW, this > was NOT a problem in the last production build. It only arose for me > when I went to the latest integration build. > > So if I understand your comment, if the link were URL encoded before > getting sent to the parser this would not be a problem? > > --stephen > > > > ------------------------------------------------------- > This sf.net email is sponsored by:ThinkGeek > Welcome to geek heaven. > http://thinkgeek.com/sf > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Somik R. <so...@ya...> - 2002-10-01 07:55:28
|
Claude Duguay wrote : >You might consider having constructor variants that use the >DefaultHTMLParserFeedback (by default ;-) so that users don't get >confused. Good idea :). Will do for the next integration release. Cheers, Somik |
From: <dha...@or...> - 2002-10-01 04:32:54
|
Yes Stephen that is true. Any URL being sent from HTML should be encoded as good programming practice. IN your case the particular link itself is incorrect. -----Original Message----- From: Stephen.Harrington [mailto:Ste...@tr...] Sent: Monday, September 30, 2002 9:52 PM To: htmlparser-user Subject: [Htmlparser-user] Not my link.... Dhaval posted: Hi Stephen, Using '>' in your link is incorrect since it is a special character in HTML. What I am saying is that your code is incorrect. Your query string must be scaped before sending tot he server. You must use the hex representation of > in your link i.e. %3E. This should not only make your HTML correct but will solve your parsing problem as well. Regards, Dhaval Udani Senior Analyst M-Line, QPEG OrbiTech Solutions Ltd. +91-22-8290019 Extn. 1457 ------------------------------------------------------------------------ --------------------- Unfortunately I am not the author of the page I a parsing. BTW, this was NOT a problem in the last production build. It only arose for me when I went to the latest integration build. So if I understand your comment, if the link were URL encoded before getting sent to the parser this would not be a problem? --stephen ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Somik R. <so...@ya...> - 2002-10-01 02:05:18
|
--- Claude Duguay <CD...@ar...> wrote: > I've see the parser throw an exception on ASP pages. > Strictly speaking, > it isn't a requirement that JSP or ASP be parsable, > but I think the only > hangup is really a failure to recognize the "<% ... > %>" pattern. Is this > functionality that should be there or would be easy > to add? That's strange. We do support ASP/JSP tags. Can you file a bug report ? Regards, Somik __________________________________________________ Do you Yahoo!? New DSL Internet Access from SBC & Yahoo! http://sbc.yahoo.com |
From: Claude D. <CD...@ar...> - 2002-09-30 16:47:19
|
I've see the parser throw an exception on ASP pages. Strictly speaking, it isn't a requirement that JSP or ASP be parsable, but I think the only hangup is really a failure to recognize the "<% ... %>" pattern. Is this functionality that should be there or would be easy to add? |
From: Stephen H. <Ste...@tr...> - 2002-09-30 16:25:14
|
Dhaval posted: Hi Stephen, Using '>' in your link is incorrect since it is a special character in HTML. What I am saying is that your code is incorrect. Your query string must be scaped before sending tot he server. You must use the hex representation of > in your link i.e. %3E. This should not only make your HTML correct but will solve your parsing problem as well. Regards, Dhaval Udani Senior Analyst M-Line, QPEG OrbiTech Solutions Ltd. +91-22-8290019 Extn. 1457 --------------------------------------------------------------------------------------------- Unfortunately I am not the author of the page I a parsing. BTW, this was NOT a problem in the last production build. It only arose for me when I went to the latest integration build. So if I understand your comment, if the link were URL encoded before getting sent to the parser this would not be a problem? --stephen |
From: Claude D. <CD...@ar...> - 2002-09-30 15:43:59
|
You might consider having constructor variants that use the DefaultHTMLParserFeedback (by default ;-) so that users don't get confused. -----Original Message----- From: Somik Raha [mailto:so...@ya...]=20 Sent: Monday, September 30, 2002 3:17 AM To: htm...@li... Subject: Re: [Htmlparser-user] Hi All Hi Drew, > I have a doubt...I am trying to extract only the text from the > html pages..But i just could nto get it..I have seen the > HTMLStringFilter.java..Bu t I could not add it to the existing > ones and run..bcoz in that the HTML parser has only one argument > passed whereas other even have the feedback...and also if it > shoudl work what feedback do we give...I mean a (T or i or s or l) > And i guess the jar file does not have code fro extracting the > text.. Sorry bout that - the web page hasnt been updated for a while. You will need to create a feedback object. If you dont need feedback from the parser, use the default one that we've provided in the com.kizna.html.util package. Try this : * Below is some sample code to parse Yahoo.com and print only the text information. This scanning * will run faster, as there are no scanners registered here. HTMLParser parser =3D new HTMLParser("http://www.yahoo.com",new DefaultHTMLParserFeedback()); // In this example, none of the scanners need to be registered // as a string node is not a tag to be scanned for. for (Enumeration e =3D parser.elements();e.hasMoreElements();) { HTMLNode node =3D (HTMLNode)e.nextElement(); if (node instanceof HTMLStringNode) { HTMLStringNode stringNode =3D (HTMLStringNode)node; System.out.println(stringNode.getText()); } } Let us know if you still face problems. Regards, Somik ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Somik R. <so...@ya...> - 2002-09-30 10:17:28
|
Hi Drew, > I have a doubt...I am trying to extract only the text from the > html pages..But i just could nto get it..I have seen the > HTMLStringFilter.java..Bu t I could not add it to the existing > ones and run..bcoz in that the HTML parser has only one argument > passed whereas other even have the feedback...and also if it > shoudl work what feedback do we give...I mean a (T or i or s or l) > And i guess the jar file does not have code fro extracting the > text.. Sorry bout that - the web page hasnt been updated for a while. You will need to create a feedback object. If you dont need feedback from the parser, use the default one that we've provided in the com.kizna.html.util package. Try this : * Below is some sample code to parse Yahoo.com and print only the text information. This scanning * will run faster, as there are no scanners registered here. HTMLParser parser = new HTMLParser("http://www.yahoo.com",new DefaultHTMLParserFeedback()); // In this example, none of the scanners need to be registered // as a string node is not a tag to be scanned for. for (Enumeration e = parser.elements();e.hasMoreElements();) { HTMLNode node = (HTMLNode)e.nextElement(); if (node instanceof HTMLStringNode) { HTMLStringNode stringNode = (HTMLStringNode)node; System.out.println(stringNode.getText()); } } Let us know if you still face problems. Regards, Somik |
From: Somik R. <so...@ya...> - 2002-09-30 09:56:57
|
MessageHi Joe, This is fixed. You should be able to do this now. HTMLFormScanner is = now enabled - it is in the standard list of scanners, so you no longer = need to add it specifically. You can check out the latest code from CVS. = This will be out in the next integration release (coming Sunday). Regards, Somik ----- Original Message -----=20 From: Joe Ryburn=20 To: htm...@li...=20 Sent: Thursday, September 19, 2002 9:55 PM Subject: [Htmlparser-user] Modify Form Action? Is there anyway to modify the formAction tag? I tried passing a = modified URL to formTag.setFormLocation() but this new location isn't being output in the toHTML() conversion. =20 =20 Regards, Joe Ryburn Technical Director Lead Router LLC Office 501-221-8865 Mobile 501-249-5015 =20 |
From: Somik R. <so...@ya...> - 2002-09-30 09:02:00
|
Hi Dhaval, This is fixed now. Will be in the next integration release. Bytway, do put in bug reports on the site. Regards, Somik ----- Original Message ----- From: <dha...@or...> To: <htm...@li...> Sent: Saturday, September 28, 2002 2:38 PM Subject: [Htmlparser-user] Parsing STYLE tag The following tag : <STYLE type="text/css"> <!-- {somethign....something} --> </STLYE> when parsed through the HTML parser and printed once again using toHTML() results in the following <STYLE> <!-- {something...something} --> </STYLE> As seen the type attribute is disappearing. I think some change on that front is required. Regards, Dhaval Udani Senior Analyst M-Line, QPEG OrbiTech Solutions Ltd. +91-22-8290019 Extn. 1457 -----Original Message----- From: Udani, Dhaval H. Sent: Wednesday, September 18, 2002 1:25 PM To: htmlparser-user Cc: Udani, Dhaval H. Subject: RE: [Htmlparser-user] Parsing 'Base' tag .... Hi Somik, Just to update the below-mentioned list with a bug I had reported earlier : If there are some special characters(we found a problem with <) within HTML comments then all lines upto that line(on which the charcter is present) gets deleted when you reprint the tag(using toHTML()). I have been using Node.toHTML() and I am assuming that the tag will get parsed as a HTMLRemarkNode and its toHTML() will get called. Whatever the case the output is distinctly different from the input. Even the starting HTML comments i.e. <!-- get deleted. Regards, Dhaval Udani Senior Analyst M-Line, QPEG OrbiTech Solutions Ltd. +91-22-8290019 Extn. 1457 -----Original Message----- From: somik [mailto:so...@ya...] Sent: Wednesday, September 18, 2002 11:55 AM To: htmlparser-user Cc: somik Subject: Re: [Htmlparser-user] Parsing 'Base' tag .... Hi Joe, Thanks for bringing this up - its been on my mind for a while. We should handle this before making the production release. As of now, we have a couple of issues to sort out : [1] Ensure all testcases pass on Linux [2] Look into Dhaval's reports of modification of representation of tags [3] Handle Base Tags as a special case of the link tag [4] Add functionality to the feedback API I might be able to spend some time on these from next week. But any help in any of these is highly appreciated. Regards, Somik |
From: Nancy D. <nan...@re...> - 2002-09-29 20:44:42
|
Hi all I am new to this mailing list... I have a doubt...I am trying to extract only the text from the html pages..But i just could nto get it..I have seen the HTMLStringFilter.java..Bu t I could not add it to the existing ones and run..bcoz in that the HTML parser has only one argument passed whereas other even have the feedback...and also if it shoudl work what feedback do we give...I mean a (T or i or s or l) And i guess the jar file does not have code fro extracting the text.. It would be of great help to me if someone could let me know ... Thanks Drew __________________________________________________________ Give your Company an email address like ravi @ ravi-exports.com. Sign up for Rediffmail Pro today! Know more. http://www.rediffmailpro.com/signup/ |
From: <dha...@or...> - 2002-09-28 09:08:42
|
The following tag : <STYLE type=3D"text/css"> <!-- {somethign....something} --> </STLYE> =A0 when parsed through the HTML parser and printed once again using toHTML() results in the following <STYLE> <!-- {something...something} --> </STYLE> =A0 As seen the type attribute is disappearing. I think some change on that front is required. Regards,=20 Dhaval Udani=20 Senior Analyst=20 M-Line, QPEG=20 OrbiTech Solutions Ltd.=20 +91-22-8290019 Extn. 1457=20 =A0 -----Original Message----- From: Udani, Dhaval H.=20 Sent: Wednesday, September 18, 2002 1:25 PM To: htmlparser-user Cc: Udani, Dhaval H. Subject: RE: [Htmlparser-user] Parsing 'Base' tag .... =20 =20 Hi Somik, =A0 Just to update=A0the below-mentioned=A0list with a=A0bug I had reporte= d earlier : =A0 If there are some special characters(we found a problem with <) within HTML comments then all lines upto that line(on which the charcter is present) gets deleted when you reprint the tag(using toHTML()). I have been using Node.toHTML() and I am assuming that the tag will get parsed as a HTMLRemarkNode and its toHTML() will get called. Whatever the case the output is distinctly different from the input. Even the starting HTML comments i.e. <!--=A0 get deleted. Regards,=20 Dhaval Udani=20 Senior Analyst=20 M-Line, QPEG=20 OrbiTech Solutions Ltd.=20 +91-22-8290019 Extn. 1457=20 =A0 -----Original Message----- From: somik [mailto:so...@ya...] Sent: Wednesday, September 18, 2002 11:55 AM To: htmlparser-user Cc: somik Subject: Re: [Htmlparser-user] Parsing 'Base' tag .... =20 =20 =20 Hi Joe, =A0=A0=A0 Thanks for bringing this up - its been on my mind for a w= hile. We should handle this before making the production release. As of now, we have a couple of issues to sort out : [1] Ensure all testcases pass on Linux [2] Look into Dhaval's reports of modification of representation of tags [3] Handle Base Tags as a special case of the link tag [4] Add functionality to the feedback API=20 =A0 I might be able to spend some time on these from next week. But any help in any of these is highly appreciated.=20 =A0 Regards, Somik =20 |
From: <dha...@or...> - 2002-09-28 06:00:21
|
Hi Stephen, Using '>' in your link is incorrect since it is a special character in HTML. What I am saying is that your code is incorrect. Your query string must be scaped before sending tot he server. You must use the hex representation of > in your link i.e. %3E. This should not only make your HTML correct but will solve your parsing problem as well. Regards, Dhaval Udani Senior Analyst M-Line, QPEG OrbiTech Solutions Ltd. +91-22-8290019 Extn. 1457 -----Original Message----- From: Stephen.Harrington [mailto:Ste...@tr...] Sent: Friday, September 27, 2002 11:08 PM To: Stephen.Harrington; htmlparser-user Subject: [Htmlparser-user] Problem parsing a link I have a simple document which I am trying to parse a link out of: Here is the code: <html> <body> <DL> <DT>YOUR QUERY WAS: </DL> Select one of the following documents to retrieve. <P> <HR> <P><DL> <DT><B>1:</B> <!-- hit --><A HREF="/cgi-bin/view_search?query_text=postdate>20020701&txt_clr=White&bg _clr=Red&url=http://localhost/Testing/Report 1.html">20020702 Report 1</A> <DD><font size="-1">Score: 1000, Size: 7.4 kbytes, Type: URL file</font> </DL> </body> </html> The parser is getting confused by the '>' after the postdate. Instead of returning the whole link: http://localhost/cgi-bin/view_search?query_text=postdate>20020701&txt_cl r=White&bg_clr=Red&url=http://localhost/Testing/Report 1.html only a portion of the link is returned: http://localhost/cgi-bin/view_search?query_text If the 'postdate>' is replaced by 'postdate=' then it functions properly. Seems like the parser is not looking at the double quotes. I am using the latest integration build (1.2-2002_08_31) Before digging into the source code and trying to fix the problem, I thought maybe someone might have run into this problem before. Thanks, --stephen ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Stephen H. <Ste...@tr...> - 2002-09-27 17:41:23
|
I have a simple document which I am trying to parse a link out of: Here is the code: <html> <body> <DL> <DT>YOUR QUERY WAS: </DL> Select one of the following documents to retrieve. <P> <HR> <P><DL> <DT><B>1:</B> <!-- hit --><A HREF="/cgi-bin/view_search?query_text=postdate>20020701&txt_clr=White&bg_clr=Red&url=http://localhost/Testing/Report 1.html">20020702 Report 1</A> <DD><font size="-1">Score: 1000, Size: 7.4 kbytes, Type: URL file</font> </DL> </body> </html> The parser is getting confused by the '>' after the postdate. Instead of returning the whole link: http://localhost/cgi-bin/view_search?query_text=postdate>20020701&txt_clr=White&bg_clr=Red&url=http://localhost/Testing/Report 1.html only a portion of the link is returned: http://localhost/cgi-bin/view_search?query_text If the 'postdate>' is replaced by 'postdate=' then it functions properly. Seems like the parser is not looking at the double quotes. I am using the latest integration build (1.2-2002_08_31) Before digging into the source code and trying to fix the problem, I thought maybe someone might have run into this problem before. Thanks, --stephen |