htmlparser-user Mailing List for HTML Parser (Page 94)
Brought to you by:
derrickoswald
You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(1) |
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(7) |
Feb
|
Mar
(9) |
Apr
(50) |
May
(20) |
Jun
(47) |
Jul
(37) |
Aug
(32) |
Sep
(30) |
Oct
(11) |
Nov
(37) |
Dec
(47) |
2003 |
Jan
(31) |
Feb
(70) |
Mar
(67) |
Apr
(34) |
May
(66) |
Jun
(25) |
Jul
(48) |
Aug
(43) |
Sep
(58) |
Oct
(25) |
Nov
(10) |
Dec
(25) |
2004 |
Jan
(38) |
Feb
(17) |
Mar
(24) |
Apr
(25) |
May
(11) |
Jun
(6) |
Jul
(24) |
Aug
(42) |
Sep
(13) |
Oct
(17) |
Nov
(13) |
Dec
(44) |
2005 |
Jan
(10) |
Feb
(16) |
Mar
(16) |
Apr
(23) |
May
(6) |
Jun
(19) |
Jul
(39) |
Aug
(15) |
Sep
(40) |
Oct
(49) |
Nov
(29) |
Dec
(41) |
2006 |
Jan
(28) |
Feb
(24) |
Mar
(52) |
Apr
(41) |
May
(31) |
Jun
(34) |
Jul
(22) |
Aug
(12) |
Sep
(11) |
Oct
(11) |
Nov
(11) |
Dec
(4) |
2007 |
Jan
(39) |
Feb
(13) |
Mar
(16) |
Apr
(24) |
May
(13) |
Jun
(12) |
Jul
(21) |
Aug
(61) |
Sep
(31) |
Oct
(13) |
Nov
(32) |
Dec
(15) |
2008 |
Jan
(7) |
Feb
(8) |
Mar
(14) |
Apr
(12) |
May
(23) |
Jun
(20) |
Jul
(9) |
Aug
(6) |
Sep
(2) |
Oct
(7) |
Nov
(3) |
Dec
(2) |
2009 |
Jan
(5) |
Feb
(8) |
Mar
(10) |
Apr
(22) |
May
(85) |
Jun
(82) |
Jul
(45) |
Aug
(28) |
Sep
(26) |
Oct
(50) |
Nov
(8) |
Dec
(16) |
2010 |
Jan
(3) |
Feb
(11) |
Mar
(39) |
Apr
(56) |
May
(80) |
Jun
(64) |
Jul
(49) |
Aug
(48) |
Sep
(16) |
Oct
(3) |
Nov
(5) |
Dec
(5) |
2011 |
Jan
(13) |
Feb
|
Mar
(1) |
Apr
(7) |
May
(7) |
Jun
(7) |
Jul
(7) |
Aug
(8) |
Sep
|
Oct
(6) |
Nov
(2) |
Dec
|
2012 |
Jan
(5) |
Feb
|
Mar
(3) |
Apr
(3) |
May
(4) |
Jun
(8) |
Jul
(1) |
Aug
(5) |
Sep
(10) |
Oct
(3) |
Nov
(2) |
Dec
(4) |
2013 |
Jan
(4) |
Feb
(2) |
Mar
(7) |
Apr
(7) |
May
(6) |
Jun
(7) |
Jul
(3) |
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
2014 |
Jan
|
Feb
(2) |
Mar
(1) |
Apr
|
May
(3) |
Jun
(1) |
Jul
|
Aug
|
Sep
(1) |
Oct
(4) |
Nov
(2) |
Dec
(4) |
2015 |
Jan
(4) |
Feb
(2) |
Mar
(8) |
Apr
(7) |
May
(6) |
Jun
(7) |
Jul
(3) |
Aug
(1) |
Sep
(1) |
Oct
(4) |
Nov
(3) |
Dec
(4) |
2016 |
Jan
(4) |
Feb
(6) |
Mar
(9) |
Apr
(9) |
May
(6) |
Jun
(1) |
Jul
(1) |
Aug
|
Sep
|
Oct
(1) |
Nov
(1) |
Dec
(1) |
2017 |
Jan
|
Feb
(1) |
Mar
(3) |
Apr
(1) |
May
|
Jun
(1) |
Jul
(2) |
Aug
(3) |
Sep
(6) |
Oct
(3) |
Nov
(2) |
Dec
(5) |
2018 |
Jan
(3) |
Feb
(13) |
Mar
(28) |
Apr
(5) |
May
(4) |
Jun
(2) |
Jul
(2) |
Aug
(8) |
Sep
(2) |
Oct
(1) |
Nov
(5) |
Dec
(1) |
2019 |
Jan
(8) |
Feb
(1) |
Mar
|
Apr
(1) |
May
(4) |
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(2) |
2020 |
Jan
|
Feb
|
Mar
(1) |
Apr
(1) |
May
(1) |
Jun
(2) |
Jul
(1) |
Aug
(1) |
Sep
(1) |
Oct
|
Nov
(1) |
Dec
(1) |
2021 |
Jan
(3) |
Feb
(2) |
Mar
(1) |
Apr
(1) |
May
(2) |
Jun
(1) |
Jul
(2) |
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2022 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
(1) |
Jun
(1) |
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2023 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2024 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2025 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Somik R. <so...@ya...> - 2002-07-10 00:56:20
|
Hi Chris, If you want to print the html code of the link tag, just use = linkTag.toHTML(). So your code becomes : if (node instanceof HTMLLinkTag) { _pwOut.print(node.toHTML()); } Note : You dont even have to downcast bcos toHTML is in the HTMLNode = interface.=20 If you want to print all the contents yourself (for whatever reason), = then you will have to enumerate thru the data inside the link tag by = picking up an enumeration -=20 for (Enumeration e=3DlinkTag.linkData();e.hasMoreElements();) { HTMLNode node =3D (HTMLNode)e.nextElement(); // You are now enumerating thru a list of nodes inside the link tag // These could be images, strings, etc.. } Regards, Somik ----- Original Message -----=20 From: Chris=20 To: htm...@li...=20 Sent: Wednesday, July 10, 2002 3:25 AM Subject: [Htmlparser-user] IMG tag within an A tag does not show up Somik,=20 I have updated to v1.2 and it is doing well.=20 I have this problem that IMG within an A tag do not show up at all. <a href=3D"link"><img src=3D"link"></a> will come through the parser as: <a href=3D"link"></a> Code Snippet: --------------- snip --------------- if (node instanceof HTMLLinkTag) { // *********************************** // Link Tag // *********************************** HTMLLinkTag htmlTag =3D (HTMLLinkTag)node; =20 // Link Tag _pwOut.print( "<a href=3D\"" + htmlTag.getText(); + "\">" ); _pwOut.println( htmlTag.getLinkText() + "</a>" ); } --------------- snip --------------- -chris carey sublimespot.com ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Stuff, things, and much much more. http://thinkgeek.com/sf _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Chris <ch...@su...> - 2002-07-09 18:28:29
|
Somik, I have updated to v1.2 and it is doing well. I have this problem that IMG within an A tag do not show up at all. <a href="link"><img src="link"></a> will come through the parser as: <a href="link"></a> Code Snippet: --------------- snip --------------- if (node instanceof HTMLLinkTag) { // *********************************** // Link Tag // *********************************** HTMLLinkTag htmlTag = (HTMLLinkTag)node; // Link Tag _pwOut.print( "<a href=\"" + htmlTag.getText(); + "\">" ); _pwOut.println( htmlTag.getLinkText() + "</a>" ); } --------------- snip --------------- -chris carey sublimespot.com |
From: Claude D. <CD...@ar...> - 2002-07-08 22:08:34
|
Here are the latest number from runs we've done using the Trek data set. These are mostly small html documents used in IR (Information Retrieval) as baselines (slightly cleaned up for HTML processing): =20 Total number of documents: 642,077 Total original document Size (in bytes): 2,596,104,858 =20 Comparison (times include local socket tranmission of output documents - possibly as much as 20-25% of the total time spent): =20 Swing parser - 4715 minutes total time, average number of documents per second: 2.269625309 HTMLParser 1.1 - 5065 minutes total time, average number of documents per second: 2.112790392=20 HTMLParser 1.2 (pre optimizations) - 5026 minutes total time, average number of documents per second: 2.129184905 =20 Previous reports that the 1.2 version was slower changed as more data was processed. It was, in fact, only slightly slower than 1.1. If Somik's recent changes improve performance as much as we expect, subsequent numbers should be even better. I thought it would be nice to share these numbers. I will post numbers from a run with the latest optimizations within a few days. =20 Note that the HTMLTitleScanner, HTMLMetaTagScanner and HTMLScriptScanner are being used in this set of tests and each element is being tested with"instanceof" to catch key tag information of relevance to our application. The HTMLScriptScanner is there only to make sure we skip over any scripts. =20 |
From: Somik R. <so...@ya...> - 2002-07-08 01:07:24
|
Hi Folks, The latest integration release (2002-07-07) is out, and has major = improvements : [1] 50% speed improvement over v1.1. The previous 1.2 versions had a = slowdown bug due to which it was slower by 20% over v1.1. [2] Fixed bug in HTMLScriptScanner, which would break on incorrect HTML = inside the script code. [3] Removed HTMLFormScanner from standard registered scanners, as it has = a bug - cannot parse non-ended forms (goes into infinite loop). Thanks to Claude Duguay for the scalability reports. It is = recommended that all v1.2 users upgrade to the latest one for these = fixes. Regards, Somik ********************************** Somik Raha System Architect Kizna Corporation Hiroo ON Bldg. 2F, 5-19-9 Hiroo, Shibuya-ku, Tokyo,=20 150-0012, JAPAN Phone : +81-3-5475-2646 Fax : +81-3-3445-9089 Web : http://www.kizna.com Mail : so...@ki... ********************************** |
From: Somik R. <so...@ya...> - 2002-07-04 00:34:22
|
Hi Chris, You can try this code : HTMLParser parser =3D new HTMLParser("http://..."); parser.registerScanners(); HTMLNode node; for (Enumeration e =3D parser.elements();e.hasMoreElements();) { node =3D (HTMLNode)e.nextElement(); if (node instanceof HTMLLinkTag) { HTMLLinkTag linkTag =3D (HTMLLinkTag)node; String link =3D linkTag.getLink(); // Now that you have the absolute link, change it the way = you want=20 String modifiedLink =3D modifyLink(link); // Output the link tag =20 } =20 else if (node instanceof HTMLImageTag) { HTMLImageTag imageTag =3D (HTMLImageTag)node; String loc =3DimageTag.getImageLocation(); // Now that you have the absolute link, change it the way = you want String modifiedImageLoc =3D modifyImageLoc(loc); // Output the image tag =20 } else { // This prints the html reconstruction of the node = =20 System.out.println(node.toHTML());=20 } } Note: When you are outputting the link and image tag, you will have to = keep a few things in mind. [1] You will need to run thru the params table inorder to accurately = reconstruct rest of the html. This is easy, the parameters in the tags = are in a hashtable that can be retrieved by HTMLTag.getParsed() (all = tags derive from HTMLTag). [2] When you are outputting the link tag, remember that links can = contain other html elements within them. Getting all the nodes contained = in them is easy - you can get an enumeration of link elements with = HTMLLinkTag.nodeData() [3] You might want to consider a second approach for uniform rendering = of all data - since you have all the source code and are fairly sure how = you want to render it - modify the toHTML methods of HTMLLinkTag and = HTMLImageTag for yourself - to change it the way you want. Then, your = application code becomes : HTMLParser parser =3D new HTMLParser("http://..."); parser.registerScanners(); HTMLNode node; for (Enumeration e =3D parser.elements();e.hasMoreElements();) { node =3D (HTMLNode)e.nextElement(); System.out.println(node.toHTML()); } [4] I am strongly considering that I should allow folks to add static = rendering handlers for the link and image tags, allowing you to change = the default toHTML with your own code, without touching the original = source. But you will have to wait for a later release...=20 Cheers, Somik ----- Original Message -----=20 From: Chris Carey=20 To: htm...@li...=20 Sent: Thursday, July 04, 2002 3:10 AM Subject: [Htmlparser-user] Storing modified web pages to hard disk I was looking for the new and improved way to do the following: a) Read in a page from disk or URL b) Modify every <A href=3D""> in the page c) Output the page to disk or to screen For example, I would just like to modify all of the <A> links or <IMG> links in a particular manner, but leave *most* of the page fairly untouched ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek No, I will not fix your computer. http://thinkgeek.com/sf _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Chris C. <ne...@su...> - 2002-07-03 18:12:53
|
I was looking for the new and improved way to do the following: a) Read in a page from disk or URL b) Modify every <A href=""> in the page c) Output the page to disk or to screen For example, I would just like to modify all of the <A> links or <IMG> links in a particular manner, but leave *most* of the page fairly untouched |
From: Somik R. <so...@ya...> - 2002-07-03 04:14:43
|
Hi Cheng, I set the int var . It works now.=20 Instead of setting the int, you might want to consider using the String. = This is bcos the string url that you provide as the param, forms the = base url for resolution with relative links. If you dont provide this, = the getLink() will not give you absolute links when the page actually = contains a relative link. Regards, Somik =20 |
From: Cheng J. <c....@sm...> - 2002-07-03 03:20:17
|
Dear Somik Raha, Thank you. I set the int var 1. It works now. I am not sure I understand.. As I mentioned earlier, the linkTag.getLink() provides you the absolute link (check http://htmlparser.sourceforge.net/javadoc/com/kizna/html/tags/HTMLLinkTag.html) Sorry, the getLink() sometimes give the realtive link. Best wishes, Cheng Jun 2002-07-03 =============2002-07-03 You write========== Hi Cheng, Would you please explain the meaning of next variable??? THX. Sorry, I forgot about that.. The second variable is a string which denotes a default url. This is the url that will be used for resolving relative links. You can ignore the int param.. Regards, Somik ----- Original Message ----- From: Cheng Jun To: htm...@li... Sent: Wednesday, July 03, 2002 12:00 AM Subject: Re: Re: Re: [Htmlparser-user] Bug found Dear Somik Raha, THX for your reply. I tried your suggesting at once. But got a problem. You gave me the code below. StringReader sr = new StringReader(sb.toString()); HTMLReader reader = new HTMLReader(new BufferedReader(sr)); This should be HTMLReader(BufferedReader , int p1 / String p1) HTMLParser parser = new HTMLParser(reader); Would you please explain the meaning of next variable??? THX. Best wishes, Cheng Jun 2002-07-03 =============2002-07-03 You write========== Hi Cheng, This is just a little problem. In my project I need the absolute link. You may write another method to give the absolute link output. I am not sure I understand.. As I mentioned earlier, the linkTag.getLink() provides you the absolute link (check http://htmlparser.sourceforge.net/javadoc/com/kizna/html/tags/HTMLLinkTag.html) Another sugguestion. I write the my own page downloader. The problem is your parsing only paring the page avaiable in the HD. Could you give us another interface to parse the page content stored in a String or StringBuffer or something else. Oh, you can parse pages off a StringBuffer :). In fact, this is the basis of all testcases written for the parser. Check the source and see any of the scanner testcases. If you have a StringBuffer- this is what you could do - assuming you had a string buffer - sb. StringReader sr = new StringReader(sb.toString()); HTMLReader reader = new HTMLReader(new BufferedReader(sr)); HTMLParser parser = new HTMLParser(reader); Thats it! You can start parsing now. This is how we have over 100 tests, giving specific inputs to the parser. HTH. Cheers, Somik ======================================== ======================================== |
From: Somik R. <so...@ya...> - 2002-07-03 03:14:17
|
Hi Cheng, Would you please explain the meaning of next variable??? THX. Sorry, I forgot about that.. The second variable is a string which = denotes a default url. This is the url that will be used for resolving = relative links. You can ignore the int param.. Regards, Somik ----- Original Message -----=20 From: Cheng Jun=20 To: htm...@li...=20 Sent: Wednesday, July 03, 2002 12:00 AM Subject: Re: Re: Re: [Htmlparser-user] Bug found Dear Somik Raha, THX for your reply. I tried your suggesting at once. But got a problem. You gave me the = code below.=20 StringReader sr =3D new StringReader(sb.toString()); HTMLReader reader =3D new HTMLReader(new BufferedReader(sr)); This = should be HTMLReader(BufferedReader , int p1 / String p1)=20 HTMLParser parser =3D new HTMLParser(reader); Would you please explain the meaning of next variable??? THX. =20 Best wishes, Cheng Jun 2002-07-03 =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D2002-07-03 You = write=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D Hi Cheng, This is just a little problem. In my project I need the = absolute link. You may write another method to give the absolute link = output.=20 I am not sure I understand.. As I mentioned earlier, the = linkTag.getLink() provides you the absolute link (check = http://htmlparser.sourceforge.net/javadoc/com/kizna/html/tags/HTMLLinkTag= .html)=20 Another sugguestion. I write the my own page downloader. The = problem is your parsing only paring the page avaiable in the HD.=20 Could you give us another interface to parse the page content = stored in a String or StringBuffer or something else.=20 Oh, you can parse pages off a StringBuffer :). In fact, this = is the basis of all testcases written for the parser. Check the source = and see any of the scanner testcases. If you have a StringBuffer- this = is what you could do - assuming you had a string buffer - sb. StringReader sr =3D new StringReader(sb.toString()); HTMLReader reader =3D new HTMLReader(new BufferedReader(sr)); HTMLParser parser =3D new HTMLParser(reader); Thats it! You can start parsing now. This is how we have over = 100 tests, giving specific inputs to the parser.=20 HTH. Cheers, Somik =20 = =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D |
From: Cheng J. <c....@sm...> - 2002-07-03 03:05:40
|
Dear Somik Raha, THX for your reply. I tried your suggesting at once. But got a problem. You gave me the code below. StringReader sr = new StringReader(sb.toString()); HTMLReader reader = new HTMLReader(new BufferedReader(sr)); This should be HTMLReader(BufferedReader , int p1 / String p1) HTMLParser parser = new HTMLParser(reader); Would you please explain the meaning of next variable??? THX. Best wishes, Cheng Jun 2002-07-03 =============2002-07-03 You write========== Hi Cheng, This is just a little problem. In my project I need the absolute link. You may write another method to give the absolute link output. I am not sure I understand.. As I mentioned earlier, the linkTag.getLink() provides you the absolute link (check http://htmlparser.sourceforge.net/javadoc/com/kizna/html/tags/HTMLLinkTag.html) Another sugguestion. I write the my own page downloader. The problem is your parsing only paring the page avaiable in the HD. Could you give us another interface to parse the page content stored in a String or StringBuffer or something else. Oh, you can parse pages off a StringBuffer :). In fact, this is the basis of all testcases written for the parser. Check the source and see any of the scanner testcases. If you have a StringBuffer- this is what you could do - assuming you had a string buffer - sb. StringReader sr = new StringReader(sb.toString()); HTMLReader reader = new HTMLReader(new BufferedReader(sr)); HTMLParser parser = new HTMLParser(reader); Thats it! You can start parsing now. This is how we have over 100 tests, giving specific inputs to the parser. HTH. Cheers, Somik ======================================== |
From: Somik R. <so...@ya...> - 2002-07-03 02:33:05
|
Hi Cheng, This is just a little problem. In my project I need the absolute link. = You may write another method to give the absolute link output.=20 I am not sure I understand.. As I mentioned earlier, the = linkTag.getLink() provides you the absolute link (check = http://htmlparser.sourceforge.net/javadoc/com/kizna/html/tags/HTMLLinkTag= .html)=20 Another sugguestion. I write the my own page downloader. The problem is = your parsing only paring the page avaiable in the HD.=20 Could you give us another interface to parse the page content stored in = a String or StringBuffer or something else.=20 Oh, you can parse pages off a StringBuffer :). In fact, this is the = basis of all testcases written for the parser. Check the source and see = any of the scanner testcases. If you have a StringBuffer- this is what = you could do - assuming you had a string buffer - sb. StringReader sr =3D new StringReader(sb.toString()); HTMLReader reader =3D new HTMLReader(new BufferedReader(sr)); HTMLParser parser =3D new HTMLParser(reader); Thats it! You can start parsing now. This is how we have over 100 tests, = giving specific inputs to the parser.=20 HTH. Cheers, Somik |
From: Cheng J. <c....@sm...> - 2002-07-03 02:19:57
|
Dear Somik Raha, THX for your reply. This is just a little problem. In my project I need the absolute link. You may write another method to give the absolute link output. Another sugguestion. I write the my own page downloader. The problem is your parsing only paring the page avaiable in the HD. Could you give us another interface to parse the page content stored in a String or StringBuffer or something else. Best wishes, Cheng Jun 2002-07-03 |
From: Somik R. <so...@ya...> - 2002-07-01 11:44:58
|
Hi Cheng Thanks for the kind words. Regarding the bug, I would call it a feature :) When you parse a link within a url - if the link is relative, it = gets processed appropriately. If you want to get the absolute link, you = should do : linkTag.getLink(). The toHTML() method however tries to reconstruct the = html as it appeared (so relative links show up as relative, and absolute = links show up as absolute). There might be a controversy regarding the = purpose of toHTML() itself - do you think toHTML() should not do an = accurate rendition in the case of the HTMLTag ? I am open to opinions = from everyone on this.. For your purposes, you will need to modify the code of toHTMLTag() = in HTMLLinkTag.java.=20 Original Code : public String toHTML() { StringBuffer sb =3D new StringBuffer(); sb.append("<"); sb.append(tagContents.toString()); sb.append(">"); HTMLNode node; for (Enumeration e =3D linkData();e.hasMoreElements();) { node =3D (HTMLNode)e.nextElement(); sb.append(node.toHTML()); } sb.append("</A>"); return sb.toString(); } Modified Code : public String toHTML() { StringBuffer sb =3D new StringBuffer(); sb.append("<"); sb.append(getLink()); // Modification Occurs here sb.append(">"); HTMLNode node; for (Enumeration e =3D linkData();e.hasMoreElements();) { node =3D (HTMLNode)e.nextElement(); sb.append(node.toHTML()); } sb.append("</A>"); return sb.toString(); } =20 Let me know if I might have misunderstood the problem, or this does not = fix it. Cheers, Somik (Note : If you checkout the code from CVS, you will get the ant build = script - this will make it really simple for you to just get the = htmlparser.jar and use it in your app.) ----- Original Message -----=20 From: Cheng Jun=20 To: htm...@li... ; = htm...@li...=20 Sent: Monday, July 01, 2002 3:51 AM Subject: [Htmlparser-user] Bug found Firstly I have to say thank you to Somik Raha. You really do a good = job to give us a new integration.=20 I am writing a program to parse webpage and retrieve the links in the = pages. I have tried the lastest version(6/30) and found there may be a bug. The following is the part of the code and output. System.out.println("Starting parsing...... " ); com.kizna.html.HTMLParser Parser =3D new = com.kizna.html.HTMLParser("E://My paper/EdCrawler/page.htm"); Parser.registerScanners() ; //Parser.parse(null); // Parse the HTML file by Tag types Enumeration e =3D Parser.elements(); while(HasMore) { try { HasMore =3D e.hasMoreElements(); //HasMore is a boolean = var }catch (Exception e2){ System.out.println( e2.toString()) ; = HasMore =3D false; }; //have to stop parsing this HTML file if( HasMore ) { com.kizna.html.HTMLNode node = =3D(com.kizna.html.HTMLNode)e.nextElement(); // HTML DoctypeTag if (node instanceof = com.kizna.html.tags.HTMLDoctypeTag) { com.kizna.html.tags.HTMLDoctypeTag DoctypeNode =3D = (com.kizna.html.tags.HTMLDoctypeTag)node; System.out.println("Doctype: " + = DoctypeNode.toPlainTextString()); }//if //title if (node instanceof com.kizna.html.tags.HTMLTitleTag) { com.kizna.html.tags.HTMLTitleTag TitleNode =3D = (com.kizna.html.tags.HTMLTitleTag)node; System.out.println("Title: "+ = TitleNode.toPlainTextString() ); } //MATA if (node instanceof com.kizna.html.tags.HTMLMetaTag) { com.kizna.html.tags.HTMLMetaTag MataNode =3D = (com.kizna.html.tags.HTMLMetaTag)node; System.out.println("MATA HTTP-EQUIV: " + = MataNode.getHttpEquiv() +" MATA name: "+ MataNode.getMetaTagName() + " = CONTENT :" + MataNode.getMetaTagContents()); }//if // Links if (node instanceof HTMLLinkTag) { HTMLLinkTag LinkNode =3D (HTMLLinkTag)node; // Retrieve the data from the object and print it System.out.println("LINK: = "+LinkNode.toPlainTextString() +" " + " toHTML " + LinkNode.toHTML()); }//if //Parser end } // if(HasMore ) }//while System.out.println("Parising END."); = =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D part of the output=20 Doctype:=20 Title: The University of Edinburgh MATA HTTP-EQUIV: null MATA name: Description CONTENT :The University = of Edinburgh, promoting excellence in teaching and research. MATA HTTP-EQUIV: null MATA name: keywords CONTENT :edinburgh = ,university ,degree ,study, studying, research, Scotland, uk, alumni, = graduate, postgraduate, PhD, masters, grad ,post ,edinboro, college, = school MATA HTTP-EQUIV: null MATA name: publisher CONTENT :The University of = Edinburgh MATA HTTP-EQUIV: null MATA name: author CONTENT :University Web = Editor LINK: Prospective Students toHTML <a href=3D"studying/">Prospective = Students</A> LINK: News & Events toHTML <a href=3D"news/">News & = Events</A> LINK: Faculties & Departments toHTML <a = href=3D"/misc/depts.html">Faculties & Departments</A> LINK: Present Students toHTML <a href=3D"/presentstudents/">Present = Students</A> LINK: Research toHTML <a href=3D"research/">Research</A> LINK: Support Services toHTML <a href=3D"/misc/support.html">Support = Services</A> LINK: Staff toHTML <a href=3D"staff/">Staff</A> INK: Lifelong Learning toHTML <a = href=3D"http://www.lifelong.ed.ac.uk/">Lifelong Learning</A> LINK: The Library toHTML <a href=3D"http://www.lib.ed.ac.uk/">The = Library</A> =A1=A1=A1=A1 Now we could see the links with the same domain name would only be = displayed as part of the linkself.=20 So please check the toHTML() method.=20 =20 =20 Cheng Jun c....@sm... 2002-07-01 02:38:51 |
From: Cheng J. <c....@sm...> - 2002-07-01 01:51:22
|
Firstly I have to say thank you to Somik Raha. You really do a= good job to give us a new integration. I am writing a program to parse webpage and retrieve the links= in the pages. I have tried the lastest version(6/30) and found there may be a= bug. The following is the part of the code and output. System.out.println("Starting parsing...... " ); com.kizna.html.HTMLParser Parser =3D new= com.kizna.html.HTMLParser("E://My paper/EdCrawler/page.htm"); Parser.registerScanners() ; //Parser.parse(null); // Parse the HTML file by Tag types Enumeration e =3D Parser.elements(); while(HasMore) { try { HasMore =3D e.hasMoreElements(); //HasMore is a= boolean var }catch (Exception e2){ System.out.println(= e2.toString()) ; HasMore =3D false; }; //have to stop parsing= this HTML file if( HasMore ) { com.kizna.html.HTMLNode node= =3D(com.kizna.html.HTMLNode)e.nextElement(); // HTML DoctypeTag if (node instanceof= com.kizna.html.tags.HTMLDoctypeTag) { com.kizna.html.tags.HTMLDoctypeTag= DoctypeNode =3D (com.kizna.html.tags.HTMLDoctypeTag)node; System.out.println("Doctype: " += DoctypeNode.toPlainTextString()); }//if //title if (node instanceof= com.kizna.html.tags.HTMLTitleTag) { com.kizna.html.tags.HTMLTitleTag TitleNode =3D= (com.kizna.html.tags.HTMLTitleTag)node; System.out.println("Title: "+= TitleNode.toPlainTextString() ); } //MATA if (node instanceof= com.kizna.html.tags.HTMLMetaTag) { com.kizna.html.tags.HTMLMetaTag MataNode =3D= (com.kizna.html.tags.HTMLMetaTag)node; System.out.println("MATA HTTP-EQUIV: " += MataNode.getHttpEquiv() +" MATA name: "+= MataNode.getMetaTagName() + " CONTENT :" += MataNode.getMetaTagContents()); }//if // Links if (node instanceof HTMLLinkTag) { HTMLLinkTag LinkNode =3D (HTMLLinkTag)node; // Retrieve the data from the object and= print it System.out.println("LINK:= "+LinkNode.toPlainTextString() +" " + " toHTML " += LinkNode.toHTML()); }//if //Parser end } // if(HasMore ) }//while System.out.println("Parising END."); =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D part of the output Doctype: Title: The University of Edinburgh MATA HTTP-EQUIV: null MATA name: Description CONTENT :The= University of Edinburgh, promoting excellence in teaching and= research. MATA HTTP-EQUIV: null MATA name: keywords CONTENT :edinburgh= ,university ,degree ,study, studying, research, Scotland, uk,= alumni, graduate, postgraduate, PhD, masters, grad ,post= ,edinboro, college, school MATA HTTP-EQUIV: null MATA name: publisher CONTENT :The= University of Edinburgh MATA HTTP-EQUIV: null MATA name: author CONTENT :University Web= Editor LINK: Prospective Students toHTML <a= href=3D"studying/">Prospective Students</A> LINK: News & Events toHTML <a href=3D"news/">News &= Events</A> LINK: Faculties & Departments toHTML <a= href=3D"/misc/depts.html">Faculties & Departments</A> LINK: Present Students toHTML <a= href=3D"/presentstudents/">Present Students</A> LINK: Research toHTML <a href=3D"research/">Research</A> LINK: Support Services toHTML <a= href=3D"/misc/support.html">Support Services</A> LINK: Staff toHTML <a href=3D"staff/">Staff</A> INK: Lifelong Learning toHTML <a= href=3D"http://www.lifelong.ed.ac.uk/">Lifelong Learning</A> LINK: The Library toHTML <a href=3D"http://www.lib.ed.ac.uk/">The= Library</A> =A1=A1=A1=A1 Now we could see the links with the same domain name would only= be displayed as part of the linkself. So please check the toHTML() method. Cheng Jun c....@sm... 2002-07-01 02:38:51 |
From: Somik R. <so...@ya...> - 2002-06-30 12:26:03
|
Hi Folks, This week's integration release is out - you can get it from = http://htmlparser.sourceforge.net.=20 All test cases are passing. Couple of bugs fixed - some interesting = bugs reported by Cedric Rosa (thanks Cedric).=20 A major refactoring has come in - by which writing a scanner becomes = easier. You dont have to worry about associating scanners with the tags = they create. Its done thru a template method internally. Also, you dont = need to worry about whether your tag has made a call to parseParameters. = It is done automatically. =20 (Kaarle - about now, almost all the scanners are using = parseParameters, thanks a lot for contributing that). =20 For the next release, Claude Duguay will be collaborating closely = with us - he has given some great suggestions and will be providing some = code - he is going to help us get this parser into the professional = league, and is currently doing scalability analysis on the parser (those = on the user list would have already seen the analysis comparing swing = and v1.1). We should have results on v1.2 soon from him. We should = probably get 1.2 out in stable form after this collaboration. (Claude -- = Thanks a ton) You can look forward to some exciting improvements in the coming = weeks.. Cheers, Somik ********************************** Somik Raha System Architect Kizna Corporation Hiroo ON Bldg. 2F, 5-19-9 Hiroo, Shibuya-ku, Tokyo,=20 150-0012, JAPAN Phone : +81-3-5475-2646 Fax : +81-3-3445-9089 Web : http://www.kizna.com Mail : so...@ki... ********************************** |
From: Somik R. <so...@ya...> - 2002-06-29 00:43:28
|
Hi Cedric, This is actually already present in the parser. Every tag has a = hashtable in it, which already contains all fields parsed. The API = specific methods are provided only as a convenience. So, if you want to = get the value of a field not in the API - simply do = yourTag.getParameter("param-name") Regards, Somik ----- Original Message -----=20 From: C=E9dric Rosa=20 To: htm...@li...=20 Sent: Friday, June 28, 2002 8:27 PM Subject: [Htmlparser-user] Missing Meta's treatments: maybe useful Hello, I've just read the wc3 recommendations for HTML and I've found that = META=20 seems have others fields like: lang, dir or scheme http://www.w3.org/TR/html4/struct/global.html#h-7.4.4 <!ELEMENT META - O EMPTY -- generic metainformation --> <!ATTLIST META %i18n; -- lang, dir, for use with = content -- http-equiv NAME #IMPLIED -- HTTP response header name = -- name NAME #IMPLIED -- metainformation name -- content CDATA #REQUIRED -- associated information -- scheme CDATA #IMPLIED -- select form of content -- > The best solution for patching is to modify HTMLMetaTag: ------------ public HTMLMetaTag(int tagBegin, int tagEnd, String tagContents, = String=20 httpEquiv, String metaTagName, String metaTagContents,String tagLine) =3D> public HTMLMetaTag(int tagBegin, int tagEnd, String tagContents, = Hashtable=20 table,String tagLine) HTMLMetaTagScanner: ------------------- HTMLMetaTag metaTag =3D new=20 = HTMLMetaTag(tag.elementBegin(),tag.elementEnd(),tag.getText(),httpEquiv, = metaTagName,metaTagContents,currLine); =3D> HTMLMetaTag metaTag =3D new=20 = HTMLMetaTag(tag.elementBegin(),tag.elementEnd(),tag.getText(),table,currL= ine); So after you are ready to get and set the fields in HTMLMetaTag. Somik, do you think it will be a good solution ? If this solution is=20 convenient, I can't patch these files and mail it to you :) Regards, Cedric. ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Caffeinated soap. No kidding. http://thinkgeek.com/sf _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: R. <ced...@fr...> - 2002-06-28 11:26:18
|
Hello, I've just read the wc3 recommendations for HTML and I've found that META seems have others fields like: lang, dir or scheme http://www.w3.org/TR/html4/struct/global.html#h-7.4.4 <!ELEMENT META - O EMPTY -- generic metainformation --> <!ATTLIST META %i18n; -- lang, dir, for use with content -- http-equiv NAME #IMPLIED -- HTTP response header name -- name NAME #IMPLIED -- metainformation name -- content CDATA #REQUIRED -- associated information -- scheme CDATA #IMPLIED -- select form of content -- > The best solution for patching is to modify HTMLMetaTag: ------------ public HTMLMetaTag(int tagBegin, int tagEnd, String tagContents, String httpEquiv, String metaTagName, String metaTagContents,String tagLine) => public HTMLMetaTag(int tagBegin, int tagEnd, String tagContents, Hashtable table,String tagLine) HTMLMetaTagScanner: ------------------- HTMLMetaTag metaTag = new HTMLMetaTag(tag.elementBegin(),tag.elementEnd(),tag.getText(),httpEquiv, metaTagName,metaTagContents,currLine); => HTMLMetaTag metaTag = new HTMLMetaTag(tag.elementBegin(),tag.elementEnd(),tag.getText(),table,currLine); So after you are ready to get and set the fields in HTMLMetaTag. Somik, do you think it will be a good solution ? If this solution is convenient, I can't patch these files and mail it to you :) Regards, Cedric. |
From: R. <ced...@fr...> - 2002-06-27 11:36:09
|
Hello, A little fix: HTMLStyleScanner and HTMLTitleScanner was linked with the same character. So it was impossible to extract only title or only style. I've just replaced "t" by "T" for title. "addScanner(new HTMLStyleScanner("-t")); addScanner(new HTMLTitleScanner("-T"));" Cedric. |
From: Somik R. <so...@ya...> - 2002-06-27 10:57:49
|
Oh yes excuse me, it's another test :) I attach the new document. I've trimed another line of the page :) Oh boy, r u torture testing or what :) This bug has been fixed. The parser should work fine now. You can get = the latest code from CVS. I don't run software with options, I use class with my own program, but = it=20 was just for testing :) I could make a fix when I find some time too :) Cool. I'll register sourceforge asap and I'll try to understand how CVS works = :) For the moment I can send my fixes to the developer mailing list. Looking forward to seeing you on the dev list. Bytway, my suggestion is = - try Eclipse- integration with CVS on sourceforge is a breeze and = eclipse is open source :) (www.eclipse.org)=20 If you want to go the hard way, check=20 http://cdx.sourceforge.net/win-HOWTO.htm=20 Cheers, Somik |
From: R. <ced...@fr...> - 2002-06-27 10:16:47
|
Oh yes excuse me, it's another test :) I attach the new document. I've trimed another line of the page :) I don't run software with options, I use class with my own program, but it was just for testing :) I could make a fix when I find some time too :) I'll register sourceforge asap and I'll try to understand how CVS works :) For the moment I can send my fixes to the developer mailing list. Regards, Cedric. At 18:43 27/06/2002 +0900, you wrote: >Hi Cedric, >Thanks for this fix. But when I download the CVS version of HTMLParser and >try to parse the page again I get this error: >"java.lang.OutOfMemoryError > <<no stack trace available>> >Exception in thread "main" " > >Is-it normal ? Should I catch this error and write my own code around ? >Its highly abnormal... It should not happen - are you trying it with the >same piece of html ? Send me the data you are trying on. If its the same >page, it works perfectly on my end. I am running HTMLParser (main) with no >params except the file name. > >Other question, I can't run the software with two options. Is-it normal ? >Why don't you set the options before the name of the file to parse ? > >Yes, this is normal (a feature, not a bug). This is bcos the options are >intended only as a demo, and I didnt think it'd really be of use to >people. Are you actually using it this way ? Also, another thing is I am >not full time on this, so I'd be grateful if you can join up as a >developer and make this fix. > >All code recieved from developers is acknowledged both in the code, and >the Contributors page that goes out with each release. You can send me >your sourceforge id and I can add you as a developer. > >It can be used like this: >public HTMLStringNode(String text,int textBegin,int textEnd) >{ > NormalizeHtmlCode normalizer = new NormalizeHtmlCode(); > this.text = normalizer.html2text(text); > this.textBegin = textBegin; > this.textEnd = textEnd; >} >You can implement it with the meta-tags, ... > >This is cool. I think it will be useful in the toPlainString() method, >where we can get the actual meaningful text out. I'd be glad to include >this as soon as I find some time. Or Tariq can also join as a developer >and I can give him CVS access to do it. > >Thanks a lot for your participation. > >Cheers, >Somik |
From: Somik R. <so...@ya...> - 2002-06-27 09:49:06
|
Hi Cedric, Thanks for this fix. But when I download the CVS version of HTMLParser = and=20 try to parse the page again I get this error: "java.lang.OutOfMemoryError <<no stack trace available>> Exception in thread "main" " Is-it normal ? Should I catch this error and write my own code around ? Its highly abnormal... It should not happen - are you trying it with the = same piece of html ? Send me the data you are trying on. If its the same = page, it works perfectly on my end. I am running HTMLParser (main) with = no params except the file name. Other question, I can't run the software with two options. Is-it normal = ?=20 Why don't you set the options before the name of the file to parse ? Yes, this is normal (a feature, not a bug). This is bcos the options are = intended only as a demo, and I didnt think it'd really be of use to = people. Are you actually using it this way ? Also, another thing is I am = not full time on this, so I'd be grateful if you can join up as a = developer and make this fix. All code recieved from developers is acknowledged both in the code, and = the Contributors page that goes out with each release. You can send me = your sourceforge id and I can add you as a developer. It can be used like this: public HTMLStringNode(String text,int textBegin,int textEnd) { NormalizeHtmlCode normalizer =3D new NormalizeHtmlCode(); this.text =3D normalizer.html2text(text); this.textBegin =3D textBegin; this.textEnd =3D textEnd; } You can implement it with the meta-tags, ... This is cool. I think it will be useful in the toPlainString() method, = where we can get the actual meaningful text out. I'd be glad to include = this as soon as I find some time. Or Tariq can also join as a developer = and I can give him CVS access to do it. Thanks a lot for your participation. Cheers, Somik |
From: R. <ced...@fr...> - 2002-06-27 09:39:16
|
Hello Somik, Thanks for this fix. But when I download the CVS version of HTMLParser and= =20 try to parse the page again I get this error: "java.lang.OutOfMemoryError <<no stack trace available>> Exception in thread "main" " Is-it normal ? Should I catch this error and write my own code around ? Other question, I can't run the software with two options. Is-it normal ?=20 Why don't you set the options before the name of the file to parse ? Last, a friend (Tarik Mokhtari) wrote a "little" normalizer to convert=20 "&*". Maybe it could be a good idea to add it to the project ? It can be used like this: public HTMLStringNode(String text,int textBegin,int textEnd) { NormalizeHtmlCode normalizer =3D new NormalizeHtmlCode(); this.text =3D normalizer.html2text(text); this.textBegin =3D textBegin; this.textEnd =3D textEnd; } You can implement it with the meta-tags, ... Regards, Cedric. At 08:23 27/06/2002 +0200, you wrote: > >----- Original Message ----- >From: <mailto:so...@ya...>Somik Raha >To:=20 ><mailto:htm...@li...>htm...@li...urcef= orge.net=20 > >Cc:=20 ><mailto:htm...@li...>htmlparser-developer@lis= ts.sourceforge.net=20 > >Sent: Thursday, June 27, 2002 4:11 AM >Subject: Re: [Htmlparser-user] Bad formed web page > >Hi Cedric, > Thanks for the bug report. This has been reproduced in=20 > HTMLTagTest.testBrokenTag(), and has been fixed. The parser now runs=20 > without failing on the same html file provided. > This fix will make it in the next integration release. > > Regarding your earlier bug report, although the bug has been fixed, I= =20 > am thinking I should introduce a template method, so that new scanner=20 > writers dont have to bother about registering the tags with their=20 > respective scanners. > > Hopefully this refactoring will be in soon enabling scanners to be=20 > written safely. Also need to get cracking at Claude's refactoring= suggestions. > >Regards, >Somik >----- Original Message ----- >From: <mailto:ced...@fr...>C=E9dric Rosa >To:=20 ><mailto:htm...@li...>htm...@li...urcef= orge.net=20 > >Sent: Thursday, June 27, 2002 12:48 AM >Subject: [Htmlparser-user] Bad formed web page > >Re Somik, > >First, thanks for your patch I'll download it as soon as possible. > >I've just tested your program with a web page which contains errors. I'm >programming a search engine and some pages may contains errors. >I attached a copy of a bad page example: the problem is the page is trim >before its end (a download error for example). >It miss a ">" ("<br") which cause the program crash with a null pointer >exception ... >Can you fix this problem or tell me where (in the sources) I can search for >patching ? > >Thanks by advance for your good support. > >Cedric. > > > > > >At 20:28 26/06/2002 +0900, you wrote: > >Hi Cedric, > > This has been fixed. These two scanners (meta and title tag= scanners) > > were not being associated with their tags. Reproduced with a test case > > and fixed. Code on CVS has been updated. This bug fix will make it in= the > > next integration release (hopefully this weekend). > > Thanks for the bug report. > >Cheers, > >Somik > >>----- Original Message ----- > >>From: <mailto:so...@ya...>Somik Raha > >>To: > >><mailto:htm...@li...>htm...@li...ur= =20 > ceforge.net > >> > >>Sent: Wednesday, June 26, 2002 8:13 PM > >>Subject: Re: [Htmlparser-user] -m option doesn't work ? > >> > >>It does look like a bug - you could probably open a BugZilla report= (from > >><http://htmlparser.sourceforge.net>http://htmlparser.sourceforge.net), > >>and describe your fix. I will also try to take a deeper look as soon as= I > >>find some time. > >> > >>Regards, > >>Somik > >>>----- Original Message ----- > >>>From: <mailto:ced...@fr...>C=E9dric Rosa > >>>To: > >>><mailto:htm...@li...>htm...@li...u= =20 > rceforge.net > >>> > >>>Sent: Wednesday, June 26, 2002 8:14 PM > >>>Subject: Re: [Htmlparser-user] -m option doesn't work ? > >>> > >>>I've tried with many urls, it's the same problem, but you can check=20 > with : > >>>"<http://www.cybergeo.presse.fr/actualit/nouvparu/crendus/irstcr3.htm>h= =20 > ttp://www.cybergeo.presse.fr/actualit/nouvparu/crendus/irstcr3.htm" > >>> > >>>I've just modified the source code to make it works (and now it woks=20 > fine) > >>>... so maybe it's a bug ? > >>> > >>>Thanks for your help. > >>> > >>>Cedric. > >>> > >>>At 20:02 26/06/2002 +0900, you wrote: > >>> >Hi Cedric, > >>> > Can you give us the url, or send the page over? > >>> > > >>> >Regards > >>> >Somik > >>> >>----- Original Message ----- > >>> >>From: > >>>=20 > <<mailto:ced...@fr...>mailto:ced...@fr...>C=E9dric= Rosa > >>> >>To: > >>>= >><<mailto:htm...@li...>mailto:htmlparser-user@ > >>>=20 >= lists.sourceforge.net><mailto:htm...@li...>htmlpar= ser...@li...=20 > > >>> > >>> >> > >>> >>Sent: Wednesday, June 26, 2002 5:40 PM > >>> >>Subject: [Htmlparser-user] -m option doesn't work ? > >>> >> > >>> >>Hello, > >>> >> > >>> >>When I'm trying to parse a web page with htmlparser with this code: > >>> >> > >>> >>HTMLParser parser =3D new HTMLParser("foo.html"); > >>> >>parser.registerScanners(); > >>> >>parser.parse(null); > >>> >> > >>> >>eveything is OK but when I tried to parse the page with : > >>> >> > >>> >>parser.parse("-m"); > >>> >>or > >>> >>parser.parse("-t"); > >>> >> > >>> >>I received no answer from the software even if page contains meta=20 > tag or > >>> >>title. > >>> >> > >>> >>What's wrong ? > >>> >> > >>> >>thanks by advance for your answers. > >>> >> > >>> >>Cedric. > >>> >> > >>> >> > >>> >> > >>> >>------------------------------------------------------- > >>> >>This sf.net email is sponsored by: Jabber Inc. > >>> >>Don't miss the IM event of the season | Special offer for OSDN=20 > members! > >>> >>JabConf 2002, Aug. 20-22, Keystone, CO > >>>= >><<http://www.jabberconf.com/osdn>http://www.jabberconf.com/osdn>http:/ > >>> /www.jabberconf.com/osdn > >>> >>_______________________________________________ > >>> >>Htmlparser-user mailing list > >>>= >><<mailto:Htm...@li...>mailto:Htmlparser-user@ > >>>=20 >= lists.sourceforge.net><mailto:Htm...@li...>Htmlpar= ser...@li...=20 > > >>>= >><https://lists.sourceforge.net/lists/listinfo/htmlparser-user>https:// > >>> lists.sourceforge.net/lists/listinfo/htmlparser-user > >>> > >>> > >>> > >>>------------------------------------------------------- > >>>This sf.net email is sponsored by: Jabber Inc. > >>>Don't miss the IM event of the season | Special offer for OSDN members! > >>>JabConf 2002, Aug. 20-22, Keystone, CO > >>><http://www.jabberconf.com/osdn>http://www.jabberconf.com/osdn > >>>_______________________________________________ > >>>Htmlparser-user mailing list > >>><mailto:Htm...@li...>Htm...@li...u= =20 > rceforge.net > >>>https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Somik R. <so...@ya...> - 2002-06-27 02:17:08
|
Hi Cedric, Thanks for the bug report. This has been reproduced in = HTMLTagTest.testBrokenTag(), and has been fixed. The parser now runs = without failing on the same html file provided. This fix will make it in the next integration release. Regarding your earlier bug report, although the bug has been fixed, = I am thinking I should introduce a template method, so that new scanner = writers dont have to bother about registering the tags with their = respective scanners. Hopefully this refactoring will be in soon enabling scanners to be = written safely. Also need to get cracking at Claude's refactoring = suggestions. Regards, Somik ----- Original Message -----=20 From: C=E9dric Rosa=20 To: htm...@li...=20 Sent: Thursday, June 27, 2002 12:48 AM Subject: [Htmlparser-user] Bad formed web page Re Somik, First, thanks for your patch I'll download it as soon as possible. I've just tested your program with a web page which contains errors. = I'm=20 programming a search engine and some pages may contains errors. I attached a copy of a bad page example: the problem is the page is = trim=20 before its end (a download error for example). It miss a ">" ("<br") which cause the program crash with a null = pointer=20 exception ... Can you fix this problem or tell me where (in the sources) I can = search for=20 patching ? Thanks by advance for your good support. Cedric. At 20:28 26/06/2002 +0900, you wrote: >Hi Cedric, > This has been fixed. These two scanners (meta and title tag = scanners)=20 > were not being associated with their tags. Reproduced with a test = case=20 > and fixed. Code on CVS has been updated. This bug fix will make it = in the=20 > next integration release (hopefully this weekend). > Thanks for the bug report. >Cheers, >Somik >>----- Original Message ----- >>From: <mailto:so...@ya...>Somik Raha >>To:=20 = >><mailto:htm...@li...>htm...@li...u= rceforge.net=20 >> >>Sent: Wednesday, June 26, 2002 8:13 PM >>Subject: Re: [Htmlparser-user] -m option doesn't work ? >> >>It does look like a bug - you could probably open a BugZilla report = (from=20 = >><http://htmlparser.sourceforge.net>http://htmlparser.sourceforge.net), = >>and describe your fix. I will also try to take a deeper look as soon = as I=20 >>find some time. >> >>Regards, >>Somik >>>----- Original Message ----- >>>From: <mailto:ced...@fr...>C=E9dric Rosa >>>To:=20 = >>><mailto:htm...@li...>htm...@li...= urceforge.net=20 >>> >>>Sent: Wednesday, June 26, 2002 8:14 PM >>>Subject: Re: [Htmlparser-user] -m option doesn't work ? >>> >>>I've tried with many urls, it's the same problem, but you can check = with : = >>>"<http://www.cybergeo.presse.fr/actualit/nouvparu/crendus/irstcr3.htm>= http://www.cybergeo.presse.fr/actualit/nouvparu/crendus/irstcr3.htm" >>> >>>I've just modified the source code to make it works (and now it = woks fine) >>>... so maybe it's a bug ? >>> >>>Thanks for your help. >>> >>>Cedric. >>> >>>At 20:02 26/06/2002 +0900, you wrote: >>> >Hi Cedric, >>> > Can you give us the url, or send the page over? >>> > >>> >Regards >>> >Somik >>> >>----- Original Message ----- >>> >>From:=20 >>> = <<mailto:ced...@fr...>mailto:ced...@fr...>C=E9dric = Rosa >>> >>To: >>> = >><<mailto:htm...@li...>mailto:htmlparser-user@ = >>> = lists.sourceforge.net><mailto:htm...@li...>htmlp= ars...@li...=20 >>> >>> >> >>> >>Sent: Wednesday, June 26, 2002 5:40 PM >>> >>Subject: [Htmlparser-user] -m option doesn't work ? >>> >> >>> >>Hello, >>> >> >>> >>When I'm trying to parse a web page with htmlparser with this = code: >>> >> >>> >>HTMLParser parser =3D new HTMLParser("foo.html"); >>> >>parser.registerScanners(); >>> >>parser.parse(null); >>> >> >>> >>eveything is OK but when I tried to parse the page with : >>> >> >>> >>parser.parse("-m"); >>> >>or >>> >>parser.parse("-t"); >>> >> >>> >>I received no answer from the software even if page contains = meta tag or >>> >>title. >>> >> >>> >>What's wrong ? >>> >> >>> >>thanks by advance for your answers. >>> >> >>> >>Cedric. >>> >> >>> >> >>> >> >>> >>------------------------------------------------------- >>> >>This sf.net email is sponsored by: Jabber Inc. >>> >>Don't miss the IM event of the season | Special offer for OSDN = members! >>> >>JabConf 2002, Aug. 20-22, Keystone, CO >>> = >><<http://www.jabberconf.com/osdn>http://www.jabberconf.com/osdn>http:/ = >>> /www.jabberconf.com/osdn >>> >>_______________________________________________ >>> >>Htmlparser-user mailing list >>> = >><<mailto:Htm...@li...>mailto:Htmlparser-user@ = >>> = lists.sourceforge.net><mailto:Htm...@li...>Htmlp= ars...@li... >>> = >><https://lists.sourceforge.net/lists/listinfo/htmlparser-user>https:// = >>> lists.sourceforge.net/lists/listinfo/htmlparser-user >>> >>> >>> >>>------------------------------------------------------- >>>This sf.net email is sponsored by: Jabber Inc. >>>Don't miss the IM event of the season | Special offer for OSDN = members! >>>JabConf 2002, Aug. 20-22, Keystone, CO=20 >>><http://www.jabberconf.com/osdn>http://www.jabberconf.com/osdn >>>_______________________________________________ >>>Htmlparser-user mailing list = >>><mailto:Htm...@li...>Htm...@li...= urceforge.net >>>https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: R. <ced...@fr...> - 2002-06-26 15:48:45
|
Re Somik, First, thanks for your patch I'll download it as soon as possible. I've just tested your program with a web page which contains errors. I'm=20 programming a search engine and some pages may contains errors. I attached a copy of a bad page example: the problem is the page is trim=20 before its end (a download error for example). It miss a ">" ("<br") which cause the program crash with a null pointer=20 exception ... Can you fix this problem or tell me where (in the sources) I can search for= =20 patching ? Thanks by advance for your good support. Cedric. At 20:28 26/06/2002 +0900, you wrote: >Hi Cedric, > This has been fixed. These two scanners (meta and title tag scanners)= =20 > were not being associated with their tags. Reproduced with a test case=20 > and fixed. Code on CVS has been updated. This bug fix will make it in the= =20 > next integration release (hopefully this weekend). > Thanks for the bug report. >Cheers, >Somik >>----- Original Message ----- >>From: <mailto:so...@ya...>Somik Raha >>To:=20 >><mailto:htm...@li...>htm...@li...urce= forge.net=20 >> >>Sent: Wednesday, June 26, 2002 8:13 PM >>Subject: Re: [Htmlparser-user] -m option doesn't work ? >> >>It does look like a bug - you could probably open a BugZilla report (from= =20 >><http://htmlparser.sourceforge.net>http://htmlparser.sourceforge.net),=20 >>and describe your fix. I will also try to take a deeper look as soon as I= =20 >>find some time. >> >>Regards, >>Somik >>>----- Original Message ----- >>>From: <mailto:ced...@fr...>C=E9dric Rosa >>>To:=20 >>><mailto:htm...@li...>htm...@li...urc= eforge.net=20 >>> >>>Sent: Wednesday, June 26, 2002 8:14 PM >>>Subject: Re: [Htmlparser-user] -m option doesn't work ? >>> >>>I've tried with many urls, it's the same problem, but you can check with= : >>>"<http://www.cybergeo.presse.fr/actualit/nouvparu/crendus/irstcr3.htm>htt= p://www.cybergeo.presse.fr/actualit/nouvparu/crendus/irstcr3.htm" >>> >>>I've just modified the source code to make it works (and now it woks= fine) >>>... so maybe it's a bug ? >>> >>>Thanks for your help. >>> >>>Cedric. >>> >>>At 20:02 26/06/2002 +0900, you wrote: >>> >Hi Cedric, >>> > Can you give us the url, or send the page over? >>> > >>> >Regards >>> >Somik >>> >>----- Original Message ----- >>> >>From:=20 >>> <<mailto:ced...@fr...>mailto:ced...@fr...>C=E9dric= Rosa >>> >>To: >>> >><<mailto:htm...@li...>mailto:htmlparser-user@= =20 >>>= lists.sourceforge.net><mailto:htm...@li...>htmlpar= ser...@li...=20 >>> >>> >> >>> >>Sent: Wednesday, June 26, 2002 5:40 PM >>> >>Subject: [Htmlparser-user] -m option doesn't work ? >>> >> >>> >>Hello, >>> >> >>> >>When I'm trying to parse a web page with htmlparser with this code: >>> >> >>> >>HTMLParser parser =3D new HTMLParser("foo.html"); >>> >>parser.registerScanners(); >>> >>parser.parse(null); >>> >> >>> >>eveything is OK but when I tried to parse the page with : >>> >> >>> >>parser.parse("-m"); >>> >>or >>> >>parser.parse("-t"); >>> >> >>> >>I received no answer from the software even if page contains meta tag= or >>> >>title. >>> >> >>> >>What's wrong ? >>> >> >>> >>thanks by advance for your answers. >>> >> >>> >>Cedric. >>> >> >>> >> >>> >> >>> >>------------------------------------------------------- >>> >>This sf.net email is sponsored by: Jabber Inc. >>> >>Don't miss the IM event of the season | Special offer for OSDN= members! >>> >>JabConf 2002, Aug. 20-22, Keystone, CO >>> >><<http://www.jabberconf.com/osdn>http://www.jabberconf.com/osdn>http:/= =20 >>> /www.jabberconf.com/osdn >>> >>_______________________________________________ >>> >>Htmlparser-user mailing list >>> >><<mailto:Htm...@li...>mailto:Htmlparser-user@= =20 >>>= lists.sourceforge.net><mailto:Htm...@li...>Htmlpar= ser...@li... >>> >><https://lists.sourceforge.net/lists/listinfo/htmlparser-user>https://= =20 >>> lists.sourceforge.net/lists/listinfo/htmlparser-user >>> >>> >>> >>>------------------------------------------------------- >>>This sf.net email is sponsored by: Jabber Inc. >>>Don't miss the IM event of the season | Special offer for OSDN members! >>>JabConf 2002, Aug. 20-22, Keystone, CO=20 >>><http://www.jabberconf.com/osdn>http://www.jabberconf.com/osdn >>>_______________________________________________ >>>Htmlparser-user mailing list >>><mailto:Htm...@li...>Htm...@li...urc= eforge.net >>>https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Somik R. <so...@ya...> - 2002-06-26 11:33:50
|
Hi Cedric, This has been fixed. These two scanners (meta and title tag = scanners) were not being associated with their tags. Reproduced with a = test case and fixed. Code on CVS has been updated. This bug fix will = make it in the next integration release (hopefully this weekend). Thanks for the bug report. Cheers, Somik ----- Original Message -----=20 From: Somik Raha=20 To: htm...@li...=20 Sent: Wednesday, June 26, 2002 8:13 PM Subject: Re: [Htmlparser-user] -m option doesn't work ? It does look like a bug - you could probably open a BugZilla report = (from http://htmlparser.sourceforge.net), and describe your fix. I will = also try to take a deeper look as soon as I find some time. Regards, Somik ----- Original Message -----=20 From: C=E9dric Rosa=20 To: htm...@li...=20 Sent: Wednesday, June 26, 2002 8:14 PM Subject: Re: [Htmlparser-user] -m option doesn't work ? I've tried with many urls, it's the same problem, but you can check = with :=20 = "http://www.cybergeo.presse.fr/actualit/nouvparu/crendus/irstcr3.htm" I've just modified the source code to make it works (and now it woks = fine)=20 ... so maybe it's a bug ? Thanks for your help. Cedric. At 20:02 26/06/2002 +0900, you wrote: >Hi Cedric, > Can you give us the url, or send the page over? > >Regards >Somik >>----- Original Message ----- >>From: <mailto:ced...@fr...>C=E9dric Rosa >>To:=20 = >><mailto:htm...@li...>htm...@li...u= rceforge.net=20 >> >>Sent: Wednesday, June 26, 2002 5:40 PM >>Subject: [Htmlparser-user] -m option doesn't work ? >> >>Hello, >> >>When I'm trying to parse a web page with htmlparser with this = code: >> >>HTMLParser parser =3D new HTMLParser("foo.html"); >>parser.registerScanners(); >>parser.parse(null); >> >>eveything is OK but when I tried to parse the page with : >> >>parser.parse("-m"); >>or >>parser.parse("-t"); >> >>I received no answer from the software even if page contains meta = tag or=20 >>title. >> >>What's wrong ? >> >>thanks by advance for your answers. >> >>Cedric. >> >> >> >>------------------------------------------------------- >>This sf.net email is sponsored by: Jabber Inc. >>Don't miss the IM event of the season | Special offer for OSDN = members! >>JabConf 2002, Aug. 20-22, Keystone, CO=20 >><http://www.jabberconf.com/osdn>http://www.jabberconf.com/osdn >>_______________________________________________ >>Htmlparser-user mailing list = >><mailto:Htm...@li...>Htm...@li...u= rceforge.net >>https://lists.sourceforge.net/lists/listinfo/htmlparser-user ------------------------------------------------------- This sf.net email is sponsored by: Jabber Inc. Don't miss the IM event of the season | Special offer for OSDN = members!=20 JabConf 2002, Aug. 20-22, Keystone, CO = http://www.jabberconf.com/osdn _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |