Thread: [Htmlparser-developer] htmlparser 1.0
Brought to you by:
derrickoswald
From: Kaarle K. <kaa...@ik...> - 2002-01-07 22:06:18
|
I tried the example applications using the bat-files with htmlparser 1.0 with not very good success. 1) runCrawler http://www.google.com 1 This gives a list of links on the abovementioned page I assume 2) (finnish broadcastin company) runCrawler http://www.yle.fi 1 This throws Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String ind ex out of range: 27 3) (finnish commercial tvstation ) runCrawler http://www.mtv3.fi 1 this throws Exception in thread "main" java.lang.OutOfMemoryError <<no stack trace available>> 4) my own simple homepage After a rather long time throws: Crawling to http://www.microsoft.com/ContentRedirect.asp?prd=iis&sbp=&pver=5.0&p id=&ID=404&cat=web&os=&over=&hrd=&Opt1=&Opt2=&Opt3= crawlDepth = 0 Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String ind ex out of range: 23 at java.lang.String.substring(Unknown Source) ........ I don't think I have such microsoft links on my page. Probably something to to with the activeisp.com that provides me with diskspace?? Similar result from my software page at www.kk-software.fi -------------------- As a result of these experiments i did not understand what the Robot tries to do?? Any explanations to this? regards Kaarle --------------------------------------------- Kaarle Kaila http://www.iki.fi/kaila mailto:kaa...@ik... tel: +358 50 3725844 |
From: Somik R. <so...@ya...> - 2002-01-08 15:16:32
|
Hi Kaarle, Thanks for pointing this out. Its not a bug with the crawler, but with the parser itself - in HTMLStyleScanner... I am trying to fix it asap. Regards, Somik ----- Original Message ----- From: "Kaarle Kaila" <kaa...@ik...> To: <htm...@li...> Sent: Tuesday, January 08, 2002 3:34 AM Subject: [Htmlparser-developer] htmlparser 1.0 > I tried the example applications using the bat-files > with htmlparser 1.0 with not very good success. > > 1) > runCrawler http://www.google.com 1 > This gives a list of links on the abovementioned page I assume > > 2) (finnish broadcastin company) > runCrawler http://www.yle.fi 1 > This throws > Exception in thread "main" java.lang.StringIndexOutOfBoundsException: > String ind > ex out of range: 27 > > 3) (finnish commercial tvstation ) > runCrawler http://www.mtv3.fi 1 > this throws > Exception in thread "main" java.lang.OutOfMemoryError > <<no stack trace available>> > > 4) my own simple homepage > > After a rather long time throws: > Crawling to > http://www.microsoft.com/ContentRedirect.asp?prd=iis&sbp=&pver=5.0&p > id=&ID=404&cat=web&os=&over=&hrd=&Opt1=&Opt2=&Opt3= crawlDepth = 0 > Exception in thread "main" java.lang.StringIndexOutOfBoundsException: > String ind > ex out of range: 23 > at java.lang.String.substring(Unknown Source) > ........ > I don't think I have such microsoft links on my page. Probably something to > to with the activeisp.com that provides me with diskspace?? > > Similar result from my software page at www.kk-software.fi > -------------------- > As a result of these experiments i did not understand what the Robot tries > to do?? > > Any explanations to this? > regards > Kaarle > > --------------------------------------------- > Kaarle Kaila > http://www.iki.fi/kaila > mailto:kaa...@ik... > tel: +358 50 3725844 > > > > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer _________________________________________________________ Do You Yahoo!? Get your free @yahoo.com address at http://mail.yahoo.com |
From: Somik R. <so...@ya...> - 2002-01-08 15:46:16
|
Hi Kaarle, To answer your basic question - crawler will crawl through a url (like websnake and similar robot crawlers). It will pick up links and visit those links and so on recursively depending on the depth you define. The bugs you see are not bcos of the crawler code, but bcos of some parser bugs. The scanner bugs came in when I tried to fix the case when the style tags are in one big line with other stuff. Obviously, not enough test cases. Thankfully, you are htmlparser's best tester :) Your site and http://www.yle.fi are working fine now. mtv3 is giving the wierd out of mem excpetion and I am now fixing that. As soon as thats done, maintenance release 1.01 will be out. Cheers, Somik ----- Original Message ----- From: "Kaarle Kaila" <kaa...@ik...> To: <htm...@li...> Sent: Tuesday, January 08, 2002 3:34 AM Subject: [Htmlparser-developer] htmlparser 1.0 > I tried the example applications using the bat-files > with htmlparser 1.0 with not very good success. > > 1) > runCrawler http://www.google.com 1 > This gives a list of links on the abovementioned page I assume > > 2) (finnish broadcastin company) > runCrawler http://www.yle.fi 1 > This throws > Exception in thread "main" java.lang.StringIndexOutOfBoundsException: > String ind > ex out of range: 27 > > 3) (finnish commercial tvstation ) > runCrawler http://www.mtv3.fi 1 > this throws > Exception in thread "main" java.lang.OutOfMemoryError > <<no stack trace available>> > > 4) my own simple homepage > > After a rather long time throws: > Crawling to > http://www.microsoft.com/ContentRedirect.asp?prd=iis&sbp=&pver=5.0&p > id=&ID=404&cat=web&os=&over=&hrd=&Opt1=&Opt2=&Opt3= crawlDepth = 0 > Exception in thread "main" java.lang.StringIndexOutOfBoundsException: > String ind > ex out of range: 23 > at java.lang.String.substring(Unknown Source) > ........ > I don't think I have such microsoft links on my page. Probably something to > to with the activeisp.com that provides me with diskspace?? > > Similar result from my software page at www.kk-software.fi > -------------------- > As a result of these experiments i did not understand what the Robot tries > to do?? > > Any explanations to this? > regards > Kaarle > > --------------------------------------------- > Kaarle Kaila > http://www.iki.fi/kaila > mailto:kaa...@ik... > tel: +358 50 3725844 > > > > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer _________________________________________________________ Do You Yahoo!? Get your free @yahoo.com address at http://mail.yahoo.com |
From: Somik R. <so...@ya...> - 2002-01-09 11:50:17
|
Dear Kaarle, Thank you very much! You are quite right, I forgot I was using = Shift-JIS for Japanese encoding support and SJIS is a Microsoft specific = standard - not unicode, but if I use a unicode encoding, it should be = fine. I will try with UTF8, will need your help to co-ordinate some more = tests. Meanwhile this style thing is proving to be a headache, just got a = report that its crashing on google. Need to add more test cases.. Regards, Somik ----- Original Message -----=20 From: Kaarle Kaila=20 To: Somik Raha=20 Sent: Wednesday, January 09, 2002 2:40 AM Subject: Re: [Htmlparser-developer] htmlparser 1.0 (Issue with mtv3 is = that of internationalization) At 22:37 8.1.2002 +0530, Somik Raha wrote: Hi Kaarle, I found the reason for the last problem - the site : = http://www.mtv3.fi has a link in Finnish. That link is not being interpreted correctly = by the parser. The link is : <a href=3D"/ks/ks_20020701b.shtml">Palveluun p=E4=E4set = t=E4st=E4</a> hi Somik, HTMLParser reads lines from the net. It initiates the contact to that = line with a command=20 reader =3D new HTMLReader(new BufferedReader(new = InputStreamReader(uc.getInputStream(),"SJIS")),resourceLocn); I don't know what SJIS stands for. The Java API does not list that, = but lists among others ISO-8859-1. Check InputStreamReader constructor. By using ISO-8859-1 it does not = hang like it did with SJIS! SJIS seems to make everything 7-bit ascii.=20 reader =3D new HTMLReader(new BufferedReader(new = InputStreamReader(uc.getInputStream(),"ISO-8859-1")),resourceLocn); With this setting at least finnish characters come correctly.=20 I also downloaded two files you hade made changes from CVS=20 and I could read www.mtv3.fi. It even reads my webpage (rather strange = output though). In Japan I would expect the internationalizing to be an issue?? = Wouldn't UNICODE=20 be required there? regards Kaarle Whats happening is that the last < is being corrupted. I havent = faced a problem with internationalization till now - and I am kind of stuck = with this one. Maybe you'd be in a better position to solve it than me. I = will make the release with the other bug fixed, and Id be grateful if u = can proceed from there. Regards, Somik _________________________________________________________ Do You Yahoo!? Get your free @yahoo.com address at http://mail.yahoo.com --------------------------------------------- Kaarle Kaila http://www.iki.fi/kaila mailto:kaa...@ik... tel: +358 50 3725844=20 |