Thread: Re: Fw: [Htmlparser-user] Bad formed web page
Brought to you by:
derrickoswald
From: R. <ced...@fr...> - 2002-06-27 09:39:16
Attachments:
NormalizeHtmlCode.java
|
Hello Somik, Thanks for this fix. But when I download the CVS version of HTMLParser and= =20 try to parse the page again I get this error: "java.lang.OutOfMemoryError <<no stack trace available>> Exception in thread "main" " Is-it normal ? Should I catch this error and write my own code around ? Other question, I can't run the software with two options. Is-it normal ?=20 Why don't you set the options before the name of the file to parse ? Last, a friend (Tarik Mokhtari) wrote a "little" normalizer to convert=20 "&*". Maybe it could be a good idea to add it to the project ? It can be used like this: public HTMLStringNode(String text,int textBegin,int textEnd) { NormalizeHtmlCode normalizer =3D new NormalizeHtmlCode(); this.text =3D normalizer.html2text(text); this.textBegin =3D textBegin; this.textEnd =3D textEnd; } You can implement it with the meta-tags, ... Regards, Cedric. At 08:23 27/06/2002 +0200, you wrote: > >----- Original Message ----- >From: <mailto:so...@ya...>Somik Raha >To:=20 ><mailto:htm...@li...>htm...@li...urcef= orge.net=20 > >Cc:=20 ><mailto:htm...@li...>htmlparser-developer@lis= ts.sourceforge.net=20 > >Sent: Thursday, June 27, 2002 4:11 AM >Subject: Re: [Htmlparser-user] Bad formed web page > >Hi Cedric, > Thanks for the bug report. This has been reproduced in=20 > HTMLTagTest.testBrokenTag(), and has been fixed. The parser now runs=20 > without failing on the same html file provided. > This fix will make it in the next integration release. > > Regarding your earlier bug report, although the bug has been fixed, I= =20 > am thinking I should introduce a template method, so that new scanner=20 > writers dont have to bother about registering the tags with their=20 > respective scanners. > > Hopefully this refactoring will be in soon enabling scanners to be=20 > written safely. Also need to get cracking at Claude's refactoring= suggestions. > >Regards, >Somik >----- Original Message ----- >From: <mailto:ced...@fr...>C=E9dric Rosa >To:=20 ><mailto:htm...@li...>htm...@li...urcef= orge.net=20 > >Sent: Thursday, June 27, 2002 12:48 AM >Subject: [Htmlparser-user] Bad formed web page > >Re Somik, > >First, thanks for your patch I'll download it as soon as possible. > >I've just tested your program with a web page which contains errors. I'm >programming a search engine and some pages may contains errors. >I attached a copy of a bad page example: the problem is the page is trim >before its end (a download error for example). >It miss a ">" ("<br") which cause the program crash with a null pointer >exception ... >Can you fix this problem or tell me where (in the sources) I can search for >patching ? > >Thanks by advance for your good support. > >Cedric. > > > > > >At 20:28 26/06/2002 +0900, you wrote: > >Hi Cedric, > > This has been fixed. These two scanners (meta and title tag= scanners) > > were not being associated with their tags. Reproduced with a test case > > and fixed. Code on CVS has been updated. This bug fix will make it in= the > > next integration release (hopefully this weekend). > > Thanks for the bug report. > >Cheers, > >Somik > >>----- Original Message ----- > >>From: <mailto:so...@ya...>Somik Raha > >>To: > >><mailto:htm...@li...>htm...@li...ur= =20 > ceforge.net > >> > >>Sent: Wednesday, June 26, 2002 8:13 PM > >>Subject: Re: [Htmlparser-user] -m option doesn't work ? > >> > >>It does look like a bug - you could probably open a BugZilla report= (from > >><http://htmlparser.sourceforge.net>http://htmlparser.sourceforge.net), > >>and describe your fix. I will also try to take a deeper look as soon as= I > >>find some time. > >> > >>Regards, > >>Somik > >>>----- Original Message ----- > >>>From: <mailto:ced...@fr...>C=E9dric Rosa > >>>To: > >>><mailto:htm...@li...>htm...@li...u= =20 > rceforge.net > >>> > >>>Sent: Wednesday, June 26, 2002 8:14 PM > >>>Subject: Re: [Htmlparser-user] -m option doesn't work ? > >>> > >>>I've tried with many urls, it's the same problem, but you can check=20 > with : > >>>"<http://www.cybergeo.presse.fr/actualit/nouvparu/crendus/irstcr3.htm>h= =20 > ttp://www.cybergeo.presse.fr/actualit/nouvparu/crendus/irstcr3.htm" > >>> > >>>I've just modified the source code to make it works (and now it woks=20 > fine) > >>>... so maybe it's a bug ? > >>> > >>>Thanks for your help. > >>> > >>>Cedric. > >>> > >>>At 20:02 26/06/2002 +0900, you wrote: > >>> >Hi Cedric, > >>> > Can you give us the url, or send the page over? > >>> > > >>> >Regards > >>> >Somik > >>> >>----- Original Message ----- > >>> >>From: > >>>=20 > <<mailto:ced...@fr...>mailto:ced...@fr...>C=E9dric= Rosa > >>> >>To: > >>>= >><<mailto:htm...@li...>mailto:htmlparser-user@ > >>>=20 >= lists.sourceforge.net><mailto:htm...@li...>htmlpar= ser...@li...=20 > > >>> > >>> >> > >>> >>Sent: Wednesday, June 26, 2002 5:40 PM > >>> >>Subject: [Htmlparser-user] -m option doesn't work ? > >>> >> > >>> >>Hello, > >>> >> > >>> >>When I'm trying to parse a web page with htmlparser with this code: > >>> >> > >>> >>HTMLParser parser =3D new HTMLParser("foo.html"); > >>> >>parser.registerScanners(); > >>> >>parser.parse(null); > >>> >> > >>> >>eveything is OK but when I tried to parse the page with : > >>> >> > >>> >>parser.parse("-m"); > >>> >>or > >>> >>parser.parse("-t"); > >>> >> > >>> >>I received no answer from the software even if page contains meta=20 > tag or > >>> >>title. > >>> >> > >>> >>What's wrong ? > >>> >> > >>> >>thanks by advance for your answers. > >>> >> > >>> >>Cedric. > >>> >> > >>> >> > >>> >> > >>> >>------------------------------------------------------- > >>> >>This sf.net email is sponsored by: Jabber Inc. > >>> >>Don't miss the IM event of the season | Special offer for OSDN=20 > members! > >>> >>JabConf 2002, Aug. 20-22, Keystone, CO > >>>= >><<http://www.jabberconf.com/osdn>http://www.jabberconf.com/osdn>http:/ > >>> /www.jabberconf.com/osdn > >>> >>_______________________________________________ > >>> >>Htmlparser-user mailing list > >>>= >><<mailto:Htm...@li...>mailto:Htmlparser-user@ > >>>=20 >= lists.sourceforge.net><mailto:Htm...@li...>Htmlpar= ser...@li...=20 > > >>>= >><https://lists.sourceforge.net/lists/listinfo/htmlparser-user>https:// > >>> lists.sourceforge.net/lists/listinfo/htmlparser-user > >>> > >>> > >>> > >>>------------------------------------------------------- > >>>This sf.net email is sponsored by: Jabber Inc. > >>>Don't miss the IM event of the season | Special offer for OSDN members! > >>>JabConf 2002, Aug. 20-22, Keystone, CO > >>><http://www.jabberconf.com/osdn>http://www.jabberconf.com/osdn > >>>_______________________________________________ > >>>Htmlparser-user mailing list > >>><mailto:Htm...@li...>Htm...@li...u= =20 > rceforge.net > >>>https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Somik R. <so...@ya...> - 2002-06-27 09:49:06
|
Hi Cedric, Thanks for this fix. But when I download the CVS version of HTMLParser = and=20 try to parse the page again I get this error: "java.lang.OutOfMemoryError <<no stack trace available>> Exception in thread "main" " Is-it normal ? Should I catch this error and write my own code around ? Its highly abnormal... It should not happen - are you trying it with the = same piece of html ? Send me the data you are trying on. If its the same = page, it works perfectly on my end. I am running HTMLParser (main) with = no params except the file name. Other question, I can't run the software with two options. Is-it normal = ?=20 Why don't you set the options before the name of the file to parse ? Yes, this is normal (a feature, not a bug). This is bcos the options are = intended only as a demo, and I didnt think it'd really be of use to = people. Are you actually using it this way ? Also, another thing is I am = not full time on this, so I'd be grateful if you can join up as a = developer and make this fix. All code recieved from developers is acknowledged both in the code, and = the Contributors page that goes out with each release. You can send me = your sourceforge id and I can add you as a developer. It can be used like this: public HTMLStringNode(String text,int textBegin,int textEnd) { NormalizeHtmlCode normalizer =3D new NormalizeHtmlCode(); this.text =3D normalizer.html2text(text); this.textBegin =3D textBegin; this.textEnd =3D textEnd; } You can implement it with the meta-tags, ... This is cool. I think it will be useful in the toPlainString() method, = where we can get the actual meaningful text out. I'd be glad to include = this as soon as I find some time. Or Tariq can also join as a developer = and I can give him CVS access to do it. Thanks a lot for your participation. Cheers, Somik |
From: R. <ced...@fr...> - 2002-06-27 10:16:47
Attachments:
index2.html
|
Oh yes excuse me, it's another test :) I attach the new document. I've trimed another line of the page :) I don't run software with options, I use class with my own program, but it was just for testing :) I could make a fix when I find some time too :) I'll register sourceforge asap and I'll try to understand how CVS works :) For the moment I can send my fixes to the developer mailing list. Regards, Cedric. At 18:43 27/06/2002 +0900, you wrote: >Hi Cedric, >Thanks for this fix. But when I download the CVS version of HTMLParser and >try to parse the page again I get this error: >"java.lang.OutOfMemoryError > <<no stack trace available>> >Exception in thread "main" " > >Is-it normal ? Should I catch this error and write my own code around ? >Its highly abnormal... It should not happen - are you trying it with the >same piece of html ? Send me the data you are trying on. If its the same >page, it works perfectly on my end. I am running HTMLParser (main) with no >params except the file name. > >Other question, I can't run the software with two options. Is-it normal ? >Why don't you set the options before the name of the file to parse ? > >Yes, this is normal (a feature, not a bug). This is bcos the options are >intended only as a demo, and I didnt think it'd really be of use to >people. Are you actually using it this way ? Also, another thing is I am >not full time on this, so I'd be grateful if you can join up as a >developer and make this fix. > >All code recieved from developers is acknowledged both in the code, and >the Contributors page that goes out with each release. You can send me >your sourceforge id and I can add you as a developer. > >It can be used like this: >public HTMLStringNode(String text,int textBegin,int textEnd) >{ > NormalizeHtmlCode normalizer = new NormalizeHtmlCode(); > this.text = normalizer.html2text(text); > this.textBegin = textBegin; > this.textEnd = textEnd; >} >You can implement it with the meta-tags, ... > >This is cool. I think it will be useful in the toPlainString() method, >where we can get the actual meaningful text out. I'd be glad to include >this as soon as I find some time. Or Tariq can also join as a developer >and I can give him CVS access to do it. > >Thanks a lot for your participation. > >Cheers, >Somik |
From: Somik R. <so...@ya...> - 2002-06-27 10:57:49
|
Oh yes excuse me, it's another test :) I attach the new document. I've trimed another line of the page :) Oh boy, r u torture testing or what :) This bug has been fixed. The parser should work fine now. You can get = the latest code from CVS. I don't run software with options, I use class with my own program, but = it=20 was just for testing :) I could make a fix when I find some time too :) Cool. I'll register sourceforge asap and I'll try to understand how CVS works = :) For the moment I can send my fixes to the developer mailing list. Looking forward to seeing you on the dev list. Bytway, my suggestion is = - try Eclipse- integration with CVS on sourceforge is a breeze and = eclipse is open source :) (www.eclipse.org)=20 If you want to go the hard way, check=20 http://cdx.sourceforge.net/win-HOWTO.htm=20 Cheers, Somik |