Re: Fw: [Htmlparser-user] Bad formed web page
Brought to you by:
derrickoswald
From: R. <ced...@fr...> - 2002-06-27 09:39:16
|
Hello Somik, Thanks for this fix. But when I download the CVS version of HTMLParser and= =20 try to parse the page again I get this error: "java.lang.OutOfMemoryError <<no stack trace available>> Exception in thread "main" " Is-it normal ? Should I catch this error and write my own code around ? Other question, I can't run the software with two options. Is-it normal ?=20 Why don't you set the options before the name of the file to parse ? Last, a friend (Tarik Mokhtari) wrote a "little" normalizer to convert=20 "&*". Maybe it could be a good idea to add it to the project ? It can be used like this: public HTMLStringNode(String text,int textBegin,int textEnd) { NormalizeHtmlCode normalizer =3D new NormalizeHtmlCode(); this.text =3D normalizer.html2text(text); this.textBegin =3D textBegin; this.textEnd =3D textEnd; } You can implement it with the meta-tags, ... Regards, Cedric. At 08:23 27/06/2002 +0200, you wrote: > >----- Original Message ----- >From: <mailto:so...@ya...>Somik Raha >To:=20 ><mailto:htm...@li...>htm...@li...urcef= orge.net=20 > >Cc:=20 ><mailto:htm...@li...>htmlparser-developer@lis= ts.sourceforge.net=20 > >Sent: Thursday, June 27, 2002 4:11 AM >Subject: Re: [Htmlparser-user] Bad formed web page > >Hi Cedric, > Thanks for the bug report. This has been reproduced in=20 > HTMLTagTest.testBrokenTag(), and has been fixed. The parser now runs=20 > without failing on the same html file provided. > This fix will make it in the next integration release. > > Regarding your earlier bug report, although the bug has been fixed, I= =20 > am thinking I should introduce a template method, so that new scanner=20 > writers dont have to bother about registering the tags with their=20 > respective scanners. > > Hopefully this refactoring will be in soon enabling scanners to be=20 > written safely. Also need to get cracking at Claude's refactoring= suggestions. > >Regards, >Somik >----- Original Message ----- >From: <mailto:ced...@fr...>C=E9dric Rosa >To:=20 ><mailto:htm...@li...>htm...@li...urcef= orge.net=20 > >Sent: Thursday, June 27, 2002 12:48 AM >Subject: [Htmlparser-user] Bad formed web page > >Re Somik, > >First, thanks for your patch I'll download it as soon as possible. > >I've just tested your program with a web page which contains errors. I'm >programming a search engine and some pages may contains errors. >I attached a copy of a bad page example: the problem is the page is trim >before its end (a download error for example). >It miss a ">" ("<br") which cause the program crash with a null pointer >exception ... >Can you fix this problem or tell me where (in the sources) I can search for >patching ? > >Thanks by advance for your good support. > >Cedric. > > > > > >At 20:28 26/06/2002 +0900, you wrote: > >Hi Cedric, > > This has been fixed. These two scanners (meta and title tag= scanners) > > were not being associated with their tags. Reproduced with a test case > > and fixed. Code on CVS has been updated. This bug fix will make it in= the > > next integration release (hopefully this weekend). > > Thanks for the bug report. > >Cheers, > >Somik > >>----- Original Message ----- > >>From: <mailto:so...@ya...>Somik Raha > >>To: > >><mailto:htm...@li...>htm...@li...ur= =20 > ceforge.net > >> > >>Sent: Wednesday, June 26, 2002 8:13 PM > >>Subject: Re: [Htmlparser-user] -m option doesn't work ? > >> > >>It does look like a bug - you could probably open a BugZilla report= (from > >><http://htmlparser.sourceforge.net>http://htmlparser.sourceforge.net), > >>and describe your fix. I will also try to take a deeper look as soon as= I > >>find some time. > >> > >>Regards, > >>Somik > >>>----- Original Message ----- > >>>From: <mailto:ced...@fr...>C=E9dric Rosa > >>>To: > >>><mailto:htm...@li...>htm...@li...u= =20 > rceforge.net > >>> > >>>Sent: Wednesday, June 26, 2002 8:14 PM > >>>Subject: Re: [Htmlparser-user] -m option doesn't work ? > >>> > >>>I've tried with many urls, it's the same problem, but you can check=20 > with : > >>>"<http://www.cybergeo.presse.fr/actualit/nouvparu/crendus/irstcr3.htm>h= =20 > ttp://www.cybergeo.presse.fr/actualit/nouvparu/crendus/irstcr3.htm" > >>> > >>>I've just modified the source code to make it works (and now it woks=20 > fine) > >>>... so maybe it's a bug ? > >>> > >>>Thanks for your help. > >>> > >>>Cedric. > >>> > >>>At 20:02 26/06/2002 +0900, you wrote: > >>> >Hi Cedric, > >>> > Can you give us the url, or send the page over? > >>> > > >>> >Regards > >>> >Somik > >>> >>----- Original Message ----- > >>> >>From: > >>>=20 > <<mailto:ced...@fr...>mailto:ced...@fr...>C=E9dric= Rosa > >>> >>To: > >>>= >><<mailto:htm...@li...>mailto:htmlparser-user@ > >>>=20 >= lists.sourceforge.net><mailto:htm...@li...>htmlpar= ser...@li...=20 > > >>> > >>> >> > >>> >>Sent: Wednesday, June 26, 2002 5:40 PM > >>> >>Subject: [Htmlparser-user] -m option doesn't work ? > >>> >> > >>> >>Hello, > >>> >> > >>> >>When I'm trying to parse a web page with htmlparser with this code: > >>> >> > >>> >>HTMLParser parser =3D new HTMLParser("foo.html"); > >>> >>parser.registerScanners(); > >>> >>parser.parse(null); > >>> >> > >>> >>eveything is OK but when I tried to parse the page with : > >>> >> > >>> >>parser.parse("-m"); > >>> >>or > >>> >>parser.parse("-t"); > >>> >> > >>> >>I received no answer from the software even if page contains meta=20 > tag or > >>> >>title. > >>> >> > >>> >>What's wrong ? > >>> >> > >>> >>thanks by advance for your answers. > >>> >> > >>> >>Cedric. > >>> >> > >>> >> > >>> >> > >>> >>------------------------------------------------------- > >>> >>This sf.net email is sponsored by: Jabber Inc. > >>> >>Don't miss the IM event of the season | Special offer for OSDN=20 > members! > >>> >>JabConf 2002, Aug. 20-22, Keystone, CO > >>>= >><<http://www.jabberconf.com/osdn>http://www.jabberconf.com/osdn>http:/ > >>> /www.jabberconf.com/osdn > >>> >>_______________________________________________ > >>> >>Htmlparser-user mailing list > >>>= >><<mailto:Htm...@li...>mailto:Htmlparser-user@ > >>>=20 >= lists.sourceforge.net><mailto:Htm...@li...>Htmlpar= ser...@li...=20 > > >>>= >><https://lists.sourceforge.net/lists/listinfo/htmlparser-user>https:// > >>> lists.sourceforge.net/lists/listinfo/htmlparser-user > >>> > >>> > >>> > >>>------------------------------------------------------- > >>>This sf.net email is sponsored by: Jabber Inc. > >>>Don't miss the IM event of the season | Special offer for OSDN members! > >>>JabConf 2002, Aug. 20-22, Keystone, CO > >>><http://www.jabberconf.com/osdn>http://www.jabberconf.com/osdn > >>>_______________________________________________ > >>>Htmlparser-user mailing list > >>><mailto:Htm...@li...>Htm...@li...u= =20 > rceforge.net > >>>https://lists.sourceforge.net/lists/listinfo/htmlparser-user |