Re: Fw: [Htmlparser-user] Microsoft's ugly web page generation and parsing
Brought to you by:
derrickoswald
From: Somik R. <so...@ya...> - 2002-07-17 01:57:49
|
Hi Cedric, Ive fixed this bug (in HTMLStringNode.java). Fix should be in the = next integration release. If you are in a hurry, you can check out from = CVS and build. I've also added StringExtractor under parserapplications - it seems = to be a common app for a lot of people. Regards Somik ----- Original Message -----=20 From: C=E9dric Rosa=20 To: so...@ya...=20 Sent: Tuesday, July 16, 2002 5:16 PM Subject: Re: Fw: [Htmlparser-user] Microsoft's ugly web page = generation and parsing Hi Somik, first scuse for this big file include in my mail. "ref6.htm" is the document I obtain when crawling with wget. I try to = crawl=20 with your software and it works better. wget may include some newlines = or=20 others characters in the saved file. As you can see in the file "logpb.log", when I directly parse the = file,=20 your parser is almost perfect but tags <![endif]> are still here. "logpb2.log" contains the log when parsing with "ref6.htm" which is on = the=20 disk. Thanks a ton for your excellent support, Regards, C=E9dric. At 08:56 16/07/2002 +0200, you wrote: > >----- Original Message ----- >From: <mailto:so...@ya...>Somik Raha >To:=20 = ><mailto:htm...@li...>htm...@li...ur= ceforge.net=20 > >Sent: Tuesday, July 16, 2002 2:12 AM >Subject: Re: [Htmlparser-user] Microsoft's ugly web page generation = and=20 >parsing > >Hi Cedric > I couldnt figure out your bug report. On parsing the pages, the=20 > output seemed (prima facie) to be correct. > Can you specifically give the input that we should try with, and = what=20 > the actual output should be, and also post what you are getting.=20 > Alternatively, tell me which lines in the page are not being parsed = correctly. > Thanks. > >Regards, >Somik > >----- Original Message ----- >From: <mailto:ced...@fr...>C=E9dric Rosa >To:=20 = ><mailto:htm...@li...>htm...@li...ur= ceforge.net=20 > >Sent: Monday, July 15, 2002 11:41 PM >Subject: [Htmlparser-user] Microsoft's ugly web page generation and = parsing > >Hello, > >Simply try to parse this ugly document for example: = ><http://www.cevipof.msh-paris.fr\moment\ref6.htm>www.cevipof.msh-paris.f= r\moment\ref6.htm=20 >(2,6Mo !!!!!) > >The text which result from the parse contains lines like : >"" <![endif]--><!--[if supportFields]>" >"v\:* {behavior:url(#default#VML);}" >"mso-font-pitch:variable;" > >I think there is a problem in text detection when several tags are=20 >imbricated. >The solution will be maybe to skip all text after "<!--" until "-->". > >I don't have time to patch the code. If someone can fix this problem, = it >will be fantastic. > >Thanks by advance, > >Cedric Rosa. > > > > > >------------------------------------------------------- >This sf.net email is sponsored by:ThinkGeek >Welcome to geek heaven. ><http://thinkgeek.com/sf>http://thinkgeek.com/sf >_______________________________________________ >Htmlparser-user mailing list = ><mailto:Htm...@li...>Htm...@li...ur= ceforge.net=20 > >https://lists.sourceforge.net/lists/listinfo/htmlparser-user |