Hi Cedric,
Ive fixed this bug (in HTMLStringNode.java). Fix should be in the =
next integration release. If you are in a hurry, you can check out from =
CVS and build.
I've also added StringExtractor under parserapplications - it seems =
to be a common app for a lot of people.
Regards
Somik
----- Original Message -----=20
From: C=E9dric Rosa=20
To: so...@ya...=20
Sent: Tuesday, July 16, 2002 5:16 PM
Subject: Re: Fw: [Htmlparser-user] Microsoft's ugly web page =
generation and parsing
Hi Somik, first scuse for this big file include in my mail.
"ref6.htm" is the document I obtain when crawling with wget. I try to =
crawl=20
with your software and it works better. wget may include some newlines =
or=20
others characters in the saved file.
As you can see in the file "logpb.log", when I directly parse the =
file,=20
your parser is almost perfect but tags <![endif]> are still here.
"logpb2.log" contains the log when parsing with "ref6.htm" which is on =
the=20
disk.
Thanks a ton for your excellent support,
Regards,
C=E9dric.
At 08:56 16/07/2002 +0200, you wrote:
>
>----- Original Message -----
>From: <mailto:so...@ya...>Somik Raha
>To:=20
=
><mailto:htm...@li...>htm...@li...ur=
ceforge.net=20
>
>Sent: Tuesday, July 16, 2002 2:12 AM
>Subject: Re: [Htmlparser-user] Microsoft's ugly web page generation =
and=20
>parsing
>
>Hi Cedric
> I couldnt figure out your bug report. On parsing the pages, the=20
> output seemed (prima facie) to be correct.
> Can you specifically give the input that we should try with, and =
what=20
> the actual output should be, and also post what you are getting.=20
> Alternatively, tell me which lines in the page are not being parsed =
correctly.
> Thanks.
>
>Regards,
>Somik
>
>----- Original Message -----
>From: <mailto:ced...@fr...>C=E9dric Rosa
>To:=20
=
><mailto:htm...@li...>htm...@li...ur=
ceforge.net=20
>
>Sent: Monday, July 15, 2002 11:41 PM
>Subject: [Htmlparser-user] Microsoft's ugly web page generation and =
parsing
>
>Hello,
>
>Simply try to parse this ugly document for example:
=
><http://www.cevipof.msh-paris.fr\moment\ref6.htm>www.cevipof.msh-paris.f=
r\moment\ref6.htm=20
>(2,6Mo !!!!!)
>
>The text which result from the parse contains lines like :
>"" <![endif]--><!--[if supportFields]>"
>"v\:* {behavior:url(#default#VML);}"
>"mso-font-pitch:variable;"
>
>I think there is a problem in text detection when several tags are=20
>imbricated.
>The solution will be maybe to skip all text after "<!--" until "-->".
>
>I don't have time to patch the code. If someone can fix this problem, =
it
>will be fantastic.
>
>Thanks by advance,
>
>Cedric Rosa.
>
>
>
>
>
>-------------------------------------------------------
>This sf.net email is sponsored by:ThinkGeek
>Welcome to geek heaven.
><http://thinkgeek.com/sf>http://thinkgeek.com/sf
>_______________________________________________
>Htmlparser-user mailing list
=
><mailto:Htm...@li...>Htm...@li...ur=
ceforge.net=20
>
>https://lists.sourceforge.net/lists/listinfo/htmlparser-user
|