Thread: Re: Fw: [Htmlparser-user] Microsoft's ugly web page generation and parsing

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi Cedric,
    Ive fixed this bug (in HTMLStringNode.java). Fix should be in the =
next integration release. If you are in a hurry, you can check out from =
CVS and build.
    I've also added StringExtractor under parserapplications - it seems =
to be a common app for a lot of people.

Regards
Somik
  ----- Original Message -----=20
  From: C=E9dric Rosa=20
  To: so...@ya...=20
  Sent: Tuesday, July 16, 2002 5:16 PM
  Subject: Re: Fw: [Htmlparser-user] Microsoft's ugly web page =
generation and parsing

  Hi Somik, first scuse for this big file include in my mail.

  "ref6.htm" is the document I obtain when crawling with wget. I try to =
crawl=20
  with your software and it works better. wget may include some newlines =
or=20
  others characters in the saved file.

  As you can see in the file "logpb.log", when I directly parse the =
file,=20
  your parser is almost perfect but tags <![endif]> are still here.

  "logpb2.log" contains the log when parsing with "ref6.htm" which is on =
the=20
  disk.

  Thanks a ton for your excellent support,

  Regards,

  C=E9dric.

  At 08:56 16/07/2002 +0200, you wrote:
  >
  >----- Original Message -----
  >From: <mailto:so...@ya...>Somik Raha
  >To:=20
  =
><mailto:htm...@li...>htm...@li...ur=
ceforge.net=20
  >
  >Sent: Tuesday, July 16, 2002 2:12 AM
  >Subject: Re: [Htmlparser-user] Microsoft's ugly web page generation =
and=20
  >parsing
  >
  >Hi Cedric
  >     I couldnt figure out your bug report. On parsing the pages, the=20
  > output seemed (prima facie) to be correct.
  >     Can you specifically give the input that we should try with, and =
what=20
  > the actual output should be, and also post what you are getting.=20
  > Alternatively, tell me which lines in the page are not being parsed =
correctly.
  >    Thanks.
  >
  >Regards,
  >Somik
  >
  >----- Original Message -----
  >From: <mailto:ced...@fr...>C=E9dric Rosa
  >To:=20
  =
><mailto:htm...@li...>htm...@li...ur=
ceforge.net=20
  >
  >Sent: Monday, July 15, 2002 11:41 PM
  >Subject: [Htmlparser-user] Microsoft's ugly web page generation and =
parsing
  >
  >Hello,
  >
  >Simply try to parse this ugly document for example:
  =
><http://www.cevipof.msh-paris.fr\moment\ref6.htm>www.cevipof.msh-paris.f=
r\moment\ref6.htm=20
  >(2,6Mo !!!!!)
  >
  >The text which result from the parse contains lines like :
  >"&quot; <![endif]--><!--[if supportFields]>"
  >"v\:* {behavior:url(#default#VML);}"
  >"mso-font-pitch:variable;"
  >
  >I think there is a problem in text detection when several tags are=20
  >imbricated.
  >The solution will be maybe to skip all text after "<!--" until "-->".
  >
  >I don't have time to patch the code. If someone can fix this problem, =
it
  >will be fantastic.
  >
  >Thanks by advance,
  >
  >Cedric Rosa.
  >
  >
  >
  >
  >
  >-------------------------------------------------------
  >This sf.net email is sponsored by:ThinkGeek
  >Welcome to geek heaven.
  ><http://thinkgeek.com/sf>http://thinkgeek.com/sf
  >_______________________________________________
  >Htmlparser-user mailing list
  =
><mailto:Htm...@li...>Htm...@li...ur=
ceforge.net=20
  >
  >https://lists.sourceforge.net/lists/listinfo/htmlparser-user