Thread: [Htmlparser-developer] Re: [Htmlparser-user] Hints on how to change image tag locations andwriteoutd
Brought to you by:
derrickoswald
|
From: Somik R. <so...@ya...> - 2002-05-12 09:07:49
|
Hi Raghav
I went thru the yahoo.txt, and just like your previous one, this one =
too had very dirty html. The reason you got the OutofMemoryException was =
that this kind of html sent the parser into an infinite loop (in =
HTMLLinkScanner).
The tag which did this was :
<a href=3Ds/8741><img =
src=3D"http://us.i1.yimg.com/us.yimg.com/i/i16/mov_popc.gif" height=3D16 =
width=3D16 border=3D0></img></td><td nowrap>
<a href=3Ds/7509><b>Yahoo! Movies</b></a>
As you can see, the first link tag does not have an end tag. I verified =
with the actual yahoo page, and this link occurs quite decently, with =
the correct end tag. After looking closely at your supplied file, I also =
notice the </img> file, which is highly unusual in normal html.
So - I am guessing that this file is generated by a program and not by a =
human. You would definitely want to check the program thats doing it - =
its surely buggy.
However, my yardstick for the robustness of this parser is Internet =
Explorer. If the stuff works in IE, then its got to work here. And as I =
tried this particularly bad piece of html, I found IE does not crash. =
Hence, I had to go about empowering the parser to parse these erroneous =
tags <sigh> Took hours!! </sigh>
The good news is, its done. We can parse these tags, and the correct =
end tag is inserted just before td. Of course, I have done a minimal =
adjustment for your purpose. As time goes on, robustness ought to =
increase further. All test cases passing. The framework for handling =
dirty html is also slightly modified.
An integration release has been made (2002-05-12), and is under the =
integration builds package. You can download from =
http://htmlparser.sourceforge.net.=20
=20
The parser should not crash on your html now.
Regards,
Somik
----- Original Message -----=20
From: Raghavender Srimantula=20
To: htm...@li...=20
Sent: Saturday, May 11, 2002 4:32 AM
Subject: Re: [Htmlparser-user] Hints on how to change image tag =
locations andwriteoutdocument
Hi Somik,
I have mentioned about the out of memory error problem earlier. last =
time=20
for every iteration of for loop I was adding the whole page to my =
string=20
buffer. so it was giving me the out of memory error. I removed that =
now. it=20
was working fine till yesterday. now I find that error again. this =
time=20
nothing to do with string buffer...and it looks like a real problem. I =
can=20
send you the main class and the yahoo.txt I have. try running it.
Thanks,
Raghav
>From: "Somik Raha" <so...@ya...>
>Reply-To: htm...@li...
>To: <htm...@li...>
>Subject: Re: [Htmlparser-user] Hints on how to change image tag =
locations=20
>andwriteoutdocument
>Date: Fri, 10 May 2002 00:43:19 +0900
>
>Hi Raghav,
> On analyzing yahoo.txt, I found that you have incorrect html. =
There is=20
>a script tag that has not been closed. So naturally the script =
scanner goes=20
>bonkers. Rename the extension to .html, and open this file in IE, and =
you=20
>will find that IE also cant handle this.
> I verified from www.yahoo.com, and found that they do have the =
correct=20
></script> tag provided. So I guess your yahoo.txt file is faulty.
>
>Regards,
>Somik
> ----- Original Message -----
> From: Raghavender Srimantula
> To: htm...@li...
> Sent: Thursday, May 09, 2002 4:53 AM
> Subject: Re: [Htmlparser-user] Hints on how to change image tag=20
>locations andwriteoutdocument
>
>
> Hi Somik,
> I was using the 1.1 version of htmlparser. I save the =
www.yahoo.com=20
>content
> in a flat file yahoo.txt. and I run the parser against this. =
throws a
> nullpointerexception in HTMLScriptScanner. this seems to be a new=20
>addition
> for 1.1. I will send the stacktrace, the main program and the =
yahoo.txt.
> actually I cannot send the stacktrace. I made some changes and the =
line
> numbers dont match. but if you run this program you would see the
> nullpointerexception.
> Thanks,
> Raghav
>
>
> >From: "Somik Raha" <so...@ya...>
> >Reply-To: htm...@li...
> >To: <htm...@li...>
> >Subject: Re: [Htmlparser-user] Hints on how to change image tag=20
>locations
> >and writeoutdocument
> >Date: Mon, 6 May 2002 13:59:11 +0900
> >
> >Hi Raghav,
> > I sent another mail sometime back to you -
> >
> >"HTMLLinkTag.linkData() - this gives you an enumeration - and in =
the
> >enumeration will be your HTMLImageTag."
> >HTMLNode node;
> >HTMLImageTag imageTag;
> >for (Enumeration e =3D linkTag.linkData();e.hasMoreElements();) {
> > node =3D (HTMLNode)e.nextElement();
> > if (node instanceof HTMLImageTag) {
> > imageTag =3D (HTMLImageTag)node;
> > // your code here
> > }
> >}
> >
> >Regards,
> >Somik
> >----- Original Message -----
> >From: "Raghavender Srimantula" <kin...@ho...>
> >To: <htm...@li...>
> >Sent: Monday, May 06, 2002 10:43 AM
> >Subject: Re: [Htmlparser-user] Hints on how to change image tag=20
>locations
> >and writeoutdocument
> >
> >
> > > Hi Somik,
> > > this question is regarding "not all images are being =
retrieved". I=20
>mean
> >the
> > > images under <a tag. I did try to open the attachment you sent =
me. I
> >could
> > > not find anything. but seeing the previous mails I could read =
that=20
>it is
> >not
> > > a bug. but still if I do want to retrieve all the images how =
do I do=20
>it.
> > > Thanks,
> > > Raghav
> > >
> > >
> > > >From: "Somik Raha" <so...@ya...>
> > > >Reply-To: htm...@li...
> > > >To: <htm...@li...>
> > > >Subject: Re: [Htmlparser-user] Hints on how to change image =
tag
> >locations
> > > >and write outdocument
> > > >Date: Tue, 30 Apr 2002 11:37:26 +0900
> > > >
> > > >Hi Raghav,
> > > > Ah - this was a question by Annette Doyle (titled "Not =
all=20
>image
> >tags
> > > >are returned"). I am attaching my reply.
> > > >
> > > >Regards
> > > >Somik
> > > >
> > > >----- Original Message -----
> > > >From: "Raghavender Srimantula" <kin...@ho...>
> > > >To: <htm...@li...>
> > > >Sent: Tuesday, April 30, 2002 11:16 AM
> > > >Subject: Re: [Htmlparser-user] Hints on how to change image =
tag
> >locations
> > > >and write outdocument
> > > >
> > > >
> > > > > hi Somik,
> > > > > I found one more interesting thing here. when I am trying =
to get=20
>all
> >the
> > > > > images the image scanner would give me images
> > > > > <img
> =
>src=3D"http://us.i1.yimg.com/us.yimg.com/i/mntl/sh/mom02/title4.gif"
> > > > > width=3D296 height=3D27 border=3D0 usemap=3D#tm>
> > > > > so if I do a imagetag.getImageLocation(), I would get
> > > > > =
http://us.i1.yimg.com/us.yimg.com/i/mntl/sh/mom02/title4.gif
> > > > >
> > > > > but is the html content is like this
> > > > > <a href=3Ds/6006><img
> > > >src=3Dhttp://us.i1.yimg.com/us.yimg.com/i/us/hj/hjys.gif
> > > > > border=3D0 width=3D70 height=3D22></a>
> > > > > which starts with <a and ends with </a>, then the image =
scanner=20
>will
> >not
> > > > > give me http://us.i1.yimg.com/us.yimg.com/i/us/hj/hjys.gif =
when=20
>I do
> >a
> > > > > imagetag.getImageLocation(). this is not even classified =
as an
> >ImageTag.
> > > > > this is classified as LinkTag. how to get this image.
> > > > >
> > > > > the above content is from www.yahoo.com. on the netscape =
browser=20
>if
> >you
> > > >goto
> > > > > view-->pageinfo, you will see a bunch of images.
> > > > > but when you run the htmlparser you can get only one =
image.
> > > > >
> > > > > Thanks,
> > > > > Raghav
> > > > >
> > > > >
> > > > > >From: "Somik Raha" <so...@ya...>
> > > > > >Reply-To: htm...@li...
> > > > > >To: <htm...@li...>
> > > > > >Subject: Re: [Htmlparser-user] Hints on how to change =
image tag
> > > >locations
> > > > > >and write outdocument
> > > > > >Date: Tue, 30 Apr 2002 09:15:38 +0900
> > > > > >
> > > > > >Can you describe your application ? Was it parsing a =
single=20
>page
> >when
> > > >the
> > > > > >problem occurred ?
> > > > > >
> > > > > >Regards,
> > > > > >Somik
> > > > > >----- Original Message -----
> > > > > >From: "Raghavender Srimantula" <kin...@ho...>
> > > > > >To: <htm...@li...>
> > > > > >Cc: <htm...@li...>
> > > > > >Sent: Tuesday, April 30, 2002 8:36 AM
> > > > > >Subject: Re: [Htmlparser-user] Hints on how to change =
image tag
> > > >locations
> > > > > >and write outdocument
> > > > > >
> > > > > >
> > > > > > > Hi Somik,
> > > > > > > I encountered a strange problem today. while I was =
running
> > > > > >htmlparser...I
> > > > > > > got a java.lang.OutOfMemoryError. seems that lot of =
objects=20
>are
> > > >being
> > > > > > > allocated. where exactly is this happening. I mean =
could you
> >give
> >me
> > > >an
> > > > > >idea
> > > > > > > where or in which file the potential problem could be.
> > > > > > > Raghav
> > > > > > >
> > > > > > >
> > > > > > > >From: "Somik Raha" <so...@ya...>
> > > > > > > >Reply-To: htm...@li...
> > > > > > > >To: <htm...@li...>
> > > > > > > >CC: <htm...@li...>
> > > > > > > >Subject: Re: [Htmlparser-user] Hints on how to change =
image=20
>tag
> > > > > >locations
> > > > > > > >and write out document
> > > > > > > >Date: Sat, 27 Apr 2002 18:22:34 +0900
> > > > > > > >
> > > > > > > >Hi Annette,
> > > > > > > > Pls find attached a program to get you started. =
This
> >program
> > > >will
> > > > > >do
> > > > > > > >what you want - you will need to modify the construct =
that
> >checks
> > > >for
> > > > > >the
> > > > > > > >image tag - and replace it with the location of your=20
>choice.
> > > > > > > > Also - I found one bug thanks to this =
requirement -=20
>image
> >tags
> > > > > >params
> > > > > > > >were not being correctly put in. Though it needs a =
deeper=20
>look,
> >I
> > > >have
> > > > > >done
> > > > > > > >a quick fix for now, and all test cases are passing =
(with=20
>one
> >test
> > > >case
> > > > > >in
> > > > > > > >HTMLImageScannerTest trapping this bug).
> > > > > > > > Please check out the latest html parser source =
code=20
>from
> >CVS.
> > > > > > > >
> > > > > > > >Regards,
> > > > > > > >Somik
> > > > > > > >
> > > > > > > > ----- Original Message -----
> > > > > > > > From: Doyle, Annette
> > > > > > > > To: htm...@li...
> > > > > > > > Sent: Friday, April 26, 2002 10:08 PM
> > > > > > > > Subject: [Htmlparser-user] Hints on how to change =
image=20
>tag
> > > > > >locations
> > > > > > > >and write out document
> > > > > > > >
> > > > > > > >
> > > > > > > > Could you please give me some hints as how to =
change=20
>only
> >image
> > > >tag
> > > > > > > >locations and then, (or at the same time) write out =
the=20
>html
> > > >document
> > > > > >to
> > > > > > > >file (with new image tag locations)?
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > Thanks-
> > > > > > > >
> > > > > > > > Annette Doyle
> > > > > > > >
> > > > > > > ><< ImageTagRetriever.java >>
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> >_________________________________________________________________
> > > > > > > Join the world's largest e-mail service with MSN =
Hotmail.
> > > > > > > http://www.hotmail.com
> > > > > > >
> > > > > > >
> > > > > > > _______________________________________________
> > > > > > > Htmlparser-user mailing list
> > > > > > > Htm...@li...
> > > > > > > =
https://lists.sourceforge.net/lists/listinfo/htmlparser-user
> > > > > >
> > > > > >
> > > > > >_______________________________________________
> > > > > >Htmlparser-user mailing list
> > > > > >Htm...@li...
> > > > > =
>https://lists.sourceforge.net/lists/listinfo/htmlparser-user
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >=20
>_________________________________________________________________
> > > > > Send and receive Hotmail on your mobile device:
> >http://mobile.msn.com
> > > > >
> > > > >
> > > > > _______________________________________________
> > > > > Htmlparser-user mailing list
> > > > > Htm...@li...
> > > > > =
https://lists.sourceforge.net/lists/listinfo/htmlparser-user
> > > ><<
> > >
> >=20
> =
>[Htmlparser-developer]Re_[Htmlparser-user]Notallimagetagsarereturned[Not=
aBu
> >g].eml
> > > > >>
> > >
> > >
> > >
> > >
> > > =
_________________________________________________________________
> > > MSN Photos is the easiest way to share and print your photos:
> > > http://photos.msn.com/support/worldwide.aspx
> > >
> > >
> > > =
_______________________________________________________________
> > >
> > > Have big pipes? SourceForge.net is looking for download =
mirrors. We
> >supply
> > > the hardware. You get the recognition. Email Us:
> >ban...@so...
> > > _______________________________________________
> > > Htmlparser-user mailing list
> > > Htm...@li...
> > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user
> >
> >
> >
> >_______________________________________________
> >Htmlparser-user mailing list
> >Htm...@li...
> >https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>
>
>
>
> _________________________________________________________________
> Get your FREE download of MSN Explorer at=20
>http://explorer.msn.com/intl.asp.
>
_________________________________________________________________
Join the world's largest e-mail service with MSN Hotmail.=20
http://www.hotmail.com
|