Thread: [Htmlparser-developer] Re: [Htmlparser-user] Hints on how to change image tag locations andwriteoutd
Brought to you by:
derrickoswald
From: Somik R. <so...@ya...> - 2002-05-12 09:07:49
|
Hi Raghav I went thru the yahoo.txt, and just like your previous one, this one = too had very dirty html. The reason you got the OutofMemoryException was = that this kind of html sent the parser into an infinite loop (in = HTMLLinkScanner). The tag which did this was : <a href=3Ds/8741><img = src=3D"http://us.i1.yimg.com/us.yimg.com/i/i16/mov_popc.gif" height=3D16 = width=3D16 border=3D0></img></td><td nowrap> <a href=3Ds/7509><b>Yahoo! Movies</b></a> As you can see, the first link tag does not have an end tag. I verified = with the actual yahoo page, and this link occurs quite decently, with = the correct end tag. After looking closely at your supplied file, I also = notice the </img> file, which is highly unusual in normal html. So - I am guessing that this file is generated by a program and not by a = human. You would definitely want to check the program thats doing it - = its surely buggy. However, my yardstick for the robustness of this parser is Internet = Explorer. If the stuff works in IE, then its got to work here. And as I = tried this particularly bad piece of html, I found IE does not crash. = Hence, I had to go about empowering the parser to parse these erroneous = tags <sigh> Took hours!! </sigh> The good news is, its done. We can parse these tags, and the correct = end tag is inserted just before td. Of course, I have done a minimal = adjustment for your purpose. As time goes on, robustness ought to = increase further. All test cases passing. The framework for handling = dirty html is also slightly modified. An integration release has been made (2002-05-12), and is under the = integration builds package. You can download from = http://htmlparser.sourceforge.net.=20 =20 The parser should not crash on your html now. Regards, Somik ----- Original Message -----=20 From: Raghavender Srimantula=20 To: htm...@li...=20 Sent: Saturday, May 11, 2002 4:32 AM Subject: Re: [Htmlparser-user] Hints on how to change image tag = locations andwriteoutdocument Hi Somik, I have mentioned about the out of memory error problem earlier. last = time=20 for every iteration of for loop I was adding the whole page to my = string=20 buffer. so it was giving me the out of memory error. I removed that = now. it=20 was working fine till yesterday. now I find that error again. this = time=20 nothing to do with string buffer...and it looks like a real problem. I = can=20 send you the main class and the yahoo.txt I have. try running it. Thanks, Raghav >From: "Somik Raha" <so...@ya...> >Reply-To: htm...@li... >To: <htm...@li...> >Subject: Re: [Htmlparser-user] Hints on how to change image tag = locations=20 >andwriteoutdocument >Date: Fri, 10 May 2002 00:43:19 +0900 > >Hi Raghav, > On analyzing yahoo.txt, I found that you have incorrect html. = There is=20 >a script tag that has not been closed. So naturally the script = scanner goes=20 >bonkers. Rename the extension to .html, and open this file in IE, and = you=20 >will find that IE also cant handle this. > I verified from www.yahoo.com, and found that they do have the = correct=20 ></script> tag provided. So I guess your yahoo.txt file is faulty. > >Regards, >Somik > ----- Original Message ----- > From: Raghavender Srimantula > To: htm...@li... > Sent: Thursday, May 09, 2002 4:53 AM > Subject: Re: [Htmlparser-user] Hints on how to change image tag=20 >locations andwriteoutdocument > > > Hi Somik, > I was using the 1.1 version of htmlparser. I save the = www.yahoo.com=20 >content > in a flat file yahoo.txt. and I run the parser against this. = throws a > nullpointerexception in HTMLScriptScanner. this seems to be a new=20 >addition > for 1.1. I will send the stacktrace, the main program and the = yahoo.txt. > actually I cannot send the stacktrace. I made some changes and the = line > numbers dont match. but if you run this program you would see the > nullpointerexception. > Thanks, > Raghav > > > >From: "Somik Raha" <so...@ya...> > >Reply-To: htm...@li... > >To: <htm...@li...> > >Subject: Re: [Htmlparser-user] Hints on how to change image tag=20 >locations > >and writeoutdocument > >Date: Mon, 6 May 2002 13:59:11 +0900 > > > >Hi Raghav, > > I sent another mail sometime back to you - > > > >"HTMLLinkTag.linkData() - this gives you an enumeration - and in = the > >enumeration will be your HTMLImageTag." > >HTMLNode node; > >HTMLImageTag imageTag; > >for (Enumeration e =3D linkTag.linkData();e.hasMoreElements();) { > > node =3D (HTMLNode)e.nextElement(); > > if (node instanceof HTMLImageTag) { > > imageTag =3D (HTMLImageTag)node; > > // your code here > > } > >} > > > >Regards, > >Somik > >----- Original Message ----- > >From: "Raghavender Srimantula" <kin...@ho...> > >To: <htm...@li...> > >Sent: Monday, May 06, 2002 10:43 AM > >Subject: Re: [Htmlparser-user] Hints on how to change image tag=20 >locations > >and writeoutdocument > > > > > > > Hi Somik, > > > this question is regarding "not all images are being = retrieved". I=20 >mean > >the > > > images under <a tag. I did try to open the attachment you sent = me. I > >could > > > not find anything. but seeing the previous mails I could read = that=20 >it is > >not > > > a bug. but still if I do want to retrieve all the images how = do I do=20 >it. > > > Thanks, > > > Raghav > > > > > > > > > >From: "Somik Raha" <so...@ya...> > > > >Reply-To: htm...@li... > > > >To: <htm...@li...> > > > >Subject: Re: [Htmlparser-user] Hints on how to change image = tag > >locations > > > >and write outdocument > > > >Date: Tue, 30 Apr 2002 11:37:26 +0900 > > > > > > > >Hi Raghav, > > > > Ah - this was a question by Annette Doyle (titled "Not = all=20 >image > >tags > > > >are returned"). I am attaching my reply. > > > > > > > >Regards > > > >Somik > > > > > > > >----- Original Message ----- > > > >From: "Raghavender Srimantula" <kin...@ho...> > > > >To: <htm...@li...> > > > >Sent: Tuesday, April 30, 2002 11:16 AM > > > >Subject: Re: [Htmlparser-user] Hints on how to change image = tag > >locations > > > >and write outdocument > > > > > > > > > > > > > hi Somik, > > > > > I found one more interesting thing here. when I am trying = to get=20 >all > >the > > > > > images the image scanner would give me images > > > > > <img > = >src=3D"http://us.i1.yimg.com/us.yimg.com/i/mntl/sh/mom02/title4.gif" > > > > > width=3D296 height=3D27 border=3D0 usemap=3D#tm> > > > > > so if I do a imagetag.getImageLocation(), I would get > > > > > = http://us.i1.yimg.com/us.yimg.com/i/mntl/sh/mom02/title4.gif > > > > > > > > > > but is the html content is like this > > > > > <a href=3Ds/6006><img > > > >src=3Dhttp://us.i1.yimg.com/us.yimg.com/i/us/hj/hjys.gif > > > > > border=3D0 width=3D70 height=3D22></a> > > > > > which starts with <a and ends with </a>, then the image = scanner=20 >will > >not > > > > > give me http://us.i1.yimg.com/us.yimg.com/i/us/hj/hjys.gif = when=20 >I do > >a > > > > > imagetag.getImageLocation(). this is not even classified = as an > >ImageTag. > > > > > this is classified as LinkTag. how to get this image. > > > > > > > > > > the above content is from www.yahoo.com. on the netscape = browser=20 >if > >you > > > >goto > > > > > view-->pageinfo, you will see a bunch of images. > > > > > but when you run the htmlparser you can get only one = image. > > > > > > > > > > Thanks, > > > > > Raghav > > > > > > > > > > > > > > > >From: "Somik Raha" <so...@ya...> > > > > > >Reply-To: htm...@li... > > > > > >To: <htm...@li...> > > > > > >Subject: Re: [Htmlparser-user] Hints on how to change = image tag > > > >locations > > > > > >and write outdocument > > > > > >Date: Tue, 30 Apr 2002 09:15:38 +0900 > > > > > > > > > > > >Can you describe your application ? Was it parsing a = single=20 >page > >when > > > >the > > > > > >problem occurred ? > > > > > > > > > > > >Regards, > > > > > >Somik > > > > > >----- Original Message ----- > > > > > >From: "Raghavender Srimantula" <kin...@ho...> > > > > > >To: <htm...@li...> > > > > > >Cc: <htm...@li...> > > > > > >Sent: Tuesday, April 30, 2002 8:36 AM > > > > > >Subject: Re: [Htmlparser-user] Hints on how to change = image tag > > > >locations > > > > > >and write outdocument > > > > > > > > > > > > > > > > > > > Hi Somik, > > > > > > > I encountered a strange problem today. while I was = running > > > > > >htmlparser...I > > > > > > > got a java.lang.OutOfMemoryError. seems that lot of = objects=20 >are > > > >being > > > > > > > allocated. where exactly is this happening. I mean = could you > >give > >me > > > >an > > > > > >idea > > > > > > > where or in which file the potential problem could be. > > > > > > > Raghav > > > > > > > > > > > > > > > > > > > > > >From: "Somik Raha" <so...@ya...> > > > > > > > >Reply-To: htm...@li... > > > > > > > >To: <htm...@li...> > > > > > > > >CC: <htm...@li...> > > > > > > > >Subject: Re: [Htmlparser-user] Hints on how to change = image=20 >tag > > > > > >locations > > > > > > > >and write out document > > > > > > > >Date: Sat, 27 Apr 2002 18:22:34 +0900 > > > > > > > > > > > > > > > >Hi Annette, > > > > > > > > Pls find attached a program to get you started. = This > >program > > > >will > > > > > >do > > > > > > > >what you want - you will need to modify the construct = that > >checks > > > >for > > > > > >the > > > > > > > >image tag - and replace it with the location of your=20 >choice. > > > > > > > > Also - I found one bug thanks to this = requirement -=20 >image > >tags > > > > > >params > > > > > > > >were not being correctly put in. Though it needs a = deeper=20 >look, > >I > > > >have > > > > > >done > > > > > > > >a quick fix for now, and all test cases are passing = (with=20 >one > >test > > > >case > > > > > >in > > > > > > > >HTMLImageScannerTest trapping this bug). > > > > > > > > Please check out the latest html parser source = code=20 >from > >CVS. > > > > > > > > > > > > > > > >Regards, > > > > > > > >Somik > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > From: Doyle, Annette > > > > > > > > To: htm...@li... > > > > > > > > Sent: Friday, April 26, 2002 10:08 PM > > > > > > > > Subject: [Htmlparser-user] Hints on how to change = image=20 >tag > > > > > >locations > > > > > > > >and write out document > > > > > > > > > > > > > > > > > > > > > > > > Could you please give me some hints as how to = change=20 >only > >image > > > >tag > > > > > > > >locations and then, (or at the same time) write out = the=20 >html > > > >document > > > > > >to > > > > > > > >file (with new image tag locations)? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks- > > > > > > > > > > > > > > > > Annette Doyle > > > > > > > > > > > > > > > ><< ImageTagRetriever.java >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >_________________________________________________________________ > > > > > > > Join the world's largest e-mail service with MSN = Hotmail. > > > > > > > http://www.hotmail.com > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > Htmlparser-user mailing list > > > > > > > Htm...@li... > > > > > > > = https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > > > > > > > > > > > > >_______________________________________________ > > > > > >Htmlparser-user mailing list > > > > > >Htm...@li... > > > > > = >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > > > > > > > > > > > > > > > > > > > >=20 >_________________________________________________________________ > > > > > Send and receive Hotmail on your mobile device: > >http://mobile.msn.com > > > > > > > > > > > > > > > _______________________________________________ > > > > > Htmlparser-user mailing list > > > > > Htm...@li... > > > > > = https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > ><< > > > > >=20 > = >[Htmlparser-developer]Re_[Htmlparser-user]Notallimagetagsarereturned[Not= aBu > >g].eml > > > > >> > > > > > > > > > > > > > > > = _________________________________________________________________ > > > MSN Photos is the easiest way to share and print your photos: > > > http://photos.msn.com/support/worldwide.aspx > > > > > > > > > = _______________________________________________________________ > > > > > > Have big pipes? SourceForge.net is looking for download = mirrors. We > >supply > > > the hardware. You get the recognition. Email Us: > >ban...@so... > > > _______________________________________________ > > > Htmlparser-user mailing list > > > Htm...@li... > > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > > >_______________________________________________ > >Htmlparser-user mailing list > >Htm...@li... > >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > _________________________________________________________________ > Get your FREE download of MSN Explorer at=20 >http://explorer.msn.com/intl.asp. > _________________________________________________________________ Join the world's largest e-mail service with MSN Hotmail.=20 http://www.hotmail.com |