Thread: [Htmlparser-developer] Re: [Htmlparser-user] Not all image tags are returned
Brought to you by:
derrickoswald
From: Somik R. <so...@ya...> - 2002-04-26 03:28:17
|
Hi Annette, Thanks for the report, I wrote a functional testcase, to do a raw = check IMG tags, and with the parser, and could reproduce the bug. I dont = think its a problem with the image scanner code - bcos the unit tests = are passing with the same yahoo tags. Here's a quick solution for you : Dont use registerScanners() for = now. Since your app specifically needs to check only image scanners, = replace the line : parser.registerScanners();=20 with parser.addScanner(new HTMLImageScanner("-i"));=20 I checked that all the yahoo image tags come fine with this change. = The functional test has been checked into CVS (FunctionalTests.java), = and the one with registerScanners() fails. The corresponding unit test = in HTMLImageScanner passes. Meanwhile, I am trying to find out which scanner is messing up.. Thanks again for your report. Cheers, Somik ----- Original Message -----=20 From: Doyle, Annette=20 To: htm...@li...=20 Sent: Friday, April 26, 2002 1:32 AM Subject: [Htmlparser-user] Not all image tags are returned Is there any known problem about not all image tags being returned? I = did the following code: =20 HTMLParser parser =3D new = HTMLParser(htmlOriginalFileLoc); // Registering all the common scanners parser.registerScanners();=20 for (Enumeration e =3D = parser.elements();e.hasMoreElements();) { HTMLNode node =3D = (HTMLNode)e.nextElement(); if (node instanceof HTMLImageTag) { System.out.println(); = System.out.println(((HTMLImageTag)node).getTagLine()); System.out.println(); =20 = //imageTagsUrl.addElement(((HTMLImageTag)node).getImageLocation()); } } =20 I was testing with another html parser and it found all the image = tags. Attached is the source from www.yahoo.com when I ran the code = above. |
From: Somik R. <so...@ya...> - 2002-04-26 03:43:51
|
Hi Annette, I just figured out what is happening... Sorry for the previous mail - this is not a bug in the parser. You see - the tags which werent getting reported as image tags, were sandwiched between link tags <A HREF="..."><IMG ..></A>. Hence, in your application, you will also need to watch out for link tags, and pick up the images from within should there be any. Now - if this causes you additional headaches, then dont register all the scanners, so the link scanner will not interfere, and you will only get the image tags. In order to prove that this analysis is correct - I added one more test case to HTMLImageScannerTest.java - testImageTagsFromYahooWithAllScannersRegistered() This test case extracts the link and checks that the image is found within. Also no of tags found is verified. You can check out this code from CVS, it might help you if you are interested in getting image tags out of link tags. Correspondingly, there is also testImageTagsFromYahoo() which passes (with only html image scanner registered). Let me know if you need further help. Regards, Somik ----- Original Message ----- From: Doyle, Annette To: htm...@li... Sent: Friday, April 26, 2002 1:32 AM Subject: [Htmlparser-user] Not all image tags are returned Is there any known problem about not all image tags being returned? I did the following code: HTMLParser parser = new HTMLParser(htmlOriginalFileLoc); // Registering all the common scanners parser.registerScanners(); for (Enumeration e = parser.elements();e.hasMoreElements();) { HTMLNode node = (HTMLNode)e.nextElement(); if (node instanceof HTMLImageTag) { System.out.println(); System.out.println(((HTMLImageTag)node).getTagLine()); System.out.println(); file://imageTagsUrl.addElement(((HTMLImageTag)node).getImageLocation()); } } I was testing with another html parser and it found all the image tags. Attached is the source from www.yahoo.com when I ran the code above. |