htmlparser-user Mailing List for HTML Parser (Page 96)
Brought to you by:
derrickoswald
You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(1) |
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(7) |
Feb
|
Mar
(9) |
Apr
(50) |
May
(20) |
Jun
(47) |
Jul
(37) |
Aug
(32) |
Sep
(30) |
Oct
(11) |
Nov
(37) |
Dec
(47) |
2003 |
Jan
(31) |
Feb
(70) |
Mar
(67) |
Apr
(34) |
May
(66) |
Jun
(25) |
Jul
(48) |
Aug
(43) |
Sep
(58) |
Oct
(25) |
Nov
(10) |
Dec
(25) |
2004 |
Jan
(38) |
Feb
(17) |
Mar
(24) |
Apr
(25) |
May
(11) |
Jun
(6) |
Jul
(24) |
Aug
(42) |
Sep
(13) |
Oct
(17) |
Nov
(13) |
Dec
(44) |
2005 |
Jan
(10) |
Feb
(16) |
Mar
(16) |
Apr
(23) |
May
(6) |
Jun
(19) |
Jul
(39) |
Aug
(15) |
Sep
(40) |
Oct
(49) |
Nov
(29) |
Dec
(41) |
2006 |
Jan
(28) |
Feb
(24) |
Mar
(52) |
Apr
(41) |
May
(31) |
Jun
(34) |
Jul
(22) |
Aug
(12) |
Sep
(11) |
Oct
(11) |
Nov
(11) |
Dec
(4) |
2007 |
Jan
(39) |
Feb
(13) |
Mar
(16) |
Apr
(24) |
May
(13) |
Jun
(12) |
Jul
(21) |
Aug
(61) |
Sep
(31) |
Oct
(13) |
Nov
(32) |
Dec
(15) |
2008 |
Jan
(7) |
Feb
(8) |
Mar
(14) |
Apr
(12) |
May
(23) |
Jun
(20) |
Jul
(9) |
Aug
(6) |
Sep
(2) |
Oct
(7) |
Nov
(3) |
Dec
(2) |
2009 |
Jan
(5) |
Feb
(8) |
Mar
(10) |
Apr
(22) |
May
(85) |
Jun
(82) |
Jul
(45) |
Aug
(28) |
Sep
(26) |
Oct
(50) |
Nov
(8) |
Dec
(16) |
2010 |
Jan
(3) |
Feb
(11) |
Mar
(39) |
Apr
(56) |
May
(80) |
Jun
(64) |
Jul
(49) |
Aug
(48) |
Sep
(16) |
Oct
(3) |
Nov
(5) |
Dec
(5) |
2011 |
Jan
(13) |
Feb
|
Mar
(1) |
Apr
(7) |
May
(7) |
Jun
(7) |
Jul
(7) |
Aug
(8) |
Sep
|
Oct
(6) |
Nov
(2) |
Dec
|
2012 |
Jan
(5) |
Feb
|
Mar
(3) |
Apr
(3) |
May
(4) |
Jun
(8) |
Jul
(1) |
Aug
(5) |
Sep
(10) |
Oct
(3) |
Nov
(2) |
Dec
(4) |
2013 |
Jan
(4) |
Feb
(2) |
Mar
(7) |
Apr
(7) |
May
(6) |
Jun
(7) |
Jul
(3) |
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
2014 |
Jan
|
Feb
(2) |
Mar
(1) |
Apr
|
May
(3) |
Jun
(1) |
Jul
|
Aug
|
Sep
(1) |
Oct
(4) |
Nov
(2) |
Dec
(4) |
2015 |
Jan
(4) |
Feb
(2) |
Mar
(8) |
Apr
(7) |
May
(6) |
Jun
(7) |
Jul
(3) |
Aug
(1) |
Sep
(1) |
Oct
(4) |
Nov
(3) |
Dec
(4) |
2016 |
Jan
(4) |
Feb
(6) |
Mar
(9) |
Apr
(9) |
May
(6) |
Jun
(1) |
Jul
(1) |
Aug
|
Sep
|
Oct
(1) |
Nov
(1) |
Dec
(1) |
2017 |
Jan
|
Feb
(1) |
Mar
(3) |
Apr
(1) |
May
|
Jun
(1) |
Jul
(2) |
Aug
(3) |
Sep
(6) |
Oct
(3) |
Nov
(2) |
Dec
(5) |
2018 |
Jan
(3) |
Feb
(13) |
Mar
(28) |
Apr
(5) |
May
(4) |
Jun
(2) |
Jul
(2) |
Aug
(8) |
Sep
(2) |
Oct
(1) |
Nov
(5) |
Dec
(1) |
2019 |
Jan
(8) |
Feb
(1) |
Mar
|
Apr
(1) |
May
(4) |
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(2) |
2020 |
Jan
|
Feb
|
Mar
(1) |
Apr
(1) |
May
(1) |
Jun
(2) |
Jul
(1) |
Aug
(1) |
Sep
(1) |
Oct
|
Nov
(1) |
Dec
(1) |
2021 |
Jan
(3) |
Feb
(2) |
Mar
(1) |
Apr
(1) |
May
(2) |
Jun
(1) |
Jul
(2) |
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2022 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
(1) |
Jun
(1) |
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2023 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2024 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2025 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Somik R. <so...@ya...> - 2002-06-17 06:32:11
|
Dear Rob, From your first mail : > Am I correct in thinking that in both an HTMLTag and an HTMLImageTag object > are created for each image tag encountered when using HTMLImageScanner? If > so, does the HTMLTag object get populated with the usual data? Functionally, you only get one tag object. If you havent registered the concerned scanner (HTMLImageScanner in this case), you will get an HTMLTag object. If you have, then you will get an HTMLImageTag object. Technically, internally, first an HTMLTag object gets created. Then control passes to registered scanners to see if this tag can be upgraded. If so, the new sublcassed tag object (HTMLImageTag, for example) gets created and returned in place of the original HTMLTag. > I didn't give it a > filter because I don't understand what the filter does. > A filter is not required - it is only for using it from the command line - allows us to check parse results easily and dump it to a file. You can ignore it for your app - the following will work : parser.addScanner(new HTMLImageScanner("")); > HTMLLinkProcessor linkProcessor = new HTMLLinkProcessor(); Why are you declaring a linkProcessor ? > HTMLImageTag imgtag = (HTMLImageTag) node; > String imgsrc = imgtag.getImageLocation(); > if(imgsrc.indexOf("http://") == -1){ > file://relative src > imgsrc = base.toString() + imgsrc; > } This is not necessary. The base url that you specify in the parser, will automatically be used to resolve relative links. Check out the testcases : testRelativeImageScan, testRelativeImageScan2, testRelativeImageScan3 in com.kizna.htmlTests.scannerTests.HTMLImageScannerTest I can also see that you are trying to reconstruct the html tag without changing its contents - you can do this with imageTag.toRawString() if you are using HTMLParser v1.2 upwards. However, this will provide you with the relative link (not resolved absolute link). Perhaps, if you need it, we can modify the toRawString() method, and get it to return absolute links ?? > 1) is this the only way to get all the attributes in the img tag? No. There's a much easier way - just do : imageTag.getParameter("alt"); If you want to get the keys, I think this should work : imageTag.getParsed().keys() [Maybe the name of this method should be changed to be easier to figure out]. > I need to do this or can I just omit the Content-length field and avoid > using the StringBuffer? Hmm.. Its not mandatory to send the content-length, but some servers expect it. To make life easier, you should use toRawString() to get the html tags out uniformly. Since this applied to a node, you dont have to write code for different types of nodes. So sb.append(node.toRawString()) is good enough (perhaps) for all nodes. The only one where there might be an issue is the HTMLImageTag for reasons that I mentioned above. You can probably rewrite the toRawString() method in HTMLImageTag for your purposes and that should solve your problem neatly. Feel free to post any further questions that you have. Regards, Somik |
From: Rob S. <bob...@ho...> - 2002-06-16 22:05:30
|
Hi, I managed to get something working like this: sb is a StringBuffer and base is a URL (the source of the document) i'm just using just a single scanner - HTMLImageScanner. I didn't give it a filter because I don't understand what the filter does. HTMLNode node; // Run through an enumeration of html elements HTMLLinkProcessor linkProcessor = new HTMLLinkProcessor(); for (Enumeration e=parser.elements();e.hasMoreElements();) { node = (HTMLNode)e.nextElement(); // Cast the element to HTMLNode if (node instanceof HTMLStringNode) { HTMLStringNode stringNode = (HTMLStringNode)node; sb.append(stringNode.getText()); } else if (node instanceof HTMLTag){ HTMLTag tag = (HTMLTag) node; if (node instanceof HTMLImageTag) { HTMLImageTag imgtag = (HTMLImageTag) node; String imgsrc = imgtag.getImageLocation(); if(imgsrc.indexOf("http://") == -1){ //relative src imgsrc = base.toString() + imgsrc; } sb.append("<img src=\"" + imgsrc + "\""); Hashtable h = imgtag.parseParameters(); for (Enumeration e2=h.keys();e2.hasMoreElements();) { String key = (String)e2.nextElement(); sb.append(" " + key + "=\"" + h.get(key) + "\""); } sb.append(">"); } else { sb.append("<" + tag.getText() + ">"); } } else if (node instanceof HTMLEndTag){ HTMLEndTag tag = (HTMLEndTag) node; sb.append("</" + tag.getContents() + ">"); } } Just a couple of questions if you don't mind. 1) is this the only way to get all the attributes in the img tag? 2) can you see any problems or suggest improvements? 3) (HTTP question) I'm adding all the output to a StringBuffer so that I can convert it to a byte array using sb.toString().getBytes() - I need to do this so that I can get the length of the byte array for use in the Content-length HTTP header field (the output is sent back to a browser). Do I need to do this or can I just omit the Content-length field and avoid using the StringBuffer? Another thing, I was testing the app on google.com and I noticed it has a strange image tag : < img width=1 height=1 alt="" > (no SRC attribute) Although the parser recognised it as an image tag, it didn't seem to pick up on the attributes. Is this a bug? > >Hi all, > >I'm new to the list today after following the thread 'Hints on how to >change image tag locations and write out document' in the archives. I'm >trying to make an application that changes all relative img src attributes >to absolute before writing out the entire document. I'd be very interested >to see some of the code from the attachments from Somik Raha if somebody >could post them. The archives don't seem to keep attachments. > >I just started using HTMLParser today and I'm currently stuck trying figure >out how to get the complete IMG tag string when using an HTMLImageScanner. >Am I correct in thinking that in both an HTMLTag and an HTMLImageTag object >are created for each image tag encountered when using HTMLImageScanner? If >so, does the HTMLTag object get populated with the usual data? > > >Thanks and regards, >Rob Shields > _________________________________________________________________ Get your FREE download of MSN Explorer at http://explorer.msn.com/intl.asp. |
From: Rob S. <bob...@ho...> - 2002-06-16 20:45:32
|
Hi all, I'm new to the list today after following the thread 'Hints on how to change image tag locations and write out document' in the archives. I'm trying to make an application that changes all relative img src attributes to absolute before writing out the entire document. I'd be very interested to see some of the code from the attachments from Somik Raha if somebody could post them. The archives don't seem to keep attachments. I just started using HTMLParser today and I'm currently stuck trying figure out how to get the complete IMG tag string when using an HTMLImageScanner. Am I correct in thinking that in both an HTMLTag and an HTMLImageTag object are created for each image tag encountered when using HTMLImageScanner? If so, does the HTMLTag object get populated with the usual data? Thanks and regards, Rob Shields _________________________________________________________________ Get your FREE download of MSN Explorer at http://explorer.msn.com/intl.asp. |
From: Somik R. <so...@ya...> - 2002-06-16 09:15:40
|
Hi Folks, A new integration build is out. Major change : [1] HTMLStringNode now gives string blocks, all in one string node = object, instead of several string node objects for continuous lines. = This is based on a=20 bug report by Gordon Deudney. This will improve the scalability of the = parser. [2] HTMLScriptScanner's scan method has been refactored. For folks = writing new scanners, take a look at this method - to see how simple it = is to make your own scanners. There's a substantial reduction in the = code size and complexity. To do :=20 [1] Integrate Raghavendra Srimantula's scanners (Form and Frame) as soon = as the test cases are available. [2] Write a guide for writing your own scanners. Regards, Somik |
From: Somik R. <so...@ya...> - 2002-06-12 00:04:16
|
Hi Claude, PS: I've found the design and implementation to be quit nice as I use it, very simple to apply in practice. If the download bundle include source I would probably have just taken a look. I'm not adverse to using CVS but the setup time is sometimes prohibitive. Having a source bundle for download might be useful in future distributions. =20 There is a source bundle in the distribution. When you unzip the = downloaded file, in the main htmlparser directory, you should be able to = see src.zip. The application extracts text, the title and some metadata (author, description, keywords - if present) from HTML documents for indexing purposes. I have successfully written code to access the content, title, and meta information but now need to put it in context. To do this, I would like to recognize the BODY tag's start and end. If I understand the architecture correctly, HTMLParser should allow me to register a simple HTMLTagScanner, but since this is an abstract class and the existing scanners don't suit my purpose, I presume I need to implement a subclass. Yes, you can write your own scanner as a subclass of HTMLTagScanner. = Check the scanners package for all the existing scanner code. They = follow the same pattern usually. Also check the docs at = http://htmlparser.sourceforge.net/design/scanner.html (its also in the = download bundle, in the docs directory). You would typically want to create a tag specific to your needs, which = is created by the scan() Factory Method/ Template Method. Your tag will = derive from HTMLTag or implement HTMLNode if it is not a tag. For your = purposes, I'd imagine a body tag class holding a vector of HTMLNode = elements. Can someone show me how to subclass HTMLTagScanner to watch for a specific tag? Its very easy.=20 public class MyScanner extends HTMLTagScanner { // This method is called to check if your scanner should be used. = Here's where you have to check if the scanner // should start public boolean evaluate(String s) { // check if s contains the word body in it. if (s.toUpperCase().indexOf("BODY")=3D=3D0) return true; else = return false; } // This method is automatically called to ask your scanner to do the = creation. Remember, the onus to do the=20 // scanning and take the scanner to the next correct location for = scanning is on you. public HTMLNode scan(...) { // .... your logic to create the return object (perhaps = HTMLBodyTag) return bodyTag; } } To register the scanner, when you create HTMLParser, you will need to do = this : HTMLParser parser =3D new HTMLParser("..."); parser.registerScanners(); // To register the standard scanners parser.addScanner(new MyScanner()); Thats all - it gets registered and used. Since you are tapping into low-level parsing, it is imperative that you = write test cases. The parserTests.scannersTests package contains sample = test code - which you can copy as a template to setup your testcases. = Its very easy, you can create dummy html code liked <BODY><STRONG>HELLO = WORLD</STRONG></BODY>, and register your body scanner to see if it is = extracting data as you would expect.=20 Also - it is very important that you run parserTests.AllTests - which = will run the 100+ testcases in the existing parser to check if you broke = anything. These tests are what ensure this parser is bug free and = usable, and make programming it manageable. One tip - when you are writing the scanner, although you are tapping = into low-level parsing, you dont have to write low-level code - you can = reuse code that might be in the other scanners. For an example of this, = see HTMLTitleScanner. I'd expect all scanners to be written like this. = But the other scanners are currently a bit archaic. Maybe I will get = around to refactoring all the scanners to be as elegant as the = HTMLTitleScanner.=20 Feel free to post any further questions. Good luck with your coding! Regards, Somik **************************************** Somik Raha System Architect Kizna Corporation Hiroo ON Bldg. 2F, 5-19-9 Hiroo, Shibuya-ku, Tokyo,=20 150-0012,=20 JAPAN Tel : +81-3-54752646 Fax : +81-3-5449-4870 Website : www.kizna.com Mail : so...@ki... *************************************************************************= ********** C makes it easy to shoot yourself in the foot. C++ makes it harder, but=20 when you do, it blows away your whole leg.=20 - Bjarne Stroustrup=20 *************************************************************************= ********** |
From: Claude D. <CD...@ar...> - 2002-06-11 16:05:38
|
Greetings. I have started developing a solution with the HTMLParser and wanted to ask about a few specifics. The application extracts text, the title and some metadata (author, description, keywords - if present) from HTML documents for indexing purposes. I have successfully written code to access the content, title, and meta information but now need to put it in context. To do this, I would like to recognize the BODY tag's start and end. If I understand the architecture correctly, HTMLParser should allow me to register a simple HTMLTagScanner, but since this is an abstract class and the existing scanners don't suit my purpose, I presume I need to implement a subclass. Can someone show me how to subclass HTMLTagScanner to watch for a specific tag? PS: I've found the design and implementation to be quit nice as I use it, very simple to apply in practice. If the download bundle include source I would probably have just taken a look. I'm not adverse to using CVS but the setup time is sometimes prohibitive. Having a source bundle for download might be useful in future distributions. Thanks. |
From: Somik R. <so...@ya...> - 2002-06-07 05:22:25
|
Hi Folks, An integration build is out, incorporating 2 bug fixes in the script = scanner, contributed by Wolfgang Germund.=20 Wolfgang --> Thanks a lot for the nice test cases that you = submitted, and of course, the fixes. Incorporated and released in the = latest package. Release 1.2 is still some way off... need test cases for the form = and frame scanners...=20 Regards, Somik |
From: Somik R. <so...@ya...> - 2002-05-12 09:07:49
|
Hi Raghav I went thru the yahoo.txt, and just like your previous one, this one = too had very dirty html. The reason you got the OutofMemoryException was = that this kind of html sent the parser into an infinite loop (in = HTMLLinkScanner). The tag which did this was : <a href=3Ds/8741><img = src=3D"http://us.i1.yimg.com/us.yimg.com/i/i16/mov_popc.gif" height=3D16 = width=3D16 border=3D0></img></td><td nowrap> <a href=3Ds/7509><b>Yahoo! Movies</b></a> As you can see, the first link tag does not have an end tag. I verified = with the actual yahoo page, and this link occurs quite decently, with = the correct end tag. After looking closely at your supplied file, I also = notice the </img> file, which is highly unusual in normal html. So - I am guessing that this file is generated by a program and not by a = human. You would definitely want to check the program thats doing it - = its surely buggy. However, my yardstick for the robustness of this parser is Internet = Explorer. If the stuff works in IE, then its got to work here. And as I = tried this particularly bad piece of html, I found IE does not crash. = Hence, I had to go about empowering the parser to parse these erroneous = tags <sigh> Took hours!! </sigh> The good news is, its done. We can parse these tags, and the correct = end tag is inserted just before td. Of course, I have done a minimal = adjustment for your purpose. As time goes on, robustness ought to = increase further. All test cases passing. The framework for handling = dirty html is also slightly modified. An integration release has been made (2002-05-12), and is under the = integration builds package. You can download from = http://htmlparser.sourceforge.net.=20 =20 The parser should not crash on your html now. Regards, Somik ----- Original Message -----=20 From: Raghavender Srimantula=20 To: htm...@li...=20 Sent: Saturday, May 11, 2002 4:32 AM Subject: Re: [Htmlparser-user] Hints on how to change image tag = locations andwriteoutdocument Hi Somik, I have mentioned about the out of memory error problem earlier. last = time=20 for every iteration of for loop I was adding the whole page to my = string=20 buffer. so it was giving me the out of memory error. I removed that = now. it=20 was working fine till yesterday. now I find that error again. this = time=20 nothing to do with string buffer...and it looks like a real problem. I = can=20 send you the main class and the yahoo.txt I have. try running it. Thanks, Raghav >From: "Somik Raha" <so...@ya...> >Reply-To: htm...@li... >To: <htm...@li...> >Subject: Re: [Htmlparser-user] Hints on how to change image tag = locations=20 >andwriteoutdocument >Date: Fri, 10 May 2002 00:43:19 +0900 > >Hi Raghav, > On analyzing yahoo.txt, I found that you have incorrect html. = There is=20 >a script tag that has not been closed. So naturally the script = scanner goes=20 >bonkers. Rename the extension to .html, and open this file in IE, and = you=20 >will find that IE also cant handle this. > I verified from www.yahoo.com, and found that they do have the = correct=20 ></script> tag provided. So I guess your yahoo.txt file is faulty. > >Regards, >Somik > ----- Original Message ----- > From: Raghavender Srimantula > To: htm...@li... > Sent: Thursday, May 09, 2002 4:53 AM > Subject: Re: [Htmlparser-user] Hints on how to change image tag=20 >locations andwriteoutdocument > > > Hi Somik, > I was using the 1.1 version of htmlparser. I save the = www.yahoo.com=20 >content > in a flat file yahoo.txt. and I run the parser against this. = throws a > nullpointerexception in HTMLScriptScanner. this seems to be a new=20 >addition > for 1.1. I will send the stacktrace, the main program and the = yahoo.txt. > actually I cannot send the stacktrace. I made some changes and the = line > numbers dont match. but if you run this program you would see the > nullpointerexception. > Thanks, > Raghav > > > >From: "Somik Raha" <so...@ya...> > >Reply-To: htm...@li... > >To: <htm...@li...> > >Subject: Re: [Htmlparser-user] Hints on how to change image tag=20 >locations > >and writeoutdocument > >Date: Mon, 6 May 2002 13:59:11 +0900 > > > >Hi Raghav, > > I sent another mail sometime back to you - > > > >"HTMLLinkTag.linkData() - this gives you an enumeration - and in = the > >enumeration will be your HTMLImageTag." > >HTMLNode node; > >HTMLImageTag imageTag; > >for (Enumeration e =3D linkTag.linkData();e.hasMoreElements();) { > > node =3D (HTMLNode)e.nextElement(); > > if (node instanceof HTMLImageTag) { > > imageTag =3D (HTMLImageTag)node; > > // your code here > > } > >} > > > >Regards, > >Somik > >----- Original Message ----- > >From: "Raghavender Srimantula" <kin...@ho...> > >To: <htm...@li...> > >Sent: Monday, May 06, 2002 10:43 AM > >Subject: Re: [Htmlparser-user] Hints on how to change image tag=20 >locations > >and writeoutdocument > > > > > > > Hi Somik, > > > this question is regarding "not all images are being = retrieved". I=20 >mean > >the > > > images under <a tag. I did try to open the attachment you sent = me. I > >could > > > not find anything. but seeing the previous mails I could read = that=20 >it is > >not > > > a bug. but still if I do want to retrieve all the images how = do I do=20 >it. > > > Thanks, > > > Raghav > > > > > > > > > >From: "Somik Raha" <so...@ya...> > > > >Reply-To: htm...@li... > > > >To: <htm...@li...> > > > >Subject: Re: [Htmlparser-user] Hints on how to change image = tag > >locations > > > >and write outdocument > > > >Date: Tue, 30 Apr 2002 11:37:26 +0900 > > > > > > > >Hi Raghav, > > > > Ah - this was a question by Annette Doyle (titled "Not = all=20 >image > >tags > > > >are returned"). I am attaching my reply. > > > > > > > >Regards > > > >Somik > > > > > > > >----- Original Message ----- > > > >From: "Raghavender Srimantula" <kin...@ho...> > > > >To: <htm...@li...> > > > >Sent: Tuesday, April 30, 2002 11:16 AM > > > >Subject: Re: [Htmlparser-user] Hints on how to change image = tag > >locations > > > >and write outdocument > > > > > > > > > > > > > hi Somik, > > > > > I found one more interesting thing here. when I am trying = to get=20 >all > >the > > > > > images the image scanner would give me images > > > > > <img > = >src=3D"http://us.i1.yimg.com/us.yimg.com/i/mntl/sh/mom02/title4.gif" > > > > > width=3D296 height=3D27 border=3D0 usemap=3D#tm> > > > > > so if I do a imagetag.getImageLocation(), I would get > > > > > = http://us.i1.yimg.com/us.yimg.com/i/mntl/sh/mom02/title4.gif > > > > > > > > > > but is the html content is like this > > > > > <a href=3Ds/6006><img > > > >src=3Dhttp://us.i1.yimg.com/us.yimg.com/i/us/hj/hjys.gif > > > > > border=3D0 width=3D70 height=3D22></a> > > > > > which starts with <a and ends with </a>, then the image = scanner=20 >will > >not > > > > > give me http://us.i1.yimg.com/us.yimg.com/i/us/hj/hjys.gif = when=20 >I do > >a > > > > > imagetag.getImageLocation(). this is not even classified = as an > >ImageTag. > > > > > this is classified as LinkTag. how to get this image. > > > > > > > > > > the above content is from www.yahoo.com. on the netscape = browser=20 >if > >you > > > >goto > > > > > view-->pageinfo, you will see a bunch of images. > > > > > but when you run the htmlparser you can get only one = image. > > > > > > > > > > Thanks, > > > > > Raghav > > > > > > > > > > > > > > > >From: "Somik Raha" <so...@ya...> > > > > > >Reply-To: htm...@li... > > > > > >To: <htm...@li...> > > > > > >Subject: Re: [Htmlparser-user] Hints on how to change = image tag > > > >locations > > > > > >and write outdocument > > > > > >Date: Tue, 30 Apr 2002 09:15:38 +0900 > > > > > > > > > > > >Can you describe your application ? Was it parsing a = single=20 >page > >when > > > >the > > > > > >problem occurred ? > > > > > > > > > > > >Regards, > > > > > >Somik > > > > > >----- Original Message ----- > > > > > >From: "Raghavender Srimantula" <kin...@ho...> > > > > > >To: <htm...@li...> > > > > > >Cc: <htm...@li...> > > > > > >Sent: Tuesday, April 30, 2002 8:36 AM > > > > > >Subject: Re: [Htmlparser-user] Hints on how to change = image tag > > > >locations > > > > > >and write outdocument > > > > > > > > > > > > > > > > > > > Hi Somik, > > > > > > > I encountered a strange problem today. while I was = running > > > > > >htmlparser...I > > > > > > > got a java.lang.OutOfMemoryError. seems that lot of = objects=20 >are > > > >being > > > > > > > allocated. where exactly is this happening. I mean = could you > >give > >me > > > >an > > > > > >idea > > > > > > > where or in which file the potential problem could be. > > > > > > > Raghav > > > > > > > > > > > > > > > > > > > > > >From: "Somik Raha" <so...@ya...> > > > > > > > >Reply-To: htm...@li... > > > > > > > >To: <htm...@li...> > > > > > > > >CC: <htm...@li...> > > > > > > > >Subject: Re: [Htmlparser-user] Hints on how to change = image=20 >tag > > > > > >locations > > > > > > > >and write out document > > > > > > > >Date: Sat, 27 Apr 2002 18:22:34 +0900 > > > > > > > > > > > > > > > >Hi Annette, > > > > > > > > Pls find attached a program to get you started. = This > >program > > > >will > > > > > >do > > > > > > > >what you want - you will need to modify the construct = that > >checks > > > >for > > > > > >the > > > > > > > >image tag - and replace it with the location of your=20 >choice. > > > > > > > > Also - I found one bug thanks to this = requirement -=20 >image > >tags > > > > > >params > > > > > > > >were not being correctly put in. Though it needs a = deeper=20 >look, > >I > > > >have > > > > > >done > > > > > > > >a quick fix for now, and all test cases are passing = (with=20 >one > >test > > > >case > > > > > >in > > > > > > > >HTMLImageScannerTest trapping this bug). > > > > > > > > Please check out the latest html parser source = code=20 >from > >CVS. > > > > > > > > > > > > > > > >Regards, > > > > > > > >Somik > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > From: Doyle, Annette > > > > > > > > To: htm...@li... > > > > > > > > Sent: Friday, April 26, 2002 10:08 PM > > > > > > > > Subject: [Htmlparser-user] Hints on how to change = image=20 >tag > > > > > >locations > > > > > > > >and write out document > > > > > > > > > > > > > > > > > > > > > > > > Could you please give me some hints as how to = change=20 >only > >image > > > >tag > > > > > > > >locations and then, (or at the same time) write out = the=20 >html > > > >document > > > > > >to > > > > > > > >file (with new image tag locations)? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks- > > > > > > > > > > > > > > > > Annette Doyle > > > > > > > > > > > > > > > ><< ImageTagRetriever.java >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >_________________________________________________________________ > > > > > > > Join the world's largest e-mail service with MSN = Hotmail. > > > > > > > http://www.hotmail.com > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > Htmlparser-user mailing list > > > > > > > Htm...@li... > > > > > > > = https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > > > > > > > > > > > > >_______________________________________________ > > > > > >Htmlparser-user mailing list > > > > > >Htm...@li... > > > > > = >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > > > > > > > > > > > > > > > > > > > >=20 >_________________________________________________________________ > > > > > Send and receive Hotmail on your mobile device: > >http://mobile.msn.com > > > > > > > > > > > > > > > _______________________________________________ > > > > > Htmlparser-user mailing list > > > > > Htm...@li... > > > > > = https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > ><< > > > > >=20 > = >[Htmlparser-developer]Re_[Htmlparser-user]Notallimagetagsarereturned[Not= aBu > >g].eml > > > > >> > > > > > > > > > > > > > > > = _________________________________________________________________ > > > MSN Photos is the easiest way to share and print your photos: > > > http://photos.msn.com/support/worldwide.aspx > > > > > > > > > = _______________________________________________________________ > > > > > > Have big pipes? SourceForge.net is looking for download = mirrors. We > >supply > > > the hardware. You get the recognition. Email Us: > >ban...@so... > > > _______________________________________________ > > > Htmlparser-user mailing list > > > Htm...@li... > > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > > >_______________________________________________ > >Htmlparser-user mailing list > >Htm...@li... > >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > _________________________________________________________________ > Get your FREE download of MSN Explorer at=20 >http://explorer.msn.com/intl.asp. > _________________________________________________________________ Join the world's largest e-mail service with MSN Hotmail.=20 http://www.hotmail.com |
From: Raghavender S. <kin...@ho...> - 2002-05-10 19:32:47
|
Hi Somik, I have mentioned about the out of memory error problem earlier. last time for every iteration of for loop I was adding the whole page to my string buffer. so it was giving me the out of memory error. I removed that now. it was working fine till yesterday. now I find that error again. this time nothing to do with string buffer...and it looks like a real problem. I can send you the main class and the yahoo.txt I have. try running it. Thanks, Raghav >From: "Somik Raha" <so...@ya...> >Reply-To: htm...@li... >To: <htm...@li...> >Subject: Re: [Htmlparser-user] Hints on how to change image tag locations >andwriteoutdocument >Date: Fri, 10 May 2002 00:43:19 +0900 > >Hi Raghav, > On analyzing yahoo.txt, I found that you have incorrect html. There is >a script tag that has not been closed. So naturally the script scanner goes >bonkers. Rename the extension to .html, and open this file in IE, and you >will find that IE also cant handle this. > I verified from www.yahoo.com, and found that they do have the correct ></script> tag provided. So I guess your yahoo.txt file is faulty. > >Regards, >Somik > ----- Original Message ----- > From: Raghavender Srimantula > To: htm...@li... > Sent: Thursday, May 09, 2002 4:53 AM > Subject: Re: [Htmlparser-user] Hints on how to change image tag >locations andwriteoutdocument > > > Hi Somik, > I was using the 1.1 version of htmlparser. I save the www.yahoo.com >content > in a flat file yahoo.txt. and I run the parser against this. throws a > nullpointerexception in HTMLScriptScanner. this seems to be a new >addition > for 1.1. I will send the stacktrace, the main program and the yahoo.txt. > actually I cannot send the stacktrace. I made some changes and the line > numbers dont match. but if you run this program you would see the > nullpointerexception. > Thanks, > Raghav > > > >From: "Somik Raha" <so...@ya...> > >Reply-To: htm...@li... > >To: <htm...@li...> > >Subject: Re: [Htmlparser-user] Hints on how to change image tag >locations > >and writeoutdocument > >Date: Mon, 6 May 2002 13:59:11 +0900 > > > >Hi Raghav, > > I sent another mail sometime back to you - > > > >"HTMLLinkTag.linkData() - this gives you an enumeration - and in the > >enumeration will be your HTMLImageTag." > >HTMLNode node; > >HTMLImageTag imageTag; > >for (Enumeration e = linkTag.linkData();e.hasMoreElements();) { > > node = (HTMLNode)e.nextElement(); > > if (node instanceof HTMLImageTag) { > > imageTag = (HTMLImageTag)node; > > // your code here > > } > >} > > > >Regards, > >Somik > >----- Original Message ----- > >From: "Raghavender Srimantula" <kin...@ho...> > >To: <htm...@li...> > >Sent: Monday, May 06, 2002 10:43 AM > >Subject: Re: [Htmlparser-user] Hints on how to change image tag >locations > >and writeoutdocument > > > > > > > Hi Somik, > > > this question is regarding "not all images are being retrieved". I >mean > >the > > > images under <a tag. I did try to open the attachment you sent me. I > >could > > > not find anything. but seeing the previous mails I could read that >it is > >not > > > a bug. but still if I do want to retrieve all the images how do I do >it. > > > Thanks, > > > Raghav > > > > > > > > > >From: "Somik Raha" <so...@ya...> > > > >Reply-To: htm...@li... > > > >To: <htm...@li...> > > > >Subject: Re: [Htmlparser-user] Hints on how to change image tag > >locations > > > >and write outdocument > > > >Date: Tue, 30 Apr 2002 11:37:26 +0900 > > > > > > > >Hi Raghav, > > > > Ah - this was a question by Annette Doyle (titled "Not all >image > >tags > > > >are returned"). I am attaching my reply. > > > > > > > >Regards > > > >Somik > > > > > > > >----- Original Message ----- > > > >From: "Raghavender Srimantula" <kin...@ho...> > > > >To: <htm...@li...> > > > >Sent: Tuesday, April 30, 2002 11:16 AM > > > >Subject: Re: [Htmlparser-user] Hints on how to change image tag > >locations > > > >and write outdocument > > > > > > > > > > > > > hi Somik, > > > > > I found one more interesting thing here. when I am trying to get >all > >the > > > > > images the image scanner would give me images > > > > > <img > >src="http://us.i1.yimg.com/us.yimg.com/i/mntl/sh/mom02/title4.gif" > > > > > width=296 height=27 border=0 usemap=#tm> > > > > > so if I do a imagetag.getImageLocation(), I would get > > > > > http://us.i1.yimg.com/us.yimg.com/i/mntl/sh/mom02/title4.gif > > > > > > > > > > but is the html content is like this > > > > > <a href=s/6006><img > > > >src=http://us.i1.yimg.com/us.yimg.com/i/us/hj/hjys.gif > > > > > border=0 width=70 height=22></a> > > > > > which starts with <a and ends with </a>, then the image scanner >will > >not > > > > > give me http://us.i1.yimg.com/us.yimg.com/i/us/hj/hjys.gif when >I do > >a > > > > > imagetag.getImageLocation(). this is not even classified as an > >ImageTag. > > > > > this is classified as LinkTag. how to get this image. > > > > > > > > > > the above content is from www.yahoo.com. on the netscape browser >if > >you > > > >goto > > > > > view-->pageinfo, you will see a bunch of images. > > > > > but when you run the htmlparser you can get only one image. > > > > > > > > > > Thanks, > > > > > Raghav > > > > > > > > > > > > > > > >From: "Somik Raha" <so...@ya...> > > > > > >Reply-To: htm...@li... > > > > > >To: <htm...@li...> > > > > > >Subject: Re: [Htmlparser-user] Hints on how to change image tag > > > >locations > > > > > >and write outdocument > > > > > >Date: Tue, 30 Apr 2002 09:15:38 +0900 > > > > > > > > > > > >Can you describe your application ? Was it parsing a single >page > >when > > > >the > > > > > >problem occurred ? > > > > > > > > > > > >Regards, > > > > > >Somik > > > > > >----- Original Message ----- > > > > > >From: "Raghavender Srimantula" <kin...@ho...> > > > > > >To: <htm...@li...> > > > > > >Cc: <htm...@li...> > > > > > >Sent: Tuesday, April 30, 2002 8:36 AM > > > > > >Subject: Re: [Htmlparser-user] Hints on how to change image tag > > > >locations > > > > > >and write outdocument > > > > > > > > > > > > > > > > > > > Hi Somik, > > > > > > > I encountered a strange problem today. while I was running > > > > > >htmlparser...I > > > > > > > got a java.lang.OutOfMemoryError. seems that lot of objects >are > > > >being > > > > > > > allocated. where exactly is this happening. I mean could you > >give > >me > > > >an > > > > > >idea > > > > > > > where or in which file the potential problem could be. > > > > > > > Raghav > > > > > > > > > > > > > > > > > > > > > >From: "Somik Raha" <so...@ya...> > > > > > > > >Reply-To: htm...@li... > > > > > > > >To: <htm...@li...> > > > > > > > >CC: <htm...@li...> > > > > > > > >Subject: Re: [Htmlparser-user] Hints on how to change image >tag > > > > > >locations > > > > > > > >and write out document > > > > > > > >Date: Sat, 27 Apr 2002 18:22:34 +0900 > > > > > > > > > > > > > > > >Hi Annette, > > > > > > > > Pls find attached a program to get you started. This > >program > > > >will > > > > > >do > > > > > > > >what you want - you will need to modify the construct that > >checks > > > >for > > > > > >the > > > > > > > >image tag - and replace it with the location of your >choice. > > > > > > > > Also - I found one bug thanks to this requirement - >image > >tags > > > > > >params > > > > > > > >were not being correctly put in. Though it needs a deeper >look, > >I > > > >have > > > > > >done > > > > > > > >a quick fix for now, and all test cases are passing (with >one > >test > > > >case > > > > > >in > > > > > > > >HTMLImageScannerTest trapping this bug). > > > > > > > > Please check out the latest html parser source code >from > >CVS. > > > > > > > > > > > > > > > >Regards, > > > > > > > >Somik > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > From: Doyle, Annette > > > > > > > > To: htm...@li... > > > > > > > > Sent: Friday, April 26, 2002 10:08 PM > > > > > > > > Subject: [Htmlparser-user] Hints on how to change image >tag > > > > > >locations > > > > > > > >and write out document > > > > > > > > > > > > > > > > > > > > > > > > Could you please give me some hints as how to change >only > >image > > > >tag > > > > > > > >locations and then, (or at the same time) write out the >html > > > >document > > > > > >to > > > > > > > >file (with new image tag locations)? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks- > > > > > > > > > > > > > > > > Annette Doyle > > > > > > > > > > > > > > > ><< ImageTagRetriever.java >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >_________________________________________________________________ > > > > > > > Join the world's largest e-mail service with MSN Hotmail. > > > > > > > http://www.hotmail.com > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > Htmlparser-user mailing list > > > > > > > Htm...@li... > > > > > > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > > > > > > > > > > > > >_______________________________________________ > > > > > >Htmlparser-user mailing list > > > > > >Htm...@li... > > > > > >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > > > > > > > > > > > > > > > > > > > > >_________________________________________________________________ > > > > > Send and receive Hotmail on your mobile device: > >http://mobile.msn.com > > > > > > > > > > > > > > > _______________________________________________ > > > > > Htmlparser-user mailing list > > > > > Htm...@li... > > > > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > ><< > > > > > > >[Htmlparser-developer]Re_[Htmlparser-user]Notallimagetagsarereturned[NotaBu > >g].eml > > > > >> > > > > > > > > > > > > > > > _________________________________________________________________ > > > MSN Photos is the easiest way to share and print your photos: > > > http://photos.msn.com/support/worldwide.aspx > > > > > > > > > _______________________________________________________________ > > > > > > Have big pipes? SourceForge.net is looking for download mirrors. We > >supply > > > the hardware. You get the recognition. Email Us: > >ban...@so... > > > _______________________________________________ > > > Htmlparser-user mailing list > > > Htm...@li... > > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > > >_______________________________________________ > >Htmlparser-user mailing list > >Htm...@li... > >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > _________________________________________________________________ > Get your FREE download of MSN Explorer at >http://explorer.msn.com/intl.asp. > _________________________________________________________________ Join the worlds largest e-mail service with MSN Hotmail. http://www.hotmail.com |
From: Somik R. <so...@ya...> - 2002-05-09 15:43:29
|
Hi Raghav, On analyzing yahoo.txt, I found that you have incorrect html. There = is a script tag that has not been closed. So naturally the script = scanner goes bonkers. Rename the extension to .html, and open this file = in IE, and you will find that IE also cant handle this. I verified from www.yahoo.com, and found that they do have the = correct </script> tag provided. So I guess your yahoo.txt file is = faulty. Regards, Somik ----- Original Message -----=20 From: Raghavender Srimantula=20 To: htm...@li...=20 Sent: Thursday, May 09, 2002 4:53 AM Subject: Re: [Htmlparser-user] Hints on how to change image tag = locations andwriteoutdocument Hi Somik, I was using the 1.1 version of htmlparser. I save the www.yahoo.com = content=20 in a flat file yahoo.txt. and I run the parser against this. throws a=20 nullpointerexception in HTMLScriptScanner. this seems to be a new = addition=20 for 1.1. I will send the stacktrace, the main program and the = yahoo.txt. actually I cannot send the stacktrace. I made some changes and the = line=20 numbers dont match. but if you run this program you would see the=20 nullpointerexception. Thanks, Raghav >From: "Somik Raha" <so...@ya...> >Reply-To: htm...@li... >To: <htm...@li...> >Subject: Re: [Htmlparser-user] Hints on how to change image tag = locations=20 >and writeoutdocument >Date: Mon, 6 May 2002 13:59:11 +0900 > >Hi Raghav, > I sent another mail sometime back to you - > >"HTMLLinkTag.linkData() - this gives you an enumeration - and in the >enumeration will be your HTMLImageTag." >HTMLNode node; >HTMLImageTag imageTag; >for (Enumeration e =3D linkTag.linkData();e.hasMoreElements();) { > node =3D (HTMLNode)e.nextElement(); > if (node instanceof HTMLImageTag) { > imageTag =3D (HTMLImageTag)node; > // your code here > } >} > >Regards, >Somik >----- Original Message ----- >From: "Raghavender Srimantula" <kin...@ho...> >To: <htm...@li...> >Sent: Monday, May 06, 2002 10:43 AM >Subject: Re: [Htmlparser-user] Hints on how to change image tag = locations >and writeoutdocument > > > > Hi Somik, > > this question is regarding "not all images are being retrieved". I = mean >the > > images under <a tag. I did try to open the attachment you sent me. = I=20 >could > > not find anything. but seeing the previous mails I could read that = it is >not > > a bug. but still if I do want to retrieve all the images how do I = do it. > > Thanks, > > Raghav > > > > > > >From: "Somik Raha" <so...@ya...> > > >Reply-To: htm...@li... > > >To: <htm...@li...> > > >Subject: Re: [Htmlparser-user] Hints on how to change image tag=20 >locations > > >and write outdocument > > >Date: Tue, 30 Apr 2002 11:37:26 +0900 > > > > > >Hi Raghav, > > > Ah - this was a question by Annette Doyle (titled "Not all = image >tags > > >are returned"). I am attaching my reply. > > > > > >Regards > > >Somik > > > > > >----- Original Message ----- > > >From: "Raghavender Srimantula" <kin...@ho...> > > >To: <htm...@li...> > > >Sent: Tuesday, April 30, 2002 11:16 AM > > >Subject: Re: [Htmlparser-user] Hints on how to change image tag=20 >locations > > >and write outdocument > > > > > > > > > > hi Somik, > > > > I found one more interesting thing here. when I am trying to = get all >the > > > > images the image scanner would give me images > > > > <img >src=3D"http://us.i1.yimg.com/us.yimg.com/i/mntl/sh/mom02/title4.gif" > > > > width=3D296 height=3D27 border=3D0 usemap=3D#tm> > > > > so if I do a imagetag.getImageLocation(), I would get > > > > http://us.i1.yimg.com/us.yimg.com/i/mntl/sh/mom02/title4.gif > > > > > > > > but is the html content is like this > > > > <a href=3Ds/6006><img > > >src=3Dhttp://us.i1.yimg.com/us.yimg.com/i/us/hj/hjys.gif > > > > border=3D0 width=3D70 height=3D22></a> > > > > which starts with <a and ends with </a>, then the image = scanner will >not > > > > give me http://us.i1.yimg.com/us.yimg.com/i/us/hj/hjys.gif = when I do=20 >a > > > > imagetag.getImageLocation(). this is not even classified as an >ImageTag. > > > > this is classified as LinkTag. how to get this image. > > > > > > > > the above content is from www.yahoo.com. on the netscape = browser if >you > > >goto > > > > view-->pageinfo, you will see a bunch of images. > > > > but when you run the htmlparser you can get only one image. > > > > > > > > Thanks, > > > > Raghav > > > > > > > > > > > > >From: "Somik Raha" <so...@ya...> > > > > >Reply-To: htm...@li... > > > > >To: <htm...@li...> > > > > >Subject: Re: [Htmlparser-user] Hints on how to change image = tag > > >locations > > > > >and write outdocument > > > > >Date: Tue, 30 Apr 2002 09:15:38 +0900 > > > > > > > > > >Can you describe your application ? Was it parsing a single = page=20 >when > > >the > > > > >problem occurred ? > > > > > > > > > >Regards, > > > > >Somik > > > > >----- Original Message ----- > > > > >From: "Raghavender Srimantula" <kin...@ho...> > > > > >To: <htm...@li...> > > > > >Cc: <htm...@li...> > > > > >Sent: Tuesday, April 30, 2002 8:36 AM > > > > >Subject: Re: [Htmlparser-user] Hints on how to change image = tag > > >locations > > > > >and write outdocument > > > > > > > > > > > > > > > > Hi Somik, > > > > > > I encountered a strange problem today. while I was running > > > > >htmlparser...I > > > > > > got a java.lang.OutOfMemoryError. seems that lot of = objects are > > >being > > > > > > allocated. where exactly is this happening. I mean could = you=20 >give >me > > >an > > > > >idea > > > > > > where or in which file the potential problem could be. > > > > > > Raghav > > > > > > > > > > > > > > > > > > >From: "Somik Raha" <so...@ya...> > > > > > > >Reply-To: htm...@li... > > > > > > >To: <htm...@li...> > > > > > > >CC: <htm...@li...> > > > > > > >Subject: Re: [Htmlparser-user] Hints on how to change = image tag > > > > >locations > > > > > > >and write out document > > > > > > >Date: Sat, 27 Apr 2002 18:22:34 +0900 > > > > > > > > > > > > > >Hi Annette, > > > > > > > Pls find attached a program to get you started. This = >program > > >will > > > > >do > > > > > > >what you want - you will need to modify the construct = that=20 >checks > > >for > > > > >the > > > > > > >image tag - and replace it with the location of your = choice. > > > > > > > Also - I found one bug thanks to this requirement - = image >tags > > > > >params > > > > > > >were not being correctly put in. Though it needs a deeper = look,=20 >I > > >have > > > > >done > > > > > > >a quick fix for now, and all test cases are passing (with = one >test > > >case > > > > >in > > > > > > >HTMLImageScannerTest trapping this bug). > > > > > > > Please check out the latest html parser source code = from >CVS. > > > > > > > > > > > > > >Regards, > > > > > > >Somik > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > From: Doyle, Annette > > > > > > > To: htm...@li... > > > > > > > Sent: Friday, April 26, 2002 10:08 PM > > > > > > > Subject: [Htmlparser-user] Hints on how to change = image tag > > > > >locations > > > > > > >and write out document > > > > > > > > > > > > > > > > > > > > > Could you please give me some hints as how to change = only >image > > >tag > > > > > > >locations and then, (or at the same time) write out the = html > > >document > > > > >to > > > > > > >file (with new image tag locations)? > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks- > > > > > > > > > > > > > > Annette Doyle > > > > > > > > > > > > > ><< ImageTagRetriever.java >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >=20 >_________________________________________________________________ > > > > > > Join the world's largest e-mail service with MSN Hotmail. > > > > > > http://www.hotmail.com > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > Htmlparser-user mailing list > > > > > > Htm...@li... > > > > > > = https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > > > > > > > > > >_______________________________________________ > > > > >Htmlparser-user mailing list > > > > >Htm...@li... > > > > >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > > > > > > > > > > > > > > > = _________________________________________________________________ > > > > Send and receive Hotmail on your mobile device:=20 >http://mobile.msn.com > > > > > > > > > > > > _______________________________________________ > > > > Htmlparser-user mailing list > > > > Htm...@li... > > > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > ><< > > > = >[Htmlparser-developer]Re_[Htmlparser-user]Notallimagetagsarereturned[Not= aBu >g].eml > > > >> > > > > > > > > > > _________________________________________________________________ > > MSN Photos is the easiest way to share and print your photos: > > http://photos.msn.com/support/worldwide.aspx > > > > > > _______________________________________________________________ > > > > Have big pipes? SourceForge.net is looking for download mirrors. = We=20 >supply > > the hardware. You get the recognition. Email Us:=20 >ban...@so... > > _______________________________________________ > > Htmlparser-user mailing list > > Htm...@li... > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > >_______________________________________________ >Htmlparser-user mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-user _________________________________________________________________ Get your FREE download of MSN Explorer at = http://explorer.msn.com/intl.asp. |
From: Raghavender S. <kin...@ho...> - 2002-05-08 19:54:09
|
Hi Somik, I was using the 1.1 version of htmlparser. I save the www.yahoo.com content in a flat file yahoo.txt. and I run the parser against this. throws a nullpointerexception in HTMLScriptScanner. this seems to be a new addition for 1.1. I will send the stacktrace, the main program and the yahoo.txt. actually I cannot send the stacktrace. I made some changes and the line numbers dont match. but if you run this program you would see the nullpointerexception. Thanks, Raghav >From: "Somik Raha" <so...@ya...> >Reply-To: htm...@li... >To: <htm...@li...> >Subject: Re: [Htmlparser-user] Hints on how to change image tag locations >and writeoutdocument >Date: Mon, 6 May 2002 13:59:11 +0900 > >Hi Raghav, > I sent another mail sometime back to you - > >"HTMLLinkTag.linkData() - this gives you an enumeration - and in the >enumeration will be your HTMLImageTag." >HTMLNode node; >HTMLImageTag imageTag; >for (Enumeration e = linkTag.linkData();e.hasMoreElements();) { > node = (HTMLNode)e.nextElement(); > if (node instanceof HTMLImageTag) { > imageTag = (HTMLImageTag)node; > // your code here > } >} > >Regards, >Somik >----- Original Message ----- >From: "Raghavender Srimantula" <kin...@ho...> >To: <htm...@li...> >Sent: Monday, May 06, 2002 10:43 AM >Subject: Re: [Htmlparser-user] Hints on how to change image tag locations >and writeoutdocument > > > > Hi Somik, > > this question is regarding "not all images are being retrieved". I mean >the > > images under <a tag. I did try to open the attachment you sent me. I >could > > not find anything. but seeing the previous mails I could read that it is >not > > a bug. but still if I do want to retrieve all the images how do I do it. > > Thanks, > > Raghav > > > > > > >From: "Somik Raha" <so...@ya...> > > >Reply-To: htm...@li... > > >To: <htm...@li...> > > >Subject: Re: [Htmlparser-user] Hints on how to change image tag >locations > > >and write outdocument > > >Date: Tue, 30 Apr 2002 11:37:26 +0900 > > > > > >Hi Raghav, > > > Ah - this was a question by Annette Doyle (titled "Not all image >tags > > >are returned"). I am attaching my reply. > > > > > >Regards > > >Somik > > > > > >----- Original Message ----- > > >From: "Raghavender Srimantula" <kin...@ho...> > > >To: <htm...@li...> > > >Sent: Tuesday, April 30, 2002 11:16 AM > > >Subject: Re: [Htmlparser-user] Hints on how to change image tag >locations > > >and write outdocument > > > > > > > > > > hi Somik, > > > > I found one more interesting thing here. when I am trying to get all >the > > > > images the image scanner would give me images > > > > <img >src="http://us.i1.yimg.com/us.yimg.com/i/mntl/sh/mom02/title4.gif" > > > > width=296 height=27 border=0 usemap=#tm> > > > > so if I do a imagetag.getImageLocation(), I would get > > > > http://us.i1.yimg.com/us.yimg.com/i/mntl/sh/mom02/title4.gif > > > > > > > > but is the html content is like this > > > > <a href=s/6006><img > > >src=http://us.i1.yimg.com/us.yimg.com/i/us/hj/hjys.gif > > > > border=0 width=70 height=22></a> > > > > which starts with <a and ends with </a>, then the image scanner will >not > > > > give me http://us.i1.yimg.com/us.yimg.com/i/us/hj/hjys.gif when I do >a > > > > imagetag.getImageLocation(). this is not even classified as an >ImageTag. > > > > this is classified as LinkTag. how to get this image. > > > > > > > > the above content is from www.yahoo.com. on the netscape browser if >you > > >goto > > > > view-->pageinfo, you will see a bunch of images. > > > > but when you run the htmlparser you can get only one image. > > > > > > > > Thanks, > > > > Raghav > > > > > > > > > > > > >From: "Somik Raha" <so...@ya...> > > > > >Reply-To: htm...@li... > > > > >To: <htm...@li...> > > > > >Subject: Re: [Htmlparser-user] Hints on how to change image tag > > >locations > > > > >and write outdocument > > > > >Date: Tue, 30 Apr 2002 09:15:38 +0900 > > > > > > > > > >Can you describe your application ? Was it parsing a single page >when > > >the > > > > >problem occurred ? > > > > > > > > > >Regards, > > > > >Somik > > > > >----- Original Message ----- > > > > >From: "Raghavender Srimantula" <kin...@ho...> > > > > >To: <htm...@li...> > > > > >Cc: <htm...@li...> > > > > >Sent: Tuesday, April 30, 2002 8:36 AM > > > > >Subject: Re: [Htmlparser-user] Hints on how to change image tag > > >locations > > > > >and write outdocument > > > > > > > > > > > > > > > > Hi Somik, > > > > > > I encountered a strange problem today. while I was running > > > > >htmlparser...I > > > > > > got a java.lang.OutOfMemoryError. seems that lot of objects are > > >being > > > > > > allocated. where exactly is this happening. I mean could you >give >me > > >an > > > > >idea > > > > > > where or in which file the potential problem could be. > > > > > > Raghav > > > > > > > > > > > > > > > > > > >From: "Somik Raha" <so...@ya...> > > > > > > >Reply-To: htm...@li... > > > > > > >To: <htm...@li...> > > > > > > >CC: <htm...@li...> > > > > > > >Subject: Re: [Htmlparser-user] Hints on how to change image tag > > > > >locations > > > > > > >and write out document > > > > > > >Date: Sat, 27 Apr 2002 18:22:34 +0900 > > > > > > > > > > > > > >Hi Annette, > > > > > > > Pls find attached a program to get you started. This >program > > >will > > > > >do > > > > > > >what you want - you will need to modify the construct that >checks > > >for > > > > >the > > > > > > >image tag - and replace it with the location of your choice. > > > > > > > Also - I found one bug thanks to this requirement - image >tags > > > > >params > > > > > > >were not being correctly put in. Though it needs a deeper look, >I > > >have > > > > >done > > > > > > >a quick fix for now, and all test cases are passing (with one >test > > >case > > > > >in > > > > > > >HTMLImageScannerTest trapping this bug). > > > > > > > Please check out the latest html parser source code from >CVS. > > > > > > > > > > > > > >Regards, > > > > > > >Somik > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > From: Doyle, Annette > > > > > > > To: htm...@li... > > > > > > > Sent: Friday, April 26, 2002 10:08 PM > > > > > > > Subject: [Htmlparser-user] Hints on how to change image tag > > > > >locations > > > > > > >and write out document > > > > > > > > > > > > > > > > > > > > > Could you please give me some hints as how to change only >image > > >tag > > > > > > >locations and then, (or at the same time) write out the html > > >document > > > > >to > > > > > > >file (with new image tag locations)? > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks- > > > > > > > > > > > > > > Annette Doyle > > > > > > > > > > > > > ><< ImageTagRetriever.java >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >_________________________________________________________________ > > > > > > Join the world's largest e-mail service with MSN Hotmail. > > > > > > http://www.hotmail.com > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > Htmlparser-user mailing list > > > > > > Htm...@li... > > > > > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > > > > > > > > > >_______________________________________________ > > > > >Htmlparser-user mailing list > > > > >Htm...@li... > > > > >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > > > > > > > > > > > > > > > _________________________________________________________________ > > > > Send and receive Hotmail on your mobile device: >http://mobile.msn.com > > > > > > > > > > > > _______________________________________________ > > > > Htmlparser-user mailing list > > > > Htm...@li... > > > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > ><< > > > >[Htmlparser-developer]Re_[Htmlparser-user]Notallimagetagsarereturned[NotaBu >g].eml > > > >> > > > > > > > > > > _________________________________________________________________ > > MSN Photos is the easiest way to share and print your photos: > > http://photos.msn.com/support/worldwide.aspx > > > > > > _______________________________________________________________ > > > > Have big pipes? SourceForge.net is looking for download mirrors. We >supply > > the hardware. You get the recognition. Email Us: >ban...@so... > > _______________________________________________ > > Htmlparser-user mailing list > > Htm...@li... > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > >_______________________________________________ >Htmlparser-user mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-user _________________________________________________________________ Get your FREE download of MSN Explorer at http://explorer.msn.com/intl.asp. |
From: Somik R. <so...@ya...> - 2002-05-08 10:16:49
|
Hi Craig, I actually replied to you on htmlparser-developer, your earlier mails went there. Are you on that list ? Am attaching the relevant mails to this mail - hope it goes thru. Regards Somik ----- Original Message ----- From: "Craig Raw" <cr...@qu...> To: <htm...@li...> Cc: <so...@ya...> Sent: Wednesday, May 08, 2002 6:49 PM Subject: [Htmlparser-user] Swing integration > Posted this earlier, seems to have got lost.... > ---- > > > Hi Somik, > > I'm looking into the HTMLParser-Swing integration again, and I have two > questions: > > 1. The HTMLEditorKit.ParserCallback takes a position with most of its > callback functions. Can this position be extracted from the HTMLTag's > elementBegin()? > > 2. There is a need to differentiate between a callback to > handleSimpleTag(HTML.Tag t, MutableAttributeSet a, int pos) and > handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos) when > iterating through the HTMLTag elements Enumeration. How? > > You mentioned you have started an implementation - if you have a > framework going, I'd be happy to continue with the donkey work. I really > think this could make Swing's HTML rendering a lot more stable. > > Regards, > Craig > > > > > > -----Original Message----- > From: Somik Raha [mailto:so...@ya...] > Sent: 16 April 2002 04:57 AM > To: htm...@li... > Cc: Craig Raw > Subject: Re: [Htmlparser-user] Swing integration > > Hi Craig, Asgher > I finally had the time to check Swing integration. Boy - the parser > design in Swing sucks!! Theoretically its possible to do it - and I got > started, but just realized that in order to be compatible with swing > objects > that do compile time type checking with a particular tag, I have to > actually > have 73 if statements to give the right tag to the callback. > I have more important things to do at the moment, but probably will > get > back to this donkey work. *sigh* > > I am thinking we should make release 1.1 and then try this. Any > suggestions ? > > Regards, > Somik > ----- Original Message ----- > From: "Somik Raha" <so...@ya...> > To: <htm...@li...> > Sent: Thursday, April 04, 2002 11:20 AM > Subject: Re: [Htmlparser-user] Swing integration > > > > Hi Craig, > > Thanks a lot for the post. Pls go ahead with your analysis. I will > try > > to catch up this weekend. > > Regards, > > Somik > > ----- Original Message ----- > > From: "Craig Raw" <cr...@qu...> > > To: "'Somik Raha'" <so...@ya...> > > Sent: Tuesday, April 02, 2002 3:32 PM > > Subject: RE: [Htmlparser-user] Swing integration > > > > > > > Hi Somik, > > > > > > A quick excerpt from javax.swing.text.html.HTMLEditorKit javadoc - > which > > > is the driver behind JEditorPane's reading and writing HTML > > > capabilities. > > > > > > --- > > > Extendable/Scalable > > > > > > To maximize the usefulness of this kit, a great deal of effort has > gone > > > into making it extendable. These are some of the features. > > > The parser is replaceable. The default parser is the Hot Java parser > > > which is DTD based. A different DTD can be used, or an entirely > > > different parser can be used. To change the parser, reimplement the > > > getParser method. The default parser is dynamically loaded when > first > > > asked for, so the class files will never be loaded if an alternative > > > parser is used. The default parser is in a separate package called > > > parser below this package. > > > > > > The parser drives the ParserCallback, which is provided by > HTMLDocument. > > > To change the callback, subclass HTMLDocument and reimplement the > > > createDefaultDocument method to return document that produces a > > > different reader. The reader controls how the document is > structured. > > > Although the Document provides HTML support by default, there is > nothing > > > preventing support of non-HTML tags that result in alternative > element > > > structures. > > > --- > > > > > > I may find some time to look into this as well, although I am not > sure > > > how much it would fix JEditorPane's somewhat buggy HTML rendering > > > capabilities.... > > > > > > -craig > > > > > > > > > -----Original Message----- > > > From: htm...@li... > > > [mailto:htm...@li...] On Behalf Of > Somik > > > Raha > > > Sent: 01 April 2002 05:28 PM > > > To: HTMLParser User List > > > Cc: HTMLParser Developer List > > > Subject: Re: [Htmlparser-user] Swing integration > > > > > > Hi Craig > > > Wow! Thats a great question. > > > Actually, I doubt if I could replace Sun Microsystems' code with > > > mine. I > > > dont think Java is that open (or is it ?) > > > However, we could think of writing our own adapter for the html > parser > > > that > > > might plugin in some way... > > > I have never used Sun's html parser (If I had, I might not have > > > started > > > this project). > > > I will need to study Sun's parser before I can answer your > > > question.. > > > But there does seem to be some interesting possibilities. > > > > > > Regards > > > Somik > > > ----- Original Message ----- > > > From: "Craig Raw" <cr...@qu...> > > > To: <htm...@li...> > > > Sent: Monday, April 01, 2002 10:20 PM > > > Subject: [Htmlparser-user] Swing integration > > > > > > > > > > Has the HTML Parser been integrated into Swing's HTMLEditorKit to > > > > provide a better implementation of JEditorPane's HTML viewing > > > > capabilities? HTML Parser would need to replace > > > > javax.swing.text.html.parser.Parser, which is currently somewhat > > > buggy. > > > > Anyone tried this? > > > > > > > > -craig > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > Htmlparser-user mailing list > > > > Htm...@li... > > > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > > > > _________________________________________________________ > > > Do You Yahoo!? > > > Get your free @yahoo.com address at http://mail.yahoo.com > > > > > > > > > _______________________________________________ > > > Htmlparser-user mailing list > > > Htm...@li... > > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > _________________________________________________________ > > Do You Yahoo!? > > Get your free @yahoo.com address at http://mail.yahoo.com > > > > > > _______________________________________________ > > Htmlparser-user mailing list > > Htm...@li... > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > _______________________________________________________________ > > Have big pipes? SourceForge.net is looking for download mirrors. We supply > the hardware. You get the recognition. Email Us: ban...@so... > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Craig R. <cr...@qu...> - 2002-05-08 09:50:22
|
Posted this earlier, seems to have got lost.... ---- Hi Somik, I'm looking into the HTMLParser-Swing integration again, and I have two questions: 1. The HTMLEditorKit.ParserCallback takes a position with most of its callback functions. Can this position be extracted from the HTMLTag's elementBegin()? 2. There is a need to differentiate between a callback to handleSimpleTag(HTML.Tag t, MutableAttributeSet a, int pos) and handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos) when iterating through the HTMLTag elements Enumeration. How? You mentioned you have started an implementation - if you have a framework going, I'd be happy to continue with the donkey work. I really think this could make Swing's HTML rendering a lot more stable. Regards, Craig -----Original Message----- From: Somik Raha [mailto:so...@ya...] Sent: 16 April 2002 04:57 AM To: htm...@li... Cc: Craig Raw Subject: Re: [Htmlparser-user] Swing integration Hi Craig, Asgher I finally had the time to check Swing integration. Boy - the parser design in Swing sucks!! Theoretically its possible to do it - and I got started, but just realized that in order to be compatible with swing objects that do compile time type checking with a particular tag, I have to actually have 73 if statements to give the right tag to the callback. I have more important things to do at the moment, but probably will get back to this donkey work. *sigh* I am thinking we should make release 1.1 and then try this. Any suggestions ? Regards, Somik ----- Original Message ----- From: "Somik Raha" <so...@ya...> To: <htm...@li...> Sent: Thursday, April 04, 2002 11:20 AM Subject: Re: [Htmlparser-user] Swing integration > Hi Craig, > Thanks a lot for the post. Pls go ahead with your analysis. I will try > to catch up this weekend. > Regards, > Somik > ----- Original Message ----- > From: "Craig Raw" <cr...@qu...> > To: "'Somik Raha'" <so...@ya...> > Sent: Tuesday, April 02, 2002 3:32 PM > Subject: RE: [Htmlparser-user] Swing integration > > > > Hi Somik, > > > > A quick excerpt from javax.swing.text.html.HTMLEditorKit javadoc - which > > is the driver behind JEditorPane's reading and writing HTML > > capabilities. > > > > --- > > Extendable/Scalable > > > > To maximize the usefulness of this kit, a great deal of effort has gone > > into making it extendable. These are some of the features. > > The parser is replaceable. The default parser is the Hot Java parser > > which is DTD based. A different DTD can be used, or an entirely > > different parser can be used. To change the parser, reimplement the > > getParser method. The default parser is dynamically loaded when first > > asked for, so the class files will never be loaded if an alternative > > parser is used. The default parser is in a separate package called > > parser below this package. > > > > The parser drives the ParserCallback, which is provided by HTMLDocument. > > To change the callback, subclass HTMLDocument and reimplement the > > createDefaultDocument method to return document that produces a > > different reader. The reader controls how the document is structured. > > Although the Document provides HTML support by default, there is nothing > > preventing support of non-HTML tags that result in alternative element > > structures. > > --- > > > > I may find some time to look into this as well, although I am not sure > > how much it would fix JEditorPane's somewhat buggy HTML rendering > > capabilities.... > > > > -craig > > > > > > -----Original Message----- > > From: htm...@li... > > [mailto:htm...@li...] On Behalf Of Somik > > Raha > > Sent: 01 April 2002 05:28 PM > > To: HTMLParser User List > > Cc: HTMLParser Developer List > > Subject: Re: [Htmlparser-user] Swing integration > > > > Hi Craig > > Wow! Thats a great question. > > Actually, I doubt if I could replace Sun Microsystems' code with > > mine. I > > dont think Java is that open (or is it ?) > > However, we could think of writing our own adapter for the html parser > > that > > might plugin in some way... > > I have never used Sun's html parser (If I had, I might not have > > started > > this project). > > I will need to study Sun's parser before I can answer your > > question.. > > But there does seem to be some interesting possibilities. > > > > Regards > > Somik > > ----- Original Message ----- > > From: "Craig Raw" <cr...@qu...> > > To: <htm...@li...> > > Sent: Monday, April 01, 2002 10:20 PM > > Subject: [Htmlparser-user] Swing integration > > > > > > > Has the HTML Parser been integrated into Swing's HTMLEditorKit to > > > provide a better implementation of JEditorPane's HTML viewing > > > capabilities? HTML Parser would need to replace > > > javax.swing.text.html.parser.Parser, which is currently somewhat > > buggy. > > > Anyone tried this? > > > > > > -craig > > > > > > > > > > > > > > > _______________________________________________ > > > Htmlparser-user mailing list > > > Htm...@li... > > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > _________________________________________________________ > > Do You Yahoo!? > > Get your free @yahoo.com address at http://mail.yahoo.com > > > > > > _______________________________________________ > > Htmlparser-user mailing list > > Htm...@li... > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > _________________________________________________________ > Do You Yahoo!? > Get your free @yahoo.com address at http://mail.yahoo.com > > > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Somik R. <so...@ya...> - 2002-05-07 06:29:11
|
Hi Folks, Following some nice suggestions from Sam Joseph, I have just = completed some design modifications to the basic HTMLNode API. The modifications are : [1] HTMLNode is no longer an interface, but an abstract class. There = were two reasons for this change. Firstly, I couldnt think of a scenario = where an object would be an html tag AND something else. Secondly, I = wanted to enforce the implementation of toString(), which is usually = done if you implement from the interface (as Object has a default = toString()). [2] abstract toString() method - children have to implement this. [3] print() and print(PrintWriter) - final methods. They will make a = call to toString(), and print to standard output and the print writer = respectively. [4] toPlainText() - this method will provide a string representation of = a tag, if there is such a representation. If not , a blank string is = returned. This has implications - our program to extract all strings = from a html page will be simplified to: HTMLNode node; for (Enumeration e =3D parser.elements();e.hasMoreElements();) { node =3D (HTMLNode)e.nextElement(); System.out.println(node.toPlainTextString()); // or whatever = processing you want to do with the string } [5] toRawString() - this method provides the complete html element (a = reconstruction), thus allowing ripping programs to be really simple. So = if you want to rip the html page to your local hard disk, your program = would look like, PrintWriter pw =3D new PrintWriter(new FileWriter("...")); for (Enumeration e =3D parser.elements();e.hasMoreElements();) { node =3D (HTMLNode)e.nextElement(); pw.println(node.toRawString()); } pw.close(); [6] Lots of bug fixes done - HTMLImageScanner had a bug, = HTMLStyleScanner also had one - all caught with more testcases. We have 100 testcases as of now, all of them passing. To-do list for Release 1.2 ------------------------------------ [1] Integration of Raghavender Srimantula's contribution - = HTMLFrameScanner and HTMLFormScanner, into the parser. This will be = integrated as soon as I get the testcases from Raghav. [2] Adding an HTML Ripping program in the parserApplications package. [3] Improving the Robot Crawler (??) [4] Bug fixes to any bugs that get reported in this period. You can check out the latest code from CVS. Or you can go to = http://htmlparser.sourceforge.net and click on the download link, and = choose htmlparser1_2_20020507.zip Feedback is welcome. Regards, Somik |
From: Somik R. <so...@ya...> - 2002-05-06 04:59:10
|
Hi Raghav, I sent another mail sometime back to you - "HTMLLinkTag.linkData() - this gives you an enumeration - and in the enumeration will be your HTMLImageTag." HTMLNode node; HTMLImageTag imageTag; for (Enumeration e = linkTag.linkData();e.hasMoreElements();) { node = (HTMLNode)e.nextElement(); if (node instanceof HTMLImageTag) { imageTag = (HTMLImageTag)node; // your code here } } Regards, Somik ----- Original Message ----- From: "Raghavender Srimantula" <kin...@ho...> To: <htm...@li...> Sent: Monday, May 06, 2002 10:43 AM Subject: Re: [Htmlparser-user] Hints on how to change image tag locations and writeoutdocument > Hi Somik, > this question is regarding "not all images are being retrieved". I mean the > images under <a tag. I did try to open the attachment you sent me. I could > not find anything. but seeing the previous mails I could read that it is not > a bug. but still if I do want to retrieve all the images how do I do it. > Thanks, > Raghav > > > >From: "Somik Raha" <so...@ya...> > >Reply-To: htm...@li... > >To: <htm...@li...> > >Subject: Re: [Htmlparser-user] Hints on how to change image tag locations > >and write outdocument > >Date: Tue, 30 Apr 2002 11:37:26 +0900 > > > >Hi Raghav, > > Ah - this was a question by Annette Doyle (titled "Not all image tags > >are returned"). I am attaching my reply. > > > >Regards > >Somik > > > >----- Original Message ----- > >From: "Raghavender Srimantula" <kin...@ho...> > >To: <htm...@li...> > >Sent: Tuesday, April 30, 2002 11:16 AM > >Subject: Re: [Htmlparser-user] Hints on how to change image tag locations > >and write outdocument > > > > > > > hi Somik, > > > I found one more interesting thing here. when I am trying to get all the > > > images the image scanner would give me images > > > <img src="http://us.i1.yimg.com/us.yimg.com/i/mntl/sh/mom02/title4.gif" > > > width=296 height=27 border=0 usemap=#tm> > > > so if I do a imagetag.getImageLocation(), I would get > > > http://us.i1.yimg.com/us.yimg.com/i/mntl/sh/mom02/title4.gif > > > > > > but is the html content is like this > > > <a href=s/6006><img > >src=http://us.i1.yimg.com/us.yimg.com/i/us/hj/hjys.gif > > > border=0 width=70 height=22></a> > > > which starts with <a and ends with </a>, then the image scanner will not > > > give me http://us.i1.yimg.com/us.yimg.com/i/us/hj/hjys.gif when I do a > > > imagetag.getImageLocation(). this is not even classified as an ImageTag. > > > this is classified as LinkTag. how to get this image. > > > > > > the above content is from www.yahoo.com. on the netscape browser if you > >goto > > > view-->pageinfo, you will see a bunch of images. > > > but when you run the htmlparser you can get only one image. > > > > > > Thanks, > > > Raghav > > > > > > > > > >From: "Somik Raha" <so...@ya...> > > > >Reply-To: htm...@li... > > > >To: <htm...@li...> > > > >Subject: Re: [Htmlparser-user] Hints on how to change image tag > >locations > > > >and write outdocument > > > >Date: Tue, 30 Apr 2002 09:15:38 +0900 > > > > > > > >Can you describe your application ? Was it parsing a single page when > >the > > > >problem occurred ? > > > > > > > >Regards, > > > >Somik > > > >----- Original Message ----- > > > >From: "Raghavender Srimantula" <kin...@ho...> > > > >To: <htm...@li...> > > > >Cc: <htm...@li...> > > > >Sent: Tuesday, April 30, 2002 8:36 AM > > > >Subject: Re: [Htmlparser-user] Hints on how to change image tag > >locations > > > >and write outdocument > > > > > > > > > > > > > Hi Somik, > > > > > I encountered a strange problem today. while I was running > > > >htmlparser...I > > > > > got a java.lang.OutOfMemoryError. seems that lot of objects are > >being > > > > > allocated. where exactly is this happening. I mean could you give me > >an > > > >idea > > > > > where or in which file the potential problem could be. > > > > > Raghav > > > > > > > > > > > > > > > >From: "Somik Raha" <so...@ya...> > > > > > >Reply-To: htm...@li... > > > > > >To: <htm...@li...> > > > > > >CC: <htm...@li...> > > > > > >Subject: Re: [Htmlparser-user] Hints on how to change image tag > > > >locations > > > > > >and write out document > > > > > >Date: Sat, 27 Apr 2002 18:22:34 +0900 > > > > > > > > > > > >Hi Annette, > > > > > > Pls find attached a program to get you started. This program > >will > > > >do > > > > > >what you want - you will need to modify the construct that checks > >for > > > >the > > > > > >image tag - and replace it with the location of your choice. > > > > > > Also - I found one bug thanks to this requirement - image tags > > > >params > > > > > >were not being correctly put in. Though it needs a deeper look, I > >have > > > >done > > > > > >a quick fix for now, and all test cases are passing (with one test > >case > > > >in > > > > > >HTMLImageScannerTest trapping this bug). > > > > > > Please check out the latest html parser source code from CVS. > > > > > > > > > > > >Regards, > > > > > >Somik > > > > > > > > > > > > ----- Original Message ----- > > > > > > From: Doyle, Annette > > > > > > To: htm...@li... > > > > > > Sent: Friday, April 26, 2002 10:08 PM > > > > > > Subject: [Htmlparser-user] Hints on how to change image tag > > > >locations > > > > > >and write out document > > > > > > > > > > > > > > > > > > Could you please give me some hints as how to change only image > >tag > > > > > >locations and then, (or at the same time) write out the html > >document > > > >to > > > > > >file (with new image tag locations)? > > > > > > > > > > > > > > > > > > > > > > > > Thanks- > > > > > > > > > > > > Annette Doyle > > > > > > > > > > > ><< ImageTagRetriever.java >> > > > > > > > > > > > > > > > > > > > > > > > > > _________________________________________________________________ > > > > > Join the world's largest e-mail service with MSN Hotmail. > > > > > http://www.hotmail.com > > > > > > > > > > > > > > > _______________________________________________ > > > > > Htmlparser-user mailing list > > > > > Htm...@li... > > > > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > > > > > > >_______________________________________________ > > > >Htmlparser-user mailing list > > > >Htm...@li... > > > >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > > > > > > > > > > _________________________________________________________________ > > > Send and receive Hotmail on your mobile device: http://mobile.msn.com > > > > > > > > > _______________________________________________ > > > Htmlparser-user mailing list > > > Htm...@li... > > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > ><< > >[Htmlparser-developer]Re_[Htmlparser-user]Notallimagetagsarereturned[NotaBu g].eml > > >> > > > > > _________________________________________________________________ > MSN Photos is the easiest way to share and print your photos: > http://photos.msn.com/support/worldwide.aspx > > > _______________________________________________________________ > > Have big pipes? SourceForge.net is looking for download mirrors. We supply > the hardware. You get the recognition. Email Us: ban...@so... > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Raghavender S. <kin...@ho...> - 2002-05-06 01:44:05
|
Hi Somik, this question is regarding "not all images are being retrieved". I mean the images under <a tag. I did try to open the attachment you sent me. I could not find anything. but seeing the previous mails I could read that it is not a bug. but still if I do want to retrieve all the images how do I do it. Thanks, Raghav >From: "Somik Raha" <so...@ya...> >Reply-To: htm...@li... >To: <htm...@li...> >Subject: Re: [Htmlparser-user] Hints on how to change image tag locations >and write outdocument >Date: Tue, 30 Apr 2002 11:37:26 +0900 > >Hi Raghav, > Ah - this was a question by Annette Doyle (titled "Not all image tags >are returned"). I am attaching my reply. > >Regards >Somik > >----- Original Message ----- >From: "Raghavender Srimantula" <kin...@ho...> >To: <htm...@li...> >Sent: Tuesday, April 30, 2002 11:16 AM >Subject: Re: [Htmlparser-user] Hints on how to change image tag locations >and write outdocument > > > > hi Somik, > > I found one more interesting thing here. when I am trying to get all the > > images the image scanner would give me images > > <img src="http://us.i1.yimg.com/us.yimg.com/i/mntl/sh/mom02/title4.gif" > > width=296 height=27 border=0 usemap=#tm> > > so if I do a imagetag.getImageLocation(), I would get > > http://us.i1.yimg.com/us.yimg.com/i/mntl/sh/mom02/title4.gif > > > > but is the html content is like this > > <a href=s/6006><img >src=http://us.i1.yimg.com/us.yimg.com/i/us/hj/hjys.gif > > border=0 width=70 height=22></a> > > which starts with <a and ends with </a>, then the image scanner will not > > give me http://us.i1.yimg.com/us.yimg.com/i/us/hj/hjys.gif when I do a > > imagetag.getImageLocation(). this is not even classified as an ImageTag. > > this is classified as LinkTag. how to get this image. > > > > the above content is from www.yahoo.com. on the netscape browser if you >goto > > view-->pageinfo, you will see a bunch of images. > > but when you run the htmlparser you can get only one image. > > > > Thanks, > > Raghav > > > > > > >From: "Somik Raha" <so...@ya...> > > >Reply-To: htm...@li... > > >To: <htm...@li...> > > >Subject: Re: [Htmlparser-user] Hints on how to change image tag >locations > > >and write outdocument > > >Date: Tue, 30 Apr 2002 09:15:38 +0900 > > > > > >Can you describe your application ? Was it parsing a single page when >the > > >problem occurred ? > > > > > >Regards, > > >Somik > > >----- Original Message ----- > > >From: "Raghavender Srimantula" <kin...@ho...> > > >To: <htm...@li...> > > >Cc: <htm...@li...> > > >Sent: Tuesday, April 30, 2002 8:36 AM > > >Subject: Re: [Htmlparser-user] Hints on how to change image tag >locations > > >and write outdocument > > > > > > > > > > Hi Somik, > > > > I encountered a strange problem today. while I was running > > >htmlparser...I > > > > got a java.lang.OutOfMemoryError. seems that lot of objects are >being > > > > allocated. where exactly is this happening. I mean could you give me >an > > >idea > > > > where or in which file the potential problem could be. > > > > Raghav > > > > > > > > > > > > >From: "Somik Raha" <so...@ya...> > > > > >Reply-To: htm...@li... > > > > >To: <htm...@li...> > > > > >CC: <htm...@li...> > > > > >Subject: Re: [Htmlparser-user] Hints on how to change image tag > > >locations > > > > >and write out document > > > > >Date: Sat, 27 Apr 2002 18:22:34 +0900 > > > > > > > > > >Hi Annette, > > > > > Pls find attached a program to get you started. This program >will > > >do > > > > >what you want - you will need to modify the construct that checks >for > > >the > > > > >image tag - and replace it with the location of your choice. > > > > > Also - I found one bug thanks to this requirement - image tags > > >params > > > > >were not being correctly put in. Though it needs a deeper look, I >have > > >done > > > > >a quick fix for now, and all test cases are passing (with one test >case > > >in > > > > >HTMLImageScannerTest trapping this bug). > > > > > Please check out the latest html parser source code from CVS. > > > > > > > > > >Regards, > > > > >Somik > > > > > > > > > > ----- Original Message ----- > > > > > From: Doyle, Annette > > > > > To: htm...@li... > > > > > Sent: Friday, April 26, 2002 10:08 PM > > > > > Subject: [Htmlparser-user] Hints on how to change image tag > > >locations > > > > >and write out document > > > > > > > > > > > > > > > Could you please give me some hints as how to change only image >tag > > > > >locations and then, (or at the same time) write out the html >document > > >to > > > > >file (with new image tag locations)? > > > > > > > > > > > > > > > > > > > > Thanks- > > > > > > > > > > Annette Doyle > > > > > > > > > ><< ImageTagRetriever.java >> > > > > > > > > > > > > > > > > > > > > _________________________________________________________________ > > > > Join the world's largest e-mail service with MSN Hotmail. > > > > http://www.hotmail.com > > > > > > > > > > > > _______________________________________________ > > > > Htmlparser-user mailing list > > > > Htm...@li... > > > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > > > >_______________________________________________ > > >Htmlparser-user mailing list > > >Htm...@li... > > >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > > > > > _________________________________________________________________ > > Send and receive Hotmail on your mobile device: http://mobile.msn.com > > > > > > _______________________________________________ > > Htmlparser-user mailing list > > Htm...@li... > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user ><< >[Htmlparser-developer]Re_[Htmlparser-user]Notallimagetagsarereturned[NotaBug].eml > >> _________________________________________________________________ MSN Photos is the easiest way to share and print your photos: http://photos.msn.com/support/worldwide.aspx |
From: Somik R. <so...@ya...> - 2002-05-03 09:25:52
|
Hi Folks, A testing build is out - you can download it from = http://htmlparser.sourceforge.net (choose the download link). This is a = testing build with important bug fixes.=20 Regards, Somik |
From: Somik R. <so...@ya...> - 2002-05-03 08:35:27
|
Hi Annette, I went thru the first problem you reported again, and I realized the = mistake in my testcase- this tag has two newlines instead of one for = each line. Could reproduce the bug after that. Have applied your fix, = and updated CVS. Thanks a lot. Regards, Somik ----- Original Message -----=20 From: Doyle, Annette=20 To: htm...@li...=20 Sent: Thursday, May 02, 2002 5:06 AM Subject: [Htmlparser-user] fixed previous problem - (however, new = problem) Fixed: <td rowspan=3D3><img height=3D49=20 =20 alt=3D"Central Intelligence Agency, Director of Central = Intelligence"=20 =20 src=3D"graphics/images_home2/cia_banners_template3_01.gif"=20 =20 width=3D241></td> =20 by changing HTMLTag as follows: public static int incrementCounter(HTMLReader reader, int = state, int i, HTMLTag tag) { String strLine =3D null; if ((state=3D=3DTAG_BEGIN_PARSING_STATE || = state =3D=3D TAG_IGNORE_DATA_STATE) && = i=3D=3Dtag.getTagLine().length()-1) { // We need to continue parsing to = the next line ; while ((strLine =3D = reader.getNextLine()).length() =3D=3D 0); = //tag.setTagLine(reader.getNextLine()); tag.setTagLine(strLine); // convert the end of line to a = space // The following line masked by = Somik Raha, 15 Apr 2002, to fix space bug in links tag.append('\n'); i=3D-1; } =20 return ++i; } =20 NEW PROBLEM in following: =20 <div align=3D"center"><font face=3D"Arial,"helvetica," = sans-serif=3D"sans-serif" size=3D"2" color=3D"#FFFFFF"><a = href=3D"/index.html" link=3D"#000000" vlink=3D"#000000"><font = color=3D"#FFFFFF">Home</font></a>=20 | <a href=3D"/cia/notices.html" link=3D"#000000" = vlink=3D"#000000"><font color=3D"#FFFFFF">Notices</font></a>=20 | <a href=3D"/cia/notices.html#priv" link=3D"#000000" = vlink=3D"#000000"><font color=3D"#FFFFFF">Privacy</font></a>=20 | <a href=3D"/cia/notices.html#sec" link=3D"#000000" = vlink=3D"#000000"><font color=3D"#FFFFFF">Security</font></a>=20 | <a href=3D"/cia/contact.htm" link=3D"#000000" = vlink=3D"#000000"><font color=3D"#FFFFFF">Contact Us</font></a> | <a href=3D"/cia/sitemap.html" link=3D"#000000" = vlink=3D"#000000"><font color=3D"#FFFFFF">Site Map</font></a> | <a href=3D"/cia/siteindex.html" link=3D"#000000" = vlink=3D"#000000"><font color=3D"#FFFFFF">Index</font></a> | <a href=3D"/search" link=3D"#000000" vlink=3D"#000000"><font = color=3D"#FFFFFF">Search</font></a>=20 </font></div> =20 Stops at=20 TAG LINE FOUND <div align=3D"center"><font = face=3D"Arial,"helvetica," sans-serif=3D"sans-serif" size=3D"2" = color=3D"#FFFFFF"><a href=3D"/index.html" link=3D"#000000" = vlink=3D"#000000"><font color=3D"#FFFFFF">Home</font></a>=20 LINE is <div align=3D"center"><font face=3D"Arial,"helvetica," = sans-serif=3D"sans-serif" size=3D"2" color=3D"#FFFFFF"><a = href=3D"/index.html" link=3D"#000000" vlink=3D"#000000"><font = color=3D"#FFFFFF">Home</font></a>=20 POSITION IS 26 TAGLINE 197 Process completed. =20 Annette Doyle =20 |
From: Somik R. <so...@ya...> - 2002-05-03 08:15:23
|
Hi Folks, We seem to have a heroic parser now... You can check out the latest code from CVS. Here's the fix. As you know - if we have an additional erroneous = inverted comma in a tag, the parser cannot judge whether to treat this = as erroneous or valid. Now the parser has some amount of intelligence - = if it encounters an inverted comma, and a close tag character, then it = does a check to see whether it should treat this as an error or a valid = character. This decision making process is facilitated with a strictVector - = which holds the tags for which it should not make allowances. Currently, = there is only one - "INPUT" (Should we have any more? ). If the tag = being parsed is not a strict tag like INPUT, then it is assumed that = this is an erroneous tag and needs to be corrected. The correction process occurs (and is validated with some testcases = in HTMLTag - particularly testStrictParsing). If you go thru that = testcase - you will see that the attributes are also correctly = retrieved. This solution doesent break anything else - we have 82 testcases, = all passing. I'd be grateful if folks can test this version and let me know if = this solution is acceptable. =20 Also - a general question - would you prefer something like nightly = drop packages for downloading, or is a request to checkout from CVS fine = ? Thanks and Regards, Somik =20 |
From: Somik R. <so...@ya...> - 2002-05-02 03:30:50
|
Hi Folks, Thanks to an interesting bug report by Roger Sollberger, a bug in = HTMLStringNode has been fixed. Links of the type : <a href=3D"http://asgard.ch">[> ASGARD <]</a> would get messed up bcos of the tag symbols, when they should really be = a part of HTMLStringNode. This has been fixed (after the bug has been reproduced in a testcase in = HTMLStringNodeTest).=20 CVS code base updated. Roger --> Thanks a lot for the report. Regards, Somik |
From: Somik R. <so...@ya...> - 2002-05-02 03:11:27
|
Hi Folks, If you've been following the latest exchange on htmlparser-user, = Annette has shown us a crazy example of dirty html, which works in the = browser, but crashes the parser. The site is http://www.cia.gov =20 Search for this string - <font face=3D"Arial,"helvetica," and you will find it in the html. Now this erroneous inverted comma = in front of helvetica should not be there.=20 This has been captured in a test case in HTMLTagTest.java (you can = get it from CVS), and this test fails (testParsing()). The problem is - the core parsing mechanism ignores anything within = inverted commas. This is critical so as to be able to accept angular = brackets in inverted commas. If we remove this feature from the parser = other tests will break. =20 So I need some suggestions on how we might modify our parsing - how = do we intelligently understand that this is an error (how easy it is for = us humans to figure this out) ? Looks like linear approaches wouldnt = work anymore... Maybe we need to associate some intelligence - that if = its a font tag, then this kind of stuff is most definitely an error. = Whereas if its a jsp tag, we can get more strict with our parsing. This = will probably cause a fundamental shift in our core parsing technique. Regards, Somik |
From: Somik R. <so...@ya...> - 2002-05-02 02:59:22
|
Hi Annette, Regarding your second problem, the parsing error occurs because -=20 =20 <div align=3D"center"><font face=3D"Arial,"helvetica," = sans-serif=3D"sans-serif" size=3D"2" color=3D"#FFFFFF"><a = href=3D"/index.html" link=3D"#000000" vlink=3D"#000000"><font=20 In the above - font face=3D"Arial,"helvetica," -- note the erroneoue = extra " in front of helvetica. Remove it and the parsing is fine. Now of = course you cant remove it, bcos this site is not yours :). So, we do = have to support this kind of dirty html. Thank you so much for bringing = it to our notice. I have written a test case to reproduce this bug, and = am working to resolve this. Regards, Somik =20 <div align=3D"center"><font face=3D"Arial,"helvetica," = sans-serif=3D"sans-serif" size=3D"2" color=3D"#FFFFFF"><a = href=3D"/index.html" link=3D"#000000" vlink=3D"#000000"><font = color=3D"#FFFFFF">Home</font></a>=20 | <a href=3D"/cia/notices.html" link=3D"#000000" = vlink=3D"#000000"><font color=3D"#FFFFFF">Notices</font></a>=20 | <a href=3D"/cia/notices.html#priv" link=3D"#000000" = vlink=3D"#000000"><font color=3D"#FFFFFF">Privacy</font></a>=20 | <a href=3D"/cia/notices.html#sec" link=3D"#000000" = vlink=3D"#000000"><font color=3D"#FFFFFF">Security</font></a>=20 | <a href=3D"/cia/contact.htm" link=3D"#000000" = vlink=3D"#000000"><font color=3D"#FFFFFF">Contact Us</font></a> | <a href=3D"/cia/sitemap.html" link=3D"#000000" = vlink=3D"#000000"><font color=3D"#FFFFFF">Site Map</font></a> | <a href=3D"/cia/siteindex.html" link=3D"#000000" = vlink=3D"#000000"><font color=3D"#FFFFFF">Index</font></a> | <a href=3D"/search" link=3D"#000000" vlink=3D"#000000"><font = color=3D"#FFFFFF">Search</font></a>=20 </font></div> =20 Stops at=20 TAG LINE FOUND <div align=3D"center"><font = face=3D"Arial,"helvetica," sans-serif=3D"sans-serif" size=3D"2" = color=3D"#FFFFFF"><a href=3D"/index.html" link=3D"#000000" = vlink=3D"#000000"><font color=3D"#FFFFFF">Home</font></a>=20 LINE is <div align=3D"center"><font face=3D"Arial,"helvetica," = sans-serif=3D"sans-serif" size=3D"2" color=3D"#FFFFFF"><a = href=3D"/index.html" link=3D"#000000" vlink=3D"#000000"><font = color=3D"#FFFFFF">Home</font></a>=20 POSITION IS 26 TAGLINE 197 Process completed. =20 Annette Doyle =20 |
From: Somik R. <so...@ya...> - 2002-05-02 02:42:14
|
Hi Annette, Regarding the first problem, I wrote a testcase, but was unable to = reproduce the error. Can you checkout the latest code from CVS, = (HTMLImageScanner), and take a look at the testcase = testImageTagOnThreeLines(). This test case passes. It ought to fail if = there is a problem in the parsing.=20 Meanwhile I am taking a look at the second issue. Regards, Somik =20 ----- Original Message -----=20 From: Doyle, Annette=20 To: htm...@li...=20 Sent: Thursday, May 02, 2002 5:06 AM Subject: [Htmlparser-user] fixed previous problem - (however, new = problem) Fixed: <td rowspan=3D3><img height=3D49=20 =20 alt=3D"Central Intelligence Agency, Director of Central = Intelligence"=20 =20 src=3D"graphics/images_home2/cia_banners_template3_01.gif"=20 =20 width=3D241></td> =20 by changing HTMLTag as follows: public static int incrementCounter(HTMLReader reader, int = state, int i, HTMLTag tag) { String strLine =3D null; if ((state=3D=3DTAG_BEGIN_PARSING_STATE || = state =3D=3D TAG_IGNORE_DATA_STATE) && = i=3D=3Dtag.getTagLine().length()-1) { // We need to continue parsing to = the next line ; while ((strLine =3D = reader.getNextLine()).length() =3D=3D 0); = //tag.setTagLine(reader.getNextLine()); tag.setTagLine(strLine); // convert the end of line to a = space // The following line masked by = Somik Raha, 15 Apr 2002, to fix space bug in links tag.append('\n'); i=3D-1; } =20 return ++i; } =20 NEW PROBLEM in following: =20 <div align=3D"center"><font face=3D"Arial,"helvetica," = sans-serif=3D"sans-serif" size=3D"2" color=3D"#FFFFFF"><a = href=3D"/index.html" link=3D"#000000" vlink=3D"#000000"><font = color=3D"#FFFFFF">Home</font></a>=20 | <a href=3D"/cia/notices.html" link=3D"#000000" = vlink=3D"#000000"><font color=3D"#FFFFFF">Notices</font></a>=20 | <a href=3D"/cia/notices.html#priv" link=3D"#000000" = vlink=3D"#000000"><font color=3D"#FFFFFF">Privacy</font></a>=20 | <a href=3D"/cia/notices.html#sec" link=3D"#000000" = vlink=3D"#000000"><font color=3D"#FFFFFF">Security</font></a>=20 | <a href=3D"/cia/contact.htm" link=3D"#000000" = vlink=3D"#000000"><font color=3D"#FFFFFF">Contact Us</font></a> | <a href=3D"/cia/sitemap.html" link=3D"#000000" = vlink=3D"#000000"><font color=3D"#FFFFFF">Site Map</font></a> | <a href=3D"/cia/siteindex.html" link=3D"#000000" = vlink=3D"#000000"><font color=3D"#FFFFFF">Index</font></a> | <a href=3D"/search" link=3D"#000000" vlink=3D"#000000"><font = color=3D"#FFFFFF">Search</font></a>=20 </font></div> =20 Stops at=20 TAG LINE FOUND <div align=3D"center"><font = face=3D"Arial,"helvetica," sans-serif=3D"sans-serif" size=3D"2" = color=3D"#FFFFFF"><a href=3D"/index.html" link=3D"#000000" = vlink=3D"#000000"><font color=3D"#FFFFFF">Home</font></a>=20 LINE is <div align=3D"center"><font face=3D"Arial,"helvetica," = sans-serif=3D"sans-serif" size=3D"2" color=3D"#FFFFFF"><a = href=3D"/index.html" link=3D"#000000" vlink=3D"#000000"><font = color=3D"#FFFFFF">Home</font></a>=20 POSITION IS 26 TAGLINE 197 Process completed. =20 Annette Doyle =20 |
From: Doyle, A. <Ann...@au...> - 2002-05-01 20:07:05
|
Fixed: <td rowspan=3D3><img height=3D49=20 =20 alt=3D"Central Intelligence Agency, Director of Central Intelligence"=20 =20 src=3D"graphics/images_home2/cia_banners_template3_01.gif"=20 =20 width=3D241></td> =20 by changing HTMLTag as follows: public static int incrementCounter(HTMLReader reader, int state, int i, HTMLTag tag) { String strLine =3D null; if ((state=3D=3DTAG_BEGIN_PARSING_STATE || state = =3D=3D TAG_IGNORE_DATA_STATE) && i=3D=3Dtag.getTagLine().length()-1) { // We need to continue parsing to the next line ; while ((strLine =3D = reader.getNextLine()).length() =3D=3D 0); =20 //tag.setTagLine(reader.getNextLine()); tag.setTagLine(strLine); // convert the end of line to a space // The following line masked by Somik Raha, 15 Apr 2002, to fix space bug in links tag.append('\n'); i=3D-1; } =20 return ++i; } =20 NEW PROBLEM in following: =20 <div align=3D"center"><font face=3D"Arial,"helvetica," sans-serif=3D"sans-serif" size=3D"2" color=3D"#FFFFFF"><a = href=3D"/index.html" link=3D"#000000" vlink=3D"#000000"><font = color=3D"#FFFFFF">Home</font></a>=20 | <a href=3D"/cia/notices.html" link=3D"#000000" vlink=3D"#000000"><font color=3D"#FFFFFF">Notices</font></a>=20 | <a href=3D"/cia/notices.html#priv" link=3D"#000000" vlink=3D"#000000"><font color=3D"#FFFFFF">Privacy</font></a>=20 | <a href=3D"/cia/notices.html#sec" link=3D"#000000" vlink=3D"#000000"><font color=3D"#FFFFFF">Security</font></a>=20 | <a href=3D"/cia/contact.htm" link=3D"#000000" vlink=3D"#000000"><font color=3D"#FFFFFF">Contact Us</font></a> | <a href=3D"/cia/sitemap.html" link=3D"#000000" vlink=3D"#000000"><font color=3D"#FFFFFF">Site Map</font></a> | <a href=3D"/cia/siteindex.html" link=3D"#000000" vlink=3D"#000000"><font color=3D"#FFFFFF">Index</font></a> | <a href=3D"/search" link=3D"#000000" vlink=3D"#000000"><font color=3D"#FFFFFF">Search</font></a>=20 </font></div> =20 Stops at=20 TAG LINE FOUND <div align=3D"center"><font = face=3D"Arial,"helvetica," sans-serif=3D"sans-serif" size=3D"2" color=3D"#FFFFFF"><a = href=3D"/index.html" link=3D"#000000" vlink=3D"#000000"><font = color=3D"#FFFFFF">Home</font></a>=20 LINE is <div align=3D"center"><font face=3D"Arial,"helvetica," sans-serif=3D"sans-serif" size=3D"2" color=3D"#FFFFFF"><a = href=3D"/index.html" link=3D"#000000" vlink=3D"#000000"><font = color=3D"#FFFFFF">Home</font></a>=20 POSITION IS 26 TAGLINE 197 Process completed. =20 Annette Doyle =20 |
From: Doyle, A. <Ann...@au...> - 2002-05-01 18:39:28
|
The following html is not parsed correctly. Try http://www.cia.gov <http://www.cia.gov/> . =20 <td rowspan=3D3><img height=3D49=20 =20 alt=3D"Central Intelligence Agency, Director of Central Intelligence"=20 =20 src=3D"graphics/images_home2/cia_banners_template3_01.gif"=20 =20 width=3D241></td> =20 Annette Doyle |