Thread: [Htmlparser-user] RE: Hints on how to change image tag locations and write o
Brought to you by:
derrickoswald
From: Rob S. <bob...@ho...> - 2002-06-16 22:05:30
|
Hi, I managed to get something working like this: sb is a StringBuffer and base is a URL (the source of the document) i'm just using just a single scanner - HTMLImageScanner. I didn't give it a filter because I don't understand what the filter does. HTMLNode node; // Run through an enumeration of html elements HTMLLinkProcessor linkProcessor = new HTMLLinkProcessor(); for (Enumeration e=parser.elements();e.hasMoreElements();) { node = (HTMLNode)e.nextElement(); // Cast the element to HTMLNode if (node instanceof HTMLStringNode) { HTMLStringNode stringNode = (HTMLStringNode)node; sb.append(stringNode.getText()); } else if (node instanceof HTMLTag){ HTMLTag tag = (HTMLTag) node; if (node instanceof HTMLImageTag) { HTMLImageTag imgtag = (HTMLImageTag) node; String imgsrc = imgtag.getImageLocation(); if(imgsrc.indexOf("http://") == -1){ //relative src imgsrc = base.toString() + imgsrc; } sb.append("<img src=\"" + imgsrc + "\""); Hashtable h = imgtag.parseParameters(); for (Enumeration e2=h.keys();e2.hasMoreElements();) { String key = (String)e2.nextElement(); sb.append(" " + key + "=\"" + h.get(key) + "\""); } sb.append(">"); } else { sb.append("<" + tag.getText() + ">"); } } else if (node instanceof HTMLEndTag){ HTMLEndTag tag = (HTMLEndTag) node; sb.append("</" + tag.getContents() + ">"); } } Just a couple of questions if you don't mind. 1) is this the only way to get all the attributes in the img tag? 2) can you see any problems or suggest improvements? 3) (HTTP question) I'm adding all the output to a StringBuffer so that I can convert it to a byte array using sb.toString().getBytes() - I need to do this so that I can get the length of the byte array for use in the Content-length HTTP header field (the output is sent back to a browser). Do I need to do this or can I just omit the Content-length field and avoid using the StringBuffer? Another thing, I was testing the app on google.com and I noticed it has a strange image tag : < img width=1 height=1 alt="" > (no SRC attribute) Although the parser recognised it as an image tag, it didn't seem to pick up on the attributes. Is this a bug? > >Hi all, > >I'm new to the list today after following the thread 'Hints on how to >change image tag locations and write out document' in the archives. I'm >trying to make an application that changes all relative img src attributes >to absolute before writing out the entire document. I'd be very interested >to see some of the code from the attachments from Somik Raha if somebody >could post them. The archives don't seem to keep attachments. > >I just started using HTMLParser today and I'm currently stuck trying figure >out how to get the complete IMG tag string when using an HTMLImageScanner. >Am I correct in thinking that in both an HTMLTag and an HTMLImageTag object >are created for each image tag encountered when using HTMLImageScanner? If >so, does the HTMLTag object get populated with the usual data? > > >Thanks and regards, >Rob Shields > _________________________________________________________________ Get your FREE download of MSN Explorer at http://explorer.msn.com/intl.asp. |
From: Somik R. <so...@ya...> - 2002-06-17 06:32:11
|
Dear Rob, From your first mail : > Am I correct in thinking that in both an HTMLTag and an HTMLImageTag object > are created for each image tag encountered when using HTMLImageScanner? If > so, does the HTMLTag object get populated with the usual data? Functionally, you only get one tag object. If you havent registered the concerned scanner (HTMLImageScanner in this case), you will get an HTMLTag object. If you have, then you will get an HTMLImageTag object. Technically, internally, first an HTMLTag object gets created. Then control passes to registered scanners to see if this tag can be upgraded. If so, the new sublcassed tag object (HTMLImageTag, for example) gets created and returned in place of the original HTMLTag. > I didn't give it a > filter because I don't understand what the filter does. > A filter is not required - it is only for using it from the command line - allows us to check parse results easily and dump it to a file. You can ignore it for your app - the following will work : parser.addScanner(new HTMLImageScanner("")); > HTMLLinkProcessor linkProcessor = new HTMLLinkProcessor(); Why are you declaring a linkProcessor ? > HTMLImageTag imgtag = (HTMLImageTag) node; > String imgsrc = imgtag.getImageLocation(); > if(imgsrc.indexOf("http://") == -1){ > file://relative src > imgsrc = base.toString() + imgsrc; > } This is not necessary. The base url that you specify in the parser, will automatically be used to resolve relative links. Check out the testcases : testRelativeImageScan, testRelativeImageScan2, testRelativeImageScan3 in com.kizna.htmlTests.scannerTests.HTMLImageScannerTest I can also see that you are trying to reconstruct the html tag without changing its contents - you can do this with imageTag.toRawString() if you are using HTMLParser v1.2 upwards. However, this will provide you with the relative link (not resolved absolute link). Perhaps, if you need it, we can modify the toRawString() method, and get it to return absolute links ?? > 1) is this the only way to get all the attributes in the img tag? No. There's a much easier way - just do : imageTag.getParameter("alt"); If you want to get the keys, I think this should work : imageTag.getParsed().keys() [Maybe the name of this method should be changed to be easier to figure out]. > I need to do this or can I just omit the Content-length field and avoid > using the StringBuffer? Hmm.. Its not mandatory to send the content-length, but some servers expect it. To make life easier, you should use toRawString() to get the html tags out uniformly. Since this applied to a node, you dont have to write code for different types of nodes. So sb.append(node.toRawString()) is good enough (perhaps) for all nodes. The only one where there might be an issue is the HTMLImageTag for reasons that I mentioned above. You can probably rewrite the toRawString() method in HTMLImageTag for your purposes and that should solve your problem neatly. Feel free to post any further questions that you have. Regards, Somik |