Re: [Htmlparser-user] RE: Hints on how to change image tag locations and write o

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Dear Rob,

From your first mail :
> Am I correct in thinking that in both an HTMLTag and an HTMLImageTag
object
> are created for each image tag encountered when using HTMLImageScanner? If
> so, does the HTMLTag object get populated with the usual data?

Functionally, you only get one tag object. If you havent registered the
concerned scanner (HTMLImageScanner in this case), you will get an HTMLTag
object. If you have, then you will get an HTMLImageTag object. Technically,
internally, first an HTMLTag object gets created. Then control passes to
registered scanners to see if this tag can be upgraded. If so, the new
sublcassed tag object (HTMLImageTag, for example) gets created and returned
in place of the original HTMLTag.

> I didn't give it a
> filter because I don't understand what the filter does.
>
A filter is not required - it is only for using it from the command line -
allows us to check parse results easily and dump it to a file. You can
ignore it for your app - the following will work :
parser.addScanner(new HTMLImageScanner(""));

>     HTMLLinkProcessor linkProcessor = new HTMLLinkProcessor();
Why are you declaring a linkProcessor ?

>           HTMLImageTag imgtag = (HTMLImageTag) node;
>           String imgsrc = imgtag.getImageLocation();
>           if(imgsrc.indexOf("http://") == -1){
>           file://relative src
>           imgsrc = base.toString() + imgsrc;
>           }

This is not necessary. The base url that you specify in the parser, will
automatically be used to resolve relative links. Check out the testcases :
testRelativeImageScan,
testRelativeImageScan2,
testRelativeImageScan3 in
com.kizna.htmlTests.scannerTests.HTMLImageScannerTest

I can also see that you are trying to reconstruct the html tag without
changing its contents - you can do this with imageTag.toRawString() if you
are using HTMLParser v1.2 upwards. However, this will provide you with the
relative link (not resolved absolute link). Perhaps, if you need it, we can
modify the toRawString() method, and get it to return absolute links ??

> 1) is this the only way to get all the attributes in the img tag?
No. There's a much easier way - just do :
imageTag.getParameter("alt");

If you want to get the keys, I think this should work :
imageTag.getParsed().keys()

[Maybe the name of this method should be changed to be easier to figure
out].

> I need to do this or can I just omit the Content-length field and avoid
> using the StringBuffer?

Hmm.. Its not mandatory to send the content-length, but some servers expect
it. To make life easier, you should use toRawString() to get the html tags
out uniformly. Since this applied to a node, you dont have to write code for
different types of nodes. So sb.append(node.toRawString()) is good enough
(perhaps) for all nodes. The only one where there might be an issue is the
HTMLImageTag for reasons that I mentioned above. You can probably rewrite
the toRawString() method  in HTMLImageTag for your purposes and that should
solve your problem neatly.

Feel free to post any further questions that you have.

Regards,
Somik