i tried it like this:
SiteCapturer capturer = new SiteCapturer();
capturer.setTarget("localSite");
capturer.setSource("http://www.google.de");
capturer.capture();
That downloads the complete page. Can you give me a short overview how I can only transform the sourcehtml to a new html string with replaced links (relative -> absolute) ?
Should I use a NodeFilter?
Thanks in advance!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
You need an in-memory image of your web page, for that HTMLParser provides a NodeList.
Get the URL, file, or text string and pass it to the parser and use a null filter to get everything:
Parser parser = new Parser ();
parser.setResource (…);
NodeList list = parser.Parse (null);
this helps if I want to change all <a></a> tags. but I also want to change <img src="">, <script src=""> and <link> tags.
they have in common that links have an attribut "src" or "href".
is there a better way to rewrite only these attributes instead of writing filters for every kind of tag?
thanks again! :)
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
hey! is it possible to rewrite the html code with jtidy so that there are no relative urls but absolute urls in the end? thanks alot!
I'm not sure about jtidy, but it is possible with HTMLParser.
See the sitecapturer application.
thank you very much!
i tried it like this:
SiteCapturer capturer = new SiteCapturer();
capturer.setTarget("localSite");
capturer.setSource("http://www.google.de");
capturer.capture();
That downloads the complete page. Can you give me a short overview how I can only transform the sourcehtml to a new html string with replaced links (relative -> absolute) ?
Should I use a NodeFilter?
Thanks in advance!
You need an in-memory image of your web page, for that HTMLParser provides a NodeList.
Get the URL, file, or text string and pass it to the parser and use a null filter to get everything:
Parser parser = new Parser ();
parser.setResource (…);
NodeList list = parser.Parse (null);
Then you need to find your links:
NodeFilter filter = new NodeClassFilter (LinkTag.class);
NodeList links = list.extractAllNodesThatMatch (filter, true /* recursive */);
Now cycle through your list and fix the link:
for (int i = 0; i < links.Length (); i++)
{
LinkTag tag = links_;
… tag.getLink();
… tag.setLink(<new link>);
}
Then output the whole page:
System.out.println (list.toHtml ());_
thank you again!
this helps if I want to change all <a></a> tags. but I also want to change <img src="">, <script src=""> and <link> tags.
they have in common that links have an attribut "src" or "href".
is there a better way to rewrite only these attributes instead of writing filters for every kind of tag?
thanks again! :)
I tried it with the CompositeTag, but it doesn't work because the <link ../>-tag has no endtag.
How can I access the <link>-tag ?
Best regards
I used the Tag.class instead :) Thank you
You could try the visitor pattern.
hello again!
I have some problems when I try to set the resource for the parser to javascript or php files:
Internal Server Error (500) - unknown protocol: javascript
Internal Server Error (500) - no protocol: index.php
I can't find the problem.
I have another question:
How can I access the url of the following code:
<style media="all" type="text/css">
@import "./templates/subSilver/themes/resolution/standard.css";
@import "./templates/subSilver/themes/default/css/all.css";
</style>
I tried to create a CssSelectorNodeFilter("@import") but this doesn't work.
ah…my mistake.. with the CssSelectorNodeFilter you can access html elements with css selections :)
sorry.
do you have an idea, how I could access these @import urls ?