rewrite relative url inside html code

Brought to you by: derrickoswald

rewrite relative url inside html code

Forum: Help

Creator: le_tmp

Created: 2010-01-09

Updated: 2013-04-27

le_tmp - 2010-01-09

hey! is it possible to rewrite the html code with jtidy so that there are no relative urls but absolute urls in the end? thanks alot!

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Derrick Oswald - 2010-01-09

I'm not sure about jtidy, but it is possible with HTMLParser.
See the sitecapturer application.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

le_tmp - 2010-01-09

thank you very much!

i tried it like this:
SiteCapturer capturer = new SiteCapturer();
capturer.setTarget("localSite");
capturer.setSource("http://www.google.de");
capturer.capture();

That downloads the complete page. Can you give me a short overview how I can only transform the sourcehtml to a new html string with replaced links (relative -> absolute) ?

Should I use a NodeFilter?
Thanks in advance!

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Derrick Oswald - 2010-01-09

You need an in-memory image of your web page, for that HTMLParser provides a NodeList.
Get the URL, file, or text string and pass it to the parser and use a null filter to get everything:

Parser parser = new Parser ();
parser.setResource (…);
NodeList list = parser.Parse (null);

Then you need to find your links:

NodeFilter filter = new NodeClassFilter (LinkTag.class);
NodeList links = list.extractAllNodesThatMatch (filter, true /* recursive */);

Now cycle through your list and fix the link:

for (int i = 0; i < links.Length (); i++)
{
    LinkTag tag = links_;
    … tag.getLink();
    … tag.setLink(<new link>);
}

Then output the whole page:
System.out.println (list.toHtml ());_

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

le_tmp - 2010-01-09

thank you again!

this helps if I want to change all <a></a> tags. but I also want to change <img src="">, <script src=""> and <link> tags.
they have in common that links have an attribut "src" or "href".
is there a better way to rewrite only these attributes instead of writing filters for every kind of tag?

thanks again! :)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

le_tmp - 2010-01-09

I tried it with the CompositeTag, but it doesn't work because the <link ../>-tag has no endtag.
How can I access the <link>-tag ?

Best regards

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

le_tmp - 2010-01-09

I used the Tag.class instead :) Thank you

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Derrick Oswald - 2010-01-09

You could try the visitor pattern.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

le_tmp - 2010-01-11

hello again!

I have some problems when I try to set the resource for the parser to javascript or php files:

Internal Server Error (500) - unknown protocol: javascript
Internal Server Error (500) - no protocol: index.php

I can't find the problem.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

le_tmp - 2010-01-11

I have another question:

How can I access the url of the following code:

<style media="all" type="text/css">
@import "./templates/subSilver/themes/resolution/standard.css";
@import "./templates/subSilver/themes/default/css/all.css";
</style>

I tried to create a CssSelectorNodeFilter("@import") but this doesn't work.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

le_tmp - 2010-01-11

ah…my mistake.. with the CssSelectorNodeFilter you can access html elements with css selections :)
sorry.

do you have an idea, how I could access these @import urls ?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Log in to post a comment.