It would be nice to have a boolean attribute for the <http> tag that would
allow bypassing the automatic URL encoding that this processor does.
My problem is very silly. The site I'm trying to crawl has percent-encoded
characters in its URLs, say an auth token might be ABC%3d or something. When
this goes through Web Harvest's URL encoding, this is converted to ABC%3D
(note the capital D). Now certainly you would expect this is a completely
harmless change, but for some unfathomable reason this breaks the site in
question and it returns a 404 instead of the expected page. I know this is
ridiculous behavior on this site's part, but it would be great if there was a
simple way to tell Web Harvest not to re-encode these URLs.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Yes, I think the implementation of this thing needs to be revisited.
Personally I don't like this feature at all. It's rather confusing that
useful. If parameter is called "url" it means that you should pass URL there
which must already conform to the standard for URLs.
I'll think about it.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
It would be nice to have a boolean attribute for the <http> tag that would
allow bypassing the automatic URL encoding that this processor does.
My problem is very silly. The site I'm trying to crawl has percent-encoded
characters in its URLs, say an auth token might be ABC%3d or something. When
this goes through Web Harvest's URL encoding, this is converted to ABC%3D
(note the capital D). Now certainly you would expect this is a completely
harmless change, but for some unfathomable reason this breaks the site in
question and it returns a 404 instead of the expected page. I know this is
ridiculous behavior on this site's part, but it would be great if there was a
simple way to tell Web Harvest not to re-encode these URLs.
Please check this http://www.w3schools.com/xquery/xquery_syntax.asp
Sorry, that was a mistake - I posted to the wrong thread. That's an answer for
your another question.
Or not yours.... :) Damn, my brain is dead today... sorry for mess
Regarding URL encoding:
Yes, I think the implementation of this thing needs to be revisited.
Personally I don't like this feature at all. It's rather confusing that
useful. If parameter is called "url" it means that you should pass URL there
which must already conform to the standard for URLs.
I'll think about it.
Here is a related topic - https://sourceforge.net/projects/web-
harvest/forums/forum/591299/topic/3966338
(Why can't I edit my posts here? That stinks!)