Menu

<http> URL encoding should be optional

2010-12-09
2012-09-04
  • Lord Samael

    Lord Samael - 2010-12-09

    It would be nice to have a boolean attribute for the <http> tag that would
    allow bypassing the automatic URL encoding that this processor does.

    My problem is very silly. The site I'm trying to crawl has percent-encoded
    characters in its URLs, say an auth token might be ABC%3d or something. When
    this goes through Web Harvest's URL encoding, this is converted to ABC%3D
    (note the capital D). Now certainly you would expect this is a completely
    harmless change, but for some unfathomable reason this breaks the site in
    question and it returns a 404 instead of the expected page. I know this is
    ridiculous behavior on this site's part, but it would be great if there was a
    simple way to tell Web Harvest not to re-encode these URLs.

     
  • Alex Wajda

    Alex Wajda - 2010-12-10
     
  • Alex Wajda

    Alex Wajda - 2010-12-10

    Sorry, that was a mistake - I posted to the wrong thread. That's an answer for
    your another question.

     
  • Alex Wajda

    Alex Wajda - 2010-12-10

    Or not yours.... :) Damn, my brain is dead today... sorry for mess

     
  • Alex Wajda

    Alex Wajda - 2010-12-10

    Regarding URL encoding:

    Yes, I think the implementation of this thing needs to be revisited.
    Personally I don't like this feature at all. It's rather confusing that
    useful. If parameter is called "url" it means that you should pass URL there
    which must already conform to the standard for URLs.

    I'll think about it.

     

Log in to post a comment.