Menu

Is there a way to pre-process urls?

Help
Anonymous
2014-04-13
2014-04-14
  • Anonymous

    Anonymous - 2014-04-13

    I'm using this amazing package to crawl our own sites - we have about 50 of them that have been developed over 10or 15 years, so there are a lot of "interesting" urls. One simple example is where the url is: www.oursite.com/foo=bar&print=1. It's effectively the same as www.oursite.com/foo=bar, and I'd like to strip the "&print=1" before PHPCrawl does it's processing. I do NOT want to ignore the link, but only strip out the "print=1". I have many other such instances where the url needs to be crawled, but certain items in the querystring needs to be stripped.

    Another example, but may be specific to our code/apache config, is a SEF/mod_rewrite issue that appends a trailing slash to all the urls on the page. This creates havoc because PHPCrawl thinks that the querystring wrapped in the slashes is a directory on the server and puts the crawling into a loop of sorts:
    starting url: www.oursite.com
    finds/crawls www.oursite.com/about-us/ and all other links located
    finds/crawls www.oursite.com/about-us// and again, all other links located
    finds/crawls www.oursite.com/about-us/// and again, all other links located
    finds/crawls www.oursite.com/about-us/// and again, all other links located
    ...
    This goes on indefinitely, so stripping the trailing slash would solve the problem for me as I determined by turning off our SEF/rewrite settings.

    Is there a function to do this that I'm not finding/understanding, or where would the best place be to add that pre-processing code before PHPCrawl adds it to the list of links to crawl?

    A great feature would be the ability to specify querystring variables to ignore such as pagination values, but in my case that would not solve all my interesting urls, thus the need to actually strip out certain substrings withing the querystring itself.

    Thanks again for a great piece of code!!!

     

    Last edit: Anonymous 2014-04-13
  • Anonymous

    Anonymous - 2014-04-14

    Hi!

    There is an open request for a overidable "prepareRequest()"-method, see here:
    http://sourceforge.net/p/phpcrawl/feature-requests/16/

    I think this is just what you need, right?

     
  • Anonymous

    Anonymous - 2014-04-14

    That would be perfect!

     

Anonymous
Anonymous

Add attachments
Cancel





Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.