Menu

#27 Allow ignored parameters

open
nobody
None
5
2014-11-25
2014-11-25
SpiderBro
No

Some sites still use server-side session ID rather than cookies, which results in parameters like jsessionid or PHPSESSID in every URL. These parameters could be ignored with a filter, however this would prevent accurate spidering of the sites (since all URLs will be appended with the session ID - no URLs would be returned after the first request).

The ideal solution would be to specify parameters that, if found, would be stripped and ignored by the crawler. You could also use this to (for example) ignore tracking parameters like utm_source.

This change would speed up crawling and also provide better accuracy in reports.

Discussion

  • Uwe Hunfeld

    Uwe Hunfeld - 2014-11-25

    Hi,

    thanks for your thoughts and the report!

    So, just to understand you right:
    IF the crawler finds an URL like http://site.com/test.php?PHPSESSID=39j39f93jf&param=233, the crawler should (optionally) strip the PHPSESSID-part and put http://site.com/test.php?param=233 to the URL-queue and follow this one instead.

    So what you'll need here is a setting like "addURLRewriteRule()", right?

    There already is an open feature to come that let's you "manipulate" requests done by the crawle by a callback-function, see here: https://sourceforge.net/p/phpcrawl/feature-requests/16/

    This would do the job too as far as a can see, what do you think?

     
  • SpiderBro

    SpiderBro - 2014-11-25

    If this would allow the URL to be manipulated prior to the request, then yes, I believe it would. A 'RewriteRule' would perhaps be simpler but the above could potential handle this well, also. Incidentally, many of the links to the forum (as linked in the feature request you mentioned) seem to be broken, so I couldn't see the original discussion.

     
  • Uwe Hunfeld

    Uwe Hunfeld - 2014-11-25

    Yes, it's a petty, after sf introduced a new forum and importing the old threads from the old forum, links to the older forum don't work anymore.

    But this is the discussion from the feature-request:
    https://sourceforge.net/p/phpcrawl/discussion/307696/thread/0901fae0/

     

Anonymous
Anonymous

Add attachments
Cancel





Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.