Menu

Prevent phpcrawl from visiting a URL twice

Help
2011-05-20
2013-04-09
  • Nobody/Anonymous

    Hi there,

    this looks like a very comfortable and convenient class for web crawling - thank you for providing it!

    I have a question: What is the easiest way to prevent phpcrawl from visiting any URL two times during a crawl?

    The reason is, I would like to crawl a website that has a pager included (data spans across multiple sites, so there's a 1 2 3 … 21 site link list at the bottom) as basically all sites with search results. Those page links contain "P-1", "P-2", etc. in the URL to link to the appropriate page, also the Previous Page and Next Page links contain these absolute numbers.
    I want the crawler to follow the links in ascending direction, but never backwards (because it would loop indefinitely) - so a general filter regexp on the page links doesn't help here.
    Hence my idea of just preventing the same page being visited more than once. Is that possible or maybe even standard behavior?

    Thanks for any hint for a solution.
    Best regards
    George

     
  • Nobody/Anonymous

    Don't worry, I found out by looking at the source code - phpcrawl caches the normalized URLs and ensures every URL is only visited once. Great!

    Sorry for posting before researching ;)

    Regards
    George

     

Anonymous
Anonymous

Add attachments
Cancel





Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.