Prevent phpcrawl from visiting a URL twice

Status: Beta

Brought to you by: huni

Prevent phpcrawl from visiting a URL twice

Forum: Help

Creator: Nobody/Anonymous

Created: 2011-05-20

Updated: 2013-04-09

Nobody/Anonymous - 2011-05-20

Hi there,

this looks like a very comfortable and convenient class for web crawling - thank you for providing it!

I have a question: What is the easiest way to prevent phpcrawl from visiting any URL two times during a crawl?

The reason is, I would like to crawl a website that has a pager included (data spans across multiple sites, so there's a 1 2 3 … 21 site link list at the bottom) as basically all sites with search results. Those page links contain "P-1", "P-2", etc. in the URL to link to the appropriate page, also the Previous Page and Next Page links contain these absolute numbers.
I want the crawler to follow the links in ascending direction, but never backwards (because it would loop indefinitely) - so a general filter regexp on the page links doesn't help here.
Hence my idea of just preventing the same page being visited more than once. Is that possible or maybe even standard behavior?

Thanks for any hint for a solution.
Best regards
George

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Nobody/Anonymous - 2011-05-20

Don't worry, I found out by looking at the source code - phpcrawl caches the normalized URLs and ensures every URL is only visited once. Great!

Sorry for posting before researching ;)

Regards
George

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Anonymous