this looks like a very comfortable and convenient class for web crawling - thank you for providing it!
I have a question: What is the easiest way to prevent phpcrawl from visiting any URL two times during a crawl?
The reason is, I would like to crawl a website that has a pager included (data spans across multiple sites, so there's a 1 2 3 … 21 site link list at the bottom) as basically all sites with search results. Those page links contain "P-1", "P-2", etc. in the URL to link to the appropriate page, also the Previous Page and Next Page links contain these absolute numbers.
I want the crawler to follow the links in ascending direction, but never backwards (because it would loop indefinitely) - so a general filter regexp on the page links doesn't help here.
Hence my idea of just preventing the same page being visited more than once. Is that possible or maybe even standard behavior?
Thanks for any hint for a solution.
Best regards
George
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi there,
this looks like a very comfortable and convenient class for web crawling - thank you for providing it!
I have a question: What is the easiest way to prevent phpcrawl from visiting any URL two times during a crawl?
The reason is, I would like to crawl a website that has a pager included (data spans across multiple sites, so there's a 1 2 3 … 21 site link list at the bottom) as basically all sites with search results. Those page links contain "P-1", "P-2", etc. in the URL to link to the appropriate page, also the Previous Page and Next Page links contain these absolute numbers.
I want the crawler to follow the links in ascending direction, but never backwards (because it would loop indefinitely) - so a general filter regexp on the page links doesn't help here.
Hence my idea of just preventing the same page being visited more than once. Is that possible or maybe even standard behavior?
Thanks for any hint for a solution.
Best regards
George
Don't worry, I found out by looking at the source code - phpcrawl caches the normalized URLs and ensures every URL is only visited once. Great!
Sorry for posting before researching ;)
Regards
George