In a certain part of my project, I only need to scrape 1 page from a site so am using setPageLimit to set the limit to 1. However I have noticed that if I am attempting to scrape 1 page from a site that used a redirect on that page, I receive an error even though setFollowRedirectsTillContent is set to true.
The problem is that "setPageLimit" is more a "setRequestLimit". So if a page uses a redirect and you set setPageLimit to 1, the crawler stops because it already made ONE request.
In the next version there will be a REAL "setPageLimit" and a "setRequestLimit".
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
View and moderate all "Help" comments posted by this user
Mark all as spam, and block user from posting to "Forum"
First, amazing project... thanks so much!
In a certain part of my project, I only need to scrape 1 page from a site so am using setPageLimit to set the limit to 1. However I have noticed that if I am attempting to scrape 1 page from a site that used a redirect on that page, I receive an error even though setFollowRedirectsTillContent is set to true.
One such site I have come across is http://www.marketingpower.com/. Is there a workaround for this?
View and moderate all "Help" comments posted by this user
Mark all as spam, and block user from posting to "Forum"
Hi!
Semms like it's the same problem as described here:
http://sourceforge.net/p/phpcrawl/bugs/55/, right?
The problem is that "setPageLimit" is more a "setRequestLimit". So if a page uses a redirect and you set setPageLimit to 1, the crawler stops because it already made ONE request.
In the next version there will be a REAL "setPageLimit" and a "setRequestLimit".