Hello. First of all thanks for your great job! Also I've read for 3 page of forum but seems to find nothing.
I'm trying to configure phpcrawl for my needs and crawling different sites for the testing. Everything is OK for now except knowed bugs and problem with url normilizing i've found on the following site. I don't know if it is phpcrawl problem or my own. But log looks like:
Here is a lot of such 404's, so it's slowing down the process.
There are no such page on this site and no such link in the referrer page source. Here is the link code which locating to website's base dir:
href="/?c=kondicionery&action=shop&cat_id=100"
Looks like it cuts link after second equal sign and it's become relative to this dir "shop&cat_id=100". All other links normilizing normaly.
No class code editing, just handler function a little bit. Also i've tested buildURLFromLink function along and it's work perfect (the normalized URL is pokolenie-spb.ru/?c=kondicionery&action=shop&cat_id=100).
Any suggestions?
Thanks for your reply,
Alexandr
PS: one more question. Is there a way to get num of remaining urls to index? thx
Last edit: Anonymous 2013-12-24
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thanks for the report! It really looks like phpcrawl is not building tiogether the URL correctly. I'll test it soon and let you know what i've found.
I'll open a bugreport for this.
ANd no, right now there is no possibility to get the number of remaining URLs to crawl in the cache, sorry.
Thanks for the report!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
i just wanted to test your report on the site you mentioned in your first post, but i'm afraid i can't find the link href="/?c=kondicionery&action=shop&cat_id=100" anywhere on the page pokolenie-spb.ru/ventilyaciya/ventilyatory/page-11.html.
I found a similar link (href="/?c=ventilyaciya&action=shop002&cat_id=&f04=&f05=&producer_id=..."), but this one works fine, the crawler rebuilds the URL correctly and the request it OK too (200).
So i don't know how to reproduce your problem right now, sorry!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
View and moderate all "Help" comments posted by this user
Mark all as spam, and block user from posting to "Forum"
Hello. First of all thanks for your great job! Also I've read for 3 page of forum but seems to find nothing.
I'm trying to configure phpcrawl for my needs and crawling different sites for the testing. Everything is OK for now except knowed bugs and problem with url normilizing i've found on the following site. I don't know if it is phpcrawl problem or my own. But log looks like:
Page requested:
pokolenie-spb.ru/ventilyaciya/ventilyatory/shop002&cat_id=&f04=&f05=&producer_id=&ob=&asc_desc=&cost=&cost1=&page=12 (404)
Referer-page:
pokolenie-spb.ru/ventilyaciya/ventilyatory/page-11.html
Here is a lot of such 404's, so it's slowing down the process.
There are no such page on this site and no such link in the referrer page source. Here is the link code which locating to website's base dir:
href="/?c=kondicionery&action=shop&cat_id=100"
Looks like it cuts link after second equal sign and it's become relative to this dir "shop&cat_id=100". All other links normilizing normaly.
No class code editing, just handler function a little bit. Also i've tested buildURLFromLink function along and it's work perfect (the normalized URL is pokolenie-spb.ru/?c=kondicionery&action=shop&cat_id=100).
Any suggestions?
Thanks for your reply,
Alexandr
PS: one more question. Is there a way to get num of remaining urls to index? thx
Last edit: Anonymous 2013-12-24
View and moderate all "Help" comments posted by this user
Mark all as spam, and block user from posting to "Forum"
Hi!
Thanks for the report! It really looks like phpcrawl is not building tiogether the URL correctly. I'll test it soon and let you know what i've found.
I'll open a bugreport for this.
ANd no, right now there is no possibility to get the number of remaining URLs to crawl in the cache, sorry.
Thanks for the report!
View and moderate all "Help" comments posted by this user
Mark all as spam, and block user from posting to "Forum"
Thanks for your job once again!
View and moderate all "Help" comments posted by this user
Mark all as spam, and block user from posting to "Forum"
Hi again,
i just wanted to test your report on the site you mentioned in your first post, but i'm afraid i can't find the link href="/?c=kondicionery&action=shop&cat_id=100" anywhere on the page pokolenie-spb.ru/ventilyaciya/ventilyatory/page-11.html.
I found a similar link (href="/?c=ventilyaciya&action=shop002&cat_id=&f04=&f05=&producer_id=..."), but this one works fine, the crawler rebuilds the URL correctly and the request it OK too (200).
So i don't know how to reproduce your problem right now, sorry!
View and moderate all "Help" comments posted by this user
Mark all as spam, and block user from posting to "Forum"
I initiate class for this way. Maybe it's couses problem?
$crawler = new MyCrawler();
$crawler->setURL($domain);
$crawler->addContentTypeReceiveRule("#text/#");
$crawler->addContentTypeReceiveRule("#image/#");
$crawler->addContentTypeReceiveRule("#application/x-shockwave-flash#");
$crawler->addURLFilterRule("#(()$# i"); //I'm filtering "("-bug
$crawler->setContentSizeLimit(10485760);
$crawler->setPageLimit(10000);
$crawler->enableAggressiveLinkSearch(FALSE);
$crawler->setLinkExtractionTags(array("href", "src", "background", "action"));
$crawler->go();