PHPCrawl / Forum / Help: URL normalizer problem

Comment has been marked as spam.
Undo

View and moderate all "Help" comments posted by this user

Mark all as spam, and block user from posting to "Forum"

Anonymous - 2013-12-24

Hello. First of all thanks for your great job! Also I've read for 3 page of forum but seems to find nothing.

I'm trying to configure phpcrawl for my needs and crawling different sites for the testing. Everything is OK for now except knowed bugs and problem with url normilizing i've found on the following site. I don't know if it is phpcrawl problem or my own. But log looks like:

Page requested:
pokolenie-spb.ru/ventilyaciya/ventilyatory/shop002&cat_id=&f04=&f05=&producer_id=&ob=&asc_desc=&cost=&cost1=&page=12 (404)
Referer-page:
pokolenie-spb.ru/ventilyaciya/ventilyatory/page-11.html

Here is a lot of such 404's, so it's slowing down the process.

There are no such page on this site and no such link in the referrer page source. Here is the link code which locating to website's base dir:

href="/?c=kondicionery&action=shop&cat_id=100"

Looks like it cuts link after second equal sign and it's become relative to this dir "shop&cat_id=100". All other links normilizing normaly.
No class code editing, just handler function a little bit. Also i've tested buildURLFromLink function along and it's work perfect (the normalized URL is pokolenie-spb.ru/?c=kondicionery&action=shop&cat_id=100).

Any suggestions?

Thanks for your reply,
Alexandr

PS: one more question. Is there a way to get num of remaining urls to index? thx

Last edit: Anonymous 2013-12-24

Hello. First of all thanks for your great job! Also I've read for 3 page of forum but seems to find nothing. I'm trying to configure phpcrawl for my needs and crawling different sites for the testing. Everything is OK for now except knowed bugs and problem with url normilizing i've found on the following site. I don't know if it is phpcrawl problem or my own. But log looks like: Page requested: pokolenie-spb.ru/ventilyaciya/ventilyatory/shop002&cat_id=&f04=&f05=&producer_id=&ob=&asc_desc=&cost=&cost1=&page=12 (404) Referer-page: pokolenie-spb.ru/ventilyaciya/ventilyatory/page-11.html Here is a lot of such 404's, so it's slowing down the process. There are no such page on this site and no such link in the referrer page source. Here is the link code which locating to website's base dir: href="/?c=kondicionery&action=shop&cat_id=100" Looks like it cuts link after second equal sign and it's become relative to this dir "shop&cat_id=100". All other links normilizing normaly. No class code editing, just handler function a little bit. Also i've tested buildURLFromLink function along and it's work perfect (the normalized URL is pokolenie-spb.ru/?c=kondicionery&action=shop&cat_id=100). Any suggestions? Thanks for your reply, Alexandr PS: one more question. Is there a way to get num of remaining urls to index? thx

Add attachments
Cancel
You seem to have CSS turned off. Please don't fill out this field.

You seem to have CSS turned off. Please don't fill out this field.

New Attachment:

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Comment has been marked as spam.
Undo

View and moderate all "Help" comments posted by this user

Mark all as spam, and block user from posting to "Forum"

Anonymous - 2014-01-02

Hi!

Thanks for the report! It really looks like phpcrawl is not building tiogether the URL correctly. I'll test it soon and let you know what i've found.
I'll open a bugreport for this.

ANd no, right now there is no possibility to get the number of remaining URLs to crawl in the cache, sorry.

Thanks for the report!

Hi! Thanks for the report! It really looks like phpcrawl is not building tiogether the URL correctly. I'll test it soon and let you know what i've found. I'll open a bugreport for this. ANd no, right now there is no possibility to get the number of remaining URLs to crawl in the cache, sorry. Thanks for the report!

Add attachments
Cancel
You seem to have CSS turned off. Please don't fill out this field.

You seem to have CSS turned off. Please don't fill out this field.

New Attachment:

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Comment has been marked as spam.
Undo

View and moderate all "Help" comments posted by this user

Mark all as spam, and block user from posting to "Forum"

Anonymous - 2014-01-02

Thanks for your job once again!

Thanks for your job once again!

Add attachments
Cancel
You seem to have CSS turned off. Please don't fill out this field.

You seem to have CSS turned off. Please don't fill out this field.

New Attachment:

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Comment has been marked as spam.
Undo

View and moderate all "Help" comments posted by this user

Mark all as spam, and block user from posting to "Forum"

Anonymous - 2014-01-06

Hi again,

i just wanted to test your report on the site you mentioned in your first post, but i'm afraid i can't find the link href="/?c=kondicionery&action=shop&cat_id=100" anywhere on the page pokolenie-spb.ru/ventilyaciya/ventilyatory/page-11.html.

I found a similar link (href="/?c=ventilyaciya&action=shop002&cat_id=&f04=&f05=&producer_id=..."), but this one works fine, the crawler rebuilds the URL correctly and the request it OK too (200).

So i don't know how to reproduce your problem right now, sorry!

Hi again, i just wanted to test your report on the site you mentioned in your first post, but i'm afraid i can't find the link href="/?c=kondicionery&action=shop&cat_id=100" anywhere on the page pokolenie-spb.ru/ventilyaciya/ventilyatory/page-11.html. I found a similar link (href="/?c=ventilyaciya&action=shop002&cat_id=&f04=&f05=&producer_id=..."), but this one works fine, the crawler rebuilds the URL correctly and the request it OK too (200). So i don't know how to reproduce your problem right now, sorry!

Add attachments
Cancel
You seem to have CSS turned off. Please don't fill out this field.

You seem to have CSS turned off. Please don't fill out this field.

New Attachment:

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Comment has been marked as spam.
Undo

View and moderate all "Help" comments posted by this user

Mark all as spam, and block user from posting to "Forum"

Anonymous - 2014-01-23

I initiate class for this way. Maybe it's couses problem?

$crawler = new MyCrawler();
$crawler->setURL($domain);
$crawler->addContentTypeReceiveRule("#text/#");
$crawler->addContentTypeReceiveRule("#image/#");
$crawler->addContentTypeReceiveRule("#application/x-shockwave-flash#");
$crawler->addURLFilterRule("#(()$# i"); //I'm filtering "("-bug
$crawler->setContentSizeLimit(10485760);
$crawler->setPageLimit(10000);
$crawler->enableAggressiveLinkSearch(FALSE);
$crawler->setLinkExtractionTags(array("href", "src", "background", "action"));
$crawler->go();

I initiate class for this way. Maybe it's couses problem? $crawler = new MyCrawler(); $crawler->setURL($domain); $crawler->addContentTypeReceiveRule("#text/#"); $crawler->addContentTypeReceiveRule("#image/#"); $crawler->addContentTypeReceiveRule("#application/x-shockwave-flash#"); $crawler->addURLFilterRule("#(\()$# i"); //I'm filtering "("-bug $crawler->setContentSizeLimit(10485760); $crawler->setPageLimit(10000); $crawler->enableAggressiveLinkSearch(FALSE); $crawler->setLinkExtractionTags(array("href", "src", "background", "action")); $crawler->go();

Add attachments
Cancel
You seem to have CSS turned off. Please don't fill out this field.

You seem to have CSS turned off. Please don't fill out this field.

New Attachment:

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

URL normalizer problem

Forums

Help

URL normalizer problem document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

URL normalizer problem