Menu

URL normalizer problem

Help
Anonymous
2013-12-24
2014-01-23
  • Anonymous

    Anonymous - 2013-12-24

    Hello. First of all thanks for your great job! Also I've read for 3 page of forum but seems to find nothing.

    I'm trying to configure phpcrawl for my needs and crawling different sites for the testing. Everything is OK for now except knowed bugs and problem with url normilizing i've found on the following site. I don't know if it is phpcrawl problem or my own. But log looks like:

    Page requested:
    pokolenie-spb.ru/ventilyaciya/ventilyatory/shop002&cat_id=&f04=&f05=&producer_id=&ob=&asc_desc=&cost=&cost1=&page=12 (404)
    Referer-page:
    pokolenie-spb.ru/ventilyaciya/ventilyatory/page-11.html

    Here is a lot of such 404's, so it's slowing down the process.

    There are no such page on this site and no such link in the referrer page source. Here is the link code which locating to website's base dir:

    href="/?c=kondicionery&action=shop&cat_id=100"

    Looks like it cuts link after second equal sign and it's become relative to this dir "shop&cat_id=100". All other links normilizing normaly.
    No class code editing, just handler function a little bit. Also i've tested buildURLFromLink function along and it's work perfect (the normalized URL is pokolenie-spb.ru/?c=kondicionery&action=shop&cat_id=100).

    Any suggestions?

    Thanks for your reply,
    Alexandr

    PS: one more question. Is there a way to get num of remaining urls to index? thx

     

    Last edit: Anonymous 2013-12-24
  • Anonymous

    Anonymous - 2014-01-02

    Hi!

    Thanks for the report! It really looks like phpcrawl is not building tiogether the URL correctly. I'll test it soon and let you know what i've found.
    I'll open a bugreport for this.

    ANd no, right now there is no possibility to get the number of remaining URLs to crawl in the cache, sorry.

    Thanks for the report!

     
  • Anonymous

    Anonymous - 2014-01-02

    Thanks for your job once again!

     
  • Anonymous

    Anonymous - 2014-01-06

    Hi again,

    i just wanted to test your report on the site you mentioned in your first post, but i'm afraid i can't find the link href="/?c=kondicionery&action=shop&cat_id=100" anywhere on the page pokolenie-spb.ru/ventilyaciya/ventilyatory/page-11.html.

    I found a similar link (href="/?c=ventilyaciya&action=shop002&cat_id=&f04=&f05=&producer_id=..."), but this one works fine, the crawler rebuilds the URL correctly and the request it OK too (200).

    So i don't know how to reproduce your problem right now, sorry!

     
  • Anonymous

    Anonymous - 2014-01-23

    I initiate class for this way. Maybe it's couses problem?

    $crawler = new MyCrawler();
    $crawler->setURL($domain);
    $crawler->addContentTypeReceiveRule("#text/#");
    $crawler->addContentTypeReceiveRule("#image/#");
    $crawler->addContentTypeReceiveRule("#application/x-shockwave-flash#");
    $crawler->addURLFilterRule("#(()$# i"); //I'm filtering "("-bug
    $crawler->setContentSizeLimit(10485760);
    $crawler->setPageLimit(10000);
    $crawler->enableAggressiveLinkSearch(FALSE);
    $crawler->setLinkExtractionTags(array("href", "src", "background", "action"));
    $crawler->go();

     

Anonymous
Anonymous

Add attachments
Cancel





Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.