Menu

Crawls some domains, but not all

Help
Anonymous
2015-03-28
2015-03-31
  • Anonymous

    Anonymous - 2015-03-28

    Helllo,

    for some reason my script in PHPCrawl crawls some of domains fully while doesn't others.

    With some domains it just crawls the first page and does not follow the subpages. So it only crawls and finds 1 page. I tried to crawl the same domains with a desktop crawler (screamingfrog) and it finds many subpages.

    For example if i crawl this page it only returns 1 page and stops: http://chic.clinic.si/

    Any idea why?

     
  • Uwe Hunfeld

    Uwe Hunfeld - 2015-03-28

    Hi,

    did you try to change the basic followMode to 1?

    By default the crawler only follows links to the same HOST, an the site you mentioned only seems to contain links to differetn hosts.
    (Entry host is "chic.clinic.si", links links lead to "www.chic.clinic.si")

    http://phpcrawl.cuab.de/classreferences/PHPCrawler/method_detail_tpl_method_setFollowMode.htm

    You also may simply change the root-URL to "www.chic.clinic.si" (same site) without changing the follow-mode, then the links lead to the same host as the one from the entry-URL.

     
  • Anonymous

    Anonymous - 2015-03-28

    Hi,

    it turns out i have to set the followMode to 0 to make it crawl the site. Unfortunately that also makes the crawler follow external domains.

    Any idea why is followMode 1 not working in this case?

     
  • Uwe Hunfeld

    Uwe Hunfeld - 2015-03-29

    Hi,

    ok, seems you suffer from this bug reported here:
    https://sourceforge.net/p/phpcrawl/bugs/67/

    Didn't adress this bug so far, but you may try the workaround posted
    by a user there.

    One question: If you set the root-URL to "www.chic.clinic.si", it doen't work
    either?

     
  • Anonymous

    Anonymous - 2015-03-31

    Hi,

    if i set the www.* then it works.
    I dont get the domains with www. but now that i think about it maybe i should add this to domains i asume this is the preferred way for websites to be indexed on.

    I will let the followMode on 1, because i actually want to crawl subdomains.

    Thanks Uwe

     

Anonymous
Anonymous

Add attachments
Cancel