Menu

Get the content for the whole domain

Help
Anonymous
2013-08-13
2013-08-14
  • Anonymous

    Anonymous - 2013-08-13

    Hi,

    I need to extract the whole content of one domain. This mean that I have to get the $DocInfo->links_found and then I have to iterate through this array to get every $DocInfo->content.
    The problem that I have is that always I get the first page content. If I try to execute the method go again I get a mistake.
    Many thanks in advance

     
  • Anonymous

    Anonymous - 2013-08-13

    Hi!

    There's no need to iterate over the links_found-array.
    Phpcrawl automatically follows every link of the domain and returns it's content, so you only have to handle it.

    Just get sure you set the follow-mode to 1 (stay in domain) or 2 (stay in host, that's the default).
    http://cuab.de/classreferences/PHPCrawler/method_detail_tpl_method_setFollowMode.htm

     
    • Anonymous

      Anonymous - 2013-08-14

      Hi!!,
      Thanks a lot for your answer. But I am a little bit confuse. If I understand well the "PHPCrawlerDocumentInfo" is a class that is returned when you execute the method "go" and this class has as one of its properties the "content" but this content is the content of the url that you have set. In adition the "PHPCrawlerDocumentInfo" has an array with all the links that the website has (links_found).
      I need to get the content of the url that I have set and all contents of the url that the crawler return in the array "links_found".
      So you say that "Phpcrawl automatically follows every link of the domain and returns it's content". My question is where I receive the content of the additional url found in the website?.
      In advance thanks!!!

       
  • Anonymous

    Anonymous - 2013-08-14

    The handeDocumentInfo method will get called SEVERAL times after you executed the "go"-method once, for EVERY URL the crawler finds on it's way.
    The crawler starts with the root-URL, gets all links from it and follows all of them (depending on your settings), and then again it follows all URLs it found on this URLs and so on. And for every of these URLs/links it calls the handeDocumentInfo method.

    Just take a look at the example and execute it, then it gets clear.

    Hope i could help.

     
    • Anonymous

      Anonymous - 2013-08-14

      Hi Dear Collegue,

      Your support have been very useful for me. Thanks a lot!!!

       

Anonymous
Anonymous

Add attachments
Cancel





Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.