I need to extract the whole content of one domain. This mean that I have to get the $DocInfo->links_found and then I have to iterate through this array to get every $DocInfo->content.
The problem that I have is that always I get the first page content. If I try to execute the method go again I get a mistake.
Many thanks in advance
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
There's no need to iterate over the links_found-array.
Phpcrawl automatically follows every link of the domain and returns it's content, so you only have to handle it.
Hi!!,
Thanks a lot for your answer. But I am a little bit confuse. If I understand well the "PHPCrawlerDocumentInfo" is a class that is returned when you execute the method "go" and this class has as one of its properties the "content" but this content is the content of the url that you have set. In adition the "PHPCrawlerDocumentInfo" has an array with all the links that the website has (links_found).
I need to get the content of the url that I have set and all contents of the url that the crawler return in the array "links_found".
So you say that "Phpcrawl automatically follows every link of the domain and returns it's content". My question is where I receive the content of the additional url found in the website?.
In advance thanks!!!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The handeDocumentInfo method will get called SEVERAL times after you executed the "go"-method once, for EVERY URL the crawler finds on it's way.
The crawler starts with the root-URL, gets all links from it and follows all of them (depending on your settings), and then again it follows all URLs it found on this URLs and so on. And for every of these URLs/links it calls the handeDocumentInfo method.
Just take a look at the example and execute it, then it gets clear.
Hope i could help.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
View and moderate all "Help" comments posted by this user
Mark all as spam, and block user from posting to "Forum"
Hi,
I need to extract the whole content of one domain. This mean that I have to get the $DocInfo->links_found and then I have to iterate through this array to get every $DocInfo->content.
The problem that I have is that always I get the first page content. If I try to execute the method go again I get a mistake.
Many thanks in advance
View and moderate all "Help" comments posted by this user
Mark all as spam, and block user from posting to "Forum"
Hi!
There's no need to iterate over the links_found-array.
Phpcrawl automatically follows every link of the domain and returns it's content, so you only have to handle it.
Just get sure you set the follow-mode to 1 (stay in domain) or 2 (stay in host, that's the default).
http://cuab.de/classreferences/PHPCrawler/method_detail_tpl_method_setFollowMode.htm
View and moderate all "Help" comments posted by this user
Mark all as spam, and block user from posting to "Forum"
Hi!!,
Thanks a lot for your answer. But I am a little bit confuse. If I understand well the "PHPCrawlerDocumentInfo" is a class that is returned when you execute the method "go" and this class has as one of its properties the "content" but this content is the content of the url that you have set. In adition the "PHPCrawlerDocumentInfo" has an array with all the links that the website has (links_found).
I need to get the content of the url that I have set and all contents of the url that the crawler return in the array "links_found".
So you say that "Phpcrawl automatically follows every link of the domain and returns it's content". My question is where I receive the content of the additional url found in the website?.
In advance thanks!!!
View and moderate all "Help" comments posted by this user
Mark all as spam, and block user from posting to "Forum"
The handeDocumentInfo method will get called SEVERAL times after you executed the "go"-method once, for EVERY URL the crawler finds on it's way.
The crawler starts with the root-URL, gets all links from it and follows all of them (depending on your settings), and then again it follows all URLs it found on this URLs and so on. And for every of these URLs/links it calls the handeDocumentInfo method.
Just take a look at the example and execute it, then it gets clear.
Hope i could help.
View and moderate all "Help" comments posted by this user
Mark all as spam, and block user from posting to "Forum"
Hi Dear Collegue,
Your support have been very useful for me. Thanks a lot!!!