it's me again. Been testing out the software, and when trying to index a whole site, I keep running into a 'The connection to the server was reset while the page was loading.' error, which occurs on the following statement:
I am able to duplicate this problem in 6 hops using the following config changes to the base example.php file. The problem occurs in both IE and Firefox. Please excuse the lame celebrity website content. (it was just a blog site I picked to do my tests)
$crawler->setURL("http://www.wwtdd.com/page/1256/");
//SET MATCHES TO FOLLOW
$crawler->addFollowMatch("/page/");
$crawler->addFollowMatch("/{4}\/{2}/");
$crawler->addLinkPriority("/page/", 10);
I am not sure if this is a bug in PHP, or if there is a problem in the preg_match_all statement. Any ideas? I've researched all over and still have no clue what the problem is. As this is my first site I am indexing, I am worried that this might happen on other sites as well.
As always, any help or guidance would be greatly appreciated.
Thanks,
Ron
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
the crawler simply refuses to process celebrity-yellowpress-sites by nature ;)
No, but i'm sorry, i can't reproduce the error you're describing over here so far.
When using your crawler-setup and runnig the crawler on the mentioned site, everything works fine as it seems
(the cralwer finds a lot of javascript-trash links in that site though, but that's not part of the problem).
Did you try to run your script from the commandline (cli) instead of in an browser-environment?
Maybe that's the only problem.
In an browser-environment it happens that the the browser or the webserver "thinks" that your script is not responding anymore whereas the cralwer is just still busy processing a site a little longer time.
In general it's strongly recommended to run a crawling-process from the commandline.
Besides, if that shouldn't be the problem, you may get a deeper look into the cause of the problen if you are running your script from the commandline (e.g. if the script exits with a segmentation fault, you won't notice that when running it in a browser).
Just let me know if that didn't fix the problem.
Best regards,
huni.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thanks so much for the tip. I had not much idea about running PHP in command line mode. It makes total sense, but I've always kicked off PHP code in the browser. As you probably can tell, I don't come from a pure CS programming background. Running in command mode gave me error messages which were not displayed in web mode. I am not sure why my error starting happening. I mean I succesfully crawled almost 1000 pages until I reached this point, but the error pointed out that my Include file to my dB connection all of a sudden could not be found. It's baffling to me, but I'm sure there is some relative path problem that somehow occurs.
OK, back up and running. No more browser mode for me. It's all cl mode from here. :)
Thanks again for the help.
Ron
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi Huni,
it's me again. Been testing out the software, and when trying to index a whole site, I keep running into a 'The connection to the server was reset while the page was loading.' error, which occurs on the following statement:
preg_match_all("/<{0,}a{0,}(?<= |\n|\r)(?:".$match_part."){0,}={0,}{0,1}({0,}){0,}>((?:(?!<*\/a*>).)*)<*\/a*>/ is", $source, $regs);
(Line 202 in the PHPCrawlerutils.class.php
I am able to duplicate this problem in 6 hops using the following config changes to the base example.php file. The problem occurs in both IE and Firefox. Please excuse the lame celebrity website content. (it was just a blog site I picked to do my tests)
$crawler->setURL("http://www.wwtdd.com/page/1256/");
//SET MATCHES TO FOLLOW
$crawler->addFollowMatch("/page/");
$crawler->addFollowMatch("/{4}\/{2}/");
$crawler->addLinkPriority("/page/", 10);
I am not sure if this is a bug in PHP, or if there is a problem in the preg_match_all statement. Any ideas? I've researched all over and still have no clue what the problem is. As this is my first site I am indexing, I am worried that this might happen on other sites as well.
As always, any help or guidance would be greatly appreciated.
Thanks,
Ron
Hi Ron,
the crawler simply refuses to process celebrity-yellowpress-sites by nature ;)
No, but i'm sorry, i can't reproduce the error you're describing over here so far.
When using your crawler-setup and runnig the crawler on the mentioned site, everything works fine as it seems
(the cralwer finds a lot of javascript-trash links in that site though, but that's not part of the problem).
Did you try to run your script from the commandline (cli) instead of in an browser-environment?
Maybe that's the only problem.
In an browser-environment it happens that the the browser or the webserver "thinks" that your script is not responding anymore whereas the cralwer is just still busy processing a site a little longer time.
In general it's strongly recommended to run a crawling-process from the commandline.
Besides, if that shouldn't be the problem, you may get a deeper look into the cause of the problen if you are running your script from the commandline (e.g. if the script exits with a segmentation fault, you won't notice that when running it in a browser).
Just let me know if that didn't fix the problem.
Best regards,
huni.
Hi Huni,
Thanks so much for the tip. I had not much idea about running PHP in command line mode. It makes total sense, but I've always kicked off PHP code in the browser. As you probably can tell, I don't come from a pure CS programming background. Running in command mode gave me error messages which were not displayed in web mode. I am not sure why my error starting happening. I mean I succesfully crawled almost 1000 pages until I reached this point, but the error pointed out that my Include file to my dB connection all of a sudden could not be found. It's baffling to me, but I'm sure there is some relative path problem that somehow occurs.
OK, back up and running. No more browser mode for me. It's all cl mode from here. :)
Thanks again for the help.
Ron
Hey Ron,
Nice to hear!
Good luck!