Menu

connection reset error on a preq_match_all

Help
RSP
2010-12-08
2013-04-09
  • RSP

    RSP - 2010-12-08

    Hi Huni,

    it's me again.  Been testing out the software, and when trying to index a whole site, I keep running into a 'The connection to the server was reset while the page was loading.' error, which occurs on the following statement:

    preg_match_all("/<{0,}a{0,}(?<= |\n|\r)(?:".$match_part."){0,}={0,}{0,1}({0,}){0,}>((?:(?!<*\/a*>).)*)<*\/a*>/ is", $source, $regs);

    (Line 202 in the PHPCrawlerutils.class.php

    I am able to duplicate this problem in 6 hops using the following config changes to the base example.php file. The problem occurs in both IE and Firefox.   Please excuse the lame celebrity website content.   (it was just a blog site I picked to do my tests)
    $crawler->setURL("http://www.wwtdd.com/page/1256/");

    //SET MATCHES TO FOLLOW
    $crawler->addFollowMatch("/page/");
    $crawler->addFollowMatch("/{4}\/{2}/");

    $crawler->addLinkPriority("/page/", 10);

    I am not sure if this is a bug in PHP, or if there is a problem in the preg_match_all statement.   Any ideas?   I've researched all over and still have no clue what the problem is.   As this is my first site I am indexing, I am worried that this might happen on other sites as well. 

    As always, any help or guidance would be greatly appreciated.

    Thanks,

    Ron

     
  • Uwe Hunfeld

    Uwe Hunfeld - 2010-12-08

    Hi Ron,

    the crawler simply refuses to process celebrity-yellowpress-sites by nature ;)

    No, but i'm sorry, i can't reproduce the error you're describing over here so far.
    When using your crawler-setup and runnig the crawler on the mentioned site, everything works fine as it seems
    (the cralwer finds a lot of javascript-trash links in that site though, but that's not part of the problem).

    Did you try to run your script from the commandline (cli) instead of in an browser-environment?

    Maybe that's the only problem.
    In an browser-environment it happens that the the browser or the webserver "thinks" that your script is not responding anymore whereas the cralwer is just still busy processing a site a little longer time.

    In general it's strongly recommended to run a crawling-process from the commandline.

    Besides, if that shouldn't be the problem, you may get a deeper look into the cause of the problen if you are running your script from the commandline (e.g. if the script exits with a segmentation fault, you won't notice that when running it in a browser).

    Just let me know if that didn't fix the problem.

    Best regards,

    huni.

     
  • RSP

    RSP - 2010-12-09

    Hi Huni,

    Thanks so much  for the tip.   I had not much idea about running PHP in command line mode.  It makes total sense, but I've always kicked off PHP code in the browser.   As you probably can tell, I don't come from a pure CS programming background.   Running in command mode gave me error messages which were not displayed in web mode.   I am not sure why my error starting happening.  I mean I succesfully crawled almost 1000 pages until I reached this point, but the error pointed out that my Include file to my dB connection all of a sudden could not be found.    It's baffling to me, but I'm sure there is some relative path problem that somehow occurs.  

    OK, back up and running.   No more browser mode for me.  It's all cl mode from here.  :)

    Thanks again for the help.

    Ron

     
  • Nobody/Anonymous

    Hey Ron,

    OK, back up and running

    Nice to hear!

    Good luck!

     

Anonymous
Anonymous

Add attachments
Cancel





Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.