I have successfully used PHPCrawl with multiple URLs, however when trying to use it with http://pastebin.com I get a Host unreachable error from $DocInfo->error_string:
"Error connecting to http://pastebin.com: Host unreachable (Connection timed out)"
Please let me know how I can get this to work and snippets of code if there is anything unusual.
Thank you!
Last edit: Anonymous 2013-11-18
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
i did some tests with phpcrawl and http://pastebin.com and you are right, it doen't work.
It's a little confusing, but it is like that:
As soon as you try to access pastebin.com with phpcrawl, the server doesn't answer anymore and your IP gets blocked for about 10 minutes. Within this 10 minutes the server won't accept connections anymore from your IP, doen't matter what client you are using (browser, wget, ping etc.), you always get the "Connection timed out" error you mentioned.
So thete's something the server doesn't like about the request-header phpcrawl sends.
I don't know yet what it is, but it's not the UserAgent-String.
I'll let you know if i found a fix.
I'll open a bugreport for this.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
View and moderate all "Help" comments posted by this user
Mark all as spam, and block user from posting to "Forum"
I have successfully used PHPCrawl with multiple URLs, however when trying to use it with http://pastebin.com I get a Host unreachable error from $DocInfo->error_string:
"Error connecting to http://pastebin.com: Host unreachable (Connection timed out)"
Please let me know how I can get this to work and snippets of code if there is anything unusual.
Thank you!
Last edit: Anonymous 2013-11-18
Hi!
Die you try to increase timeouts like this?
$crawler->setStreamTimeout(5); // defaults to 2 seconds $crawler->setConnectionTimeout(10); // defaults to 5 seconds
Also see the first QaA in the FAQ section:
http://cuab.de/faq.html
Hope this will help.
View and moderate all "Help" comments posted by this user
Mark all as spam, and block user from posting to "Forum"
Yes, I increased timeouts to 30 seconds.
View and moderate all "Help" comments posted by this user
Mark all as spam, and block user from posting to "Forum"
I even tried it at 300 seconds each. It still doesn't work.
Hmm, do you use a proxy?
And are you able to retreive pages from pastebin.com with wget (or something else) from the server running your script?
Maybe they blocked your IP or they blocked the UserAgent "phpcrawl".
(try to change it with setUserAgentString()).
I'll give it a try tomorrow and senn what happens from here.
Hi again,
i did some tests with phpcrawl and http://pastebin.com and you are right, it doen't work.
It's a little confusing, but it is like that:
As soon as you try to access pastebin.com with phpcrawl, the server doesn't answer anymore and your IP gets blocked for about 10 minutes. Within this 10 minutes the server won't accept connections anymore from your IP, doen't matter what client you are using (browser, wget, ping etc.), you always get the "Connection timed out" error you mentioned.
So thete's something the server doesn't like about the request-header phpcrawl sends.
I don't know yet what it is, but it's not the UserAgent-String.
I'll let you know if i found a fix.
I'll open a bugreport for this.
THe problems should be fixed.
See this bugreport https://sourceforge.net/p/phpcrawl/bugs/49/ and the
attached and fixed PHPCrawlerHTTPRequest-class.