This is contained in the url_rebuild variable from PHPCrawlerDocumentInfo::links_found_url_descriptors array
I can understand if there was a formatting error on the actial HTML page that the link was extracted from but when I visit the page it is definately not there. I am seeing this on numerous links that are being extracted from websites.
All manual tests I have performed using the link extraction code do not replicate this behaviour. Am I missing something?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
i just tested both sites (http://www.flickr.com/photos/64riv/10246410266/ and http://atlantaga.creativeloafing.com/employment/) with phpcrawl 0.81, and i can't repoduce the problem you described.
The URLs found by the crawler look OK, your mentioned malformed URLs are not in the PHPCrawlerDocumentInfo::links_found_url_descriptors array over here.
I'm attaching two files containing the URLs the crawler found (url_rebuild).
So, is it possible that your code somehow manipulates the URLs in some way?
I have done the same thing and tested the link extraction on the same urls. The results were correct with no malformed urls. The only difference is that on the live system the urls go into a db table.
I have extended the URL Descriptor class with the following
Yes, i'm getting 500 errors on sourceforge too sometimes.
And please understand that i can't give support for user-changed core-code or user-projects in general. I think you just have to do some debugging with your code with the mentioned sites and see where it fails, that's you job, not mine ;)
As far as i can see is (unchanged) phpcrawl working correctly here.
I request your indulgence for this!
And i recommend you not to change any phpcrawl-code if possible, you won't be able to update it anymore e.g.
Best regards!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I have not changed any of the core code. All of my modifications are extended classes that override the code code.
I think what I may do is take a cache of the crawled pages and any funny looking urls I find I can see exactly what the crawler was looking at when it extracted them.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I am facing the exact same issues that the user "Pan European" was facing two years back. However, I may have a clue that might help figure this out.
I have the PhpCrawl code running on two servers - one is a dedicated server at Hetzner, Germany and the other is a Linode VPS in the US. Both have Ubuntu 14.04 and are setup to be identical. However, the issue only manifests itself on the Germany server at Hetzner and not on the Linode VPS in the US. I have tested this multiple times to confirm this. I therefore think that this might be an issue with the environment setting.
Some examples below - As you can see there are random characters, added to the domain name - like "33f" or "000033cc" or "216". Below are some domain names extracted by the crawler from the websites.
I am using v0.83 - multithreaded with SQLite. I have not made any changes to the original code and I can confirm that the above issues are from the core code.
Let me know if you have any thoughts on how I can go about fixing this.
Vinay
P.S Love the software you've written, Uwe. Apart from this one strange issue, I've had absolutely no issues with it.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I have never encountered this problem, and i don't know any user having this problem, except you and Pan European here.
The really stange thing: WHY the hell appear these stange charactes right in the middel of the hostname, without any exceptional charactes in the original hostname (like umlauts or something like this). And what ARE these strange characters? Some sort of code-representations??
This is a mystery for the X-files ;)
Ok, i'm just unable to fix this without being able to reproduce this behaviour.
BUT (this is just an idea):
Would it be possible to get a (strictly limited) access (SSH) to your german Hetzner-Server, so that i may track down the problem and hopefully can fix it?
That would be great!
I'm really getting couriuos now about this too!
Best regards,
Agent Mulder.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I'm seeing characters in the domains from urls extracted from a page that shouldn't be there. Example:
http://farm9.194fstaticflickr.com
Should be
http://farm9.staticflickr.com
This is contained in the url_rebuild variable from PHPCrawlerDocumentInfo::links_found_url_descriptors array
I can understand if there was a formatting error on the actial HTML page that the link was extracted from but when I visit the page it is definately not there. I am seeing this on numerous links that are being extracted from websites.
All manual tests I have performed using the link extraction code do not replicate this behaviour. Am I missing something?
Hi!
Strange.
Could you post the URL of the page where e.g. "http://farm9.194fstaticflickr.com" was extraced from?
Then i can take a look.
Test
Sorry its taken so long to reply
Here is the offending URL:
http://www.flickr.com/photos/64riv/10246410266/
The domain I have posted is not in the source.
Here is another example:
URL: http://atlantaga.creativeloafing.com/employment/
Link Found: http://atlantaga.cre5aeativeloafing.com/DriverJobs/otr-driver-wanted/20599945
Link in Source: http://atlantaga.creativeloafing.com/DriverJobs/otr-driver-wanted/20599945
Where has the 5ae come from?
I am saving these URLS to a MySQL db. Here are some others that contain invalid characters in the domain:
cre5aeativeloafing.com
cre5b4ativeloafing.com
cre759ativeloafing.com
crea11d1tiveloafing.com
crea5aftiveloafing.com
creativeb4aloafing.com
creativelb4aoafing.com
creativeloafif0dng.com
creativeloafingb4a.com
creativelob62afing.com
View and moderate all "Help" comments posted by this user
Mark all as spam, and block user from posting to "Forum"
Thanks!
I'll take a look and let you know, really strange, URLs form outer space.
Last edit: Anonymous 2013-11-25
Hi again,
i just tested both sites (http://www.flickr.com/photos/64riv/10246410266/ and http://atlantaga.creativeloafing.com/employment/) with phpcrawl 0.81, and i can't repoduce the problem you described.
The URLs found by the crawler look OK, your mentioned malformed URLs are not in the PHPCrawlerDocumentInfo::links_found_url_descriptors array over here.
I'm attaching two files containing the URLs the crawler found (url_rebuild).
So, is it possible that your code somehow manipulates the URLs in some way?
...
Sorry I keep posting test. I get a 500 error when posting on this forum
Last edit: Pan European 2013-11-26
Thanks
I have done the same thing and tested the link extraction on the same urls. The results were correct with no malformed urls. The only difference is that on the live system the urls go into a db table.
I have extended the URL Descriptor class with the following
In the Page Handler class I have extended the process method to save the results to db tables. The page links found are in the following:
Hi!
Yes, i'm getting 500 errors on sourceforge too sometimes.
And please understand that i can't give support for user-changed core-code or user-projects in general. I think you just have to do some debugging with your code with the mentioned sites and see where it fails, that's you job, not mine ;)
As far as i can see is (unchanged) phpcrawl working correctly here.
I request your indulgence for this!
And i recommend you not to change any phpcrawl-code if possible, you won't be able to update it anymore e.g.
Best regards!
Hi
I have not changed any of the core code. All of my modifications are extended classes that override the code code.
I think what I may do is take a cache of the crawled pages and any funny looking urls I find I can see exactly what the crawler was looking at when it extracted them.
Hi Uwe,
I am facing the exact same issues that the user "Pan European" was facing two years back. However, I may have a clue that might help figure this out.
I have the PhpCrawl code running on two servers - one is a dedicated server at Hetzner, Germany and the other is a Linode VPS in the US. Both have Ubuntu 14.04 and are setup to be identical. However, the issue only manifests itself on the Germany server at Hetzner and not on the Linode VPS in the US. I have tested this multiple times to confirm this. I therefore think that this might be an issue with the environment setting.
Some examples below - As you can see there are random characters, added to the domain name - like "33f" or "000033cc" or "216". Below are some domain names extracted by the crawler from the websites.
www.in.g33fov - http://www.in.gov/cgi-bin/ai/parser.pl/0113/www.in.gov/judiciary/3470.htm
tagessp000033cciegel.de - http://www.tagesspiegel.de/suchergebnis/?search-ressort=2876&search-day=20151226
ladnyd216om.pl - http://zdrowie.gazeta.pl/Zdrowie/1,101459,15714188,Klirens_kreatyniny.html
www.rodaleinc.com13c9y - http://www.prevention.com/health/health-concerns/12-ways-to-prevent-osteoporosis-and-broken-bones/7-mind-your-meds?slide=8
I am using v0.83 - multithreaded with SQLite. I have not made any changes to the original code and I can confirm that the above issues are from the core code.
Let me know if you have any thoughts on how I can go about fixing this.
Vinay
P.S Love the software you've written, Uwe. Apart from this one strange issue, I've had absolutely no issues with it.
Hi Vinay,
thanks for your report!
So this is REALLY strange!
I have never encountered this problem, and i don't know any user having this problem, except you and Pan European here.
The really stange thing: WHY the hell appear these stange charactes right in the middel of the hostname, without any exceptional charactes in the original hostname (like umlauts or something like this). And what ARE these strange characters? Some sort of code-representations??
This is a mystery for the X-files ;)
Ok, i'm just unable to fix this without being able to reproduce this behaviour.
BUT (this is just an idea):
Would it be possible to get a (strictly limited) access (SSH) to your german Hetzner-Server, so that i may track down the problem and hopefully can fix it?
That would be great!
I'm really getting couriuos now about this too!
Best regards,
Agent Mulder.
Just added a bug-report for this:
https://sourceforge.net/p/phpcrawl/bugs/97/
Sure Agent Mulder. Should I send you the SSH credentials through sourceforge email?
Yes, or send it directly to phpcrawl@cuab.de if you want.
Emailed the credentials. Let me know if you need anything else.