PHPCrawl / Forum / Help: Characters in domain

Pan European - 2013-11-22

I'm seeing characters in the domains from urls extracted from a page that shouldn't be there. Example:

http://farm9.194fstaticflickr.com
Should be
http://farm9.staticflickr.com

This is contained in the url_rebuild variable from PHPCrawlerDocumentInfo::links_found_url_descriptors array

I can understand if there was a formatting error on the actial HTML page that the link was extracted from but when I visit the page it is definately not there. I am seeing this on numerous links that are being extracted from websites.

All manual tests I have performed using the link extraction code do not replicate this behaviour. Am I missing something?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Uwe Hunfeld - 2013-11-22

Hi!

Strange.

Could you post the URL of the page where e.g. "http://farm9.194fstaticflickr.com" was extraced from?

Then i can take a look.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Pan European - 2013-11-25

Test

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Pan European - 2013-11-25

Sorry its taken so long to reply

Here is the offending URL:
http://www.flickr.com/photos/64riv/10246410266/

The domain I have posted is not in the source.

Here is another example:

URL: http://atlantaga.creativeloafing.com/employment/
Link Found: http://atlantaga.cre5aeativeloafing.com/DriverJobs/otr-driver-wanted/20599945
Link in Source: http://atlantaga.creativeloafing.com/DriverJobs/otr-driver-wanted/20599945

Where has the 5ae come from?

I am saving these URLS to a MySQL db. Here are some others that contain invalid characters in the domain:

cre5aeativeloafing.com
cre5b4ativeloafing.com
cre759ativeloafing.com
crea11d1tiveloafing.com
crea5aftiveloafing.com
creativeb4aloafing.com
creativelb4aoafing.com
creativeloafif0dng.com
creativeloafingb4a.com
creativelob62afing.com

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Comment has been marked as spam.
Undo

View and moderate all "Help" comments posted by this user

Mark all as spam, and block user from posting to "Forum"

Anonymous - 2013-11-25

Thanks!

I'll take a look and let you know, really strange, URLs form outer space.

Last edit: Anonymous 2013-11-25

Thanks! I'll take a look and let you know, really strange, URLs form outer space.

Add attachments
Cancel
You seem to have CSS turned off. Please don't fill out this field.

You seem to have CSS turned off. Please don't fill out this field.

New Attachment:

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Uwe Hunfeld - 2013-11-25

Hi again,

i just tested both sites (http://www.flickr.com/photos/64riv/10246410266/ and http://atlantaga.creativeloafing.com/employment/) with phpcrawl 0.81, and i can't repoduce the problem you described.

The URLs found by the crawler look OK, your mentioned malformed URLs are not in the PHPCrawlerDocumentInfo::links_found_url_descriptors array over here.

I'm attaching two files containing the URLs the crawler found (url_rebuild).

So, is it possible that your code somehow manipulates the URLs in some way?

urls_flickr.txt

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Uwe Hunfeld - 2013-11-25

...

urls_atlantaga.txt

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Pan European - 2013-11-26

Sorry I keep posting test. I get a 500 error when posting on this forum

Last edit: Pan European 2013-11-26

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Thanks

I have done the same thing and tested the link extraction on the same urls. The results were correct with no malformed urls. The only difference is that on the live system the urls go into a db table.

I have extended the URL Descriptor class with the following

public function __construct($url_rebuild, $link_raw = null, $linkcode = null, $linktext = null, $refering_url = null)
{
    $this->url_rebuild = $url_rebuild;

    $url_parsed = parse_url($this->url_rebuild);
    if(isset($url_parsed['host']))
    {
        $this->hostname = $url_parsed['host'];  
    }
    elseif(isset($url_parsed['path']))
    {
        $url_parsed = parse_url('http://' . $url_parsed['path']);
        $this->hostname = $url_parsed['host'];
    }
    else
    {
        $this->hostname = FALSE;
    }     

    if(!empty($link_raw)) 
    {
        $this->link_raw = $link_raw;
    }
    if(!empty($linkcode)) 
    {
        $this->linkcode = $linkcode;
        if (preg_match("#^<[^>]*rel\s*=\s*(?|\"\s*nofollow\s*\"|'\s*nofollow\s*'|\s*nofollow\s*)[^>]*>#", $this->linkcode))
        {
            $this->is_nofollow = TRUE;
        }           
    }
    if(!empty($linktext)) 
    {
    $this->linktext = $linktext;
    }

    if(!empty($refering_url)) 
    {
        // the referring url is the source of the document
        // if it is contained in the link found then it is an internal link 
        $this->refering_url = $refering_url;
        $url_parts = PHPCrawlerUrlPartsDescriptor::fromURL($url_rebuild);
        if(!empty($url_parts->host))
        {
            if(strstr($refering_url, $url_parts->host))
            {
                $this->is_internal = TRUE;
            }
            else 
            {
                $this->is_external = TRUE;  
            }
        }
    }
    else 
    {
        // no referring url so it must be interal
        $this->is_internal = TRUE;
    }
}

In the Page Handler class I have extended the process method to save the results to db tables. The page links found are in the following:

$page_links = $page->links_found_url_descriptors;

Anonymous

Add attachments
Cancel
You seem to have CSS turned off. Please don't fill out this field.

You seem to have CSS turned off. Please don't fill out this field.

Uwe Hunfeld - 2013-11-26

Hi!

Yes, i'm getting 500 errors on sourceforge too sometimes.

And please understand that i can't give support for user-changed core-code or user-projects in general. I think you just have to do some debugging with your code with the mentioned sites and see where it fails, that's you job, not mine ;)

As far as i can see is (unchanged) phpcrawl working correctly here.

I request your indulgence for this!

And i recommend you not to change any phpcrawl-code if possible, you won't be able to update it anymore e.g.

Best regards!

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Pan European - 2013-11-26

Hi

I have not changed any of the core code. All of my modifications are extended classes that override the code code.

I think what I may do is take a cache of the crawled pages and any funny looking urls I find I can see exactly what the crawler was looking at when it extracted them.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Vinay - 2016-02-12

Hi Uwe,

I am facing the exact same issues that the user "Pan European" was facing two years back. However, I may have a clue that might help figure this out.

I have the PhpCrawl code running on two servers - one is a dedicated server at Hetzner, Germany and the other is a Linode VPS in the US. Both have Ubuntu 14.04 and are setup to be identical. However, the issue only manifests itself on the Germany server at Hetzner and not on the Linode VPS in the US. I have tested this multiple times to confirm this. I therefore think that this might be an issue with the environment setting.

Some examples below - As you can see there are random characters, added to the domain name - like "33f" or "000033cc" or "216". Below are some domain names extracted by the crawler from the websites.

www.in.g33fov - http://www.in.gov/cgi-bin/ai/parser.pl/0113/www.in.gov/judiciary/3470.htm
tagessp000033cciegel.de - http://www.tagesspiegel.de/suchergebnis/?search-ressort=2876&search-day=20151226
ladnyd216om.pl - http://zdrowie.gazeta.pl/Zdrowie/1,101459,15714188,Klirens_kreatyniny.html
www.rodaleinc.com13c9y - http://www.prevention.com/health/health-concerns/12-ways-to-prevent-osteoporosis-and-broken-bones/7-mind-your-meds?slide=8

I am using v0.83 - multithreaded with SQLite. I have not made any changes to the original code and I can confirm that the above issues are from the core code.

Let me know if you have any thoughts on how I can go about fixing this.

Vinay

P.S Love the software you've written, Uwe. Apart from this one strange issue, I've had absolutely no issues with it.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Uwe Hunfeld - 2016-02-14

Hi Vinay,

thanks for your report!

So this is REALLY strange!

I have never encountered this problem, and i don't know any user having this problem, except you and Pan European here.

The really stange thing: WHY the hell appear these stange charactes right in the middel of the hostname, without any exceptional charactes in the original hostname (like umlauts or something like this). And what ARE these strange characters? Some sort of code-representations??

This is a mystery for the X-files ;)

Ok, i'm just unable to fix this without being able to reproduce this behaviour.

BUT (this is just an idea):
Would it be possible to get a (strictly limited) access (SSH) to your german Hetzner-Server, so that i may track down the problem and hopefully can fix it?

That would be great!

I'm really getting couriuos now about this too!

Best regards,

Agent Mulder.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Uwe Hunfeld - 2016-02-14

Just added a bug-report for this:
https://sourceforge.net/p/phpcrawl/bugs/97/

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Vinay - 2016-02-15

Sure Agent Mulder. Should I send you the SSH credentials through sourceforge email?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Uwe Hunfeld - 2016-02-16

Yes, or send it directly to phpcrawl@cuab.de if you want.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Vinay - 2016-02-16

Emailed the credentials. Let me know if you need anything else.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Characters in domain

Forums

Help

Characters in domain document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Characters in domain