Menu

Characters in domain

Help
2013-11-22
2016-02-16
  • Pan European

    Pan European - 2013-11-22

    I'm seeing characters in the domains from urls extracted from a page that shouldn't be there. Example:

    http://farm9.194fstaticflickr.com
    Should be
    http://farm9.staticflickr.com

    This is contained in the url_rebuild variable from PHPCrawlerDocumentInfo::links_found_url_descriptors array

    I can understand if there was a formatting error on the actial HTML page that the link was extracted from but when I visit the page it is definately not there. I am seeing this on numerous links that are being extracted from websites.

    All manual tests I have performed using the link extraction code do not replicate this behaviour. Am I missing something?

     
  • Uwe Hunfeld

    Uwe Hunfeld - 2013-11-22

    Hi!

    Strange.

    Could you post the URL of the page where e.g. "http://farm9.194fstaticflickr.com" was extraced from?

    Then i can take a look.

     
  • Pan European

    Pan European - 2013-11-25

    Test

     
  • Anonymous

    Anonymous - 2013-11-25

    Thanks!

    I'll take a look and let you know, really strange, URLs form outer space.

     

    Last edit: Anonymous 2013-11-25
  • Uwe Hunfeld

    Uwe Hunfeld - 2013-11-25

    Hi again,

    i just tested both sites (http://www.flickr.com/photos/64riv/10246410266/ and http://atlantaga.creativeloafing.com/employment/) with phpcrawl 0.81, and i can't repoduce the problem you described.

    The URLs found by the crawler look OK, your mentioned malformed URLs are not in the PHPCrawlerDocumentInfo::links_found_url_descriptors array over here.

    I'm attaching two files containing the URLs the crawler found (url_rebuild).

    So, is it possible that your code somehow manipulates the URLs in some way?

     
  • Uwe Hunfeld

    Uwe Hunfeld - 2013-11-25

    ...

     
  • Pan European

    Pan European - 2013-11-26

    Sorry I keep posting test. I get a 500 error when posting on this forum

     

    Last edit: Pan European 2013-11-26
  • Pan European

    Pan European - 2013-11-26

    Thanks

    I have done the same thing and tested the link extraction on the same urls. The results were correct with no malformed urls. The only difference is that on the live system the urls go into a db table.

    I have extended the URL Descriptor class with the following

    public function __construct($url_rebuild, $link_raw = null, $linkcode = null, $linktext = null, $refering_url = null)
    {
        $this->url_rebuild = $url_rebuild;
    
        $url_parsed = parse_url($this->url_rebuild);
        if(isset($url_parsed['host']))
        {
            $this->hostname = $url_parsed['host'];  
        }
        elseif(isset($url_parsed['path']))
        {
            $url_parsed = parse_url('http://' . $url_parsed['path']);
            $this->hostname = $url_parsed['host'];
        }
        else
        {
            $this->hostname = FALSE;
        }     
    
        if(!empty($link_raw)) 
        {
            $this->link_raw = $link_raw;
        }
        if(!empty($linkcode)) 
        {
            $this->linkcode = $linkcode;
            if (preg_match("#^<[^>]*rel\s*=\s*(?|\"\s*nofollow\s*\"|'\s*nofollow\s*'|\s*nofollow\s*)[^>]*>#", $this->linkcode))
            {
                $this->is_nofollow = TRUE;
            }           
        }
        if(!empty($linktext)) 
        {
        $this->linktext = $linktext;
        }
    
        if(!empty($refering_url)) 
        {
            // the referring url is the source of the document
            // if it is contained in the link found then it is an internal link 
            $this->refering_url = $refering_url;
            $url_parts = PHPCrawlerUrlPartsDescriptor::fromURL($url_rebuild);
            if(!empty($url_parts->host))
            {
                if(strstr($refering_url, $url_parts->host))
                {
                    $this->is_internal = TRUE;
                }
                else 
                {
                    $this->is_external = TRUE;  
                }
            }
        }
        else 
        {
            // no referring url so it must be interal
            $this->is_internal = TRUE;
        }
    }
    

    In the Page Handler class I have extended the process method to save the results to db tables. The page links found are in the following:

    $page_links = $page->links_found_url_descriptors;
    
     
  • Uwe Hunfeld

    Uwe Hunfeld - 2013-11-26

    Hi!

    Yes, i'm getting 500 errors on sourceforge too sometimes.

    And please understand that i can't give support for user-changed core-code or user-projects in general. I think you just have to do some debugging with your code with the mentioned sites and see where it fails, that's you job, not mine ;)

    As far as i can see is (unchanged) phpcrawl working correctly here.

    I request your indulgence for this!

    And i recommend you not to change any phpcrawl-code if possible, you won't be able to update it anymore e.g.

    Best regards!

     
  • Pan European

    Pan European - 2013-11-26

    Hi

    I have not changed any of the core code. All of my modifications are extended classes that override the code code.

    I think what I may do is take a cache of the crawled pages and any funny looking urls I find I can see exactly what the crawler was looking at when it extracted them.

     
  • Vinay

    Vinay - 2016-02-12

    Hi Uwe,

    I am facing the exact same issues that the user "Pan European" was facing two years back. However, I may have a clue that might help figure this out.

    I have the PhpCrawl code running on two servers - one is a dedicated server at Hetzner, Germany and the other is a Linode VPS in the US. Both have Ubuntu 14.04 and are setup to be identical. However, the issue only manifests itself on the Germany server at Hetzner and not on the Linode VPS in the US. I have tested this multiple times to confirm this. I therefore think that this might be an issue with the environment setting.

    Some examples below - As you can see there are random characters, added to the domain name - like "33f" or "000033cc" or "216". Below are some domain names extracted by the crawler from the websites.

    www.in.g33fov - http://www.in.gov/cgi-bin/ai/parser.pl/0113/www.in.gov/judiciary/3470.htm
    tagessp000033cciegel.de - http://www.tagesspiegel.de/suchergebnis/?search-ressort=2876&search-day=20151226
    ladnyd216om.pl - http://zdrowie.gazeta.pl/Zdrowie/1,101459,15714188,Klirens_kreatyniny.html
    www.rodaleinc.com13c9y - http://www.prevention.com/health/health-concerns/12-ways-to-prevent-osteoporosis-and-broken-bones/7-mind-your-meds?slide=8

    I am using v0.83 - multithreaded with SQLite. I have not made any changes to the original code and I can confirm that the above issues are from the core code.

    Let me know if you have any thoughts on how I can go about fixing this.

    Vinay

    P.S Love the software you've written, Uwe. Apart from this one strange issue, I've had absolutely no issues with it.

     
  • Uwe Hunfeld

    Uwe Hunfeld - 2016-02-14

    Hi Vinay,

    thanks for your report!

    So this is REALLY strange!

    I have never encountered this problem, and i don't know any user having this problem, except you and Pan European here.

    The really stange thing: WHY the hell appear these stange charactes right in the middel of the hostname, without any exceptional charactes in the original hostname (like umlauts or something like this). And what ARE these strange characters? Some sort of code-representations??

    This is a mystery for the X-files ;)

    Ok, i'm just unable to fix this without being able to reproduce this behaviour.

    BUT (this is just an idea):
    Would it be possible to get a (strictly limited) access (SSH) to your german Hetzner-Server, so that i may track down the problem and hopefully can fix it?

    That would be great!

    I'm really getting couriuos now about this too!

    Best regards,

    Agent Mulder.

     
  • Uwe Hunfeld

    Uwe Hunfeld - 2016-02-14

    Just added a bug-report for this:
    https://sourceforge.net/p/phpcrawl/bugs/97/

     
  • Vinay

    Vinay - 2016-02-15

    Sure Agent Mulder. Should I send you the SSH credentials through sourceforge email?

     
  • Uwe Hunfeld

    Uwe Hunfeld - 2016-02-16

    Yes, or send it directly to phpcrawl@cuab.de if you want.

     
  • Vinay

    Vinay - 2016-02-16

    Emailed the credentials. Let me know if you need anything else.

     

Anonymous
Anonymous

Add attachments
Cancel





Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.