Menu

Tag examples

Help
RSP
2010-11-24
2013-04-09
  • RSP

    RSP - 2010-11-24

    Hi There.  I have a couple questions on tags and how to extract them.   I looked through the examples and also the documentation, but wasn't sure on or if this software could do this.  Any help would be greatly appreciated.

    1.  extracting META tags for keywords and description?

    2.  extract the date between this tag without extracting all span tags    <span class="date">11.23.2010</span>?

    Thanks,

    Ron

     
  • Uwe Hunfeld

    Uwe Hunfeld - 2010-11-25

    Hi Ron,

    i'm sorry, but phpcrawl itself doesn't provide any functionality for extracting special tags or other data
    from websites.

    phpcrawl is a pure crawler that spiders websites and passes the found pages/documents "as they are" to
    the user of the library.

    But using some regular-expressions should do the trick.

    For extracting keywords from meta-tags use something linke this i.e.:

    preg_match('#<\s*meta\s*name\s*=\s*"\s*keywords\s*"\s*content\s*=\s*"(.*)"# Ui', $source, $match);
    // $match[1] contains the keywords if found
    

    And for extracting the date from <span class="date">11.23.2010</span> try something like this:

    preg_match_all('#<\s*span\s*class\s*=\s*"date"\s*>(.*)</span># Ui', $source, $matches);
    // $matches[1] now contains an numeric array with all found dates
    

    I didn't test the expressions above properly, they are just an approach.

    Best regards,

    huni.

     
  • RSP

    RSP - 2010-12-01

    Thanks huni for the quick reply.  I got it now.  I am pretty new to object oriented programming, so I often miss some pretty obvious things.  The code looks great and I'm continuing on.   

    One more question.  I noticed that pages with 301 permanent redirects do not download any content, so I am unable to get their meta tags.    I am able to goto these pages manually, and also I am using the default settings of phpCrawl which follows redirects.   Looking at these pages so far, I probably don't care about indexing them, but I am just wondering why the 301 status acts the way it does.   Thanks again for this great code.

    Ron

     
  • Uwe Hunfeld

    Uwe Hunfeld - 2010-12-02

    Hi Ron again,

    i'm not sure if i understand you question.

    The typically puropose of sending redirects (301 or other) is to make a website reachable under different hostnames.

    Jast as an example, the webiste "www.heise.de" is also reachable by using "heise.de" (without "www" at the beginning).
    The webserver's answer to a request for "heise.de" is simply a 301-redirect-header to "www.heise.de".

    The redirect-header looks like this:

    HTTP/1.1 301 Moved Permanently
    Date: Wed, 01 Dec 2010 23:43:58 GMT
    Server: Apache
    Location: http://www.heise.de/
    Content-Length: 228
    Connection: close
    Content-Type: text/html; charset=iso-8859-1

    .. and the content (which only intent is to forward browsers or bots that don't support redirects by presenting a link
    to them) is this:

    <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
    <html><head>
    <title>301 Moved Permanently</title>
    </head><body>
    <h1>Moved Permanently</h1>
    <p>The document has moved <a href="http://www.heise.de/">here</a>.</p>
    </body></html>

    And as you can see, there aren't any meta-tags (or other useful information) in this content and you
    can simply ignore it for your purposes i guess.

    Best regards,

    huni.

     
  • Uwe Hunfeld

    Uwe Hunfeld - 2010-12-02

    Hi Ron again,

    i'm not sure if i understand your question.

    The typically puropose of sending redirects (301 or other) is to make a website reachable under different hostnames.

    Jast as an example, the website "www.heise.de" is also reachable by using "heise.de" (without "www" at the beginning).
    The webserver's answer to a request for "heise.de" is simply a 301-redirect-header to "www.heise.de".

    The redirect-header looks like this:

    HTTP/1.1 301 Moved Permanently
    Date: Wed, 01 Dec 2010 23:43:58 GMT
    Server: Apache
    Location: http://www.heise.de/
    Content-Length: 228
    Connection: close
    Content-Type: text/html; charset=iso-8859-1

    .. and the content (which only intent is to forward browsers or bots that don't support redirects by presenting a link
    to them) is this:

    <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
    <html><head>
    <title>301 Moved Permanently</title>
    </head><body>
    <h1>Moved Permanently</h1>
    <p>The document has moved <a href="http://www.heise.de/">here</a>.</p>
    </body></html>

    And as you can see, there aren't any meta-tags (or other useful information) in this content and you
    can simply ignore it for your purposes i guess.

    Best regards,

    huni.

     
  • Nobody/Anonymous

    Hi Huni,

    Oh OK, I understand why the 301 redirect page does not bring back any meta data now.   I guess my question is does the redirected page also go into the links to crawl queue?

    For your example, lets say we crawl heise.de which is a redirect to http://www.heise.de.     So for heise.de, I get a 301 redirect and no meta data, but does the crawler also put the redirected page into the crawling queue, so that I end up crawling http://www.heise.de and getting the meta data that exists on that page?  I'm guessing that's what the option setFollowRedirects() was for?

    Just want to once again say how helpful this code is.  Thanks so much for putting this together.

    Ron

     
  • Uwe Hunfeld

    Uwe Hunfeld - 2010-12-02

    Hey Ron,

    .. my question is does the redirected page also go into the links to crawl queue?

    Yes, it does by default (cause setFollowRedirects() is set to TRUE by default).
    This option is only implemented for people that don't want the crawler to follow redirects (for whatever reasons).

    I'm glad if i could help!

     

Anonymous
Anonymous

Add attachments
Cancel





Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.