Hi There. I have a couple questions on tags and how to extract them. I looked through the examples and also the documentation, but wasn't sure on or if this software could do this. Any help would be greatly appreciated.
1. extracting META tags for keywords and description?
2. extract the date between this tag without extracting all span tags <span class="date">11.23.2010</span>?
Thanks,
Ron
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
i'm sorry, but phpcrawl itself doesn't provide any functionality for extracting special tags or other data
from websites.
phpcrawl is a pure crawler that spiders websites and passes the found pages/documents "as they are" to
the user of the library.
But using some regular-expressions should do the trick.
For extracting keywords from meta-tags use something linke this i.e.:
preg_match('#<\s*meta\s*name\s*=\s*"\s*keywords\s*"\s*content\s*=\s*"(.*)"# Ui', $source, $match);
// $match[1] contains the keywords if found
And for extracting the date from <span class="date">11.23.2010</span> try something like this:
preg_match_all('#<\s*span\s*class\s*=\s*"date"\s*>(.*)</span># Ui', $source, $matches);
// $matches[1] now contains an numeric array with all found dates
I didn't test the expressions above properly, they are just an approach.
Best regards,
huni.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thanks huni for the quick reply. I got it now. I am pretty new to object oriented programming, so I often miss some pretty obvious things. The code looks great and I'm continuing on.
One more question. I noticed that pages with 301 permanent redirects do not download any content, so I am unable to get their meta tags. I am able to goto these pages manually, and also I am using the default settings of phpCrawl which follows redirects. Looking at these pages so far, I probably don't care about indexing them, but I am just wondering why the 301 status acts the way it does. Thanks again for this great code.
Ron
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The typically puropose of sending redirects (301 or other) is to make a website reachable under different hostnames.
Jast as an example, the webiste "www.heise.de" is also reachable by using "heise.de" (without "www" at the beginning).
The webserver's answer to a request for "heise.de" is simply a 301-redirect-header to "www.heise.de".
The redirect-header looks like this:
HTTP/1.1 301 Moved Permanently
Date: Wed, 01 Dec 2010 23:43:58 GMT
Server: Apache Location: http://www.heise.de/
Content-Length: 228
Connection: close
Content-Type: text/html; charset=iso-8859-1
.. and the content (which only intent is to forward browsers or bots that don't support redirects by presenting a link
to them) is this:
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>301 Moved Permanently</title>
</head><body>
<h1>Moved Permanently</h1> <p>The document has moved <a href="http://www.heise.de/">here</a>.</p>
</body></html>
And as you can see, there aren't any meta-tags (or other useful information) in this content and you
can simply ignore it for your purposes i guess.
Best regards,
huni.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The typically puropose of sending redirects (301 or other) is to make a website reachable under different hostnames.
Jast as an example, the website "www.heise.de" is also reachable by using "heise.de" (without "www" at the beginning).
The webserver's answer to a request for "heise.de" is simply a 301-redirect-header to "www.heise.de".
The redirect-header looks like this:
HTTP/1.1 301 Moved Permanently
Date: Wed, 01 Dec 2010 23:43:58 GMT
Server: Apache
Location: http://www.heise.de/
Content-Length: 228
Connection: close
Content-Type: text/html; charset=iso-8859-1
.. and the content (which only intent is to forward browsers or bots that don't support redirects by presenting a link
to them) is this:
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>301 Moved Permanently</title>
</head><body>
<h1>Moved Permanently</h1>
<p>The document has moved <a href="http://www.heise.de/">here</a>.</p>
</body></html>
And as you can see, there aren't any meta-tags (or other useful information) in this content and you
can simply ignore it for your purposes i guess.
Best regards,
huni.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Oh OK, I understand why the 301 redirect page does not bring back any meta data now. I guess my question is does the redirected page also go into the links to crawl queue?
For your example, lets say we crawl heise.de which is a redirect to http://www.heise.de. So for heise.de, I get a 301 redirect and no meta data, but does the crawler also put the redirected page into the crawling queue, so that I end up crawling http://www.heise.de and getting the meta data that exists on that page? I'm guessing that's what the option setFollowRedirects() was for?
Just want to once again say how helpful this code is. Thanks so much for putting this together.
Ron
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
.. my question is does the redirected page also go into the links to crawl queue?
Yes, it does by default (cause setFollowRedirects() is set to TRUE by default).
This option is only implemented for people that don't want the crawler to follow redirects (for whatever reasons).
I'm glad if i could help!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi There. I have a couple questions on tags and how to extract them. I looked through the examples and also the documentation, but wasn't sure on or if this software could do this. Any help would be greatly appreciated.
1. extracting META tags for keywords and description?
2. extract the date between this tag without extracting all span tags <span class="date">11.23.2010</span>?
Thanks,
Ron
Hi Ron,
i'm sorry, but phpcrawl itself doesn't provide any functionality for extracting special tags or other data
from websites.
phpcrawl is a pure crawler that spiders websites and passes the found pages/documents "as they are" to
the user of the library.
But using some regular-expressions should do the trick.
For extracting keywords from meta-tags use something linke this i.e.:
And for extracting the date from <span class="date">11.23.2010</span> try something like this:
I didn't test the expressions above properly, they are just an approach.
Best regards,
huni.
Thanks huni for the quick reply. I got it now. I am pretty new to object oriented programming, so I often miss some pretty obvious things. The code looks great and I'm continuing on.
One more question. I noticed that pages with 301 permanent redirects do not download any content, so I am unable to get their meta tags. I am able to goto these pages manually, and also I am using the default settings of phpCrawl which follows redirects. Looking at these pages so far, I probably don't care about indexing them, but I am just wondering why the 301 status acts the way it does. Thanks again for this great code.
Ron
Hi Ron again,
i'm not sure if i understand you question.
The typically puropose of sending redirects (301 or other) is to make a website reachable under different hostnames.
Jast as an example, the webiste "www.heise.de" is also reachable by using "heise.de" (without "www" at the beginning).
The webserver's answer to a request for "heise.de" is simply a 301-redirect-header to "www.heise.de".
The redirect-header looks like this:
HTTP/1.1 301 Moved Permanently
Date: Wed, 01 Dec 2010 23:43:58 GMT
Server: Apache
Location: http://www.heise.de/
Content-Length: 228
Connection: close
Content-Type: text/html; charset=iso-8859-1
.. and the content (which only intent is to forward browsers or bots that don't support redirects by presenting a link
to them) is this:
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>301 Moved Permanently</title>
</head><body>
<h1>Moved Permanently</h1>
<p>The document has moved <a href="http://www.heise.de/">here</a>.</p>
</body></html>
And as you can see, there aren't any meta-tags (or other useful information) in this content and you
can simply ignore it for your purposes i guess.
Best regards,
huni.
Hi Ron again,
i'm not sure if i understand your question.
The typically puropose of sending redirects (301 or other) is to make a website reachable under different hostnames.
Jast as an example, the website "www.heise.de" is also reachable by using "heise.de" (without "www" at the beginning).
The webserver's answer to a request for "heise.de" is simply a 301-redirect-header to "www.heise.de".
The redirect-header looks like this:
HTTP/1.1 301 Moved Permanently
Date: Wed, 01 Dec 2010 23:43:58 GMT
Server: Apache
Location: http://www.heise.de/
Content-Length: 228
Connection: close
Content-Type: text/html; charset=iso-8859-1
.. and the content (which only intent is to forward browsers or bots that don't support redirects by presenting a link
to them) is this:
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>301 Moved Permanently</title>
</head><body>
<h1>Moved Permanently</h1>
<p>The document has moved <a href="http://www.heise.de/">here</a>.</p>
</body></html>
And as you can see, there aren't any meta-tags (or other useful information) in this content and you
can simply ignore it for your purposes i guess.
Best regards,
huni.
Hi Huni,
Oh OK, I understand why the 301 redirect page does not bring back any meta data now. I guess my question is does the redirected page also go into the links to crawl queue?
For your example, lets say we crawl heise.de which is a redirect to http://www.heise.de. So for heise.de, I get a 301 redirect and no meta data, but does the crawler also put the redirected page into the crawling queue, so that I end up crawling http://www.heise.de and getting the meta data that exists on that page? I'm guessing that's what the option setFollowRedirects() was for?
Just want to once again say how helpful this code is. Thanks so much for putting this together.
Ron
Hey Ron,
Yes, it does by default (cause setFollowRedirects() is set to TRUE by default).
This option is only implemented for people that don't want the crawler to follow redirects (for whatever reasons).
I'm glad if i could help!