PhpGedView / Discussion / Help: Check with Google Labs gives code 403

rick - 2010-03-25

Hi,
When I check with Google Labs the access of Googlebot, the test fails with the HTTP 403 error code.

URL: http://mywebsite/
Datum: Thu Mar 25 11:59:53 PDT 2010
Googlebot Type: Web
HTTP/1.0 403 Forbidden
Date: Thu, 25 Mar 2010 18:59:53 GMT
Server: Apache
X-Powered-By: me
Content-Encoding: gzip
Vary: Accept-Encoding
Content-Length: 82
Connection: close
Content-Type: text/html

Sorry, this page is not available for search engine bots.

I do not have any clue why this happens. The server is running with ISPadmin and all other virtual websites does not have this problem.
Can someone give me a direction to look at, maybe a spot which I didn't search yet.

Thanks in advance,

Rick

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Greg Roach - 2010-03-25

<<When I check with Google Labs the access of Googlebot, the test fails with the HTTP 403 error code.>>

How would I run this test? Do I need a google-labs account? Is there a URL?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

rick - 2010-03-25

Yes you need. It is part of the webmasters menu at google.
The strange thing is that the sitemaps can be read by google. But I am confused if it is apache,isp or phpgedview causing the 403.
I remember to have seen a topic about a list in phpgedview where bots where mensioned but can not find it.

Rick

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

roland_l - 2010-03-28

Hello Rick -

I am both a noob to PGV and my website maintenance in general, but am curious about the use of Google Lab tools in both regards. I am hoping my documentation on your question of "search engines" (aka 'bots') will help you resolve your issue(s) or add more insight to the PGV project. Though I assume you are trying to use the GL tool named "Fetch", please take all with a grain of salt if none of this applies to your situation. At least it will be documented for future reference of others.

The ability of "bots" to search your site is often controlled by the use of a simple text file named "robots.txt" which is usually located in the root directory of your web space - ex. http://mywebsite.com - and its usage is covered well in the PGV Wiki at http://wiki.phpgedview.net/en/index.php?title=Restrict_bot_access.

You are correct, there have been several past posts regarding "bots" in this forum. Most are related to "bad bots" which attempt to obtain info from our PGV installs which we would consider private and normally would not share. In this regard, I think the PGV developers have done a wonderful job of coding in many protections against these type of "attacks" and applaud them for doing so, as a rookie like me would have overlooked the impact this whole process has on a site after it has been activated.

While the robots.txt "standard" is but one method (as bot access can also be controlled more selectively with other techniques), all search engines do not "play" by the rules you may define. These more advanced methods generally fall into a category known as SEO (search engine optimization), and are commonly applied to monetized sites (think $$$ here), not at all similar to our "free" PGV implementations. Needless to say, bot access can also be specified directly in the META tag(s) of an HTML page or through the use of well-crafted ".htaccess" files. A simple web search of any above terms will bring up many more pages of insight of how the professional web designers (not me) get their sites to the "top of the list" for all to see.

But, back to your post - A "403 - Forbidden" status code (error message) is a response "…returned by the Apache web server when directory listings have been disabled." (see http://en.wikipedia.org/wiki/HTTP_403). I would suggested first looking for a "robots.txt" in the directory you are having this issue with. Just FYI here - you can have more than one robots.txt file on your site as it is "path specific". Also, Google permits a variation of the "standard" and will recognize an "Allow" tag in the file for its several bots, while others will ignore it. Again, more explained at http://en.wikipedia.org/wiki/Robots.txt and here http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=156449.

Hoping this helps you (and others) in some small way. Good Luck!

Roland_L

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Greg Roach - 2010-03-28

If PGV detects a bad bot, it responds with a 403.

WIthout knowing things like the ua string involved, it is hard to say why this might be triggering it.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

rick - 2010-03-28

Dear Robert,
I was traveling in Germany this weekend so could not respond. Thanks for your reply. The robot.txt at the server contains:

User-agent: *
Allow: /

which should be fine I think. An other strange thing noticed at Google is that on Sitemap is accepted and the other (i have two) is not accepted. They are both in the same folder and do have the same permissions. The one accepted is not indexed, reason I do not know.
I understand that it could be caused by many things. When I access these Sitemaps by browsing direct to these files, the site map is correctly displayed, no access issues. If I clink at a link that Google is complaining about, the site is normally displayed. But in that case I am not a BOT but a normal user.
Maybe it is a forwarding thing in ISPadmin but I checked it and can not find any thing related to bots.
The funny thing is that it always worked when I had a structure like /var/www/webx/web/phpgedview but recently moved it to /var/www/webx. All except google works fine after the move. I am lost.

What is ua string? Can I provide you that?

Rick.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

rick - 2010-03-28

Sorry Roland! Got your name wrong!

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

roland_l - 2010-03-28

Hello Rick -

No harm with the "name change", it's a tough one and not too common. You should hear some of the other versions I get …

Glad my previous post did not offend but gave you an opportunity to do some checking. Even with more information, I can only stress a few points I made before.

1. Remember, you should only have one robots.txt file in any given path (or folder) of your web space. You can have others in different sub-domain folders, images folders, etc. or just detail all bot-access rules in one file kept in the root folder. Did you review the examples in the Wiki? While your code above should "Allow" all bots (and to all paths!), clearly some bots are best kept away. There are also tools on the WWW which you can use to test your robots.txt for effectiveness.

Also, I am not sure how your webhost allows two files with the same name in one path anyway, but clearly this could be part of the confusion for Google. After all, they are pretty much running the "Search" show these days.

2. Only guessing here, but most Web APIs (like Google Maps, for example) require you to specify the path (ie. web address) it will (only) be used from. I don't think you can move things around later without requesting a new "key" for it to work with. Perhaps moving your files is the sole source of things not working now and you simply need to request a new code for this installation path. Again, everything in webhosting is path (and rights) restricted, limited to the "User" who wishes to access something on a site.

3. As far as

What is ua string? Can I provide you that?

goes, I am sure it means UserAgent String, but don't know the details. So I asked my friend Google and she said this: http://en.wikipedia.org/wiki/User_agent. So I can't use your UA String (once you find it), but clearly Greg (fisharebest) is indicating that PGV does depend on this information and could be a source of 430 - Forbidden errors if what is being used to access your site is invalid or incorrect. This clue would help the our developers re-code for this Google API and make it "play well" with PGV in the future !

Question: Exact which Google Labs Tool are you having this problem with ? Is it Fetch or Analytics?

Hopefully this all gives you more info toward resolving your problem(s). Please keep us posted as you progress. Roland_L

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Greg Roach - 2010-03-28

PGV uses the User Agent string to identify good/bad robots.

The "google labs bot" will have its own UA string.

So, my best guess is that PGV doesn't like this. But without knowing the string, this is just speculation.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

rick - 2010-03-28

Hi,

As a test I moved the phpgeview content back to /var/www/webx/web/phpgedview and did the google lab test again.
The result:

URL: http://mydomain/phpgedview
Datum: Sun Mar 28 13:24:03 PDT 2010
Googlebot Type: Web
HTTP/1.1 301 Moved Permanently
Date: Sun, 28 Mar 2010 20:24:03 GMT
Server: Apache
Location: http://mydomain/phpgedview/
Vary: Accept-Encoding
Content-Encoding: gzip
Content-Length: 206
Keep-Alive: timeout=15, max=100
Connection: Keep-Alive
Content-Type: text/html; charset=iso-8859-1

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>301 Moved Permanently</title>
</head><body>
<h1>Moved Permanently</h1>
<p>The document has moved <a href="http://mydomain/phpgedview/">here</a>.</p>
</body></html>

I am not sure but it looks like that somewhere in the phpgedview code is defined that the phpgedview source should be located in /rootwebsite/phpgediew instead of /rootwebsite

Is it possible to confirm?

Rick.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Greg Roach - 2010-03-28

You will get the 301 if your SERVER_URL setting is different to the URL used to contact the site.

It is how we redirect www.example.com/~user/phpgedview to www.mydomain.com

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

rick - 2010-03-28

Saw your replies when I post my test result.

I am using the first LAB test, in Dutch it is called 'Ophalen als Googlebot', translated it would be something like 'Retrieve or Fetch as Googlebot.

In my log file is:
mydomain 66.249.65.54 - - "GET / HTTP/1.1" 403 82 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

Rick

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

rick - 2010-03-28

Ok Greg, understood. Well the google bot seems to work okay when the website is defined as mydomain/phpgedview but when I tell the system that the site should be only mydomain, the 403 is returned. The 302 is because I did just copy the content and not reconfigure the configuration file. No worries about that.

Rick

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

rick - 2010-03-28

@Roland,

Well I am my own host, I am running the websites from my own server and not depending on a comercial provider. The advantage is that I can do whatever I need or want to. I am using Ubuntu and ISP do have different virtual domains so currently have about 6 active ;-)

With all others no issues and before the move also no issues with phpgedview but now it is a p.i.t.b.

Rick

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Greg Roach - 2010-03-28

Don't forget that robots.txt must only be used in the root. i.e. www.example.com/robots.txt You cannot put it in a subdirectory, such as www.example.com/phpgedview/robots.txt

"Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

This is the UA string. Check your PGV logs. If it blocks a bot (based on UA string), it will record it there.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

rick - 2010-03-28

The robot.txt is in the root!
Greg, thanks for pointing to the pgv log file. Did not check this before but at the time of the log entry posted in message 12, which is an entry of the apache log file, no entry is written in the PGV log file. There are some crawler blocks mentioned but not from that IP address.
So apperently the blocking is not from phpgedview. This makes it even more difficult.

In some of the source files is written:

if (!defined('PGV_PHPGEDVIEW')) {
header('HTTP/1.0 403 Forbidden');
exit;
}
Could it be that the PGV_PHPGEDVIEW parameter is configured to point at previous path/configuration?

Rick.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Greg Roach - 2010-03-28

This block of logic is simply used to prevent people from trying to access an include file. PGV_PHPGEDVIEW is set in session.php, so it says that you cannot load this until you have loaded session.php - and session.php checks that it can only be loaded by designated scripts.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

rick - 2010-03-28

Thanks Greg,

Well, as it looks like I need to move all back to the previous configuration where the structure is website/phpgedview.
Either there is somewhere a file or configuration that keeps apache telling that the genealogy is located at domain/phpgedview
or a hidden .htaccess file is doing a bad job. Check all but no clue yet.

Keep you posted on progress. Google is important because most visitors are entering the site with the search at Google.

Have a good night, Rick.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Greg Roach - 2010-03-28

Have you tried getting a plug-in for firefox which lets you set your UA string. This lets you impersonate a search-engine, and see what content is returned to it.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

rick - 2010-03-28

Hi Greg,
No I didnt but will try it.
In the mean time I looked at the spider_session.php and found the place where the string 'Sorry , this page is not available for search engine bots.' is produced. To test I changed this text a little and wondered what Google returened.
The message is:

HTTP/1.0 403 Forbidden
Date: Sun, 28 Mar 2010 21:38:43 GMT
Server: Apache
X-Powered-By: PHP/5.2.6-2ubuntu4.6
Content-Encoding: gzip
Vary: Accept-Encoding
Content-Length: 86
Connection: close
Content-Type: text/html

Sorry spider, this page is not available for search engine bots.

The 'spider' was added to the code, which proves that phpgedview is blocking Google. Now I need to find out why.
Maybe the plugin will help to find the cause.

Rick.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Greg Roach - 2010-03-28

I have no idea how google-labs works, but the "web developer" plug for firefox will show you these reponse headers. In combination with the default-user-agent plugin, you can test all this locally, without needing to go to an external site.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

rick - 2010-03-29

Hi Greg,
Downloaded Firefox and required plugin. I got the same error as google does. Than tried to find the problem and could
manage the error by changing the file session_spider.php.

I added in the array $real_browsers the browser 'Googlebot' and saved it.
Tested with Firefox and Google Lab and now all is okay without any error.

Can you confirm this is the right action taken?

Rick.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Greg Roach - 2010-03-29

Probably not. It will allow PGV to index all the pages in your site that robots should not see - as if it was a real user. e.g. time-consuming relationship charts, etc. It will also get the theme options, which let it see multiple copies of everything (which can affect your page rank badly).

One more thought. You are not trying to access any of the pages that search engines are not allowed to see?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

rick - 2010-03-30

Hi Greg,

Well if the home page of PHPGEDVIEW is such page than yes. I have tried different user-page settings. Mozilla/5.0 is not a problem. The page will be displayed. But Mozilla/5.0 (compatible etc…..) will show 'Search engine not allowed). To me it is the code. Why is was working the previous configuration is the 301 moved result I guess.

Rick

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

rick - 2010-03-30

Greg, I removed it again. Your right, Googlebot became a normal user.
Continued testing:
When I removed the 'exit' in session_spider.php:

if ($SEARCH_SPIDER && !in_array(PGV_SCRIPT_NAME, $bots_not_allowed)) {
header("HTTP/1.0 403 Forbidden");
print "Sorry spider, this page is not available for search engine bots.";
// exit;
}

I get the following info back as Googlebot:

~Search Engine Detected~
PhpGedView automatically provides search engines with smaller data files with fewer links. The data is limited to the individual and immediate family, without adding information about grand parents or grand children. Many reports and server-intensive pages like the calendar are off limits to the spiders.
Attempts by the spiders to go to those pages result in showing this page. If you are seeing this text, the software believes you are a search engine spider. Below is the list of pages that are allowed to be spidered and will provide the abbreviated data.
Real users who follow search engine links into this site will see the full pages and data, and not this page.

Search Engine Spider Detected: Googlebot/ http://www.google.com/bot.html

Does this give you a clue on what is happening?
Note this is just the home page what is requested!

Rick.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Check with Google Labs gives code 403

Forums

Help

Check with Google Labs gives code 403 document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Check with Google Labs gives code 403