Thread: [Phplib-users] robots and session-id's
Brought to you by:
nhruby,
richardarcher
|
From: Sascha W. <sas...@gm...> - 2002-03-07 16:41:30
|
Hi out there, I'm coding a shop with the help of phplib (thanks a lot to all the developers at this point). One purpose is, that every page with products-infos can be indexed by search-engines to make the rare products more findable - so I can't lock all pages for robots. At this point I was pretty astonished, that I found no standard-solution to prevent urls with session-id's to be indexed by robots! All hints I found (most in the archive of this list) recommended to check $HTTP_USER_AGENT against a list of known user-agents and then not to start phplib-page-management-functions if a robot was identified. For me this seems not to be very usable as I think it's hard to maintain an actual list of ALL relevant search-engines. Now my questions: - Does anyone know a better (perfect and simple) solution for this problem? OR - Wouldn't it be more simple and effective to check $HTTP_USER_AGENT against some valid user-agent-strings from browsers? (At least every browser does have something like "...compatible; MSIE..." in his string and I guess this list is shorter and easyier to maintain as a list of robots) Does anyone know some PROs and CONs for this assumption? Thanks in advance, Sascha. |
|
From: Lindsay H. <fm...@fm...> - 2002-03-07 17:30:51
|
One thing I've found is that some search engines _won't_ go into pages for which there's session data passed back to the client. I'm not sure if the rejection is on account of GET data or attempts to set cookies, or both. To get at least the top page of such sites indexed, I generally take a snapshot of the top page and serve it as a straight HTML. The site doesn't try to establish a session until a visitor goes to satellite pages. This doesn't answer your question, nor I'm sure is it encouraging news w. regard to getting your product catalog indexed, but I thought I'd pass it on. Thus spake Sascha Weise on Thu, Mar 07, 2002 at 10:58:37AM CST > Hi out there, > > I'm coding a shop with the help of phplib (thanks a lot to all the > developers at this point). > One purpose is, that every page with products-infos can be indexed by > search-engines to make the rare products more findable - so I can't lock > all pages for robots. > > At this point I was pretty astonished, that I found no standard-solution to > prevent urls with session-id's to be indexed by robots! > > All hints I found (most in the archive of this list) recommended to check > $HTTP_USER_AGENT against a list of known user-agents and then not to start > phplib-page-management-functions if a robot was identified. > For me this seems not to be very usable as I think it's hard to maintain an > actual list of ALL relevant search-engines. > > > Now my questions: > > - Does anyone know a better (perfect and simple) solution for this problem? > > OR > > - Wouldn't it be more simple and effective to check $HTTP_USER_AGENT > against some valid user-agent-strings from browsers? > (At least every browser does have something like "...compatible; MSIE..." > in his string and I guess this list is shorter and easyier to maintain as a > list of robots) > Does anyone know some PROs and CONs for this assumption? > > > Thanks in advance, > > Sascha. > > > _______________________________________________ > Phplib-users mailing list > Php...@li... > https://lists.sourceforge.net/lists/listinfo/phplib-users -- Lindsay Haisley | "Everything works | PGP public key FMP Computer Services | if you let it" | available at 512-259-1190 | (The Roadie) | <http://www.fmp.com/pubkeys> http://www.fmp.com | | |
|
From: Sascha W. <sas...@gm...> - 2002-03-08 12:06:14
|
At 11:30 07.03.2002 -0600, you wrote: >One thing I've found is that some search engines _won't_ go into pages for >which there's session data passed back to the client. As long as that doesn't apply for all search-engines you can't really deal with it. I've read about guys having big problems with a sessionid indexed on ALTAVISTA and all user coming from there having the same shopping_cart. >This doesn't answer your question, nor I'm sure is it encouraging news w. >regard to getting your product catalog indexed, but I thought I'd pass it >on. Thank you sincerely for your effort anyway, Sascha. >Thus spake Sascha Weise on Thu, Mar 07, 2002 at 10:58:37AM CST > > Hi out there, > > > > I'm coding a shop with the help of phplib (thanks a lot to all the > > developers at this point). > > One purpose is, that every page with products-infos can be indexed by > > search-engines to make the rare products more findable - so I can't lock > > all pages for robots. > > > > At this point I was pretty astonished, that I found no > standard-solution to > > prevent urls with session-id's to be indexed by robots! > > > > All hints I found (most in the archive of this list) recommended to check > > $HTTP_USER_AGENT against a list of known user-agents and then not to start > > phplib-page-management-functions if a robot was identified. > > For me this seems not to be very usable as I think it's hard to > maintain an > > actual list of ALL relevant search-engines. > > > > > > Now my questions: > > > > - Does anyone know a better (perfect and simple) solution for this problem? > > > > OR > > > > - Wouldn't it be more simple and effective to check $HTTP_USER_AGENT > > against some valid user-agent-strings from browsers? > > (At least every browser does have something like "...compatible; MSIE..." > > in his string and I guess this list is shorter and easyier to maintain > as a > > list of robots) > > Does anyone know some PROs and CONs for this assumption? > > > > > > Thanks in advance, > > > > Sascha. > > > > > > _______________________________________________ > > Phplib-users mailing list > > Php...@li... > > https://lists.sourceforge.net/lists/listinfo/phplib-users > >-- >Lindsay Haisley | "Everything works | PGP public key >FMP Computer Services | if you let it" | available at >512-259-1190 | (The Roadie) | <http://www.fmp.com/pubkeys> >http://www.fmp.com | | > >_______________________________________________ >Phplib-users mailing list >Php...@li... >https://lists.sourceforge.net/lists/listinfo/phplib-users |
|
From: Martin L. <mar...@ma...> - 2002-03-08 06:54:07
|
Hello Sascha, Thursday, March 07, 2002, 5:58:37 PM, you wrote: SW> At this point I was pretty astonished, that I found no standard-solution to SW> prevent urls with session-id's to be indexed by robots! SW> - Does anyone know a better (perfect and simple) solution for this problem? Just a quick thought. No search-engine that I know of will ever make a POST but just follow links, so maybe it is possible for you to have the user to post a (empty) form to get into to "search-engine-locked" part of the site? This might be a little dirty workaround, but depending on your project it might be the easiest solution. -- Martin mailto:mar...@ma... Mail me for my public PGP-key |
|
From: Sascha W. <sas...@gm...> - 2002-03-08 12:39:49
|
Hi Martin, >Just a quick thought. No search-engine that I know of will ever make a >POST but just follow links, so maybe it is possible for you to have >the user to post a (empty) form to get into to "search-engine-locked" >part of the site? I'm not sure if I understood you right, but there's no "search-engine-locked" part. The whole site SHALL BE INDEXED but WITHOUT SESSIONIDs. I think the problem is at the point where the phplib-pagemanagement sends a HTTP-"Location:"-header with a sessionid in the url to the client if there isn't already one in cookie or url!! That means, even if a robot requests an url WITHOUT sessionid, an url WITH sessionid will be indexed if the pagemanagement isn't disabled for robots. Or am I wrong? greets, Sascha. |
|
From: Tarique S. <ta...@sa...> - 2002-03-08 13:03:32
|
On Fri, 8 Mar 2002, Sascha Weise wrote: Hello Folks, Wouldn't it be better to check if the session ID in question matches the browser which created it? The simplest way (and not without loopholes) is to keep a track of IP which created the session ID and create a new session if they dont match? There are several other ways to narrow down the browser to sessionID matching Also not having sessionID as a part of get query helps, note this does not necessarily mean that you post it OR use cookies (URL emebedded!?) HTH Tarique -- ============================================================= PHP Applications for E-Biz: http://www.sanisoft.com Indian PHP User Group: http://groups.yahoo.com/group/in-phpug ============================================================= |
|
From: Stephen W. <wo...@sw...> - 2002-03-08 14:46:30
|
I thought session IDs timed out, so even is your site was indexed with session IDs they would/should not be valid when a search user comes back to the site and your software should generate a new ID. The idea of keeping the IP address is a good one also, because clearly a user coming in from a different host can not be the same session. It could be the same user, but the same session. -Steve "Tarique Sani " wrote: > > On Fri, 8 Mar 2002, Sascha Weise wrote: > Hello Folks, > > Wouldn't it be better to check if the session ID in question matches the > browser which created it? > > The simplest way (and not without loopholes) is to keep a track of IP > which created the session ID and create a new session if they dont match? > > There are several other ways to narrow down the browser to sessionID > matching > > Also not having sessionID as a part of get query helps, note this does not > necessarily mean that you post it OR use cookies (URL emebedded!?) > > HTH > > Tarique |
|
From: Sascha W. <sas...@gm...> - 2002-03-08 16:30:55
|
>I thought session IDs timed out, so even is your site was indexed with >session IDs they would/should not be valid when a search user comes back >to the site and your software should generate a new ID. IMHO there's no validity-check of the id's. I may be completely wrong, but that was my conclusion after I tested to request urls with SESSIDs which I had deleted before from the active_sessions-table or with "unvalid" SESSIDs like "...=01". As result I found these ids in afterwards in the active_sessions-table again. There was definitely NO NEW ID created if there was ANY sessionid found. If that was a stupid test, please let me know. Sascha. |
|
From: Joe S. <jo...@be...> - 2002-03-08 16:54:52
|
On Fri, Mar 08, 2002 at 05:46:50PM +0100, Sascha Weise wrote: > > >I thought session IDs timed out, so even is your site was indexed with > >session IDs they would/should not be valid when a search user comes back > >to the site and your software should generate a new ID. > > IMHO there's no validity-check of the id's. > I may be completely wrong, but that was my conclusion after I tested to > request urls with SESSIDs which I had deleted before from the > active_sessions-table or with "unvalid" SESSIDs like "...=01". > As result I found these ids in afterwards in the active_sessions-table again. > There was definitely NO NEW ID created if there was ANY sessionid found. > > If that was a stupid test, please let me know. > This is known. And another reason not to use get fallback on ecommerce sites. Also phplib's garbage collection of stale sessions doesn't delete old sessions all the time. Mr. Chaney has proof again: http://marc.theaimsgroup.com/?t=95599153200002&r=1&w=2 He suggested a REFERER check to see if it was an internal link. http://marc.theaimsgroup.com/?l=phplib&m=96732284720675&w=2 He has now written his own auth library I believe and doesn't use phplib session and auth. Joe > > Sascha. > |
|
From: Tarique S. <ta...@sa...> - 2002-03-09 04:11:49
|
On Fri, 8 Mar 2002, Joe Stewart wrote: > This is known. And another reason not to use get fallback on ecommerce > sites. Also phplib's garbage collection of stale sessions doesn't delete > old sessions all the time. This is easily remedied by setting gc_probablity - set it to 100% Tarique -- ============================================================= PHP Applications for E-Biz: http://www.sanisoft.com Indian PHP User Group: http://groups.yahoo.com/group/in-phpug ============================================================= |
|
From: Michael C. <mdc...@mi...> - 2002-03-22 15:49:54
|
On Fri, Mar 08, 2002 at 09:46:19AM -0500, Stephen Woodbridge wrote: > I thought session IDs timed out, so even is your site was indexed with > session IDs they would/should not be valid when a search user comes back > to the site and your software should generate a new ID. Even if a session id disappears from the database due to garbage collection, it will immediately be recreated when someone comes in using it. That's essentially how they are created in the first place. Checking the HTTP_REFERER [sic] will mostly solve that problem. > The idea of keeping the IP address is a good one also, because clearly a > user coming in from a different host can not be the same session. It > could be the same user, but the same session. Please read archives pointed to elsewhere in this thread. AOL, and a number of other sizeable internet services, use transparent proxies to do the actual requests, so there's no guarantee that the IP will remain consistent across page views. If it was as simple as IP addresses, we wouldn't bother with session id's. I have written a new authentication class which uses php4's much faster sessioning, and I now use that exclusively. Actually, I'm going to be converting my phplib sites over at some point. (This isn't a slam on phplib, it made php3 quite useable) I'm finishing up development of a "generic" version that will make it easy for a competent programmer to get a site up quickly. As an example, I just completed a fully-functioning e-commerce site in 16 hours. I can shave another 4 or so hours off that. I will post an announcement when it is ready for all. Michael -- Michael Darrin Chaney mdc...@mi... http://www.michaelchaney.com/ |
|
From: Joe S. <jo...@be...> - 2002-03-08 15:32:08
|
This question seems to come up often. My opinion has been to use cookie mode only and forget the get fallback. Makes for ugly urls if someone is trying to bookmark the site, email the link, or even write it down. There was a discussion about this on the phpslash-user list recently. http://marc.theaimsgroup.com/?t=101346376400018&r=1&w=2 continued - http://marc.theaimsgroup.com/?t=101346579700003&r=1&w=2 nathan suggest that future phplib versions should support php4's transparent SID's. On Fri, Mar 08, 2002 at 09:46:19AM -0500, Stephen Woodbridge wrote: > I thought session IDs timed out, so even is your site was indexed with > session IDs they would/should not be valid when a search user comes back > to the site and your software should generate a new ID. > This is my understanding also if everything is working properly. > The idea of keeping the IP address is a good one also, because clearly a > user coming in from a different host can not be the same session. It > could be the same user, but the same session. > IP's can come from pools for users. AOL users in particular don't come from one IP over the course of their visit. Been discussed here before - http://sourceforge.net/mailarchive/message.php?msg_id=1131438 Joe |
|
From: Mike G. <Mik...@sa...> - 2002-03-08 16:10:31
|
Joe Stewart wrote: > nathan suggest that future phplib versions should support php4's transparent > SID's. How far off are we from having this ability? It seems to me that there is something there for the latest pre-release, but that it is buggy?? If so, is there ongoing work to make that work? |
|
From: Roy C. <rc...@ho...> - 2002-03-09 05:25:51
|
This one is kinda on the phplib track. I was working with pgmarket which
uses the phplib template class and some other elements. Using a straight out
of the box install... When I go to the error checking mode I get undefined
indexes all over the place. My question is really pretty much, "Is this
normal?"
first the treemenu for the left sidebar has a problem. (first 3 error
messages)
from the following code
/*********************************************/
/* Get Node numbers to expand */
/*********************************************/
if ($p!="") $explevels = explode("|",$p);
$i=0;
while($i<count($explevels))
{
$expand[$explevels[$i]]=1;
$i++;
}
springs the following errors
Warning: Undefined variable: p in
/usr/local/httpd/htdocs/pgmarket/lib/treemenu.inc.php
on line 94
Warning: Undefined variable: explevels in
/usr/local/httpd/htdocs/pgmarket/lib/treemenu.inc.php
on line 97
Warning: Undefined variable: explevels in
/usr/local/httpd/htdocs/pgmarket/lib/treemenu.inc.php
on line 136
then a series of undefined indexes which come from the below snippit of code.
(I comment it out and the errors vanish)
"session_stringtsf" => $SESSION["stringtsf"],
"session_concatenation_checked_or" => $SESSION["concatenation"] != "AND" ?
"checked" : "",
"session_concatenation_checked_and" => $SESSION["concatenation"] == "AND" ?
"checked" : "",
"session_case_sensitive_checked" => $SESSION["case_sensitive"] ? "checked" :
""
Warning: Undefined index: stringtsf in
/usr/local/httpd/htdocs/pgmarket/header.php on line 19
Warning: Undefined index: concatenation in
/usr/local/httpd/htdocs/pgmarket/header.php on line 20
Warning: Undefined index: concatenation in
/usr/local/httpd/htdocs/pgmarket/header.php on line 21
Warning: Undefined index: case_sensitive in
/usr/local/httpd/htdocs/pgmarket/header.php on line 22
And then a set of unrelated undefined indexes:
Warning: Undefined index: there_are_special_products_blck in
Warning: Undefined index: user in
/usr/local/httpd/htdocs/pgmarket/lib/pgmarket.inc.php on line 91
Warning: Undefined index: user in
/usr/local/httpd/htdocs/pgmarket/lib/pgmarket.inc.php on line 91
Warning: Undefined index: user in
/usr/local/httpd/htdocs/pgmarket/lib/pgmarket.inc.php on line 91
Warning: Undefined index: user in
/usr/local/httpd/htdocs/pgmarket/lib/pgmarket.inc.php on line 91
/usr/local/httpd/htdocs/pgmarket/lib/template.inc.php on line 210
--
Dr. Roy F. Cabaniss
9704048 or US2002021452
Head Boll of the Evil Weevils
|
|
From: Sascha W. <sas...@gm...> - 2002-03-08 11:00:11
|
Hi Steve, >One problem I have noticed by watching my logs is that a lot of robots >masquerade as browsers How did you find out? >So the question becomes can you make a list of robots that you care >about knowing that you will never catch all of them anyway because >some don't play fair. Right, those who don't play fair can't be considered anyway. But don't you think, it's easier to make a list of ca. 5 browsers than a list of an unknown amount of robots I (should) care about? thanks for your answer, Sascha. |
|
From: Sascha W. <sas...@gm...> - 2002-03-08 12:40:13
|
>My logs list 186 User agents - so what does one do in such situations? hhmmmmm.... perhaps I haven't got enough experience with logs of well-visited sites ;-( But I thought most of them would at least contain something like "...compatible; MSIE...". Don't they? Sascha. |