Re: [Sprawler-devel] client check.

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

>>>>> "Eric" == Eric Anderson <and...@ce...> writes:

> Mojo B. Nichols wrote:
>>> Actually I think it may be my perl and just the sockets... can
>>> somebody else try this on linux?  I said client because that
>>> fails, but upon closer inspection it doesn't seem that simple.
>>> 

>>  Whew no not my sockets. Basically the problem was two fold: One my
>> client database didn't have my client in there.  If I add it
>> blindly to the database that takes it past that point.  Then my url
>> seed db was either empty or broken or something. removing that
>> index allowed it to start working (reseeded it etc). I'm going to
>> shuffle this off to the side and see if I can figure out where it
>> went wrong. Perhaps the seeded url db is in cvs? I'll check it out.

> Glad to hear you got it working!  That makes me feel better
> anyway.. :)

>> I'm curious about this client db and its intended use.
>> 

> Basically, to keep clients from being able to check in/upload data
> for *ANY* arbitrary url they desire.  That way, an evil indexer
> can't fake an index for it's own website, with all kinds of fake or
> misleading data in it, causing our index to be invalid.  They
> request URL's to index, then they must check in those URL's.  You
> can't check in URL's you have not checked out..

> Maybe it's time we have someone write up some documentation on all
> this?  How to use each piece, with example and syntax, etc..  What
> do you think?

Last I checked documentation was pretty there although this being a
new method may need to be added.  It sounds as if we need to add them
to the client db upon sending a set of urls. I have to think about
this a little more.  

I thought we could use client redundancy and checksums to insure index
integrity. As we receive a batch we put it in a queue as soon as its
redundant client (or clients) return with indexes and they checkout
the master accepts them. The theory here is that then a rogue client
would have to occupy a large percentage of client machines to every
get skewed results in.  As for the client check in alone, it seems
like it could still be manipulated. If they obtain a set of urls what
is to prevent them from under reporting those urls, or other such
mischievousness. Any way at the end of the day it probably doesn't
hurt to ensure that urls sent to a client come back from a
client... so I'm not really arguing against it.  In fact it would be a
necessary step in preventing a rogue client from just sending the
required amount of skewed indexes to try to fool the master in the
redundancy scheme.  We can dub the redundancy scheme RAIC Redundant
Array of Independent Clients:-) or some other such nonsense.

mojo

--
When the Apple IIc was introduced, the informative copy led off with a couple
of asterisked sentences:

	It weighs less than 8 pounds.*
	And costs less than $1,300.**

In tiny type were these "fuller explanations":

      * Don't asterisks make you suspicious as all get out?  Well, all
	this means is that the IIc alone weights 7.5 pounds. The power
	pack, monitor, an extra disk drive, a printer and several bricks
	will make the IIc weigh more. Our lawyers were concerned that you
	might not be able to figure this out for yourself.

     ** The FTC is concerned about price fixing. You can pay more if
	you really want to.  Or less.
		-- Forbes