From: <mni...@mo...> - 2004-06-23 04:26:27
|
>>>>> "Eric" == Eric Anderson <and...@ce...> writes: > Mojo B. Nichols wrote: >>> Actually I think it may be my perl and just the sockets... can >>> somebody else try this on linux? I said client because that >>> fails, but upon closer inspection it doesn't seem that simple. >>> >> Whew no not my sockets. Basically the problem was two fold: One my >> client database didn't have my client in there. If I add it >> blindly to the database that takes it past that point. Then my url >> seed db was either empty or broken or something. removing that >> index allowed it to start working (reseeded it etc). I'm going to >> shuffle this off to the side and see if I can figure out where it >> went wrong. Perhaps the seeded url db is in cvs? I'll check it out. > Glad to hear you got it working! That makes me feel better > anyway.. :) >> I'm curious about this client db and its intended use. >> > Basically, to keep clients from being able to check in/upload data > for *ANY* arbitrary url they desire. That way, an evil indexer > can't fake an index for it's own website, with all kinds of fake or > misleading data in it, causing our index to be invalid. They > request URL's to index, then they must check in those URL's. You > can't check in URL's you have not checked out.. > Maybe it's time we have someone write up some documentation on all > this? How to use each piece, with example and syntax, etc.. What > do you think? Last I checked documentation was pretty there although this being a new method may need to be added. It sounds as if we need to add them to the client db upon sending a set of urls. I have to think about this a little more. I thought we could use client redundancy and checksums to insure index integrity. As we receive a batch we put it in a queue as soon as its redundant client (or clients) return with indexes and they checkout the master accepts them. The theory here is that then a rogue client would have to occupy a large percentage of client machines to every get skewed results in. As for the client check in alone, it seems like it could still be manipulated. If they obtain a set of urls what is to prevent them from under reporting those urls, or other such mischievousness. Any way at the end of the day it probably doesn't hurt to ensure that urls sent to a client come back from a client... so I'm not really arguing against it. In fact it would be a necessary step in preventing a rogue client from just sending the required amount of skewed indexes to try to fool the master in the redundancy scheme. We can dub the redundancy scheme RAIC Redundant Array of Independent Clients:-) or some other such nonsense. mojo -- When the Apple IIc was introduced, the informative copy led off with a couple of asterisked sentences: It weighs less than 8 pounds.* And costs less than $1,300.** In tiny type were these "fuller explanations": * Don't asterisks make you suspicious as all get out? Well, all this means is that the IIc alone weights 7.5 pounds. The power pack, monitor, an extra disk drive, a printer and several bricks will make the IIc weigh more. Our lawyers were concerned that you might not be able to figure this out for yourself. ** The FTC is concerned about price fixing. You can pay more if you really want to. Or less. -- Forbes |