|
From: Ning L. <nin...@gm...> - 2008-03-19 20:03:54
|
On Wed, Mar 19, 2008 at 2:06 PM, Doug Cutting <cu...@ap...> wrote: > I had a slightly different idea. I thought that the id string would be > the external id provided by the application that we return with hits, > e.g., a uri, a filename, etc. We'd also have a numeric 'position' value > that places the document on the ring. The position would, by default, > be the hash of the id, but an application might override that. It would > be a bug for an application to ever provide different positions for the > same id. Originally, I was thinking simply using the application-specified external id as its 'position' value on the ring. We'd have one value instead of two. No need to check if different positions are ever provided for the same id. The ring distribution won't be uniform in this case. But we have to deal with this case anyway. So the main downside I see is the performance cost with strings - computation, memory... That's why I'm fine with a separate 'position' value. > I'd imagined that positions would be longs, but Yonik has argued that > they might as well be ints, and I can't think why they couldn't, if > we're going to keep the string id too. That makes the default > implementation in Java much easier, since it can be hashCode(). I'm not insisting on longs. But here is what I reasoned. :) I imagined a good number of the applications which would use Bailey would be similar to an email system - the application would provide the 'position' values so that a search on a fraction of all the documents spans a relatively small number of nodes. Let's use Yonik's suggestion to assign such 'position' values: > Of course, fixing my bug it would be (username.hashCode() << 29) | > (id.hashCode() >>> 3) One user may have one document. Another may have a lot. Is 29 bits for username enough? Maybe. But is 3 bits for the documents of a user enough? That means a user's documents cannot span more than 8 nodes. Maybe I over-thought the problem. :) Ning |