Re: [bailey-developers] SF.net SVN: bailey: [17] trunk/src

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On Fri, Mar 28, 2008 at 5:42 PM, Doug Cutting <cu...@ap...> wrote:
>  - RangeResults belongs in the ddb implementation package.  The top-level
>  package has the end-user API, and RangeResults are an implementation detail.

Agree. I'll make the change.

>  - in Hadoop, the way we handle threads is to stop them with
>  Thread.interrupt() rather than by setting a flag.  In the thread, always
>  treat InterruptedException as a signal to exit.  the run() loop should
>  check !this.isInterrupted().

OK. We can use the same practice.

>  - shouldn't the host get the hostMap from the master?  And shouldn't it
>  periodically refresh both the hostMap and the logMap from the master?
>  For this, and instead of using ClientToMaster within a host, we need to
>  add a method to HostToMaster protocol that returns a Mapper for the
>  subset of the ring that concerns the calling node or host.  Intitially
>  it might return the full mapper, but, eventually, it should only
>  transmit the node's neighborhood, or perhaps the host's neighborhoods.

Agree on all the points. I wrote the withlog package with the minimal
functionalities just enough to demonstrate the use of the log interface.
Much more needs to be done for host-to-master interactions. I figure
we should come up with the detailed design for how to carry out the
two types of load balancing - move a node from one host to another,
and add/remove a node?

>  - Should we have a single propagator per host, instead of per node?
>  That would conserve calls to the master, and a single propagation thread
>  would throttle things, so that indexing doesn't overwhelm search
>  performance.  OTOH, we might sometimes want to propagate changes faster
>  than a single thread can.  But that's probably better dealt with
>  explicitly rather than having a thread per node...

Agree. The part needs to be optimized - e.g. currently the synchronizer
retrieves one doc at a time after processing the log.

>  - some possible name improvements:
>    Tuple -> NodeState ?
>    logMap -> overlappingNodes or neighbors?
>    propagator -> retriever? synchronizer?

I like better names. :) I said very early I'm not good at names. :)

>  - the point where we have a new log event from a neighbor, and need to
>  resolve it against ourselves seems like a good point for a method call.

The database should decide whether a version of a doc already
exists, right? Should we add a method to the database class?
Also, there will be a delay between checking if a version of a doc
already exists and adding the version of the doc into the database
(after retrieving it from a neighbor). Does this mean we always
have to check if a version of a doc already exists before adding it?

>  I think we should add a Range element to Query that narrows it.  But we
>  first need to define what it means in terms of other public API
>  elements.  I think we define it in terms of the document's "position"
>  field, which is the hashCode of its id by default, but can be explicitly
>  specified.  Does that sound right?

Sounds good.

>  Does getDocs() need to be in the top-level application API?  At some
>  point we need to distinguish between full documents and "outline"
>  documents.  E.g., if we're storing full-text then we don't want to
>  transmit that to search clients when they're just displaying hits.  We
>  might, e.g., add a list of fields to be retrieved to Query.  But I don't
>  yet see a case where an application will need to fetch a set of
>  documents by id.  Except for search results, one-at-a-time access will
>  be more typical, no?

I thought getDoc() means getting the full document. I want to use
getDocs() in log processing - after we process a number of log
entries from a node and identify the docs we should retrieve, we
call getDocs() to get those docs. What do you think?

Ning