|
From: Ning L. <nin...@gm...> - 2008-03-31 15:55:20
|
On Fri, Mar 28, 2008 at 5:42 PM, Doug Cutting <cu...@ap...> wrote: > - RangeResults belongs in the ddb implementation package. The top-level > package has the end-user API, and RangeResults are an implementation detail. Agree. I'll make the change. > - in Hadoop, the way we handle threads is to stop them with > Thread.interrupt() rather than by setting a flag. In the thread, always > treat InterruptedException as a signal to exit. the run() loop should > check !this.isInterrupted(). OK. We can use the same practice. > - shouldn't the host get the hostMap from the master? And shouldn't it > periodically refresh both the hostMap and the logMap from the master? > For this, and instead of using ClientToMaster within a host, we need to > add a method to HostToMaster protocol that returns a Mapper for the > subset of the ring that concerns the calling node or host. Intitially > it might return the full mapper, but, eventually, it should only > transmit the node's neighborhood, or perhaps the host's neighborhoods. Agree on all the points. I wrote the withlog package with the minimal functionalities just enough to demonstrate the use of the log interface. Much more needs to be done for host-to-master interactions. I figure we should come up with the detailed design for how to carry out the two types of load balancing - move a node from one host to another, and add/remove a node? > - Should we have a single propagator per host, instead of per node? > That would conserve calls to the master, and a single propagation thread > would throttle things, so that indexing doesn't overwhelm search > performance. OTOH, we might sometimes want to propagate changes faster > than a single thread can. But that's probably better dealt with > explicitly rather than having a thread per node... Agree. The part needs to be optimized - e.g. currently the synchronizer retrieves one doc at a time after processing the log. > - some possible name improvements: > Tuple -> NodeState ? > logMap -> overlappingNodes or neighbors? > propagator -> retriever? synchronizer? I like better names. :) I said very early I'm not good at names. :) > - the point where we have a new log event from a neighbor, and need to > resolve it against ourselves seems like a good point for a method call. The database should decide whether a version of a doc already exists, right? Should we add a method to the database class? Also, there will be a delay between checking if a version of a doc already exists and adding the version of the doc into the database (after retrieving it from a neighbor). Does this mean we always have to check if a version of a doc already exists before adding it? > I think we should add a Range element to Query that narrows it. But we > first need to define what it means in terms of other public API > elements. I think we define it in terms of the document's "position" > field, which is the hashCode of its id by default, but can be explicitly > specified. Does that sound right? Sounds good. > Does getDocs() need to be in the top-level application API? At some > point we need to distinguish between full documents and "outline" > documents. E.g., if we're storing full-text then we don't want to > transmit that to search clients when they're just displaying hits. We > might, e.g., add a list of fields to be retrieved to Query. But I don't > yet see a case where an application will need to fetch a set of > documents by id. Except for search results, one-at-a-time access will > be more typical, no? I thought getDoc() means getting the full document. I want to use getDocs() in log processing - after we process a number of log entries from a node and identify the docs we should retrieve, we call getDocs() to get those docs. What do you think? Ning |