|
From: Doug C. <cu...@ap...> - 2008-03-28 21:42:11
|
ni...@us... wrote: > 1 Add the protocol and the classes related to log propagation. Also add a simple implementation in the withlog package and a test case TestSimpleDbWithLog. This stuff looks great! You're a tour de force! A few minor comments: - RangeResults belongs in the ddb implementation package. The top-level package has the end-user API, and RangeResults are an implementation detail. - in Hadoop, the way we handle threads is to stop them with Thread.interrupt() rather than by setting a flag. In the thread, always treat InterruptedException as a signal to exit. the run() loop should check !this.isInterrupted(). - shouldn't the host get the hostMap from the master? And shouldn't it periodically refresh both the hostMap and the logMap from the master? For this, and instead of using ClientToMaster within a host, we need to add a method to HostToMaster protocol that returns a Mapper for the subset of the ring that concerns the calling node or host. Intitially it might return the full mapper, but, eventually, it should only transmit the node's neighborhood, or perhaps the host's neighborhoods. - Should we have a single propagator per host, instead of per node? That would conserve calls to the master, and a single propagation thread would throttle things, so that indexing doesn't overwhelm search performance. OTOH, we might sometimes want to propagate changes faster than a single thread can. But that's probably better dealt with explicitly rather than having a thread per node... - some possible name improvements: Tuple -> NodeState ? logMap -> overlappingNodes or neighbors? propagator -> retriever? synchronizer? - the point where we have a new log event from a neighbor, and need to resolve it against ourselves seems like a good point for a method call. > 2 Add the RangedDatabase class which contains NodeStatus, Database and Log. > 3 Add "getDocs" to the Database class to retrieve a number of documents. This will be used to improve performance during the log propagation. Q: Should Database be aware of Range to support filtered queries based on Range? Or do we make RangedDatabase add a clause to a query before passing it down to Database? I think we should add a Range element to Query that narrows it. But we first need to define what it means in terms of other public API elements. I think we define it in terms of the document's "position" field, which is the hashCode of its id by default, but can be explicitly specified. Does that sound right? Does getDocs() need to be in the top-level application API? At some point we need to distinguish between full documents and "outline" documents. E.g., if we're storing full-text then we don't want to transmit that to search clients when they're just displaying hits. We might, e.g., add a list of fields to be retrieved to Query. But I don't yet see a case where an application will need to fetch a set of documents by id. Except for search results, one-at-a-time access will be more typical, no? Sorry I've not been more involved this week... Doug |