|
From: Doug C. <cu...@ap...> - 2008-04-02 17:36:40
|
Ning Li wrote: > Yes, we have to log and propagate deletes correctly. > > What I'm worried about is the impact of the version check on the index > build performance. As you said, for general synchronization, we always > need to check versions. After we check a database/log and decide to > add/delete a document, we call Database's addDoc/removeDoc method. > In this addDoc/removeDoc method, we first parse the document for > addDoc. Then in the same critical section, we have to check again if > it is the latest version and applies the add/delete. Is checking the log > for the version for a delete expensive here? And a log is not part of > the Database abstraction, but part of RangedDatabase... First, I think we need to add an abstrct Database service-provider interface, called perhaps RangeDatabase, that's different from Database, adding methods that will be critical to good performance that must be implemented by, e.g., HeapDatabase and LuceneDatabase. Second, I don't yet see a way around checking versions when documents are added or deleted. The ugliest bit is that we have to keep track of the version of every document that's ever been deleted, in case a long-offline node comes online and reports a stale addition. That table could grow without bound. Sigh. Do you see a way around this? Perhaps a node could discard old deletions after a time, keeping track of the log entry number of the oldest retained deletion. Attempts to sync starting with an older entry number should be rejected and should trigger a complete copy-based replacement of the stale index. The hazard is that, if a document is added to a single node, then that node goes offline for a long time, then, when it comes online, the addition will be lost. Not great. Doug |