|
From: Doug C. <cu...@ap...> - 2008-02-29 21:24:43
|
Ning Li wrote: > On Thu, Feb 28, 2008 at 5:37 PM, Doug Cutting <cu...@ap...> wrote: >> A unique updatetable index per document would be nice, but I'm not yet >> entirely convinced it is practical. > > Not if short glitches are not acceptable. In BigTable, a tablet is served > by a single tablet server. I wonder if they find it to be a problem. BigTable points towards a different architecture, where all modifications are logged to a shared filesystem, and a single node handles both updates and searches for that range of ids. Perhaps we should consider this more seriously. We want to scale flexibly both in collection size and in search traffic. If search traffic is low, then indexes might be large, and if search traffic is high, indexes might be smaller and replication might be higher. But, with no search node replication, system performance tops out a the rate that a node can process queries on a tiny index, which is not infinite. So you'd probably want to add read-only replicas onto the BigTable model. But then, when you have lots of writes, you don't fully utilize your cluster, and our writes are much more compute intensive than BigTable writes. I think configuring a cluster in this model would be more complicated and less fluid. Finally, as you observed, there would be hiccups whenever a node fails. Hiccups affect a small percentage of BigTable clients, only those touching the tablet on the failed node. But, in distributed search, every query touches a large portion of the nodes. So, in a 1000 node cluster, a failure might delay .1% of BigTable users, but might delay 33% of distributed search users (assuming 3-way replication). So search can be much more sensitive to this. So I'm not convinced that the BigTable model is as appropriate for distributed full-text search as consistent hashing. Thoughts? Doug |