|
From: Doug C. <cu...@ap...> - 2008-03-19 20:13:47
|
Someone at Y! last week asked why Bailey doesn't use HDFS. I gave the following reasons: - performance: by keeping indexes local search & indexing will be faster - reliability: bailey replicates already, so hdfs replication is redundant - continuous growth: consistent hashing lets us add and remove nodes without fundamentally changing the way the index is partitioned. a host-independent partitioning in HDFS would be too static. He countered: - for decent search performance, the majority of the index must be in memory anyway. i conceded that much of the benefit of local indexes might come from the filesystem buffer cache, which hdfs lacks. - for decent indexing performance, we could persist only logs + index checkpoints to HDFS (once it supports append). - even consistent hashing will require the master to be somewhat involved in indexing as nodes are added and removed. is that really inherently more complicated than having the master dole out subdirectories from a central hdfs repository, merging and splitting them as needed? The advantage of HDFS-based indexes is that nodes have less state. The disadvantage is that you have to run HDFS (if you're not already), and that performance will probably always be a bit less. I don't see a clear advantage either way, and thus tend towards fewer dependencies and better performance. Other thoughts? Doug |