[bailey-developers] hdfs-backed bailey

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Someone at Y! last week asked why Bailey doesn't use HDFS.  I gave the 
following reasons:

- performance: by keeping indexes local search & indexing will be faster
- reliability: bailey replicates already, so hdfs replication is redundant
- continuous growth: consistent hashing lets us add and remove nodes 
without fundamentally changing the way the index is partitioned.  a 
host-independent partitioning in HDFS would be too static.

He countered:
  - for decent search performance, the majority of the index must be in 
memory anyway.  i conceded that much of the benefit of local indexes 
might come from the filesystem buffer cache, which hdfs lacks.
  - for decent indexing performance, we could persist only logs + index 
checkpoints to HDFS (once it supports append).
  - even consistent hashing will require the master to be somewhat 
involved in indexing as nodes are added and removed.  is that really 
inherently more complicated than having the master dole out 
subdirectories from a central hdfs repository, merging and splitting 
them as needed?

The advantage of HDFS-based indexes is that nodes have less state.  The 
disadvantage is that you have to run HDFS (if you're not already), and 
that performance will probably always be a bit less.  I don't see a 
clear advantage either way, and thus tend towards fewer dependencies and 
better performance.

Other thoughts?

Doug