|
From: Ning L. <nin...@gm...> - 2008-03-20 23:35:20
|
On Wed, Mar 19, 2008 at 4:13 PM, Doug Cutting <cu...@ap...> wrote: > Someone at Y! last week asked why Bailey doesn't use HDFS. I gave the > following reasons: > > - performance: by keeping indexes local search & indexing will be faster > - reliability: bailey replicates already, so hdfs replication is redundant > - continuous growth: consistent hashing lets us add and remove nodes > without fundamentally changing the way the index is partitioned. a > host-independent partitioning in HDFS would be too static. Don't we still need something like consistent hashing or range partitioning to partition the index even if HDFS is used? What is a static host-independent partitioning in HDFS? > He countered: > - for decent search performance, the majority of the index must be in > memory anyway. i conceded that much of the benefit of local indexes > might come from the filesystem buffer cache, which hdfs lacks. If HDFS is used and a partition index cannot fit in memory, caching in local file system (or a complete copy in local FS) is definitely necessary. > - for decent indexing performance, we could persist only logs + index > checkpoints to HDFS (once it supports append). I'll do some calculation later in this email. > - even consistent hashing will require the master to be somewhat > involved in indexing as nodes are added and removed. is that really > inherently more complicated than having the master dole out > subdirectories from a central hdfs repository, merging and splitting > them as needed? Using HDFS won't be more complicated. Now let's compute how much data are stored using HDFS vs. using local FS. Let's assume the replication level in HDFS is h, and the read/write replication level for the index is r. A document d means the original (text) copy of the document. The index form of document d means the inverted document in the index (typically much smaller than the original copy). Case 1: a simple example case In my design of a scalable indexer on HDFS, only batch update is supported and an batch update is carried out using a Map/Reduce job. No logging is required. So we have h+r copies for each document in its index form - assume each index server caches a copy in its local FS. Case 2: Now, what about a scalable index supporting incremental update using HDFS? If multiple readers of a partition share the same copy of the partition in HDFS (which is possible if single-writer-multi-reader), we also have h+r copies for each document in its index form. In addition, we'll have h copies for each document in the log if we don't want to lose data. So the total is h+r copies in index form plus h copies of a doc. Case 3: If multiple readers of a partition do not shard the same copy of the partition in HDFS (as in Bailey), we'd have r * (1+h) copies for each document in its index form and r * (1+h) copies for each document. Case 4: In the current Bailey design, we have r copies for each document in its index form and r copies for each document. Cases 1 and 2 will have more copies for each document if getDoc() operation is to be supported. I think cases 1, 2 and 4 all make sense. But probably not case 3. Well, did I totally mis-interpret how Bailey would use HDFS? Ning |