Re: [bailey-developers] hdfs-backed bailey

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On Wed, Mar 19, 2008 at 4:13 PM, Doug Cutting <cu...@ap...> wrote:
> Someone at Y! last week asked why Bailey doesn't use HDFS.  I gave the
>  following reasons:
>
>  - performance: by keeping indexes local search & indexing will be faster
>  - reliability: bailey replicates already, so hdfs replication is redundant
>  - continuous growth: consistent hashing lets us add and remove nodes
>  without fundamentally changing the way the index is partitioned.  a
>  host-independent partitioning in HDFS would be too static.

Don't we still need something like consistent hashing or range
partitioning to partition the index even if HDFS is used? What
is a static host-independent partitioning in HDFS?

>  He countered:
>   - for decent search performance, the majority of the index must be in
>  memory anyway.  i conceded that much of the benefit of local indexes
>  might come from the filesystem buffer cache, which hdfs lacks.

If HDFS is used and a partition index cannot fit in memory,
caching in local file system (or a complete copy in local FS)
is definitely necessary.

>   - for decent indexing performance, we could persist only logs + index
>  checkpoints to HDFS (once it supports append).

I'll do some calculation later in this email.

>   - even consistent hashing will require the master to be somewhat
>  involved in indexing as nodes are added and removed.  is that really
>  inherently more complicated than having the master dole out
>  subdirectories from a central hdfs repository, merging and splitting
>  them as needed?

Using HDFS won't be more complicated.

Now let's compute how much data are stored using HDFS
vs. using local FS. Let's assume the replication level in HDFS
is h, and the read/write replication level for the index is r.
A document d means the original (text) copy of the document.
The index form of document d means the inverted document
in the index (typically much smaller than the original copy).

Case 1: a simple example case
In my design of a scalable indexer on HDFS, only batch
update is supported and an batch update is carried out
using a Map/Reduce job. No logging is required. So we
have h+r copies for each document in its index form -
assume each index server caches a copy in its local FS.

Case 2:
Now, what about a scalable index supporting incremental
update using HDFS? If multiple readers of a partition
share the same copy of the partition in HDFS (which
is possible if single-writer-multi-reader), we also have
h+r copies for each document in its index form.
In addition, we'll have h copies for each document
in the log if we don't want to lose data. So the total
is h+r copies in index form plus h copies of a doc.

Case 3:
If multiple readers of a partition do not shard the same
copy of the partition in HDFS (as in Bailey), we'd have
r * (1+h) copies for each document in its index form
and r * (1+h) copies for each document.

Case 4:
In the current Bailey design, we have r copies for each
document in its index form and r copies for each document.

Cases 1 and 2 will have more copies for each document
if getDoc() operation is to be supported. I think cases
1, 2 and 4 all make sense. But probably not case 3.

Well, did I totally mis-interpret how Bailey would use HDFS?

Ning