As it stands now Jdbm based backend implementations
like the SystemBackend and the DefaultBackend have a
severe bottleneck relating to duplicate keys that
must be overcome.
There are situations where a Table (a.k.a. a B+Tree)
will store more than one value for a single key. In
this case the B+Tree would need to support duplicate
keys but it does not. One of the roles of the Table
abstraction on top of the Jdbm BTree is to enable the
use of duplicates. Currently this is implemented by
storing a TreeSet within the BTree for a key rather
than storing values directly. Values for the key are
then stored in the TreeSet which sorts values using a
Comparator.
This configuration is fine when the number of
duplicates is small however certain pathologically
circumstances will result where a key will have as
many values as there are entries within the backend.
For example the existance index which stores
attribute names mapped to the primary keys of entries
in the forward BTree would have as many values for
the 'objectClass' key as there are entries within the
backend. This is the case because every entry in the
backend has an objectClass attribute. Now reading,
adding and deleting a single value for the
objectClass key will retrieve and store a TreeSet to
and from main memory with all the values every time.
The slowdown when 50,000+ values are stored for the
key can be crippling making performance degradation
asymtotic.
To solve this problem we can use a inner BTree
instead of a TreeSet to store the values for the key.
The inner BTree need not have values for its tuples
since the key is always the same - it can just store
a key which are the values of key for the outer
BTree. This way we save space and the values are
sorted when they are added to the inner BTree as
keys. A threshold can be used to trigger the
replacement of a TreeSet by a BTree when the number
of duplicates surpass a value. This threshold can be
variable and used as a tuning parameter.
A BTree within a BTree configuration will at most
result in two IO operations if at all. The majority
of keys (say 80%-95%) across all Tables/BTrees will
be single valued. However there will be that 5%-20%
that will have duplicates and even smaller than that
will be the subset that have enough values to become
a performance bottleneck. This threshold variable
depends on the population of duplicate keys and the
distribution of the number of values they hold.
Although crittical for moderate, large and X-large
directories the lack of the feature does not stop us
from dealing with small configurations on the order
of thousands of entries. In the tens of thousands
range we will see severe degradation especially when
indices are build on attributes like 'objectClass'.
So this is high priority but does not supercede
critical architectural features required for
operation.
Alex
Logged In: YES
user_id=630512
Wes will take this one and has been working on it. Wes do
you have some status and can you attach some background
to this tracker?