Thread: RE: re[6]: [Jdbm-developer] primary and secondary indexes in the record manager

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Kevin,

I would rather keep the existing features and promote the design to a =
b-link
tree.

I actually have a use case for generalized values for the btrees.  In =
order
to improve read performance over indexed statements some of the =
attributes
of the statement are redundantly persisted in the btree values.

In terms of an indexing and constraint system, I am fine with that as a
feature but I see it as layered on, separable and not critical path for =
a
jdbm2 initial release.  Plus I just don't see how you are going to get =
that
without one thing or another that you seem to find distasteful.

-bryan

________________________________________
From: jdb...@li...
[mailto:jdb...@li...] On Behalf Of Kevin =
Day
Sent: Tuesday, April 11, 2006 6:00 PM
To: JDBM Developer listserv
Subject: re[6]: [Jdbm-developer] primary and secondary indexes in the =
record
manager

Bryan-
=A0
I don't use JDO in my projects because it is much heavier weight than =
is
appropriate for what I need.=A0 I also disagree with their use of byte =
code
injection, and big gnarly configuration files.
=A0
In terms of whether indexing is a jdbm concern or not, I think I'll =
have to
defer to Alex on this one.=A0 My understanding from past emails on the =
subject
was that some form of indexing and constraint system was an extremely
desirable feature for the core jdbm2 implementation.
=A0
=A0
When I say that the indexes need to reside outside the record manager, =
I
mean that the standard recman semantecs don't apply.=A0 The new recman =
always
operates in a transaction context, updates to index data structures =
don't
take place in a transaction context=A0- that's why I say the indexes =
need to
live outside of the record manager.
=A0
Take a look at the sample class in the proof of concept code I emailed =
last
week, and you'll see that we are *not* making insert/fetch/update calls
against a transaction object.=A0 Instead, we make the calls against the =
record
manager, which in turn determines which transactional context the =
operations
should be performed.
=A0
=A0
Finally, the BTree implementation for the indexes is going to be =
completely
different than a btree that lives inside the record manager.=A0 First =
off,
it's going to be a b-link tree.=A0 Second, it's values will always be =
recids.=A0
Third, browser traversal and lookup will implicitly be fetching =
records,
based on the current transaction context.=A0 So, the index is *much* =
closer to
the record manager itself than to the objects stored inside the record
manager...
=A0
- K
=A0
=A0
=A0=20

> Kevin,

My critique of JDO was more of implementation approaches, but you raise =
a
fair point. =A0Would JDO for jdbm answer the mail for you? =A0I know =
that there
are OS JDO projects -- I believe Apache hosts one for example. =A0Could =
you
leverage the JDO framework (reflection, etc.) but use jdbm2 as the
persistence store? =A0Or is that exactly what you are planning?

However, I still do not see why management of primary and secondary =
indices
is a "jdbm" level concern. =A0This feature can clearly be built within =
a
framework over jdbm, so why introduce it within jdbm? =A0It seems to me =
that
we are talking about the separation of concerns between record =
management
with transactions, concurrency, etc. and an oodbms with transparent =
access
to and use of the native java fields. =A0I see the latter as interested =
but
cleanly separable from the former -- just a layer on top.

I still don't follow why the b-link nodes would exist outside of the =
record
manager API. =A0Perhaps we are just using the terms slightly =
differently. =A0I
see that we need to use distinct pages for index nodes, but not that =
the
data structures on those pages need to be any different. =A0It is more =
that
the free space on those pages is not available for the physical record
allocator for user objects, but only for node allocations on the btree.

-bryan

________________________________________
From: jdb...@li...
[mailto:jdb...@li...] On Behalf Of Kevin =
Day
Sent: Tuesday, April 11, 2006 3:33 PM
To: JDBM Developer listserv
Subject: re[4]: [Jdbm-developer] primary and secondary indexes in the =
record
manager

Bryan-
=A0
MS COM and GOM have a lot in common, actually - but that's a =
conversation
for another day after we ship jdbm2 :-)
=A0
=A0
I think that you may have a mis-perception of how JDO works.=A0 To the =
actual
user of the system, it's completely transparent.=A0 There is work that =
has to
be done in=A0an external configuration file=A0to define the O/R =
mapping, and
setting up the indexes in the underlying datastore, but once that's =
done,
using JDO is almost exactly what we are talking about for jdbm - you =
start a
transaction, you do stuff and you call commit (the biggest difference =
is
that in JDO I don't think you have to call update() when you=A0change =
an
object=A0).
=A0
With the GOM approach, the overhead of property access, etc... is not =
the
concern for me=A0- it's the impact on the developer and the style of =
coding.=A0
It certainly would be possible to use annotations and automatically =
generate
objects that conform to the described interface, etc...=A0 But it moves =
far
afield from the concept of POJOs.
=A0
Run-time overriding (for example using AspectJ) would be another way to =
get
at this - but you force the user to declare which calls can change the
object state, etc... - it's just a ton of extra work just to get
persistence.
=A0
=A0
=A0
Anyway, I think that it is best if we focus on the jdbm subsystem =
itself=A0at
this stage - we can certainly include hooks, etc... for making the =
higher
level stuff happen - but I'd like to focus on getting the core =
functionality
of a highly concurrent, fully functional=A0oodbm up and running.=A0 One =
of the
core features of most oodbms=A0is the ability to navigate (and =
query)=A0the
object store using indexes, and I think we need to seriously consider =
if/how
to do this in jdbm.=A0 If a higher level system doesn't take advantage =
of
those indexes, then that's fine - no harm done.
=A0
The vast majority of jdbm users are using it as a persistent tree map, =
which
implies index based access.=A0 For high concurrency, we need to offer =
these
users map capability that is concurrent aware, and if we are going down =
that
path, then we should at least talk about secondary indexes, because =
having
multi-keyed maps is useful for a huge number of development situations.
=A0
BTW - another reason for why index trees shouldn't reside in the =
regular
record manager is concurrent updates - if tx1 causes the tree to =
re-balance,
we need to have a mechanism for ensuring that tx2's logical view of the =
tree
stays the same, but that any changes that tx2 wanted to make will still
work.=A0 That's why I'm proposing a b-link structure that holds recid
references, and those recids are resolved in the context of whatever
transaction happens to be active.=A0=20
=A0
This approach (I think) implies a certain amount of garbage collection
functionality, and the GC operation requires visibility to both the
versioned and non-versioned logical store.=A0 This also implies that =
the GC
must *know* which objects are index objects, which in turn implies that =
the
indexes are in some way managed in jdbm (something akin to SQL INSERT =
INDEX,
DROP INDEX, etc... commands).
=A0
- K
=A0=20

> Kevin,

That is interesting feedback. =A0I personally do not think of COM when =
I think
about GOM. =A0I am not actually very read in on the Microsoft =
architectures
and have not looked at them in several years, but I do not consider =
them to
be persistent object database systems, data extensible, etc.

In any case, you can get strongly typed behavior from GOM by using =
property
and association classes to define property and link constraints. =A0I =
do not
find much overhead in the property access and type checking mechanisms.
2/3rds of the current time for our application on a bulk load operation =
is
the serialization of objects and laying them down on to page images. =
=A0If we
use a hard reference cache so that no cache evictions occur until the =
tx
commit, then the processing time drops by 2/3rds. =A0This is why one of =
my
goals has been a rewrite of the transaction system and physical row
allocation systems of jdbm.

I certainly agree that handling index updates transparently is =
difficult for
"standard" java objects. =A0I think that an interesting approach could =
be done
with Java 5 annotations and a BCE package. =A0From what I have been =
able to
see about JDO, there are very few implementations that provide what I =
might
call a native java persistent object system that is performance =
oriented.
E.g., JDO achieved through an O-R mapper and stored in an RDBMS appears =
to
be a common strategy.

What I have see is pretty non-transparent. =A0You write your native =
java
objects to be explicitly persistence aware and do things like indicate =
when
a property value change has invalidated an index entry for an object. =
=A0I
take it that this solution is not satisfying for you and that you are
looking for a means to achieve transparent persistence for native java?

What I tend to do with GOM is write "skins" which implement application
interfaces and are backed by the generic data object. =A0That gives me =
the
application semantics for the persistent objects and strong type =
checking as
well as interoperability across GOM implementations. =A0I don't find =
this to
be that much effort, but it does smack of something that would benefit =
from
automation. =A0You could also subclass the generic object for a =
specific
implementation and then operate in a type safe traditional OO manner, =
but
that locks you into a specific object hierarchy and backend.

I am noodling with both UML driven authoring of the class, property and
association constraints and BCE based optimizations for skins over live
generic objects but I do not have anything concrete to offer in either =
case.

-bryan

________________________________________
From: jdb...@li...
[mailto:jdb...@li...] On Behalf Of Kevin =
Day
Sent: Tuesday, April 11, 2006 12:49 PM
To: JDBM Developer listserv
Subject: re[2]: [Jdbm-developer] primary and secondary indexes in the =
record
manager

Bryan-
=A0
Yes - I've looked at the GOM implementation (and have worked =
extensively
with similar approaches - Microsoft COM, mozilla cross platform COM,
etc...).=A0 I can see the utility in some situations, but for the =
development
I do, losing the runtime type checking would be a massive problem.=A0 =
I'm
trying to think of jdbm as a true object database, and not just a way =
of
storing key/value pairs.=A0 Certainly, I could create wrapper classes =
that
call the GOM framework, then work in an object oriented manner, but =
that is
a heck of a lot of work (and boring work at that).
=A0
As a user of an oodbm, I expect to be able to work with objects the way =
I
normally work with objects, but have the persistent capabilities.=A0 =
The
solution I'm going for is much closer to JDO than COM...=A0 So, the =
idea of
using events and listeners is fine - but we have to have a true object
oriented mechanism for driving those events.=A0 Short of code =
injection,
aspect=A0weaving,=A0or some ugly notification requirement imposed by =
the API, I
just don't see how to achieve this.=A0 The alternative, then, is to =
provide
some sort of reverse lookup method that either caches index info for =
all
retrieved records or provides some sort of view into the index keys (or =
node
locations) for each recid.
=A0
=A0
=A0
Expanding on my comment re: blink trees:=A0 We could certainly store =
the index
pages on DATA pages, if desired - but there may be advantages (locality
again!) to storing the index pages in their own page thread.=A0 Given =
that
index page objects are likely to change much, much more frequently than =
data
pages, it may make sense from a caching and free page perspective to =
keep
them in a separate page thread.
=A0
Likewise, in my proof of concept, I could have stored the version =
tables on
DATA pages, but I chose not to because version tables are extremely =
short
lived, and will undergo a lot of churn.=A0 If the version tables were
interspersed with the data, we'd wind up with a huge amount of
fragmentation.=A0 I think (but can not say with absolute certainty) =
that
maintaining these data structures in a separate page thread should =
optimize
caching behavior, allocation, etc...
=A0
- K
=A0=20

>=20
Kevin,=20
=A0
Can you expand on your comment about blink trees? =A0Why do you see =
that they
need to be outside of the record management API? =A0I was hoping that =
some
data structures persisted as records could opt out of the default
concurrency control mechanisms and use consistency based mechanisms, =
but
that does not mean that we need to store them directly on pages does =
it?=20
=A0
I see index maintenance as the role of a framework layer over the =
record
management API. =A0Have you looked at the framework that I have =
implemented?
=A0[1]. =A0It handles index maintenance by catching property updates =
and
generating events. =A0Indices register as listeners for events and =
handle the
removal under the old key and the insert under the new key (the old =
key, new
key, and the object are part of the event). =A0This framework is =
designed
around generic data rather than explicitly declared java fields and
reflection. =A0If you want to manage indices and your classes have =
fields that
are getting serialized, serve as your index keys, and you have setters =
for
those fields then you need to generate notification events from those
setters and have your indices registered as listeners.=20
=A0
-bryan=20
=A0
[1]
http://proto.cognitiveweb.org/projects/cweb/multiproject/cweb-generic-na=
tive
/index.html=20
=A0

From: jdb...@li...
[mailto:jdb...@li...] On Behalf Of Kevin =
Day
Sent: Monday, April 10, 2006 9:46 PM
To: JDBM Developer listserv
Subject: [Jdbm-developer] primary and secondary indexes in the record
manager=20
=A0

One thing that I've been thinking about in a background thread:=20

=A0

We've discussed the possibility of making indexes a first order citizen =
of
jdbm (instead of storing a BTree in the record manager itself, on the =
data
pages). =A0This has lots of advantages from a concurrency perspective =
(we can
use BLink trees). =A0It also creates a programming paradigm that is =
much
closer to how people are used to working with databases.=20

=A0

This is where secondary indexes come in.=20

=A0

My experience so far with implementing multi-index constructs in jdbm =
has
been that keeping the indexes in sync is an absolute bugger. =A0I've =
had to
resort to special serializers that get the serialized form of the keys =
used
by a given record when that record is initially retrieved, then use =
those
keys to remove and re-insert values during any update of the record.=20

=A0

The problem with this is that you wind up having to deserialize the =
full
record, then turn around and serialize keys for each index that record =
is a
part of. =A0And you have to do it for every single read (it isn't =
possible to
determine the index key values during the update itself, because the =
object
itself has already been changed).=20

=A0

Another approach to this problem would be to maintain index meta data =
in a
suplimentary index, keyed by recid. =A0Whenever an update is performed, =
the
recid would be retrieved, and the page and node of the object could be
retrieved. =A0This would allow for efficient updates, but changes to =
the main
index tree (like a re-balance operation) could require a significant =
number
of changes in the siplimentary index.=20

=A0

A compromise would be to store only the page where the node for a given
record exists, then do a linear search for the recid during updates.=20

=A0

I'm really not sure where to go with all of this, but I think it's
appropriate to start talking about it. =A0If we have a suplimentary =
index,
then it may make sense to toss the physical row location into it and =
ditch
the translation pages entirely. =A0Individual record lookups would =
require a
tree search, but in 99% of my uses of jdbm this is the case already. =
=A0This
is closer to the primary key index idea that Alex floated awhile back.=20

=A0

I'm not at all saying this is the way to go, but I would like to begin
getting a handle on how we can maintain indexes for objects stored in =
jdbm.=20

=A0

- K <

-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting =
language
that extends applications into web and mobile media. Attend the live =
webcast
and join the prime developer group breaking into this new coding =
territory!
http://sel.as-us.falkag.net/sel?cmd

<

-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting =
language
that extends applications into web and mobile media. Attend the live =
webcast
and join the prime developer group breaking into this new coding =
territory!
http://sel.as-us.falkag.net/sel?cmd

<

Thread: RE: re[6]: [Jdbm-developer] primary and secondary indexes in the record manager

jdbm-developer