Thread: [Hibernate] Hibernate Lucene integration

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

[Resent with a better formatting]

I've worked a lot recently on the Hibernate Lucene integration. Here are 
the concepts, the new features and the todo list.
Please comment and give feedbacks.

My work is commited in branches/Lucene_Integration because we'll 
probably need to be based on Hibernate 3.3

*Concepts*
Each time you change an object state, the lucene index is updated and 
kept in sync. This is done through the Hibernate event system.
Whether an entity is indexed or not and whether a property is indexed or 
not is defined through annotations.
You can also search through your domain model using Lucene and retrieve 
managed objects. The whole idea here is to do a nice integration between 
the search engine and the ORM without loosing the search engine power, 
hence most of the API remains. To sum up, query Lucene, get managed 
object back.

*Mapping*
A given entity is mapped to an index. A lucene index is stored in a 
Directory, a Directory is a Lucnee abstract concept for index storage 
system. It can be a memory directory (RAMDirectory), a file system 
directory (FSDirectory) or any other kind of backend. Hibernate Lucene 
introduce the notion of DirectoryProvider that you can configure and 
define on a per entity basis (and wich is defaulted defaulted). The 
concept is very similar to ConnectionProvider.

Lucene only works with Strings, so you can define a @FieldBridge which 
transform a java property into a Lucene Field (and potentially 
vice-versa). A more simple (useful?) version handle the transformation 
of a java property into a String.
Some built-in FieldBrigde exists. @FieldBridge is very much like an 
Hibernate Type. Esp I introduced the notion of precision in dates (year, 
month, .. second, millisecond). This FieldBridge and StringBridge gives 
a lot of flexibility in the way to design the property indexing.

*Querying*
I've introduced the notion of LuceneSession which implements Session and 
actually delegates to a regular Hibernate Session. This lucene session 
has a /createLuceneQuery()/ method and a /index()/ method.

/session.createLuceneQuery(lucene.Query, Class[])/ takes a Lucene query 
as a parameter and the list of targeted entities. Using a Lucene query 
as a parameter gives the full Lucene flexibility (no abstraction on top 
of it). An /org.hibernate.Query/ object is returned.
You can (must) use pagination. A Lucene query also return the number of 
matching results (regardless of the pagination): query.resultSize() sort 
of count(*).
   /list()/ returns the list of matching objects. It heavily depends on 
batch-size to be efficient (ie the proxy are created for all the results 
and then we initialize them.
There might be alternative strategies here (select ... where id in ( , , 
, ) ), but the real benefit would come if combined with the dynamic 
fetching profile we talked about a while ago.
   /iterate()/ has the same semantic as the regular method in hibernate, 
meaning initialize the objects one by one.
   /scroll()/ allows an efficient navigation into the resultset, 
(objects are loaded one by one though).
Having the dynamic fetch profile would definitely be a killer pair 
(searching the lucene index, and fetching the appropriate object graph)

/session.//index(Object)/ is currently not implemented it requires some 
modifications of SessionImpl or of LuceneSession. This feature is useful 
to initialize / refresh the index in a batch way (ie loading the data 
and applying the indexing process on this set of data).
Basically the object is added to the index queue. At flush() time, the 
queue is processed.

design considerations:
The delegation vs subclassing strategy for LuceneSession (ie 
LuceneSession delegating to a regular Session allowing simple wrapping 
or the LuceneSessionImpl being a subclass of SessionImpl is an ongoing 
discussion.
Using a subclassing model would allow the LuceneSession to keep 
operation queues (for batch indexing either through object changes or 
through session.index() ), but it does not allow a potential Hibernate - 
XXX integration on the same subclassing model. Batching is essential in 
Lucene for performance reasons.
Using the delegation model requires some SessionImpl modifications to be 
able to keep track of a generic context. This context will keep the 
operation queues.

*ToDo*
Argue on the LuceneSession design are pick up one (Steve/Emmanuel/Feel 
free to join the danse)

Find a way to keep the DocumentBuilder (sort of EntityPersister) at the 
SessionFactory level rather than the EventListener level (Steve/Emmanuel)

Implement the use of FieldBridge for all properties. It is currently 
used for the id property only (trivial).

Batch changes: to do that I need to be able to keep a session related 
queue of all insert/update changes. I can't in the current design 
because SessionImpl does not have such concept and because the 
LuceneSession is build on the delegation model. We need to discuss the 
strategy here (delegation vs subclassing)

Massive batch changes: in some system, we don't really bother with "real 
time" index synchronization, for those a common batch changes queue (ie 
across several sessions) would make sense with a queue flushing 
happening every hour  for example.

Clustered Directory: think about that. A JDBC Directory might not be the 
perfect solution actually.

fetch profile

Align the field indexing annotations to Lucene 2.0

Think aboud Analyser to give the same flexibility @Boost provide

Make Lucene query parameterizable query.setParameter();

implements additional strategies to load object on query.list()

Thread: [Hibernate] Hibernate Lucene integration

An object relational-mapping (ORM) library for Java

hibernate-devel