Re: [Exist-open] Questions about eXist with Lucene's indexingcapabilities

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Oystein Reigem wrote:

> >  From the discussions I learn there is work going on with including
> > Lucene's indexing capabilities in eXist.

[Mr Sourceforge has gone into retentive mode and  didn't send me either
Oysten's posting, or Wolfgang's reply, though I did see the subsequent
follow-ups. And attempting to catch up via
Gmane suggests that something has gone wrong with the Sourceforge
listserver's threading.]

I don't think anyone talked about "including Lucene's indexing
capabilities". There are (as Wolfang and I have separately tried to
emphasise) two quite distinct tracks involving Lucene. One, originally
devised by Adam and being taken forward by Patrick, is a way of keeping a
Lucene and an eXist docbase in synch. Put at its simplest, any operation
that causes eXist to reindex will trigger a comparable re-indexing in
Lucene. This allows an application to mix and match the distinctive
strengths of the two very different architectures. The other undertaking is
completely unrelated, both in techniques and purposes. It is to make some
aspects of eXist's indexation (primarily though not exclusively
tokenization) pluggable, with an API that would allow the use of Lucene
components, among others, for that purpose. But the low-level storage and
retrieval architecture that such a pluggable indexer would serve would
remain distinctively eXist's and would not necessarily imply any use of
Lucene's core techniques, which address different needs.
>
> > Of course it will allow simple stuff like using different tokenizers.

Actually it's not that simple at all.  Those of us who are especially keen
to see a pluggable tokenizer architecture are not much bothered about things
like where and when an apostrophe or a hyphen marks a token delimiter. We
want eXist to work properly with languages where the sort of things that
simple tokenizers can recognise, derived as they are from lexers for
artificial languages, are of no use whatsoever in identifying the units we
want to index and search on.  Amend the statement to read "it will allow
really complex stuff like using different tokenizers" and it might capture
some of the importance of this matter for working with non-European
languages (not to mention a few languages that are geographically "European"
but not linguistically so).
> >
> > I assume eXist gets the ability to index and search non-XML documents.

I fervently hope not. I for one will drop it like a hot brick if there is
any attempt to divorce it from the XML data model. What the line Adam and
Patrick are pioneering could allow is for developers and deployers to build
applications that will allow *users* to index and search non-XML documents,
transparently using Lucene for that purpose in tandem with eXist. But I for
one hope the X in eXist is there for a purpose, and is there to stay.

> > But will the queries still be XQuery queries? In case - what is it that
> > gets queried? A pure text version of the original documents?

As above. I trust they will always be queries expressed either in W3C
standard forms, or in ways which seem likely to be compatible or readily
adaptable to what emerges as the W3C way of performing the tasks involved
(currently updates and fulltext operations).

> >
> > What about features that are unique to Lucene queries - ranked
> > searching, boosting a term, etc.?

These aren't unique to Lucene. They are common to most purpose-built
fulltext information retrieval systems, which eXist is not. There is no
reason why eXist shouldn't get them (they are completely compatible with the
XML data model, unlike some of the features of Lucene expressly designed to
allow other ways of implementing "regions" in textual data). eXist currently
doesn't currently have them, but as W3C deliberations on fulltext extensions
become clearer that is very likely to change. But that would be because they
are necessary features of any fullblown  fulltext engine, not because of any
incorporation specifically of Lucene into eXist.

Michael Beddow

Re: [Exist-open] Questions about eXist with Lucene's indexingcapabilities

eXist-db is a feature rich Open Source native XML database

Re: [Exist-open] Questions about eXist with Lucene's indexingcapabilities