From: Michael B. <mbe...@mb...> - 2007-02-19 15:05:34
|
Oystein Reigem wrote: > > From the discussions I learn there is work going on with including > > Lucene's indexing capabilities in eXist. [Mr Sourceforge has gone into retentive mode and didn't send me either Oysten's posting, or Wolfgang's reply, though I did see the subsequent follow-ups. And attempting to catch up via Gmane suggests that something has gone wrong with the Sourceforge listserver's threading.] I don't think anyone talked about "including Lucene's indexing capabilities". There are (as Wolfang and I have separately tried to emphasise) two quite distinct tracks involving Lucene. One, originally devised by Adam and being taken forward by Patrick, is a way of keeping a Lucene and an eXist docbase in synch. Put at its simplest, any operation that causes eXist to reindex will trigger a comparable re-indexing in Lucene. This allows an application to mix and match the distinctive strengths of the two very different architectures. The other undertaking is completely unrelated, both in techniques and purposes. It is to make some aspects of eXist's indexation (primarily though not exclusively tokenization) pluggable, with an API that would allow the use of Lucene components, among others, for that purpose. But the low-level storage and retrieval architecture that such a pluggable indexer would serve would remain distinctively eXist's and would not necessarily imply any use of Lucene's core techniques, which address different needs. > > > Of course it will allow simple stuff like using different tokenizers. Actually it's not that simple at all. Those of us who are especially keen to see a pluggable tokenizer architecture are not much bothered about things like where and when an apostrophe or a hyphen marks a token delimiter. We want eXist to work properly with languages where the sort of things that simple tokenizers can recognise, derived as they are from lexers for artificial languages, are of no use whatsoever in identifying the units we want to index and search on. Amend the statement to read "it will allow really complex stuff like using different tokenizers" and it might capture some of the importance of this matter for working with non-European languages (not to mention a few languages that are geographically "European" but not linguistically so). > > > > I assume eXist gets the ability to index and search non-XML documents. I fervently hope not. I for one will drop it like a hot brick if there is any attempt to divorce it from the XML data model. What the line Adam and Patrick are pioneering could allow is for developers and deployers to build applications that will allow *users* to index and search non-XML documents, transparently using Lucene for that purpose in tandem with eXist. But I for one hope the X in eXist is there for a purpose, and is there to stay. > > But will the queries still be XQuery queries? In case - what is it that > > gets queried? A pure text version of the original documents? As above. I trust they will always be queries expressed either in W3C standard forms, or in ways which seem likely to be compatible or readily adaptable to what emerges as the W3C way of performing the tasks involved (currently updates and fulltext operations). > > > > What about features that are unique to Lucene queries - ranked > > searching, boosting a term, etc.? These aren't unique to Lucene. They are common to most purpose-built fulltext information retrieval systems, which eXist is not. There is no reason why eXist shouldn't get them (they are completely compatible with the XML data model, unlike some of the features of Lucene expressly designed to allow other ways of implementing "regions" in textual data). eXist currently doesn't currently have them, but as W3C deliberations on fulltext extensions become clearer that is very likely to change. But that would be because they are necessary features of any fullblown fulltext engine, not because of any incorporation specifically of Lucene into eXist. Michael Beddow |