Re: [Exist-open] eXist for East Asian (aka CJKV) languages

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

"Michael Beddow" <mbn...@mb...> writes:

> In his latest CVS announcement, Wolfgang characteristically didn't reveal
> the amount of effort he has put into clarifying and resolving this issue
> over the past few days, for which eXist users (including vast numbers of
> potential users in Asia who badly need an XML database that handles their
> writing systems properly) should be even more grateful to him than we all
> are already.

Indeed, Wolfgang did a tremendous work tracking down and crushing all these bugs.  Thanks, Wolfgang!

[...]
>
> There are two issues here.
>
> 1) Avoiding incorrect agglomeration of ideographical sequences into
> meaningless tokens. This is relatively simple. The lexer would need to
> recognise codepoints in the character ranges concerned (for which there are
> defined Unicode range names) and treat each such character as a discrete
> token. This would allow eXist's fulltext index and the matching methods that
> invoke it to operate at single-character level on ideographic ranges. Its
> implicit ANDing behaviours would then also allow CJKV character sequences to
> be matched correctly in most circumstances

Exactly.  This could be added to the current code without breaking
anything.
>
> 2) Avoiding, consequent upon (1), the incorrect splitting of semantically
> coherent ideographic character sequences which ought to be indexed as
> unbroken sequences. This is a HUGE problem. Though much research has been
> done on it, AFAIK there is no context-free method for doing this. Not only
> are look-aheads, look-behinds, and dictionary lookups required to identify
> candidate boundaries, there can then be disambiguation problems requiring a
> second tier of lookups and possibly much backtracking. I don't think eXist
> could be expected to address this issue. However, the lexer is pluggable,
> and if someone can come up with the appropriate code, it could be slotted
> in. I would expect a dramatic drop in performance, though.

Some of the ambiguities could not even reasonable solved with such an
approach (as they are intended to be amgbiguous in some cases :-) or
just so because of the writing system.  I think English would show the
same problems if written without spaces and white space is in fact the
most import markup that we use.  So I simple do not think it would be
practical to but this burden on eXist; those who really think they
need it (and can live with the results) should do some preprocessing
before they send the stuff to eXist.  This has the additional
advantage of making manual intervention possible.   

>
> I personally would be unhappy about a switch to n-gram based indexing (which
> in any case would not really eliminate problem 2, though it would remove
> some of the unwanted side-effects of Latin word-based lexing) because I have
> serious uses for eXist's ability to expose its fulltext index for each
> collection (which I use as the basic for creating text concordances driven
> by eXist). I see this as an important additional application for eXist in
> the area of linguistic research for which there is at the moment no
> equivalent Open Source tool. I hope to produce a paper on this aspect before
> too long.

Looking forward to that paper.  I hope you do not misunderstand me
here, my suggestions were intended as additional (configurable)
possibilities beside what we have now, not instead of it.  Replacing
the current one with n-grams probably would break most existing
applications.  

> Christian's suggestion re U200B or U200C and suchlike strikes me as
> promising. If the lexer were configured to treat such characters as word
> boundaries between characters in the ideographic ranges, then the onus would
> be on preparers of CJK documents to use these characters to delineate
> sequences. Provided they did so, eXist would then be able to handle both
> Latin "words" and CJK sequences with equal appropriateness.

It just occurred to me, that a user option could be to pipe a document
through a XSLT transformation to produce the index.  That would allow
very finegrained control of every aspect of indexing and could for
example also deal with things like regularization, textcritical
editions and other instances of non-linear text.  If this would be a
configurable option, only those who are willing to accept the overhead
and performance hit (on index generation only, I assume) would be
concerned.

All the best,

Christian

-- 

 Christian Wittern 
 Institute for Research in Humanities, Kyoto University
 47 Higashiogura-cho, Kitashirakawa, Sakyo-ku, Kyoto 606-8265, JAPAN

Re: [Exist-open] eXist for East Asian (aka CJKV) languages

eXist-db is a feature rich Open Source native XML database

Re: [Exist-open] eXist for East Asian (aka CJKV) languages