From: Christian W. <wi...@ka...> - 2003-07-31 12:58:20
|
"Michael Beddow" <mbn...@mb...> writes: > In his latest CVS announcement, Wolfgang characteristically didn't reveal > the amount of effort he has put into clarifying and resolving this issue > over the past few days, for which eXist users (including vast numbers of > potential users in Asia who badly need an XML database that handles their > writing systems properly) should be even more grateful to him than we all > are already. Indeed, Wolfgang did a tremendous work tracking down and crushing all these bugs. Thanks, Wolfgang! [...] > > There are two issues here. > > 1) Avoiding incorrect agglomeration of ideographical sequences into > meaningless tokens. This is relatively simple. The lexer would need to > recognise codepoints in the character ranges concerned (for which there are > defined Unicode range names) and treat each such character as a discrete > token. This would allow eXist's fulltext index and the matching methods that > invoke it to operate at single-character level on ideographic ranges. Its > implicit ANDing behaviours would then also allow CJKV character sequences to > be matched correctly in most circumstances Exactly. This could be added to the current code without breaking anything. > > 2) Avoiding, consequent upon (1), the incorrect splitting of semantically > coherent ideographic character sequences which ought to be indexed as > unbroken sequences. This is a HUGE problem. Though much research has been > done on it, AFAIK there is no context-free method for doing this. Not only > are look-aheads, look-behinds, and dictionary lookups required to identify > candidate boundaries, there can then be disambiguation problems requiring a > second tier of lookups and possibly much backtracking. I don't think eXist > could be expected to address this issue. However, the lexer is pluggable, > and if someone can come up with the appropriate code, it could be slotted > in. I would expect a dramatic drop in performance, though. Some of the ambiguities could not even reasonable solved with such an approach (as they are intended to be amgbiguous in some cases :-) or just so because of the writing system. I think English would show the same problems if written without spaces and white space is in fact the most import markup that we use. So I simple do not think it would be practical to but this burden on eXist; those who really think they need it (and can live with the results) should do some preprocessing before they send the stuff to eXist. This has the additional advantage of making manual intervention possible. > > I personally would be unhappy about a switch to n-gram based indexing (which > in any case would not really eliminate problem 2, though it would remove > some of the unwanted side-effects of Latin word-based lexing) because I have > serious uses for eXist's ability to expose its fulltext index for each > collection (which I use as the basic for creating text concordances driven > by eXist). I see this as an important additional application for eXist in > the area of linguistic research for which there is at the moment no > equivalent Open Source tool. I hope to produce a paper on this aspect before > too long. Looking forward to that paper. I hope you do not misunderstand me here, my suggestions were intended as additional (configurable) possibilities beside what we have now, not instead of it. Replacing the current one with n-grams probably would break most existing applications. > Christian's suggestion re U200B or U200C and suchlike strikes me as > promising. If the lexer were configured to treat such characters as word > boundaries between characters in the ideographic ranges, then the onus would > be on preparers of CJK documents to use these characters to delineate > sequences. Provided they did so, eXist would then be able to handle both > Latin "words" and CJK sequences with equal appropriateness. It just occurred to me, that a user option could be to pipe a document through a XSLT transformation to produce the index. That would allow very finegrained control of every aspect of indexing and could for example also deal with things like regularization, textcritical editions and other instances of non-linear text. If this would be a configurable option, only those who are willing to accept the overhead and performance hit (on index generation only, I assume) would be concerned. All the best, Christian -- Christian Wittern Institute for Research in Humanities, Kyoto University 47 Higashiogura-cho, Kitashirakawa, Sakyo-ku, Kyoto 606-8265, JAPAN |