Re: [Exist-open] Full Text Support for Chinese/Japanese or Script-based languages?

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Alex Milowski wrote:

> Has anyone tried out the full text support with Chinese/Japanese or
> script-based languages like Thai?

Yes. The good news is that eXist itself won't trash the encoding either of
the documents or the queries (though some of the application frameworks and
client libraries commonly used with eXist will, sometimes in a big way).

That said, eXist works as well as any text indexing and retrieval system can
do which (a) attempts to tokenise on word boundaries on the Western model,
and (b) in so doing employs the default Unicode classes which result in
ideographic characters being treated as as  "words" even where they are
actually part of longer sequences which are a semantic unit (and thus
roughly equivalent to a "word" in Western terms (c) has no inbuilt knowledge
of positional or composite encoding such as are used in  Arabic and Thai
respectively.  Whether that is good enough depends very much on the
use-cases (and to some exent also on the data).

If the SimpleTokeniser class is left as is, then Chinese users who enter
multi-ideographic sequences, or Japanese users who, as well as such
sequences, enter ideographs with affixed inflections that employ a
syllabary, will not get the matches they expect, because eXist will index
each ideograph in the sequence as a "word" in its own right. So either (a)
such searches must be intercepted and rewritten using the near() function
against the fulltext index, or (b) a string-range index plus standard XPath
operators/functions, not eXist's fulltext ones, must be used, or (c)
SimpleTokenizer needs to be modified to alter the treatment of ideographs.

None of these solutions is wholly satisfactory. Proper handling of such
languages requires specialised tokenization in both the indexer and the
query parser, and for both Chinese and Japanese that can only be achieved
with the help of a lexicon accessible to the tokenizer itself, or by
pre-processing the documents to be stored using a segmenter, also partially
lexicon-driven, which (in the most common case) inserts ASCII spaces between
the semantic character-sequence units it detects. Once those spaces are
present, of course, then the eXist tokenizer can be told to treat ideographs
as "letters", not "words", and it will correctly tokenise the sequences at
the inserted ASCII space boundaries (which can be stripped out when
presenting results to the user, who however may have to be told to insert
ASCII spaces in some types of query terms, since the query parser knows
nothing about segmenting).

Yet another approach is to abandon the attempt to index at lexical token
level and instead use n-gram indexation, which is what is done AFAIK by many
industrial-strength CJK text retrieval systems (and also by sleepycat's
bdbxml) but that would take eXist rather far from its current indexation
model. N-gram indexation doesn't of course remove the problem of sequence
segmentation, but it does prevent a whitespace boundary model getting in the
way of index building.

Pluggable tokenizers of the kind catered for in Lucene would be an important
step forward, but I myself am not convinced that a truly
writing-system-independent fulltext retrieval system is possible. That
doesn't mean that, given sufficient resources, eXist couldn't be adapted to
work with any specific "exotic" writing system. But I would not advise any
developer to invest significant effort into supporting text retrieval in
non-Western scripts unless (a) they have themselves a good working knowledge
of the language(s) concerned and/or (b) they have  ready access to someone
who does know the language AND who has the patience and communication skills
to explain to the developer the complex linguistic issues that a processing
system needs to handle. Otherwise the risk of creating a product that a
client simply can't use (even though it appears to meet that client's
initial specs) is very high.

Michael Beddow

Re: [Exist-open] Full Text Support for Chinese/Japanese or Script-based languages?

eXist-db is a feature rich Open Source native XML database

Re: [Exist-open] Full Text Support for Chinese/Japanese or Script-based languages?