From: Michael B. <mbe...@mb...> - 2006-01-25 17:16:28
|
Alex Milowski wrote: > Has anyone tried out the full text support with Chinese/Japanese or > script-based languages like Thai? Yes. The good news is that eXist itself won't trash the encoding either of the documents or the queries (though some of the application frameworks and client libraries commonly used with eXist will, sometimes in a big way). That said, eXist works as well as any text indexing and retrieval system can do which (a) attempts to tokenise on word boundaries on the Western model, and (b) in so doing employs the default Unicode classes which result in ideographic characters being treated as as "words" even where they are actually part of longer sequences which are a semantic unit (and thus roughly equivalent to a "word" in Western terms (c) has no inbuilt knowledge of positional or composite encoding such as are used in Arabic and Thai respectively. Whether that is good enough depends very much on the use-cases (and to some exent also on the data). If the SimpleTokeniser class is left as is, then Chinese users who enter multi-ideographic sequences, or Japanese users who, as well as such sequences, enter ideographs with affixed inflections that employ a syllabary, will not get the matches they expect, because eXist will index each ideograph in the sequence as a "word" in its own right. So either (a) such searches must be intercepted and rewritten using the near() function against the fulltext index, or (b) a string-range index plus standard XPath operators/functions, not eXist's fulltext ones, must be used, or (c) SimpleTokenizer needs to be modified to alter the treatment of ideographs. None of these solutions is wholly satisfactory. Proper handling of such languages requires specialised tokenization in both the indexer and the query parser, and for both Chinese and Japanese that can only be achieved with the help of a lexicon accessible to the tokenizer itself, or by pre-processing the documents to be stored using a segmenter, also partially lexicon-driven, which (in the most common case) inserts ASCII spaces between the semantic character-sequence units it detects. Once those spaces are present, of course, then the eXist tokenizer can be told to treat ideographs as "letters", not "words", and it will correctly tokenise the sequences at the inserted ASCII space boundaries (which can be stripped out when presenting results to the user, who however may have to be told to insert ASCII spaces in some types of query terms, since the query parser knows nothing about segmenting). Yet another approach is to abandon the attempt to index at lexical token level and instead use n-gram indexation, which is what is done AFAIK by many industrial-strength CJK text retrieval systems (and also by sleepycat's bdbxml) but that would take eXist rather far from its current indexation model. N-gram indexation doesn't of course remove the problem of sequence segmentation, but it does prevent a whitespace boundary model getting in the way of index building. Pluggable tokenizers of the kind catered for in Lucene would be an important step forward, but I myself am not convinced that a truly writing-system-independent fulltext retrieval system is possible. That doesn't mean that, given sufficient resources, eXist couldn't be adapted to work with any specific "exotic" writing system. But I would not advise any developer to invest significant effort into supporting text retrieval in non-Western scripts unless (a) they have themselves a good working knowledge of the language(s) concerned and/or (b) they have ready access to someone who does know the language AND who has the patience and communication skills to explain to the developer the complex linguistic issues that a processing system needs to handle. Otherwise the risk of creating a product that a client simply can't use (even though it appears to meet that client's initial specs) is very high. Michael Beddow |