|
From: Achyuth P. <ach...@gm...> - 2023-07-14 06:27:42
|
Hi Developers, I am attaching the tokens generated from Java Lucene and CLucene. I am getting different tokens for non-latin texts using StandardAnalyser. Is there a solution which will generate the same tokens for CLucene as the Java Lucene? Thanks & Regards, Achyuth Pramod On Mon, Jul 10, 2023 at 6:44 PM Kostka Bořivoj <ko...@to...> wrote: > CLucene supports at least Unicode plane 0 > > CLucene uses wchar_t as internal representation, while indexes uses UTF-8 > > You must not set ENABLE_ASCII_MODE in CMake during build, otherwise only > US-Acscii (or perhaps ISO Latin 1, I‘m not sure) is supported > > > > Not 100% sure about Standard Analyzer, because we don’t use them, but I > can’t see any problem in it. > > > > In your Greek query, the problem can also be with lowercasing and „ending > sigma“ (ς) character (see https://en.wikipedia.org/wiki/Sigma) > > > > Hope this helps > > > > Borivoj > > > > *From:* Achyuth Pramod [mailto:ach...@gm...] > *Sent:* Monday, July 10, 2023 2:32 PM > *To:* clu...@li... > *Subject:* [CLucene-dev] Inquiry about CLucene's UTF-8 support > > > > Dear developers, > > I am using CLucene in my project and I would like to inquire about the UTF-8 encoding support in the Standard Analyzer. Specifically, I would like to know if the Standard Analyzer handles tokenization and text processing correctly for non-Latin UTF-8 encoded text. > > Could you please confirm if the Standard Analyzer in CLucene has built-in support for UTF-8 encoded text? If not, are there any recommended alternatives or additional analyzers that provide better support for non-Latin UTF-8 text? > > The below is the search results of few queries > Max Docs: 1 > Num Docs: 1 > Current Version: 1688707923968.0 > Term count: 66 > > Enter query string: dignissimos > Searching for: dignissimos > > 0. /home/nonLatin100Rows.csv - 0.04746387 > > > Search took: 0 ms. > Screen dump took: 0 ms. > > Enter query string: διαχειριστής > Searching for: > > > > Search took: 0 ms. > Screen dump took: 0 ms. > Thank you for your time. > > - Achyuth Pramod > > _______________________________________________ > CLucene-developers mailing list > CLu...@li... > https://lists.sourceforge.net/lists/listinfo/clucene-developers > |