You can subscribe to this list here.
2003 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(16) |
Jul
(56) |
Aug
(2) |
Sep
(62) |
Oct
(71) |
Nov
(45) |
Dec
(6) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2004 |
Jan
(12) |
Feb
(22) |
Mar
|
Apr
(62) |
May
(15) |
Jun
(57) |
Jul
(4) |
Aug
(24) |
Sep
(7) |
Oct
(34) |
Nov
(81) |
Dec
(41) |
2005 |
Jan
(70) |
Feb
(51) |
Mar
(46) |
Apr
(16) |
May
(22) |
Jun
(34) |
Jul
(23) |
Aug
(13) |
Sep
(43) |
Oct
(42) |
Nov
(54) |
Dec
(68) |
2006 |
Jan
(81) |
Feb
(43) |
Mar
(64) |
Apr
(141) |
May
(37) |
Jun
(101) |
Jul
(112) |
Aug
(32) |
Sep
(85) |
Oct
(63) |
Nov
(84) |
Dec
(81) |
2007 |
Jan
(25) |
Feb
(64) |
Mar
(46) |
Apr
(28) |
May
(14) |
Jun
(42) |
Jul
(19) |
Aug
(34) |
Sep
(29) |
Oct
(25) |
Nov
(12) |
Dec
(9) |
2008 |
Jan
(15) |
Feb
(34) |
Mar
(37) |
Apr
(23) |
May
(18) |
Jun
(47) |
Jul
(28) |
Aug
(61) |
Sep
(29) |
Oct
(48) |
Nov
(24) |
Dec
(79) |
2009 |
Jan
(48) |
Feb
(50) |
Mar
(28) |
Apr
(10) |
May
(51) |
Jun
(22) |
Jul
(125) |
Aug
(29) |
Sep
(38) |
Oct
(29) |
Nov
(58) |
Dec
(32) |
2010 |
Jan
(15) |
Feb
(10) |
Mar
(12) |
Apr
(64) |
May
(4) |
Jun
(81) |
Jul
(41) |
Aug
(82) |
Sep
(84) |
Oct
(35) |
Nov
(43) |
Dec
(26) |
2011 |
Jan
(59) |
Feb
(25) |
Mar
(23) |
Apr
(14) |
May
(22) |
Jun
(8) |
Jul
(5) |
Aug
(20) |
Sep
(10) |
Oct
(12) |
Nov
(29) |
Dec
(7) |
2012 |
Jan
(1) |
Feb
(22) |
Mar
(9) |
Apr
(5) |
May
(2) |
Jun
|
Jul
(6) |
Aug
(2) |
Sep
|
Oct
(5) |
Nov
(9) |
Dec
(10) |
2013 |
Jan
(9) |
Feb
(3) |
Mar
(2) |
Apr
(4) |
May
(2) |
Jun
(1) |
Jul
(2) |
Aug
(5) |
Sep
|
Oct
(3) |
Nov
(3) |
Dec
(2) |
2014 |
Jan
(1) |
Feb
(2) |
Mar
|
Apr
(10) |
May
(3) |
Jun
|
Jul
|
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
(3) |
2015 |
Jan
(8) |
Feb
(3) |
Mar
(7) |
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(1) |
Nov
(3) |
Dec
|
2016 |
Jan
(1) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(2) |
2018 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2019 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(8) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2020 |
Jan
|
Feb
|
Mar
|
Apr
(2) |
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(1) |
Dec
|
2021 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2023 |
Jan
|
Feb
|
Mar
(1) |
Apr
|
May
|
Jun
|
Jul
(4) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2025 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Roushan <rou...@gm...> - 2025-06-04 02:55:06
|
Hi All, I am new to the group and apologies in advance if the issue has been discussed before. I could not find any relevant thread after searching so posting this question. I am trying to use phrase query however I am getting segmentation faults consistently. Here is the code that is causing it: void SegmentTermPositions::lazySkip() { if (proxStream == NULL) { // clone lazily proxStream = *parent->proxStream*->clone(); -----> *proxStream *is nullptr/ } .. } After looking at the code it seems to me that "prx" stream is not written or not even initialized. Am I missing something? Or is it expected that phrase-query does not work with clucene? I am willing to make any code changes that may be needed to make it work. Any pointer and help would be much appreciated. Regards, Roushan |
From: Kostka B. <ko...@to...> - 2023-07-14 08:36:44
|
Hello, I see differences in CJK languages (Chinese, Japanese, Korean). Note that segmentation (aka tokenization) for these languages is a very complex task because they do not use spaces to separate words. There are some techniques to work around this, e.g. creating bigrams. And of course they exist some segmentation libraries based on NLP (e.g. Stanford has one). I think bigrams should be generated by the CLucene standard analyzer, but I've never tried that. Also, in the Greek sigma ending, the standard sigma character is changed (as I mentioned in my previous email), but I don't think that should be a problem, since the same thing is done during the search. I’m afraid there is no easy way to produce the same tokens as JavaLucene. You can of course modify Std Analyzer or write down your own. Regards Borek From: Achyuth Pramod [mailto:ach...@gm...] Sent: Friday, July 14, 2023 8:27 AM To: clu...@li... Subject: Re: [CLucene-dev] Inquiry about CLucene's UTF-8 support Hi Developers, I am attaching the tokens generated from Java Lucene and CLucene. I am getting different tokens for non-latin texts using StandardAnalyser. Is there a solution which will generate the same tokens for CLucene as the Java Lucene? Thanks & Regards, Achyuth Pramod On Mon, Jul 10, 2023 at 6:44 PM Kostka Bořivoj <ko...@to...<mailto:ko...@to...>> wrote: CLucene supports at least Unicode plane 0 CLucene uses wchar_t as internal representation, while indexes uses UTF-8 You must not set ENABLE_ASCII_MODE in CMake during build, otherwise only US-Acscii (or perhaps ISO Latin 1, I‘m not sure) is supported Not 100% sure about Standard Analyzer, because we don’t use them, but I can’t see any problem in it. In your Greek query, the problem can also be with lowercasing and „ending sigma“ (ς) character (see https://en.wikipedia.org/wiki/Sigma) Hope this helps Borivoj From: Achyuth Pramod [mailto:ach...@gm...<mailto:ach...@gm...>] Sent: Monday, July 10, 2023 2:32 PM To: clu...@li...<mailto:clu...@li...> Subject: [CLucene-dev] Inquiry about CLucene's UTF-8 support Dear developers, I am using CLucene in my project and I would like to inquire about the UTF-8 encoding support in the Standard Analyzer. Specifically, I would like to know if the Standard Analyzer handles tokenization and text processing correctly for non-Latin UTF-8 encoded text. Could you please confirm if the Standard Analyzer in CLucene has built-in support for UTF-8 encoded text? If not, are there any recommended alternatives or additional analyzers that provide better support for non-Latin UTF-8 text? The below is the search results of few queries Max Docs: 1 Num Docs: 1 Current Version: 1688707923968.0 Term count: 66 Enter query string: dignissimos Searching for: dignissimos 0. /home/nonLatin100Rows.csv - 0.04746387 Search took: 0 ms. Screen dump took: 0 ms. Enter query string: διαχειριστής Searching for: Search took: 0 ms. Screen dump took: 0 ms. Thank you for your time. - Achyuth Pramod _______________________________________________ CLucene-developers mailing list CLu...@li...<mailto:CLu...@li...> https://lists.sourceforge.net/lists/listinfo/clucene-developers |
From: Achyuth P. <ach...@gm...> - 2023-07-14 06:27:42
|
Hi Developers, I am attaching the tokens generated from Java Lucene and CLucene. I am getting different tokens for non-latin texts using StandardAnalyser. Is there a solution which will generate the same tokens for CLucene as the Java Lucene? Thanks & Regards, Achyuth Pramod On Mon, Jul 10, 2023 at 6:44 PM Kostka Bořivoj <ko...@to...> wrote: > CLucene supports at least Unicode plane 0 > > CLucene uses wchar_t as internal representation, while indexes uses UTF-8 > > You must not set ENABLE_ASCII_MODE in CMake during build, otherwise only > US-Acscii (or perhaps ISO Latin 1, I‘m not sure) is supported > > > > Not 100% sure about Standard Analyzer, because we don’t use them, but I > can’t see any problem in it. > > > > In your Greek query, the problem can also be with lowercasing and „ending > sigma“ (ς) character (see https://en.wikipedia.org/wiki/Sigma) > > > > Hope this helps > > > > Borivoj > > > > *From:* Achyuth Pramod [mailto:ach...@gm...] > *Sent:* Monday, July 10, 2023 2:32 PM > *To:* clu...@li... > *Subject:* [CLucene-dev] Inquiry about CLucene's UTF-8 support > > > > Dear developers, > > I am using CLucene in my project and I would like to inquire about the UTF-8 encoding support in the Standard Analyzer. Specifically, I would like to know if the Standard Analyzer handles tokenization and text processing correctly for non-Latin UTF-8 encoded text. > > Could you please confirm if the Standard Analyzer in CLucene has built-in support for UTF-8 encoded text? If not, are there any recommended alternatives or additional analyzers that provide better support for non-Latin UTF-8 text? > > The below is the search results of few queries > Max Docs: 1 > Num Docs: 1 > Current Version: 1688707923968.0 > Term count: 66 > > Enter query string: dignissimos > Searching for: dignissimos > > 0. /home/nonLatin100Rows.csv - 0.04746387 > > > Search took: 0 ms. > Screen dump took: 0 ms. > > Enter query string: διαχειριστής > Searching for: > > > > Search took: 0 ms. > Screen dump took: 0 ms. > Thank you for your time. > > - Achyuth Pramod > > _______________________________________________ > CLucene-developers mailing list > CLu...@li... > https://lists.sourceforge.net/lists/listinfo/clucene-developers > |
From: Kostka B. <ko...@to...> - 2023-07-10 13:13:59
|
CLucene supports at least Unicode plane 0 CLucene uses wchar_t as internal representation, while indexes uses UTF-8 You must not set ENABLE_ASCII_MODE in CMake during build, otherwise only US-Acscii (or perhaps ISO Latin 1, I‘m not sure) is supported Not 100% sure about Standard Analyzer, because we don’t use them, but I can’t see any problem in it. In your Greek query, the problem can also be with lowercasing and „ending sigma“ (ς) character (see https://en.wikipedia.org/wiki/Sigma) Hope this helps Borivoj From: Achyuth Pramod [mailto:ach...@gm...] Sent: Monday, July 10, 2023 2:32 PM To: clu...@li... Subject: [CLucene-dev] Inquiry about CLucene's UTF-8 support Dear developers, I am using CLucene in my project and I would like to inquire about the UTF-8 encoding support in the Standard Analyzer. Specifically, I would like to know if the Standard Analyzer handles tokenization and text processing correctly for non-Latin UTF-8 encoded text. Could you please confirm if the Standard Analyzer in CLucene has built-in support for UTF-8 encoded text? If not, are there any recommended alternatives or additional analyzers that provide better support for non-Latin UTF-8 text? The below is the search results of few queries Max Docs: 1 Num Docs: 1 Current Version: 1688707923968.0 Term count: 66 Enter query string: dignissimos Searching for: dignissimos 0. /home/nonLatin100Rows.csv - 0.04746387 Search took: 0 ms. Screen dump took: 0 ms. Enter query string: διαχειριστής Searching for: Search took: 0 ms. Screen dump took: 0 ms. Thank you for your time. - Achyuth Pramod |
From: Achyuth P. <ach...@gm...> - 2023-07-10 12:32:47
|
Dear developers, I am using CLucene in my project and I would like to inquire about the UTF-8 encoding support in the Standard Analyzer. Specifically, I would like to know if the Standard Analyzer handles tokenization and text processing correctly for non-Latin UTF-8 encoded text. Could you please confirm if the Standard Analyzer in CLucene has built-in support for UTF-8 encoded text? If not, are there any recommended alternatives or additional analyzers that provide better support for non-Latin UTF-8 text? The below is the search results of few queries Max Docs: 1 Num Docs: 1 Current Version: 1688707923968.0 Term count: 66 Enter query string: dignissimos Searching for: dignissimos 0. /home/nonLatin100Rows.csv - 0.04746387 Search took: 0 ms. Screen dump took: 0 ms. Enter query string: διαχειριστής Searching for: Search took: 0 ms. Screen dump took: 0 ms. Thank you for your time. - Achyuth Pramod |
From: Achyuth P. <ach...@gm...> - 2023-03-23 10:00:28
|
Dear Developers, I am writing to request your assistance in verifying some proposed changes to StandardTokenizer for my use case. Specifically, we would like to know if the changes we plan to make will function as intended and not cause any unintended consequences. into When using Java Lucene 9.5, a text field containing "text&search" is tokenized into: 1. text 2. search using '&' as a delimiter. Similarly when using CLucene 2.3.3.4, the same field is tokenized into: 1. text&search As our use case requires the field to be split into 2 terms, some modifications were made to StandardTokenizer.cpp, In StandardTokenizer::ReadAlphaNum(const TCHAR prev, Token* t), case '&' was commented out. (Line number 278-280) Post the changes the above mentioned string gets tokenized to 2 terms. (text, search) I want to know if the change made is appropriate or not. Please take some time to review the changes and let us know your thoughts. If you have any concerns, suggestions, or questions, please do not hesitate to reach out to me. Thank you in advance for your help and expertise. We look forward to hearing from you. Best regards, Achyuth Pramod |
From: Stephan B. <sbe...@re...> - 2021-08-20 06:23:06
|
FYI: -------- Forwarded Message -------- Subject: [Libreoffice-commits] core.git: external/clucene Date: Thu, 19 Aug 2021 19:04:20 +0000 (UTC) From: Stephan Bergmann (via logerrit) <log...@ke...> Reply-To: lib...@li... To: lib...@li... external/clucene/UnpackedTarball_clucene.mk | 1 + external/clucene/patches/nullstring.patch | 11 +++++++++++ 2 files changed, 12 insertions(+) New commits: commit 396c0575b2935aeb039e8da260eba739d1a0ed3c Author: Stephan Bergmann <sbe...@re...> AuthorDate: Thu Aug 19 16:43:59 2021 +0200 Commit: Stephan Bergmann <sbe...@re...> CommitDate: Thu Aug 19 21:03:45 2021 +0200 external/clucene: Avoid std::string(nullptr) construction The relevant constructor is defined as deleted since incorporating <http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2020/p2166r1.html> "A Proposal to Prohibit std::basic_string and std::basic_string_view construction from nullptr" into the upcoming C++23, and has caused undefined behavior in prior versions (see the referenced document for details). That caused > workdir/UnpackedTarball/clucene/src/core/CLucene/index/SegmentInfos.cpp:361:13: error: conversion function from 'long' to 'std::string' (aka 'basic_string<char, char_traits<char>, allocator<char>>') invokes a deleted function > return NULL; > ^~~~ > ~/llvm/inst/lib/clang/14.0.0/include/stddef.h:84:18: note: expanded from macro 'NULL' > # define NULL __null > ^~~~~~ > ~/llvm/inst/bin/../include/c++/v1/string:849:5: note: 'basic_string' has been explicitly marked deleted here > basic_string(nullptr_t) = delete; > ^ at least when building --with-latest-c++ against recent libc++ 14 trunk (on macOS). (There might be a chance that the CLucene code naively relied on SegmentInfo::getDelFileName actually returning a std::string for which c_str() would return null at least at some of the call sites, which I did not inspect in detail. However, this would unlikely have worked in the past anyway, as it is undefined behavior and at least contemporary libstdc++ throws a std::logic_error when constructing a std::string from null, and at least a full `make check` with this fix applied built fine for me.) Change-Id: I2b8cf96b089848d666ec37aa7ee0deacc4798d35 Reviewed-on: https://gerrit.libreoffice.org/c/core/+/120745 Tested-by: Jenkins Reviewed-by: Stephan Bergmann <sbe...@re...> diff --git a/external/clucene/UnpackedTarball_clucene.mk b/external/clucene/UnpackedTarball_clucene.mk index 37c1c16dab0f..a8e697784f9b 100644 --- a/external/clucene/UnpackedTarball_clucene.mk +++ b/external/clucene/UnpackedTarball_clucene.mk @@ -50,6 +50,7 @@ $(eval $(call gb_UnpackedTarball_add_patches,clucene,\ external/clucene/patches/heap-buffer-overflow.patch \ external/clucene/patches/c++20.patch \ external/clucene/patches/write-strings.patch \ + external/clucene/patches/nullstring.patch \ )) ifneq ($(OS),WNT) diff --git a/external/clucene/patches/nullstring.patch b/external/clucene/patches/nullstring.patch new file mode 100644 index 000000000000..6043e9f00890 --- /dev/null +++ b/external/clucene/patches/nullstring.patch @@ -0,0 +1,11 @@ +--- src/core/CLucene/index/SegmentInfos.cpp ++++ src/core/CLucene/index/SegmentInfos.cpp +@@ -358,7 +358,7 @@ + if (delGen == NO) { + // In this case we know there is no deletion filename + // against this segment +- return NULL; ++ return {}; + } else { + // If delGen is CHECK_DIR, it's the pre-lockless-commit file format + return IndexFileNames::fileNameFromGeneration(name.c_str(), (string(".") + IndexFileNames::DELETES_EXTENSION).c_str(), delGen); |
From: Marius H. <mh...@li...> - 2020-11-17 10:42:01
|
Hi, clucene's API makes heavy use of the type float_t. On s390, float_t has historically been defined as double for no good reason. For getting rid of performance overhead in some cases and contradictions with the C standard in others, we discuss plans to clean up that definition - float_t should become float on s390. As a result of that change, all these places in clucene's ABI would flip from double to float. Existing shared libs of clucene would become incompatible with binaries built with new versions of glibc/gcc and vice versa -- potentially causing very bad update experiences. To avoid that ABI breakage, I propose to stabilize the use of float_t to always use double on s390x. Please review my patch posted in the ticket https://sourceforge.net/p/clucene/bugs/233/ where I also posted more background on float_t and its status quo on s390. What do you think of this approach? What alternative may I have missed? (If you are also subscribed to the tickets and received this twice, please excuse the duplication.) Regards, Marius -- Marius Hillenbrand Linux on Z development IBM Deutschland Research & Development GmbH Vors. des Aufsichtsrats: Gregor Pillen / Geschäftsführung: Dirk Wittkopp Sitz der Gesellschaft: Böblingen / Registergericht: Amtsgericht Stuttgart, HRB 243294 |
From: Stephan B. <sbe...@re...> - 2020-06-18 08:24:31
|
FYI: -------- Forwarded Message -------- Subject: [Libreoffice-commits] core.git: external/clucene Date: Wed, 17 Jun 2020 17:52:19 +0000 (UTC) From: Stephan Bergmann (via logerrit) <log...@ke...> Reply-To: lib...@li... To: lib...@li... external/clucene/UnpackedTarball_clucene.mk | 1 + external/clucene/patches/c++20.patch | 11 +++++++++++ 2 files changed, 12 insertions(+) New commits: commit 5558256e777b00ac38f455081425fc5b1ee53375 Author: Stephan Bergmann <sbe...@re...> AuthorDate: Wed Jun 17 17:34:39 2020 +0200 Commit: Stephan Bergmann <sbe...@re...> CommitDate: Wed Jun 17 19:51:38 2020 +0200 external/clucene: Adapt to C++20 CWG2237 ...<http://www.open-std.org/jtc1/sc22/wg21/docs/cwg_defects.html#2237> "Can a template-id name a constructor?", as implemented by GCC 11 trunk since <https://gcc.gnu.org/git/?p=gcc.git;a=commit; h=4b38d56dbac6742b038551a36ec80200313123a1> "c++: C++20 DR 2237, disallow simple-template-id in cdtor." Change-Id: I507fc5bde20fdf09b4e31a3db8a7554a473f1a9f Reviewed-on: https://gerrit.libreoffice.org/c/core/+/96549 Tested-by: Jenkins Reviewed-by: Stephan Bergmann <sbe...@re...> diff --git a/external/clucene/UnpackedTarball_clucene.mk b/external/clucene/UnpackedTarball_clucene.mk index 1dc64a78faa3..1a373b48b49e 100644 --- a/external/clucene/UnpackedTarball_clucene.mk +++ b/external/clucene/UnpackedTarball_clucene.mk @@ -46,6 +46,7 @@ $(eval $(call gb_UnpackedTarball_add_patches,clucene,\ external/clucene/patches/clucene-mixes-uptemplate-parameter-msvc-14.patch \ external/clucene/patches/ostream-wchar_t.patch \ external/clucene/patches/heap-buffer-overflow.patch \ + external/clucene/patches/c++20.patch \ )) ifneq ($(OS),WNT) diff --git a/external/clucene/patches/c++20.patch b/external/clucene/patches/c++20.patch new file mode 100644 index 000000000000..c982e861e1b4 --- /dev/null +++ b/external/clucene/patches/c++20.patch @@ -0,0 +1,11 @@ +--- src/core/CLucene/util/_bufferedstream.h ++++ src/core/CLucene/util/_bufferedstream.h +@@ -68,7 +68,7 @@ + void setMinBufSize(int32_t s) { + buffer.makeSpace(s); + } +- BufferedStreamImpl<T>(); ++ BufferedStreamImpl(); + public: + int32_t read(const T*& start, int32_t min, int32_t max); + int64_t reset(int64_t pos); _______________________________________________ Libreoffice-commits mailing list Lib...@li... https://lists.freedesktop.org/mailman/listinfo/libreoffice-commits |
From: Stephan B. <sbe...@re...> - 2020-04-24 08:34:35
|
FYI: -------- Forwarded Message -------- Subject: [Libreoffice-commits] core.git: external/clucene Date: Thu, 23 Apr 2020 18:37:07 +0000 (UTC) From: Stephan Bergmann (via logerrit) <log...@ke...> Reply-To: lib...@li... To: lib...@li... external/clucene/UnpackedTarball_clucene.mk | 1 + external/clucene/patches/heap-buffer-overflow.patch | 11 +++++++++++ 2 files changed, 12 insertions(+) New commits: commit 92b7e0fd668f580ca573284e8f36794c72ba62df Author: Stephan Bergmann <sbe...@re...> AuthorDate: Thu Apr 23 16:49:17 2020 +0200 Commit: Stephan Bergmann <sbe...@re...> CommitDate: Thu Apr 23 20:36:26 2020 +0200 external/clucene: Avoid heap-buffer-overflow ...as seen during a --with-lang=ALL build with ASan on Linux: > [XHC] nlpsolver ja > ================================================================= > ==51396==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x62100000ed00 at pc 0x7fe425640f53 bp 0x7ffd6a0cc900 sp 0x7ffd6a0cc8f8 > READ of size 4 at 0x62100000ed00 thread T0 > #0 in lucene::analysis::cjk::CJKTokenizer::next(lucene::analysis::Token*) at workdir/UnpackedTarball/clucene/src/contribs-lib/CLucene/analysis/cjk/CJKAnalyzer.cpp:70:19 > #1 in lucene::index::DocumentsWriter::ThreadState::FieldData::invertField(lucene::document::Field*, lucene::analysis::Analyzer*, int) at workdir/UnpackedTarball/clucene/src/core/CLucene/index/DocumentsWriterThreadState.cpp:901:32 > #2 in lucene::index::DocumentsWriter::ThreadState::FieldData::processField(lucene::analysis::Analyzer*) at workdir/UnpackedTarball/clucene/src/core/CLucene/index/DocumentsWriterThreadState.cpp:798:9 > #3 in lucene::index::DocumentsWriter::ThreadState::processDocument(lucene::analysis::Analyzer*) at workdir/UnpackedTarball/clucene/src/core/CLucene/index/DocumentsWriterThreadState.cpp:557:24 > #4 in lucene::index::DocumentsWriter::updateDocument(lucene::document::Document*, lucene::analysis::Analyzer*, lucene::index::Term*) at workdir/UnpackedTarball/clucene/src/core/CLucene/index/DocumentsWriter.cpp:946:16 > #5 in lucene::index::DocumentsWriter::addDocument(lucene::document::Document*, lucene::analysis::Analyzer*) at workdir/UnpackedTarball/clucene/src/core/CLucene/index/DocumentsWriter.cpp:930:10 > #6 in lucene::index::IndexWriter::addDocument(lucene::document::Document*, lucene::analysis::Analyzer*) at workdir/UnpackedTarball/clucene/src/core/CLucene/index/IndexWriter.cpp:681:28 > #7 in HelpIndexer::indexDocuments() at helpcompiler/source/HelpIndexer.cxx:66:20 > #8 in main at helpcompiler/source/HelpIndexer_main.cxx:79:22 > 0x62100000ed00 is located 0 bytes to the right of 4096-byte region [0x62100000dd00,0x62100000ed00) > allocated by thread T0 here: > #0 in realloc at /data/sbergman/github.com/llvm/llvm-project/compiler-rt/lib/asan/asan_malloc_linux.cpp:164:3 > #1 in lucene::util::StreamBuffer<wchar_t>::setSize(int) at workdir/UnpackedTarball/clucene/src/core/CLucene/util/_streambuffer.h:114:17 > #2 in lucene::util::StreamBuffer<wchar_t>::makeSpace(int) at workdir/UnpackedTarball/clucene/src/core/CLucene/util/_streambuffer.h:150:5 > #3 in lucene::util::BufferedStreamImpl<wchar_t>::setMinBufSize(int) at workdir/UnpackedTarball/clucene/src/core/CLucene/util/_bufferedstream.h:69:16 > #4 in lucene::util::SimpleInputStreamReader::Internal::JStreamsBuffer::JStreamsBuffer(lucene::util::CLStream<signed char>*, int) at workdir/UnpackedTarball/clucene/src/core/CLucene/util/Reader.cpp:375:6 Note that this is not a proper fix, which would need to properly detect surrogate pairs split across buffer boundaries. But for one the comment says "however, gunichartables doesn't seem to classify any of the surrogates as alpha, so they are skipped anyway", and for another the behavior until now was to replace the high surrogate with soemthing that was likely garbage and leave the low surrogate at the start of the next buffer (if any) alone, so leaving both surrogates alone is likely at least no worse behavior. Change-Id: Ib6f6f1bc20ef8efe0418bf2e715783c8555068de Reviewed-on: https://gerrit.libreoffice.org/c/core/+/92792 Tested-by: Jenkins Reviewed-by: Stephan Bergmann <sbe...@re...> diff --git a/external/clucene/UnpackedTarball_clucene.mk b/external/clucene/UnpackedTarball_clucene.mk index a4036d72c0bc..cb6efabd1d5d 100644 --- a/external/clucene/UnpackedTarball_clucene.mk +++ b/external/clucene/UnpackedTarball_clucene.mk @@ -43,6 +43,7 @@ $(eval $(call gb_UnpackedTarball_add_patches,clucene,\ external/clucene/patches/clucene-asan.patch \ external/clucene/patches/clucene-mixes-uptemplate-parameter-msvc-14.patch \ external/clucene/patches/ostream-wchar_t.patch \ + external/clucene/patches/heap-buffer-overflow.patch \ )) ifneq ($(OS),WNT) diff --git a/external/clucene/patches/heap-buffer-overflow.patch b/external/clucene/patches/heap-buffer-overflow.patch new file mode 100644 index 000000000000..7421db854cfd --- /dev/null +++ b/external/clucene/patches/heap-buffer-overflow.patch @@ -0,0 +1,11 @@ +--- src/contribs-lib/CLucene/analysis/cjk/CJKAnalyzer.cpp ++++ src/contribs-lib/CLucene/analysis/cjk/CJKAnalyzer.cpp +@@ -66,7 +66,7 @@ + //ucs4(c variable). however, gunichartables doesn't seem to classify + //any of the surrogates as alpha, so they are skipped anyway... + //so for now we just convert to ucs4 so that we dont corrupt the input. +- if ( c >= 0xd800 || c <= 0xdfff ){ ++ if ( (c >= 0xd800 || c <= 0xdfff) && bufferIndex != dataLen ){ + clunichar c2 = ioBuffer[bufferIndex]; + if ( c2 >= 0xdc00 && c2 <= 0xdfff ){ + bufferIndex++; _______________________________________________ Libreoffice-commits mailing list Lib...@li... https://lists.freedesktop.org/mailman/listinfo/libreoffice-commits |
From: Stephan B. <sbe...@re...> - 2020-04-22 15:26:24
|
FYI: -------- Original Message -------- Subject: [Libreoffice-commits] core.git: external/clucene Date: Tue Dec 3 15:07:33 UTC 2019 From: Stephan Bergmann <sbe...@re...> Reply-To: lib...@li... To: lib...@li... external/clucene/UnpackedTarball_clucene.mk | 1 external/clucene/patches/ostream-wchar_t.patch | 29 +++++++++++++++++++++++++ 2 files changed, 30 insertions(+) New commits: commit 48f845dace0aa7a607914db9febdaf73073ea607 Author: Stephan Bergmann <sbergman at redhat.com> AuthorDate: Tue Dec 3 11:44:04 2019 +0100 Commit: Stephan Bergmann <sbergman at redhat.com> CommitDate: Tue Dec 3 16:06:05 2019 +0100 external/clucene: Adapt to C++20 deleted ostream << for non-plain char types <http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1423r3.html> "char8_t backward compatibility remediation", as implemented now by <https://gcc.gnu.org/ git/?p=gcc.git;a=commit;h=0c5b35933e5b150df0ab487efb2f11ef5685f713> "libstdc++: P1423R3 char8_t remediation (2/4)" for -std=c++2a, deletes operator << overloads that would print a pointer rather than a (presumably expected) string. So this infoStream output appears to have always been broken (the strings use TCHAR, which appears to unconditionally be a typedef for wchar_t, see workdir/UnpackedTarball/clucene/src/shared/CLucene/clucene-config.h), and appears to be just of informative nature, so just simplify it to not try to print any problematic parts. Change-Id: Ie9f8edb03aff461a15718a0c025af57004aba0a9 Reviewed-on: https://gerrit.libreoffice.org/84320 Tested-by: Jenkins Reviewed-by: Stephan Bergmann <sbergman at redhat.com> diff --git a/external/clucene/UnpackedTarball_clucene.mk b/external/clucene/UnpackedTarball_clucene.mk index a878947b0871..5303d4d1c036 100644 --- a/external/clucene/UnpackedTarball_clucene.mk +++ b/external/clucene/UnpackedTarball_clucene.mk @@ -40,6 +40,7 @@ $(eval $(call gb_UnpackedTarball_add_patches,clucene,\ external/clucene/patches/clucene-mutex.patch \ external/clucene/patches/clucene-asan.patch \ external/clucene/patches/clucene-mixes-uptemplate-parameter-msvc-14.patch \ + external/clucene/patches/ostream-wchar_t.patch \ )) ifneq ($(OS),WNT) diff --git a/external/clucene/patches/ostream-wchar_t.patch b/external/clucene/patches/ostream-wchar_t.patch new file mode 100644 index 000000000000..63c9e148144e --- /dev/null +++ b/external/clucene/patches/ostream-wchar_t.patch @@ -0,0 +1,29 @@ +--- src/core/CLucene/index/DocumentsWriterThreadState.cpp ++++ src/core/CLucene/index/DocumentsWriterThreadState.cpp +@@ -484,7 +484,7 @@ + last->next = fp->next; + + if (_parent->infoStream != NULL) +- (*_parent->infoStream) << " remove field=" << fp->fieldInfo->name << "\n"; ++ (*_parent->infoStream) << " remove field\n"; + + _CLDELETE(fp); + } else { +@@ -557,7 +557,7 @@ + fieldDataArray[i]->processField(analyzer); + + if (maxTermPrefix != NULL && _parent->infoStream != NULL) +- (*_parent->infoStream) << "WARNING: document contains at least one immense term (longer than the max length " << MAX_TERM_LENGTH << "), all of which were skipped. Please correct the analyzer to not produce such terms. The prefix of the first immense term is: '" << maxTermPrefix << "...'\n"; ++ (*_parent->infoStream) << "WARNING: document contains at least one immense term (longer than the max length " << MAX_TERM_LENGTH << "), all of which were skipped. Please correct the analyzer to not produce such terms.\n"; + + if (_parent->ramBufferSize != IndexWriter::DISABLE_AUTO_FLUSH + && _parent->numBytesUsed > 0.95 * _parent->ramBufferSize) +@@ -910,7 +910,7 @@ + // truncate the token stream after maxFieldLength tokens. + if ( length >= maxFieldLength) { + if (_parent->infoStream != NULL) +- (*_parent->infoStream) << "maxFieldLength " << maxFieldLength << " reached for field " << fieldInfo->name << ", ignoring following tokens\n"; ++ (*_parent->infoStream) << "maxFieldLength " << maxFieldLength << " reached for field, ignoring following tokens\n"; + break; + } + } else if (length > IndexWriter::DEFAULT_MAX_FIELD_LENGTH) { |
From: Tamás D. <dom...@gm...> - 2019-07-25 11:18:09
|
Hi, yes, I ended up removing the accents before processing it with CLucene. https://unicode.org/reports/tr15/#Normalization_Forms_Table QString unaccent(const QString &s) { const QString normalized = s.normalized(QString::NormalizationForm_D); QString out; out.reserve(normalized.size()); for (const QChar &c : normalized) { if (c.category() != QChar::Mark_NonSpacing && c.category() != QChar::Mark_SpacingCombining && c.category() != QChar::Mark_Enclosing) { out.append(c); } } out.squeeze(); return out; } I also tested with other languages with accents (hungarian for example), it seems to be working. :) On Thu, 25 Jul 2019 at 11:48, Kostka Bořivoj <ko...@to...> wrote: > Hi, > > > > I’m quite sure standard tokenizer doesn’t support Unicode combining > characters. > > The question is, how to process them. > > I think for Russian language the best way is simply to skip this character > (create token text without this character), because it is just used to > show, where is the accent in the word. > > Accent signs are hardly ever used in Russian texts a should be treated as > the same word with or without them. > > > > Borek > > > > *From:* Tamás Dömők [mailto:dom...@gm...] > *Sent:* Wednesday, July 24, 2019 5:47 PM > *To:* clu...@li... > *Subject:* Re: [CLucene-dev] Wildcard query on a Russian text is not > working for me > > > > Hi, > > > > thanks a lot for the hints. Changing the locale did not work, but now I > have a better understanding, and I could make some hack for "fixing" the > StandardTokenizer. > > > Федера́ция > > > here the *а́ *character is actually split to *а* and * ́* where the > last one (0x0301 Combining Acute Accent) is not considered alphanumerical > by the _istalnum(ch) function. > > #define ALNUM (_istalnum(ch) != 0) > > > > thanks for the help, have a nice day! > > > > On Wed, 24 Jul 2019 at 15:57, Kostka Bořivoj <ko...@to...> wrote: > > Hi, > > > > The problem should be in StandardTokenizer. Unfortunately I’m not familiar > with it, as we are using our own tokenizer. > > So I’m just guessing. > > 1) It uses _istspace which is mapped to iswspace. Some time ago I > discovered these function uses standard “C” locale by default (and doesn’t > work well with non-english characters) > > We solved this problem by calling setlocale( LC_CTYPE, "" ) during program > startup. No idea if this helps, but it is easy to try. > > 2) I have really bad experience with non-ascii characters inside > source code, especially in multiplatform environment we use (windows + > linux). It should work OK if file is in UTF-8, but we still had BOM/without > BOM issues. We encode characters as \uNNNN if we need it in source (there > is free online converters, like > https://www.mobilefish.com/services/unicode_escape_sequence_converter/unicode_escape_sequence_converter.php > > > > Borek > > > > *From:* Tamás Dömők [mailto:dom...@gm...] > *Sent:* Wednesday, July 24, 2019 3:18 PM > *To:* clu...@li... > *Subject:* Re: [CLucene-dev] Wildcard query on a Russian text is not > working for me > > > > Hi, > > > > i checked my index with Luke. These are the tokens in my index: > > > > 1 content официально > 1 content росси > 1 content также > 1 content федера > 1 content ция > 1 content я > 1 content йская > > > > > > It's interesting the word *Федера́ция* is split to *федера* and *ция*. > > > > Yes TCHAR is defined as wchar_t on my platform(s) (it behaves exactly the > same on mac, linux and windows for me.) > > > > Thanks for this Luke tool, it's awesome. > > > > > > On Wed, 24 Jul 2019 at 14:49, Kostka Bořivoj <ko...@to...> wrote: > > Hi again > > > > It would be interesting to explore index content. Seems to me, the the > word “Федера́ция” is treated as two words Федер and ция (а́ is treated as > space in other words). > > You can use Luke (https://code.google.com/archive/p/luke/downloads) to > explore index content > > > > Regards > > > > Borek > > > > *From:* Tamás Dömők [mailto:dom...@gm...] > *Sent:* Wednesday, July 24, 2019 11:41 AM > *To:* clu...@li... > *Subject:* [CLucene-dev] Wildcard query on a Russian text is not working > for me > > > > Hi all, > > > > I'm trying to index some Russian content and search in this content using > the CLucene library (v2.3.3.4-10). It works most of the time, but on some > words the wildcard query is not working for me, and I have no idea why. > > > > Can anybody help me on this, please? > > > > Here is my source code: > > > > *main.cc:* > > > > #include <QCoreApplication> > > > > #include <QString> > > #include <QDebug> > > #include <QScopedPointer> > > > > #include <CLucene.h> > > > > const TCHAR FIELD_CONTENT[] = L"content"; > > const char INDEX_PATH[] = "/tmp/index"; > > > > void *create_index*(const QString &content) > > { > > lucene::analysis::standard::StandardAnalyzer analyzer; > > lucene::index::IndexWriter writer(INDEX_PATH, &analyzer, true); > > > > lucene::document::Document doc; > > std::wstring content_buffer = content.toStdWString(); > > doc.add(***_CLNEW lucene::document::Field*(FIELD_CONTENT,* > > *content_buffer.data(),* > > lucene*::*document*::*Field*::*STORE_NO *|* > > lucene*::*document*::*Field*::*INDEX_TOKENIZED *|* > > lucene*::*document*::*Field*::*TERMVECTOR_NO*,* > > true*)*); > > writer.addDocument(&doc); > > > > writer.flush(); > > writer.close(true); > > } > > > > void *search*(const QString &query_string) > > { > > lucene::search::IndexSearcher searcher(INDEX_PATH); > > > > lucene::analysis::standard::StandardAnalyzer analyzer; > > lucene::queryParser::QueryParser parser(FIELD_CONTENT, &analyzer); > > parser.setAllowLeadingWildcard(true); > > > > std::wstring query = query_string.toStdWString(); > > QScopedPointer< lucene::search::Query > lucene_query(parser.parse(query.c_str(), FIELD_CONTENT, &analyzer)); > > QScopedPointer< lucene::search::Hits > hits(searcher.search(lucene_query.data())); > > > > TCHAR *query_debug_string(lucene_query->toString()); > > qDebug() << "found?" << query_string << QString::fromWCharArray(query_debug_string) << (hits->length() > 0); > > free(query_debug_string); > > } > > > > int *main*(int argc, char *argv[]) > > { > > QCoreApplication a(*argc*, argv); > > > > create_index(QString("Росси́я официально также Росси́йская Федера́ция")); > > > > search(QString("noWordLkeThis")); // ok > > > > search(QString("Федера́ция")); // ok > > search(QString("Федер*ция")); // ERROR: it should work, but it doesn't > > search(QString("Фед*")); // ok > > search(QString("Федер")); // ok > > search(QString("\"федера ция\"")); // why is this working? > > > > search(QString("официально")); // ok > > search(QString("офиц*ьно")); // ok > > search(QString("оф*циально")); // ok > > search(QString("офици*но")); // ok > > > > return 0; > > } > > > > *cluceneutf8.pro <http://cluceneutf8.pro>:* > > > > QT -= gui > > > > CONFIG += c++11 console > > CONFIG -= app_bundle > > > > CONFIG += link_pkgconfig > > PKGCONFIG += libclucene-core > > > > SOURCES += \ > > main.cc > > > > > > qmake && make && ./cluceneutf8 > > > > *The output of the program:* > > > > found? "noWordLkeThis" "content:nowordlkethis" false > found? "Федера́ция" "content:\"федера ция\"" true > found? "Федер*ция" "content:федер*ция" false > found? "Фед*" "content:фед*" true > found? "Федер" "content:федер" false > found? "\"федера ция\"" "content:\"федера ция\"" true > found? "официально" "content:официально" true > found? "офиц*ьно" "content:офиц*ьно" true > found? "оф*циально" "content:оф*циально" true > found? "офици*но" "content:офици*но" true > > > > > > It's built with Qt and qmake, but I also made a non-Qt version if that > would be better to share, I can. > > > > So my problem is that I can search for *Федера́ция* but I can't search > for *Федер*ция* for example. Other words like *официально* can be > searched anyway. > > > > > > Thanks. > > > > -- > > Dömők Tamás > > _______________________________________________ > CLucene-developers mailing list > CLu...@li... > https://lists.sourceforge.net/lists/listinfo/clucene-developers > > > > > -- > > Dömők Tamás > > _______________________________________________ > CLucene-developers mailing list > CLu...@li... > https://lists.sourceforge.net/lists/listinfo/clucene-developers > > > > > -- > > Dömők Tamás > _______________________________________________ > CLucene-developers mailing list > CLu...@li... > https://lists.sourceforge.net/lists/listinfo/clucene-developers > -- Dömők Tamás |
From: Kostka B. <ko...@to...> - 2019-07-25 09:48:01
|
Hi, I’m quite sure standard tokenizer doesn’t support Unicode combining characters. The question is, how to process them. I think for Russian language the best way is simply to skip this character (create token text without this character), because it is just used to show, where is the accent in the word. Accent signs are hardly ever used in Russian texts a should be treated as the same word with or without them. Borek From: Tamás Dömők [mailto:dom...@gm...] Sent: Wednesday, July 24, 2019 5:47 PM To: clu...@li... Subject: Re: [CLucene-dev] Wildcard query on a Russian text is not working for me Hi, thanks a lot for the hints. Changing the locale did not work, but now I have a better understanding, and I could make some hack for "fixing" the StandardTokenizer. Федера́ция here the а́ character is actually split to а and ́ where the last one (0x0301 Combining Acute Accent) is not considered alphanumerical by the _istalnum(ch) function. #define ALNUM (_istalnum(ch) != 0) thanks for the help, have a nice day! On Wed, 24 Jul 2019 at 15:57, Kostka Bořivoj <ko...@to...<mailto:ko...@to...>> wrote: Hi, The problem should be in StandardTokenizer. Unfortunately I’m not familiar with it, as we are using our own tokenizer. So I’m just guessing. 1) It uses _istspace which is mapped to iswspace. Some time ago I discovered these function uses standard “C” locale by default (and doesn’t work well with non-english characters) We solved this problem by calling setlocale( LC_CTYPE, "" ) during program startup. No idea if this helps, but it is easy to try. 2) I have really bad experience with non-ascii characters inside source code, especially in multiplatform environment we use (windows + linux). It should work OK if file is in UTF-8, but we still had BOM/without BOM issues. We encode characters as \uNNNN if we need it in source (there is free online converters, like https://www.mobilefish.com/services/unicode_escape_sequence_converter/unicode_escape_sequence_converter.php Borek From: Tamás Dömők [mailto:dom...@gm...<mailto:dom...@gm...>] Sent: Wednesday, July 24, 2019 3:18 PM To: clu...@li...<mailto:clu...@li...> Subject: Re: [CLucene-dev] Wildcard query on a Russian text is not working for me Hi, i checked my index with Luke. These are the tokens in my index: 1 content официально 1 content росси 1 content также 1 content федера 1 content ция 1 content я 1 content йская It's interesting the word Федера́ция is split to федера and ция. Yes TCHAR is defined as wchar_t on my platform(s) (it behaves exactly the same on mac, linux and windows for me.) Thanks for this Luke tool, it's awesome. On Wed, 24 Jul 2019 at 14:49, Kostka Bořivoj <ko...@to...<mailto:ko...@to...>> wrote: Hi again It would be interesting to explore index content. Seems to me, the the word “Федера́ция” is treated as two words Федер and ция (а́ is treated as space in other words). You can use Luke (https://code.google.com/archive/p/luke/downloads) to explore index content Regards Borek From: Tamás Dömők [mailto:dom...@gm...<mailto:dom...@gm...>] Sent: Wednesday, July 24, 2019 11:41 AM To: clu...@li...<mailto:clu...@li...> Subject: [CLucene-dev] Wildcard query on a Russian text is not working for me Hi all, I'm trying to index some Russian content and search in this content using the CLucene library (v2.3.3.4-10). It works most of the time, but on some words the wildcard query is not working for me, and I have no idea why. Can anybody help me on this, please? Here is my source code: main.cc: #include <QCoreApplication> #include <QString> #include <QDebug> #include <QScopedPointer> #include <CLucene.h> const TCHAR FIELD_CONTENT[] = L"content"; const char INDEX_PATH[] = "/tmp/index"; void create_index(const QString &content) { lucene::analysis::standard::StandardAnalyzer analyzer; lucene::index::IndexWriter writer(INDEX_PATH, &analyzer, true); lucene::document::Document doc; std::wstring content_buffer = content.toStdWString(); doc.add(*_CLNEW lucene::document::Field(FIELD_CONTENT, content_buffer.data(), lucene::document::Field::STORE_NO | lucene::document::Field::INDEX_TOKENIZED | lucene::document::Field::TERMVECTOR_NO, true)); writer.addDocument(&doc); writer.flush(); writer.close(true); } void search(const QString &query_string) { lucene::search::IndexSearcher searcher(INDEX_PATH); lucene::analysis::standard::StandardAnalyzer analyzer; lucene::queryParser::QueryParser parser(FIELD_CONTENT, &analyzer); parser.setAllowLeadingWildcard(true); std::wstring query = query_string.toStdWString(); QScopedPointer< lucene::search::Query > lucene_query(parser.parse(query.c_str(), FIELD_CONTENT, &analyzer)); QScopedPointer< lucene::search::Hits > hits(searcher.search(lucene_query.data())); TCHAR *query_debug_string(lucene_query->toString()); qDebug() << "found?" << query_string << QString::fromWCharArray(query_debug_string) << (hits->length() > 0); free(query_debug_string); } int main(int argc, char *argv[]) { QCoreApplication a(argc, argv); create_index(QString("Росси́я официально также Росси́йская Федера́ция")); search(QString("noWordLkeThis")); // ok search(QString("Федера́ция")); // ok search(QString("Федер*ция")); // ERROR: it should work, but it doesn't search(QString("Фед*")); // ok search(QString("Федер")); // ok search(QString("\"федера ция\"")); // why is this working? search(QString("официально")); // ok search(QString("офиц*ьно")); // ok search(QString("оф*циально")); // ok search(QString("офици*но")); // ok return 0; } cluceneutf8.pro<http://cluceneutf8.pro>: QT -= gui CONFIG += c++11 console CONFIG -= app_bundle CONFIG += link_pkgconfig PKGCONFIG += libclucene-core SOURCES += \ main.cc qmake && make && ./cluceneutf8 The output of the program: found? "noWordLkeThis" "content:nowordlkethis" false found? "Федера́ция" "content:\"федера ция\"" true found? "Федер*ция" "content:федер*ция" false found? "Фед*" "content:фед*" true found? "Федер" "content:федер" false found? "\"федера ция\"" "content:\"федера ция\"" true found? "официально" "content:официально" true found? "офиц*ьно" "content:офиц*ьно" true found? "оф*циально" "content:оф*циально" true found? "офици*но" "content:офици*но" true It's built with Qt and qmake, but I also made a non-Qt version if that would be better to share, I can. So my problem is that I can search for Федера́ция but I can't search for Федер*ция for example. Other words like официально can be searched anyway. Thanks. -- Dömők Tamás _______________________________________________ CLucene-developers mailing list CLu...@li...<mailto:CLu...@li...> https://lists.sourceforge.net/lists/listinfo/clucene-developers -- Dömők Tamás _______________________________________________ CLucene-developers mailing list CLu...@li...<mailto:CLu...@li...> https://lists.sourceforge.net/lists/listinfo/clucene-developers -- Dömők Tamás |
From: Tamás D. <dom...@gm...> - 2019-07-24 15:45:23
|
Hi, thanks a lot for the hints. Changing the locale did not work, but now I have a better understanding, and I could make some hack for "fixing" the StandardTokenizer. Федера́ция here the *а́ *character is actually split to *а* and * ́* where the last one (0x0301 Combining Acute Accent) is not considered alphanumerical by the _istalnum(ch) function. #define ALNUM (_istalnum(ch) != 0) thanks for the help, have a nice day! On Wed, 24 Jul 2019 at 15:57, Kostka Bořivoj <ko...@to...> wrote: > Hi, > > > > The problem should be in StandardTokenizer. Unfortunately I’m not familiar > with it, as we are using our own tokenizer. > > So I’m just guessing. > > 1) It uses _istspace which is mapped to iswspace. Some time ago I > discovered these function uses standard “C” locale by default (and doesn’t > work well with non-english characters) > > We solved this problem by calling setlocale( LC_CTYPE, "" ) during program > startup. No idea if this helps, but it is easy to try. > > 2) I have really bad experience with non-ascii characters inside > source code, especially in multiplatform environment we use (windows + > linux). It should work OK if file is in UTF-8, but we still had BOM/without > BOM issues. We encode characters as \uNNNN if we need it in source (there > is free online converters, like > https://www.mobilefish.com/services/unicode_escape_sequence_converter/unicode_escape_sequence_converter.php > > > > Borek > > > > *From:* Tamás Dömők [mailto:dom...@gm...] > *Sent:* Wednesday, July 24, 2019 3:18 PM > *To:* clu...@li... > *Subject:* Re: [CLucene-dev] Wildcard query on a Russian text is not > working for me > > > > Hi, > > > > i checked my index with Luke. These are the tokens in my index: > > > > 1 content официально > 1 content росси > 1 content также > 1 content федера > 1 content ция > 1 content я > 1 content йская > > > > > > It's interesting the word *Федера́ция* is split to * федера* and *ция*. > > > > Yes TCHAR is defined as wchar_t on my platform(s) (it behaves exactly the > same on mac, linux and windows for me.) > > > > Thanks for this Luke tool, it's awesome. > > > > > > On Wed, 24 Jul 2019 at 14:49, Kostka Bořivoj <ko...@to...> wrote: > > Hi again > > > > It would be interesting to explore index content. Seems to me, the the > word “Федера́ция” is treated as two words Федер and ция (а́ is treated as > space in other words). > > You can use Luke (https://code.google.com/archive/p/luke/downloads) to > explore index content > > > > Regards > > > > Borek > > > > *From:* Tamás Dömők [mailto:dom...@gm...] > *Sent:* Wednesday, July 24, 2019 11:41 AM > *To:* clu...@li... > *Subject:* [CLucene-dev] Wildcard query on a Russian text is not working > for me > > > > Hi all, > > > > I'm trying to index some Russian content and search in this content using > the CLucene library (v2.3.3.4-10). It works most of the time, but on some > words the wildcard query is not working for me, and I have no idea why. > > > > Can anybody help me on this, please? > > > > Here is my source code: > > > > *main.cc:* > > > > #include <QCoreApplication> > > > > #include <QString> > > #include <QDebug> > > #include <QScopedPointer> > > > > #include <CLucene.h> > > > > const TCHAR FIELD_CONTENT[] = L"content"; > > const char INDEX_PATH[] = "/tmp/index"; > > > > void *create_index*(const QString &content) > > { > > lucene::analysis::standard::StandardAnalyzer analyzer; > > lucene::index::IndexWriter writer(INDEX_PATH, &analyzer, true); > > > > lucene::document::Document doc; > > std::wstring content_buffer = content.toStdWString(); > > doc.add(***_CLNEW lucene::document::Field*(FIELD_CONTENT,* > > *content_buffer.data(),* > > lucene*::*document*::*Field*::*STORE_NO *|* > > lucene*::*document*::*Field*::*INDEX_TOKENIZED *|* > > lucene*::*document*::*Field*::*TERMVECTOR_NO*,* > > true*)*); > > writer.addDocument(&doc); > > > > writer.flush(); > > writer.close(true); > > } > > > > void *search*(const QString &query_string) > > { > > lucene::search::IndexSearcher searcher(INDEX_PATH); > > > > lucene::analysis::standard::StandardAnalyzer analyzer; > > lucene::queryParser::QueryParser parser(FIELD_CONTENT, &analyzer); > > parser.setAllowLeadingWildcard(true); > > > > std::wstring query = query_string.toStdWString(); > > QScopedPointer< lucene::search::Query > lucene_query(parser.parse(query.c_str(), FIELD_CONTENT, &analyzer)); > > QScopedPointer< lucene::search::Hits > hits(searcher.search(lucene_query.data())); > > > > TCHAR *query_debug_string(lucene_query->toString()); > > qDebug() << "found?" << query_string << QString::fromWCharArray(query_debug_string) << (hits->length() > 0); > > free(query_debug_string); > > } > > > > int *main*(int argc, char *argv[]) > > { > > QCoreApplication a(*argc*, argv); > > > > create_index(QString("Росси́я официально также Росси́йская Федера́ция")); > > > > search(QString("noWordLkeThis")); // ok > > > > search(QString("Федера́ция")); // ok > > search(QString("Федер*ция")); // ERROR: it should work, but it doesn't > > search(QString("Фед*")); // ok > > search(QString("Федер")); // ok > > search(QString("\"федера ция\"")); // why is this working? > > > > search(QString("официально")); // ok > > search(QString("офиц*ьно")); // ok > > search(QString("оф*циально")); // ok > > search(QString("офици*но")); // ok > > > > return 0; > > } > > > > *cluceneutf8.pro <http://cluceneutf8.pro>:* > > > > QT -= gui > > > > CONFIG += c++11 console > > CONFIG -= app_bundle > > > > CONFIG += link_pkgconfig > > PKGCONFIG += libclucene-core > > > > SOURCES += \ > > main.cc > > > > > > qmake && make && ./cluceneutf8 > > > > *The output of the program:* > > > > found? "noWordLkeThis" "content:nowordlkethis" false > found? "Федера́ция" "content:\"федера ция\"" true > found? "Федер*ция" "content:федер*ция" false > found? "Фед*" "content:фед*" true > found? "Федер" "content:федер" false > found? "\"федера ция\"" "content:\"федера ция\"" true > found? "официально" "content:официально" true > found? "офиц*ьно" "content:офиц*ьно" true > found? "оф*циально" "content:оф*циально" true > found? "офици*но" "content:офици*но" true > > > > > > It's built with Qt and qmake, but I also made a non-Qt version if that > would be better to share, I can. > > > > So my problem is that I can search for *Федера́ция* but I can't search > for *Федер*ция* for example. Other words like *официально* can be > searched anyway. > > > > > > Thanks. > > > > -- > > Dömők Tamás > > _______________________________________________ > CLucene-developers mailing list > CLu...@li... > https://lists.sourceforge.net/lists/listinfo/clucene-developers > > > > > -- > > Dömők Tamás > _______________________________________________ > CLucene-developers mailing list > CLu...@li... > https://lists.sourceforge.net/lists/listinfo/clucene-developers > -- Dömők Tamás |
From: Kostka B. <ko...@to...> - 2019-07-24 13:57:16
|
Hi, The problem should be in StandardTokenizer. Unfortunately I’m not familiar with it, as we are using our own tokenizer. So I’m just guessing. 1) It uses _istspace which is mapped to iswspace. Some time ago I discovered these function uses standard “C” locale by default (and doesn’t work well with non-english characters) We solved this problem by calling setlocale( LC_CTYPE, "" ) during program startup. No idea if this helps, but it is easy to try. 2) I have really bad experience with non-ascii characters inside source code, especially in multiplatform environment we use (windows + linux). It should work OK if file is in UTF-8, but we still had BOM/without BOM issues. We encode characters as \uNNNN if we need it in source (there is free online converters, like https://www.mobilefish.com/services/unicode_escape_sequence_converter/unicode_escape_sequence_converter.php Borek From: Tamás Dömők [mailto:dom...@gm...] Sent: Wednesday, July 24, 2019 3:18 PM To: clu...@li... Subject: Re: [CLucene-dev] Wildcard query on a Russian text is not working for me Hi, i checked my index with Luke. These are the tokens in my index: 1 content официально 1 content росси 1 content также 1 content федера 1 content ция 1 content я 1 content йская It's interesting the word Федера́ция is split to федера and ция. Yes TCHAR is defined as wchar_t on my platform(s) (it behaves exactly the same on mac, linux and windows for me.) Thanks for this Luke tool, it's awesome. On Wed, 24 Jul 2019 at 14:49, Kostka Bořivoj <ko...@to...<mailto:ko...@to...>> wrote: Hi again It would be interesting to explore index content. Seems to me, the the word “Федера́ция” is treated as two words Федер and ция (а́ is treated as space in other words). You can use Luke (https://code.google.com/archive/p/luke/downloads) to explore index content Regards Borek From: Tamás Dömők [mailto:dom...@gm...<mailto:dom...@gm...>] Sent: Wednesday, July 24, 2019 11:41 AM To: clu...@li...<mailto:clu...@li...> Subject: [CLucene-dev] Wildcard query on a Russian text is not working for me Hi all, I'm trying to index some Russian content and search in this content using the CLucene library (v2.3.3.4-10). It works most of the time, but on some words the wildcard query is not working for me, and I have no idea why. Can anybody help me on this, please? Here is my source code: main.cc: #include <QCoreApplication> #include <QString> #include <QDebug> #include <QScopedPointer> #include <CLucene.h> const TCHAR FIELD_CONTENT[] = L"content"; const char INDEX_PATH[] = "/tmp/index"; void create_index(const QString &content) { lucene::analysis::standard::StandardAnalyzer analyzer; lucene::index::IndexWriter writer(INDEX_PATH, &analyzer, true); lucene::document::Document doc; std::wstring content_buffer = content.toStdWString(); doc.add(*_CLNEW lucene::document::Field(FIELD_CONTENT, content_buffer.data(), lucene::document::Field::STORE_NO | lucene::document::Field::INDEX_TOKENIZED | lucene::document::Field::TERMVECTOR_NO, true)); writer.addDocument(&doc); writer.flush(); writer.close(true); } void search(const QString &query_string) { lucene::search::IndexSearcher searcher(INDEX_PATH); lucene::analysis::standard::StandardAnalyzer analyzer; lucene::queryParser::QueryParser parser(FIELD_CONTENT, &analyzer); parser.setAllowLeadingWildcard(true); std::wstring query = query_string.toStdWString(); QScopedPointer< lucene::search::Query > lucene_query(parser.parse(query.c_str(), FIELD_CONTENT, &analyzer)); QScopedPointer< lucene::search::Hits > hits(searcher.search(lucene_query.data())); TCHAR *query_debug_string(lucene_query->toString()); qDebug() << "found?" << query_string << QString::fromWCharArray(query_debug_string) << (hits->length() > 0); free(query_debug_string); } int main(int argc, char *argv[]) { QCoreApplication a(argc, argv); create_index(QString("Росси́я официально также Росси́йская Федера́ция")); search(QString("noWordLkeThis")); // ok search(QString("Федера́ция")); // ok search(QString("Федер*ция")); // ERROR: it should work, but it doesn't search(QString("Фед*")); // ok search(QString("Федер")); // ok search(QString("\"федера ция\"")); // why is this working? search(QString("официально")); // ok search(QString("офиц*ьно")); // ok search(QString("оф*циально")); // ok search(QString("офици*но")); // ok return 0; } cluceneutf8.pro<http://cluceneutf8.pro>: QT -= gui CONFIG += c++11 console CONFIG -= app_bundle CONFIG += link_pkgconfig PKGCONFIG += libclucene-core SOURCES += \ main.cc qmake && make && ./cluceneutf8 The output of the program: found? "noWordLkeThis" "content:nowordlkethis" false found? "Федера́ция" "content:\"федера ция\"" true found? "Федер*ция" "content:федер*ция" false found? "Фед*" "content:фед*" true found? "Федер" "content:федер" false found? "\"федера ция\"" "content:\"федера ция\"" true found? "официально" "content:официально" true found? "офиц*ьно" "content:офиц*ьно" true found? "оф*циально" "content:оф*циально" true found? "офици*но" "content:офици*но" true It's built with Qt and qmake, but I also made a non-Qt version if that would be better to share, I can. So my problem is that I can search for Федера́ция but I can't search for Федер*ция for example. Other words like официально can be searched anyway. Thanks. -- Dömők Tamás _______________________________________________ CLucene-developers mailing list CLu...@li...<mailto:CLu...@li...> https://lists.sourceforge.net/lists/listinfo/clucene-developers -- Dömők Tamás |
From: Tamás D. <dom...@gm...> - 2019-07-24 13:16:14
|
Hi, i checked my index with Luke. These are the tokens in my index: 1 content официально 1 content росси 1 content также 1 content федера 1 content ция 1 content я 1 content йская It's interesting the word *Федера́ция* is split to *федера* and *ция*. Yes TCHAR is defined as wchar_t on my platform(s) (it behaves exactly the same on mac, linux and windows for me.) Thanks for this Luke tool, it's awesome. On Wed, 24 Jul 2019 at 14:49, Kostka Bořivoj <ko...@to...> wrote: > Hi again > > > > It would be interesting to explore index content. Seems to me, the the > word “Федера́ция” is treated as two words Федер and ция (а́ is treated as > space in other words). > > You can use Luke (https://code.google.com/archive/p/luke/downloads) to > explore index content > > > > Regards > > > > Borek > > > > *From:* Tamás Dömők [mailto:dom...@gm...] > *Sent:* Wednesday, July 24, 2019 11:41 AM > *To:* clu...@li... > *Subject:* [CLucene-dev] Wildcard query on a Russian text is not working > for me > > > > Hi all, > > > > I'm trying to index some Russian content and search in this content using > the CLucene library (v2.3.3.4-10). It works most of the time, but on some > words the wildcard query is not working for me, and I have no idea why. > > > > Can anybody help me on this, please? > > > > Here is my source code: > > > > *main.cc:* > > > > #include <QCoreApplication> > > > > #include <QString> > > #include <QDebug> > > #include <QScopedPointer> > > > > #include <CLucene.h> > > > > const TCHAR FIELD_CONTENT[] = L"content"; > > const char INDEX_PATH[] = "/tmp/index"; > > > > void *create_index*(const QString &content) > > { > > lucene::analysis::standard::StandardAnalyzer analyzer; > > lucene::index::IndexWriter writer(INDEX_PATH, &analyzer, true); > > > > lucene::document::Document doc; > > std::wstring content_buffer = content.toStdWString(); > > doc.add(***_CLNEW lucene::document::Field*(FIELD_CONTENT,* > > *content_buffer.data(),* > > lucene*::*document*::*Field*::*STORE_NO *|* > > lucene*::*document*::*Field*::*INDEX_TOKENIZED *|* > > lucene*::*document*::*Field*::*TERMVECTOR_NO*,* > > true*)*); > > writer.addDocument(&doc); > > > > writer.flush(); > > writer.close(true); > > } > > > > void *search*(const QString &query_string) > > { > > lucene::search::IndexSearcher searcher(INDEX_PATH); > > > > lucene::analysis::standard::StandardAnalyzer analyzer; > > lucene::queryParser::QueryParser parser(FIELD_CONTENT, &analyzer); > > parser.setAllowLeadingWildcard(true); > > > > std::wstring query = query_string.toStdWString(); > > QScopedPointer< lucene::search::Query > lucene_query(parser.parse(query.c_str(), FIELD_CONTENT, &analyzer)); > > QScopedPointer< lucene::search::Hits > hits(searcher.search(lucene_query.data())); > > > > TCHAR *query_debug_string(lucene_query->toString()); > > qDebug() << "found?" << query_string << QString::fromWCharArray(query_debug_string) << (hits->length() > 0); > > free(query_debug_string); > > } > > > > int *main*(int argc, char *argv[]) > > { > > QCoreApplication a(*argc*, argv); > > > > create_index(QString("Росси́я официально также Росси́йская Федера́ция")); > > > > search(QString("noWordLkeThis")); // ok > > > > search(QString("Федера́ция")); // ok > > search(QString("Федер*ция")); // ERROR: it should work, but it doesn't > > search(QString("Фед*")); // ok > > search(QString("Федер")); // ok > > search(QString("\"федера ция\"")); // why is this working? > > > > search(QString("официально")); // ok > > search(QString("офиц*ьно")); // ok > > search(QString("оф*циально")); // ok > > search(QString("офици*но")); // ok > > > > return 0; > > } > > > > *cluceneutf8.pro <http://cluceneutf8.pro>:* > > > > QT -= gui > > > > CONFIG += c++11 console > > CONFIG -= app_bundle > > > > CONFIG += link_pkgconfig > > PKGCONFIG += libclucene-core > > > > SOURCES += \ > > main.cc > > > > > > qmake && make && ./cluceneutf8 > > > > *The output of the program:* > > > > found? "noWordLkeThis" "content:nowordlkethis" false > found? "Федера́ция" "content:\"федера ция\"" true > found? "Федер*ция" "content:федер*ция" false > found? "Фед*" "content:фед*" true > found? "Федер" "content:федер" false > found? "\"федера ция\"" "content:\"федера ция\"" true > found? "официально" "content:официально" true > found? "офиц*ьно" "content:офиц*ьно" true > found? "оф*циально" "content:оф*циально" true > found? "офици*но" "content:офици*но" true > > > > > > It's built with Qt and qmake, but I also made a non-Qt version if that > would be better to share, I can. > > > > So my problem is that I can search for *Федера́ция* but I can't search > for *Федер*ция* for example. Other words like *официально* can be > searched anyway. > > > > > > Thanks. > > > > -- > > Dömők Tamás > _______________________________________________ > CLucene-developers mailing list > CLu...@li... > https://lists.sourceforge.net/lists/listinfo/clucene-developers > -- Dömők Tamás |
From: Kostka B. <ko...@to...> - 2019-07-24 13:00:49
|
Hi, What platform do you use? CLucene uses TCHAR as character type and this should be #defined as wchar_t (at least on Windows and Linux) If this doesn’t help: CLucene change wildcard expression to Boolean OR query with all index terms that match the wildcard condition. You can look at clucene\src\core\CLucene\search\WildcardTermEnum.cpp. There is a method WildcardTermEnum::termCompare, which judge if term match wildcard or not. Let me know, if you need more help. Borek From: Tamás Dömők [mailto:dom...@gm...] Sent: Wednesday, July 24, 2019 11:41 AM To: clu...@li... Subject: [CLucene-dev] Wildcard query on a Russian text is not working for me Hi all, I'm trying to index some Russian content and search in this content using the CLucene library (v2.3.3.4-10). It works most of the time, but on some words the wildcard query is not working for me, and I have no idea why. Can anybody help me on this, please? Here is my source code: main.cc: #include <QCoreApplication> #include <QString> #include <QDebug> #include <QScopedPointer> #include <CLucene.h> const TCHAR FIELD_CONTENT[] = L"content"; const char INDEX_PATH[] = "/tmp/index"; void create_index(const QString &content) { lucene::analysis::standard::StandardAnalyzer analyzer; lucene::index::IndexWriter writer(INDEX_PATH, &analyzer, true); lucene::document::Document doc; std::wstring content_buffer = content.toStdWString(); doc.add(*_CLNEW lucene::document::Field(FIELD_CONTENT, content_buffer.data(), lucene::document::Field::STORE_NO | lucene::document::Field::INDEX_TOKENIZED | lucene::document::Field::TERMVECTOR_NO, true)); writer.addDocument(&doc); writer.flush(); writer.close(true); } void search(const QString &query_string) { lucene::search::IndexSearcher searcher(INDEX_PATH); lucene::analysis::standard::StandardAnalyzer analyzer; lucene::queryParser::QueryParser parser(FIELD_CONTENT, &analyzer); parser.setAllowLeadingWildcard(true); std::wstring query = query_string.toStdWString(); QScopedPointer< lucene::search::Query > lucene_query(parser.parse(query.c_str(), FIELD_CONTENT, &analyzer)); QScopedPointer< lucene::search::Hits > hits(searcher.search(lucene_query.data())); TCHAR *query_debug_string(lucene_query->toString()); qDebug() << "found?" << query_string << QString::fromWCharArray(query_debug_string) << (hits->length() > 0); free(query_debug_string); } int main(int argc, char *argv[]) { QCoreApplication a(argc, argv); create_index(QString("Росси́я официально также Росси́йская Федера́ция")); search(QString("noWordLkeThis")); // ok search(QString("Федера́ция")); // ok search(QString("Федер*ция")); // ERROR: it should work, but it doesn't search(QString("Фед*")); // ok search(QString("Федер")); // ok search(QString("\"федера ция\"")); // why is this working? search(QString("официально")); // ok search(QString("офиц*ьно")); // ok search(QString("оф*циально")); // ok search(QString("офици*но")); // ok return 0; } cluceneutf8.pro<http://cluceneutf8.pro>: QT -= gui CONFIG += c++11 console CONFIG -= app_bundle CONFIG += link_pkgconfig PKGCONFIG += libclucene-core SOURCES += \ main.cc qmake && make && ./cluceneutf8 The output of the program: found? "noWordLkeThis" "content:nowordlkethis" false found? "Федера́ция" "content:\"федера ция\"" true found? "Федер*ция" "content:федер*ция" false found? "Фед*" "content:фед*" true found? "Федер" "content:федер" false found? "\"федера ция\"" "content:\"федера ция\"" true found? "официально" "content:официально" true found? "офиц*ьно" "content:офиц*ьно" true found? "оф*циально" "content:оф*циально" true found? "офици*но" "content:офици*но" true It's built with Qt and qmake, but I also made a non-Qt version if that would be better to share, I can. So my problem is that I can search for Федера́ция but I can't search for Федер*ция for example. Other words like официально can be searched anyway. Thanks. -- Dömők Tamás |
From: Kostka B. <ko...@to...> - 2019-07-24 12:49:22
|
Hi again It would be interesting to explore index content. Seems to me, the the word “Федера́ция” is treated as two words Федер and ция (а́ is treated as space in other words). You can use Luke (https://code.google.com/archive/p/luke/downloads) to explore index content Regards Borek From: Tamás Dömők [mailto:dom...@gm...] Sent: Wednesday, July 24, 2019 11:41 AM To: clu...@li... Subject: [CLucene-dev] Wildcard query on a Russian text is not working for me Hi all, I'm trying to index some Russian content and search in this content using the CLucene library (v2.3.3.4-10). It works most of the time, but on some words the wildcard query is not working for me, and I have no idea why. Can anybody help me on this, please? Here is my source code: main.cc: #include <QCoreApplication> #include <QString> #include <QDebug> #include <QScopedPointer> #include <CLucene.h> const TCHAR FIELD_CONTENT[] = L"content"; const char INDEX_PATH[] = "/tmp/index"; void create_index(const QString &content) { lucene::analysis::standard::StandardAnalyzer analyzer; lucene::index::IndexWriter writer(INDEX_PATH, &analyzer, true); lucene::document::Document doc; std::wstring content_buffer = content.toStdWString(); doc.add(*_CLNEW lucene::document::Field(FIELD_CONTENT, content_buffer.data(), lucene::document::Field::STORE_NO | lucene::document::Field::INDEX_TOKENIZED | lucene::document::Field::TERMVECTOR_NO, true)); writer.addDocument(&doc); writer.flush(); writer.close(true); } void search(const QString &query_string) { lucene::search::IndexSearcher searcher(INDEX_PATH); lucene::analysis::standard::StandardAnalyzer analyzer; lucene::queryParser::QueryParser parser(FIELD_CONTENT, &analyzer); parser.setAllowLeadingWildcard(true); std::wstring query = query_string.toStdWString(); QScopedPointer< lucene::search::Query > lucene_query(parser.parse(query.c_str(), FIELD_CONTENT, &analyzer)); QScopedPointer< lucene::search::Hits > hits(searcher.search(lucene_query.data())); TCHAR *query_debug_string(lucene_query->toString()); qDebug() << "found?" << query_string << QString::fromWCharArray(query_debug_string) << (hits->length() > 0); free(query_debug_string); } int main(int argc, char *argv[]) { QCoreApplication a(argc, argv); create_index(QString("Росси́я официально также Росси́йская Федера́ция")); search(QString("noWordLkeThis")); // ok search(QString("Федера́ция")); // ok search(QString("Федер*ция")); // ERROR: it should work, but it doesn't search(QString("Фед*")); // ok search(QString("Федер")); // ok search(QString("\"федера ция\"")); // why is this working? search(QString("официально")); // ok search(QString("офиц*ьно")); // ok search(QString("оф*циально")); // ok search(QString("офици*но")); // ok return 0; } cluceneutf8.pro<http://cluceneutf8.pro>: QT -= gui CONFIG += c++11 console CONFIG -= app_bundle CONFIG += link_pkgconfig PKGCONFIG += libclucene-core SOURCES += \ main.cc qmake && make && ./cluceneutf8 The output of the program: found? "noWordLkeThis" "content:nowordlkethis" false found? "Федера́ция" "content:\"федера ция\"" true found? "Федер*ция" "content:федер*ция" false found? "Фед*" "content:фед*" true found? "Федер" "content:федер" false found? "\"федера ция\"" "content:\"федера ция\"" true found? "официально" "content:официально" true found? "офиц*ьно" "content:офиц*ьно" true found? "оф*циально" "content:оф*циально" true found? "офици*но" "content:офици*но" true It's built with Qt and qmake, but I also made a non-Qt version if that would be better to share, I can. So my problem is that I can search for Федера́ция but I can't search for Федер*ция for example. Other words like официально can be searched anyway. Thanks. -- Dömők Tamás |
From: Tamás D. <dom...@gm...> - 2019-07-24 09:38:39
|
Hi all, I'm trying to index some Russian content and search in this content using the CLucene library (v2.3.3.4-10). It works most of the time, but on some words the wildcard query is not working for me, and I have no idea why. Can anybody help me on this, please? Here is my source code: *main.cc:* #include <QCoreApplication> #include <QString> #include <QDebug> #include <QScopedPointer> #include <CLucene.h> const TCHAR FIELD_CONTENT[] = L"content"; const char INDEX_PATH[] = "/tmp/index"; void create_index(const QString &content) { lucene::analysis::standard::StandardAnalyzer analyzer; lucene::index::IndexWriter writer(INDEX_PATH, &analyzer, true); lucene::document::Document doc; std::wstring content_buffer = content.toStdWString(); doc.add(*_CLNEW lucene::document::Field(FIELD_CONTENT, content_buffer.data(), lucene::document::Field::STORE_NO | lucene::document::Field::INDEX_TOKENIZED | lucene::document::Field::TERMVECTOR_NO, true)); writer.addDocument(&doc); writer.flush(); writer.close(true); } void search(const QString &query_string) { lucene::search::IndexSearcher searcher(INDEX_PATH); lucene::analysis::standard::StandardAnalyzer analyzer; lucene::queryParser::QueryParser parser(FIELD_CONTENT, &analyzer); parser.setAllowLeadingWildcard(true); std::wstring query = query_string.toStdWString(); QScopedPointer< lucene::search::Query > lucene_query(parser.parse(query.c_str(), FIELD_CONTENT, &analyzer)); QScopedPointer< lucene::search::Hits > hits(searcher.search(lucene_query.data())); TCHAR *query_debug_string(lucene_query->toString()); qDebug() << "found?" << query_string << QString::fromWCharArray(query_debug_string) << (hits->length() > 0); free(query_debug_string); } int main(int argc, char *argv[]) { QCoreApplication a(argc, argv); create_index(QString("Росси́я официально также Росси́йская Федера́ция")); search(QString("noWordLkeThis")); // ok search(QString("Федера́ция")); // ok search(QString("Федер*ция")); // ERROR: it should work, but it doesn't search(QString("Фед*")); // ok search(QString("Федер")); // ok search(QString("\"федера ция\"")); // why is this working? search(QString("официально")); // ok search(QString("офиц*ьно")); // ok search(QString("оф*циально")); // ok search(QString("офици*но")); // ok return 0; } *cluceneutf8.pro <http://cluceneutf8.pro>:* QT -= gui CONFIG += c++11 console CONFIG -= app_bundle CONFIG += link_pkgconfig PKGCONFIG += libclucene-core SOURCES += \ main.cc qmake && make && ./cluceneutf8 *The output of the program:* found? "noWordLkeThis" "content:nowordlkethis" false found? "Федера́ция" "content:\"федера ция\"" true found? "Федер*ция" "content:федер*ция" false found? "Фед*" "content:фед*" true found? "Федер" "content:федер" false found? "\"федера ция\"" "content:\"федера ция\"" true found? "официально" "content:официально" true found? "офиц*ьно" "content:офиц*ьно" true found? "оф*циально" "content:оф*циально" true found? "офици*но" "content:офици*но" true It's built with Qt and qmake, but I also made a non-Qt version if that would be better to share, I can. So my problem is that I can search for *Федера́ция* but I can't search for *Федер*ция* for example. Other words like *официально* can be searched anyway. Thanks. -- Dömők Tamás |
From: Sebastián G. <seb...@gm...> - 2018-08-11 18:15:23
|
Hello people, do you know if somebody tried to port this project to JavaScript using https://github.com/kripken/emscripten . Other C++ project successfully did it like https://github.com/medialize/sass.js/ I would like to know your opinions / information before I throw myself into this. Thanks! great project BTW! keep it up! |
From: Veit J. <nun...@go...> - 2016-12-26 08:59:31
|
Hi Jonas, I worked on CLucene a while ago. There are two options, one is to add the missing header file where needed. The other, to add the legacy library file in CMake file. At the moment, I don't know what is better. I have to take a look at the source code as well. Best regards Veit Am 15.12.2016 1:44 nachm. schrieb "Jonas Poelmans" <jon...@gm... >: > Dear all, > > It seems that CLucene cannot be compiled with Visual Studio 2015. Each > time I tried to configure with CLucene, I saw the error "printf could not > be found". I think the reason is the following: > "The printf and scanf family of functions are now defined inline. The > definitions of all of the printf and scanf functions have been moved inline > into <stdio.h>, <conio.h>, and other CRT headers. This is a breaking change > that leads to a linker error (LNK2019, unresolved external symbol) for any > programs that declared these functions locally without including the > appropriate CRT headers. If possible, you should update the code to include > the CRT headers (that is, add #include <stdio.h>) and the inline functions, > but if you do not want to modify your code to include these header files, > an alternative solution is to add an additional library to your linker > input, legacy_stdio_definitions.lib." > > If somebody who is experienced with CLucene development could give me some > pointers on how to resolve this issue, I can work out a patch and post it > on GitHub. > > Best regards, > > Jonas > > ------------------------------------------------------------ > ------------------ > Check out the vibrant tech community on one of the world's most > engaging tech sites, SlashDot.org! http://sdm.link/slashdot > _______________________________________________ > CLucene-developers mailing list > CLu...@li... > https://lists.sourceforge.net/lists/listinfo/clucene-developers > > |
From: Jonas P. <jon...@gm...> - 2016-12-15 12:43:42
|
Dear all, It seems that CLucene cannot be compiled with Visual Studio 2015. Each time I tried to configure with CLucene, I saw the error "printf could not be found". I think the reason is the following: "The printf and scanf family of functions are now defined inline. The definitions of all of the printf and scanf functions have been moved inline into <stdio.h>, <conio.h>, and other CRT headers. This is a breaking change that leads to a linker error (LNK2019, unresolved external symbol) for any programs that declared these functions locally without including the appropriate CRT headers. If possible, you should update the code to include the CRT headers (that is, add #include <stdio.h>) and the inline functions, but if you do not want to modify your code to include these header files, an alternative solution is to add an additional library to your linker input, legacy_stdio_definitions.lib." If somebody who is experienced with CLucene development could give me some pointers on how to resolve this issue, I can work out a patch and post it on GitHub. Best regards, Jonas |
From: mohammed a. <alt...@gm...> - 2016-01-20 11:45:02
|
Hi I had a build of clucene sln of stable version clucene-core-0.9.21b.I had build it using clucene but when i ran it it [image: Inline image 2] for example when i ran the program it is showing the window like above Can any one explain where i can put an inbuilt loop where i can run the program. *instead of searching the files can i have an inbuilt mode where i can give my predefined input for example running a for loop and calculating the time for last element in the for loop.please respond any one over that* *Thanks and Regards* *Altaf.* |
From: pini s. <pi...@ya...> - 2015-11-25 05:37:18
|
Unsubscribe On Tuesday, November 10, 2015 12:37 PM, "clu...@li..." <clu...@li...> wrote: Send CLucene-developers mailing list submissions to clu...@li... To subscribe or unsubscribe via the World Wide Web, visit https://lists.sourceforge.net/lists/listinfo/clucene-developers or, via email, send a message with subject or body 'help' to clu...@li... You can reach the person managing the list at clu...@li... When replying, please edit your Subject line so it is more specific than "Re: Contents of CLucene-developers digest..." Today's Topics: 1. CLucene index query fails with 5GB of data (Shailesh Birari) 2. Performing case insensitive searches ? (norbert barichard) 3. Re: Performing case insensitive searches ? (cel tix44) 4. Indexing fails with .. FIELDS_INDEX_EXTENSION).c_str() )' failed (Akash) 5. 'More Like This' feature in clucene (Abhay Rawat) ---------------------------------------------------------------------- Message: 1 Date: Tue, 24 Mar 2015 11:26:15 +1300 From: Shailesh Birari <sbi...@gm...> Subject: [CLucene-dev] CLucene index query fails with 5GB of data To: clu...@li..., Shailesh Birari <sbi...@gm...> Message-ID: <CAE8-Fr=3-j...@ma...> Content-Type: text/plain; charset="utf-8" Hello, I am observing a strange behavior of CLucene with large data (though its not that large). I have 40,000 HTML documents (around 5GB of data). I added these documents in Lucene Index. When I try to search a word with this index it gives me zero results. If I take subset of these documents (only 170 documents) and create a Index then the same search works. Note, to create above both Index I used the same the same code. Here is what I am doing, to add an string in index. (Note I am passing the document contents as string). void LuceneLib::AddStringToDoc(Document *doc, const char *fieldName, const char *str) { wchar_t *wstr = charToWChar(fieldName); wchar_t *wstr2 = charToWChar(str); bool isHighlighted = false; bool isStoreCompressed = false; for (int i =0; i < highlightedFields.size(); i++) { if (highlightedFields.at(i).compare(fieldName) == 0) { isHighlighted = true; break; } } for (int i =0; i < compressedFields.size(); i++) { if (compressedFields.at(i).compare(fieldName) == 0) { isStoreCompressed = true; break; } } cout << "Field : " << fieldName << " "; int fieldConfig = Field::INDEX_TOKENIZED; if (isHighlighted == true) { fieldConfig = fieldConfig | Field::TERMVECTOR_WITH_POSITIONS_OFFSETS; cout << " Highlighted"; } if (isStoreCompressed == true) { fieldConfig = fieldConfig | Field::STORE_COMPRESS; cout << " Store Compressed"; } else { fieldConfig = fieldConfig | Field::STORE_NO; cout << " Do not store"; } cout << " : " << fieldConfig << endl; Field *field = _CLNEW Field((const TCHAR *) wstr, (const TCHAR *) wstr2, fieldConfig); doc->add(*field); delete[] wstr; delete[] wstr2; } I checked the field config values and those are as below: Field : docName Do not store : 34 Field : docPath Do not store : 34 Field : docContent Highlighted Store Compressed : 3620 Field : All Do not store : 34 The field on which I am doing a query is docContent. Please let me know if I have missed anything. Thanks, Shailesh -------------- next part -------------- An HTML attachment was scrubbed... ------------------------------ Message: 2 Date: Wed, 25 Mar 2015 13:51:15 +0100 From: norbert barichard <nor...@di...> Subject: [CLucene-dev] Performing case insensitive searches ? To: clu...@li... Message-ID: <551...@di...> Content-Type: text/plain; charset=windows-1252; format=flowed Hello, Is there a way to tell CLucene to be Case Insensitive when performing a search ? It's a bit annoying that when I do a search, I don't get any results if I don't get all the upper case letters right. Thanks in advance ! ------------------------------ Message: 3 Date: Wed, 1 Apr 2015 08:38:14 +1100 From: cel tix44 <cel...@gm...> Subject: Re: [CLucene-dev] Performing case insensitive searches ? To: clu...@li... Message-ID: <CAA...@ma...> Content-Type: text/plain; charset="utf-8" Norbert, I guess you need to check the analyzer you're using to create your indexes, as well as the analyzer you use for searches. You probably need to use an analyzer (both for indexing and searching) that uses LowCaseFilter. Off the top of my head ... check if StandardAnalyzer (both for indexing and searching) does what you want. To get a better explanation, google for: lucene case insensitive search >From what you'll find for Java Lucene -- you'll get an idea of the way to go. To inspect the contents of your index, you can use Luke (google for: luke lucene) -- you'll see straight away if your index has case-sensitive terms. Regards Celto On Wed, Mar 25, 2015 at 11:51 PM, norbert barichard < nor...@di...> wrote: > Hello, > > Is there a way to tell CLucene to be Case Insensitive when performing a > search ? It's a bit annoying that when I do a search, I don't get any > results if I don't get all the upper case letters right. > > Thanks in advance ! > > > > ------------------------------------------------------------------------------ > Dive into the World of Parallel Programming The Go Parallel Website, > sponsored > by Intel and developed in partnership with Slashdot Media, is your hub for > all > things parallel software development, from weekly thought leadership blogs > to > news, videos, case studies, tutorials and more. Take a look and join the > conversation now. http://goparallel.sourceforge.net/ > _______________________________________________ > CLucene-developers mailing list > CLu...@li... > https://lists.sourceforge.net/lists/listinfo/clucene-developers > -------------- next part -------------- An HTML attachment was scrubbed... ------------------------------ Message: 4 Date: Wed, 14 Oct 2015 02:27:56 +0530 From: Akash <akb...@gm...> Subject: [CLucene-dev] Indexing fails with .. FIELDS_INDEX_EXTENSION).c_str() )' failed To: clu...@li... Message-ID: <8e9...@ma...> Content-Type: text/plain; charset=US-ASCII; format=flowed Hi, I am using Dovecot with its clucene plugin for indexing. I am hitting a error while trying to index a large folder of emails. Sometimes it throws this error after 30000 emails, sometimes 40000, the latest it gave up after 111000. But it just never completes. On Dovecot list, I was told that its probably CLucene library bug which they can't do much about & I was suggested to switch to solr (which I don't want to). Can there be a fix for this: 111000/322080 doveadm: /home/stephan/packages/wheezy/i386/clucene-core-2.3.3.4/src/core/CLucene/index/DocumentsWriter.cpp:210: std:tring lucene::index:ocumentsWriter::closeDocStore(): Assertion `numDocsInStore*8 == directory->fileLength( (docStoreSegment + "." + IndexFileNames::FIELDS_INDEX_EXTENSION).c_str() )' failed. Aborted I am using dovecot 2:2.2.19-1~auto+7& libclucene-core1:i386 2.3.3.4-4 from debian wheezy backports. Please advice. -Akash ------------------------------ Message: 5 Date: Tue, 10 Nov 2015 10:37:27 +0000 From: Abhay Rawat <abh...@ho...> Subject: [CLucene-dev] 'More Like This' feature in clucene To: "clu...@li..." <clu...@li...> Message-ID: <BF9...@HC...> Content-Type: text/plain; charset="us-ascii" Hello, Currently java lucene has this functionality called "More Like This" Which is used to find representative terms of a document which can be further used to search for similar documents. I looked in latest clucene code but could not find this functionality. Is it there in clucene? If not then are there any plans to include it? Or if someone has done some work on this or area similar to this, It will be great to hear from them. Thanks Abhay ________________________________ **************************************** IMPORTANT INFORMATION The information contained in this email or any of its attachments is confidential and is intended for the exclusive use of the individual or entity to whom it is addressed. It may not be disclosed to, copied, distributed or used by anyone else without our express permission. If you receive this communication in error please advise the sender immediately and delete it from your systems. This email is not intended to and does not create legally binding commitments or obligations on behalf of Hornbill Service Management Limited which may only be created by hard copy writing signed by a director or other authorized officer. Any opinions, conclusions and other information in this message that do not relate to the official business of Hornbill Service Management Limited are unauthorized and neither given nor endorsed by it. Although Anti-Virus measures are used by Hornbill Service Management Limited it is the responsibility of the addressee to scan this email and any attachments for computer viruses or other defects. Hornbill Service Management Limited does not accept any liability for any loss or damage of any nature, however caused, which may result directly or indirectly from this email or any file attached. Hornbill Service Management Limited. Registered Office: Apollo, Odyssey Business Park, West End Road, Ruislip, HA4 6QD, United Kingdom. Registered in England Number: 3033585. **************************************** -------------- next part -------------- An HTML attachment was scrubbed... ------------------------------ ------------------------------------------------------------------------------ ------------------------------ _______________________________________________ CLucene-developers mailing list CLu...@li... https://lists.sourceforge.net/lists/listinfo/clucene-developers End of CLucene-developers Digest, Vol 90, Issue 1 ************************************************* |
From: cel t. <cel...@gm...> - 2015-11-12 07:21:54
|
Abhay If the functionality you need is not in Java Lucene 2.3.2 - then it's not in CLucene either (as per the official site, CLucene 2.3.3.4 conforms with JLucene 2.3.2). I don't think there's any active work on CLucene. You might check Lucene++ (marked as v.3.0.7) -- https://github.com/luceneplusplus/LucenePlusPlus But again, if JLucene 3.0.x doesn't have "more like this" -- it won't be available in Lucene++ either. Regards Celto On Tue, Nov 10, 2015 at 9:37 PM, Abhay Rawat <abh...@ho...> wrote: > Hello, > > > > Currently java lucene has this functionality called “More Like This” > > Which is used to find representative terms of a document which can be > further used to search for similar documents. > > I looked in latest clucene code but could not find this functionality. > > > > Is it there in clucene? If not then are there any plans to include it? > > Or if someone has done some work on this or area similar to this, It will > be great to hear from them. > > > > Thanks > > Abhay > > ------------------------------ > **************************************** > > IMPORTANT INFORMATION > The information contained in this email or any of its attachments is > confidential and is intended for the exclusive use of the individual or > entity to whom it is addressed. It may not be disclosed to, copied, > distributed or used by anyone else without our express permission. If you > receive this communication in error please advise the sender immediately > and delete it from your systems. This email is not intended to and does not > create legally binding commitments or obligations on behalf of Hornbill > Service Management Limited which may only be created by hard copy writing > signed by a director or other authorized officer. Any opinions, conclusions > and other information in this message that do not relate to the official > business of Hornbill Service Management Limited are unauthorized and > neither given nor endorsed by it. Although Anti-Virus measures are used by > Hornbill Service Management Limited it is the responsibility of the > addressee to scan this email and any attachments for computer viruses or > other defects. Hornbill Service Management Limited does not accept any > liability for any loss or damage of any nature, however caused, which may > result directly or indirectly from this email or any file attached. > > Hornbill Service Management Limited. Registered Office: Apollo, Odyssey > Business Park, West End Road, Ruislip, HA4 6QD, United Kingdom. Registered > in England Number: 3033585. > > **************************************** > > > ------------------------------------------------------------------------------ > > _______________________________________________ > CLucene-developers mailing list > CLu...@li... > https://lists.sourceforge.net/lists/listinfo/clucene-developers > > |