clucene-developers Mailing List for CLucene - a C++ search engine

Status: Beta

Brought to you by: synhershko, ustramooner

clucene-developers — CLucene development-related discussion.

You can subscribe to this list here.

2003	Jan	Feb	Mar	Apr	May	Jun (16)	Jul (56)	Aug (2)	Sep (62)	Oct (71)	Nov (45)	Dec (6)
2004	Jan (12)	Feb (22)	Mar	Apr (62)	May (15)	Jun (57)	Jul (4)	Aug (24)	Sep (7)	Oct (34)	Nov (81)	Dec (41)
2005	Jan (70)	Feb (51)	Mar (46)	Apr (16)	May (22)	Jun (34)	Jul (23)	Aug (13)	Sep (43)	Oct (42)	Nov (54)	Dec (68)
2006	Jan (81)	Feb (43)	Mar (64)	Apr (141)	May (37)	Jun (101)	Jul (112)	Aug (32)	Sep (85)	Oct (63)	Nov (84)	Dec (81)
2007	Jan (25)	Feb (64)	Mar (46)	Apr (28)	May (14)	Jun (42)	Jul (19)	Aug (34)	Sep (29)	Oct (25)	Nov (12)	Dec (9)
2008	Jan (15)	Feb (34)	Mar (37)	Apr (23)	May (18)	Jun (47)	Jul (28)	Aug (61)	Sep (29)	Oct (48)	Nov (24)	Dec (79)
2009	Jan (48)	Feb (50)	Mar (28)	Apr (10)	May (51)	Jun (22)	Jul (125)	Aug (29)	Sep (38)	Oct (29)	Nov (58)	Dec (32)
2010	Jan (15)	Feb (10)	Mar (12)	Apr (64)	May (4)	Jun (81)	Jul (41)	Aug (82)	Sep (84)	Oct (35)	Nov (43)	Dec (26)
2011	Jan (59)	Feb (25)	Mar (23)	Apr (14)	May (22)	Jun (8)	Jul (5)	Aug (20)	Sep (10)	Oct (12)	Nov (29)	Dec (7)
2012	Jan (1)	Feb (22)	Mar (9)	Apr (5)	May (2)	Jun	Jul (6)	Aug (2)	Sep	Oct (5)	Nov (9)	Dec (10)
2013	Jan (9)	Feb (3)	Mar (2)	Apr (4)	May (2)	Jun (1)	Jul (2)	Aug (5)	Sep	Oct (3)	Nov (3)	Dec (2)
2014	Jan (1)	Feb (2)	Mar	Apr (10)	May (3)	Jun	Jul	Aug	Sep (1)	Oct	Nov	Dec (3)
2015	Jan (8)	Feb (3)	Mar (7)	Apr	May	Jun	Jul	Aug	Sep	Oct (1)	Nov (3)	Dec
2016	Jan (1)	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec (2)
2018	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug (1)	Sep	Oct	Nov	Dec
2019	Jan	Feb	Mar	Apr	May	Jun	Jul (8)	Aug	Sep	Oct	Nov	Dec
2020	Jan	Feb	Mar	Apr (2)	May	Jun (1)	Jul	Aug	Sep	Oct	Nov (1)	Dec
2021	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug (1)	Sep	Oct	Nov	Dec
2023	Jan	Feb	Mar (1)	Apr	May	Jun	Jul (4)	Aug	Sep	Oct	Nov	Dec
2025	Jan	Feb	Mar	Apr	May	Jun (1)	Jul	Aug	Sep	Oct	Nov	Dec

Flat | Threaded

1 2 3 .. 168 > >> (Page 1 of 168)

[CLucene-dev] Phrase Query not working

From: Roushan <rou...@gm...> - 2025-06-04 02:55:06

Hi All,
        I am new to the group and apologies in advance if the issue has
been discussed before. I could not find any relevant thread after searching
so posting this question.

I am trying to use phrase query however I am getting segmentation faults
consistently. Here is the code that is causing it:

void SegmentTermPositions::lazySkip() {
    if (proxStream == NULL) {
      // clone lazily
      proxStream = *parent->proxStream*->clone(); ----->  *proxStream *is
nullptr/

    }
..
}

After looking at the code it seems to me that "prx" stream is not written
or not even initialized. Am I missing something? Or is it expected that
phrase-query does not work with clucene?

I am willing to make any code changes that may be needed to make it work.
Any pointer and help would be much appreciated.

Regards,
Roushan

Re: [CLucene-dev] Inquiry about CLucene's UTF-8 support

From: Kostka B. <ko...@to...> - 2023-07-14 08:36:44

Hello,

I see differences in CJK languages (Chinese, Japanese, Korean). Note that segmentation (aka tokenization) for these languages is a very complex task because
they do not use spaces to separate words. There are some techniques to work around this, e.g. creating bigrams. And of course they exist
some segmentation libraries based on NLP (e.g. Stanford has one). I think bigrams should be generated by the CLucene standard analyzer, but I've never tried that.

Also, in the Greek sigma ending, the standard sigma character is changed (as I mentioned in my previous email), but I don't think that should be a problem,
since the same thing is done during the search.

I’m afraid there is no easy way to produce the same tokens as JavaLucene. You can of course modify Std Analyzer or write down your own.

Regards

Borek

From: Achyuth Pramod [mailto:ach...@gm...]
Sent: Friday, July 14, 2023 8:27 AM
To: clu...@li...
Subject: Re: [CLucene-dev] Inquiry about CLucene's UTF-8 support

Hi Developers,
I am attaching the tokens generated from Java Lucene and CLucene. I am getting different tokens for non-latin texts using StandardAnalyser.
Is there a solution which will generate the same tokens for CLucene as the Java Lucene?

Thanks & Regards,
Achyuth Pramod

On Mon, Jul 10, 2023 at 6:44 PM Kostka Bořivoj <ko...@to...<mailto:ko...@to...>> wrote:
CLucene supports at least Unicode plane 0
CLucene uses wchar_t as internal representation, while indexes uses UTF-8
You must not set ENABLE_ASCII_MODE in CMake during build, otherwise only US-Acscii (or perhaps ISO Latin 1, I‘m not sure) is supported

Not 100% sure about Standard Analyzer, because we don’t use them, but I can’t see any problem in it.

In your Greek query, the problem can also be with lowercasing and  „ending sigma“ (ς) character (see https://en.wikipedia.org/wiki/Sigma)

Hope this helps

Borivoj

From: Achyuth Pramod [mailto:ach...@gm...<mailto:ach...@gm...>]
Sent: Monday, July 10, 2023 2:32 PM
To: clu...@li...<mailto:clu...@li...>
Subject: [CLucene-dev] Inquiry about CLucene's UTF-8 support


Dear developers,

I am using CLucene in my project and I would like to inquire about the UTF-8 encoding support in the Standard Analyzer. Specifically, I would like to know if the Standard Analyzer handles tokenization and text processing correctly for non-Latin UTF-8 encoded text.

Could you please confirm if the Standard Analyzer in CLucene has built-in support for UTF-8 encoded text? If not, are there any recommended alternatives or additional analyzers that provide better support for non-Latin UTF-8 text?

The below is the search results of few queries
Max Docs: 1
Num Docs: 1
Current Version: 1688707923968.0
Term count: 66

Enter query string: dignissimos
Searching for: dignissimos

0. /home/nonLatin100Rows.csv - 0.04746387


Search took: 0 ms.
Screen dump took: 0 ms.

Enter query string: διαχειριστής
Searching for:



Search took: 0 ms.
Screen dump took: 0 ms.
Thank you for your time.

- Achyuth Pramod
_______________________________________________
CLucene-developers mailing list
CLu...@li...<mailto:CLu...@li...>
https://lists.sourceforge.net/lists/listinfo/clucene-developers

Re: [CLucene-dev] Inquiry about CLucene's UTF-8 support

From: Achyuth P. <ach...@gm...> - 2023-07-14 06:27:42

Attachments: CLucene.txt JavaLucene.txt

Hi Developers,
I am attaching the tokens generated from Java Lucene and CLucene. I am
getting different tokens for non-latin texts using StandardAnalyser.
Is there a solution which will generate the same tokens for CLucene as the
Java Lucene?

Thanks & Regards,
Achyuth Pramod

On Mon, Jul 10, 2023 at 6:44 PM Kostka Bořivoj <ko...@to...> wrote:

> CLucene supports at least Unicode plane 0
>
> CLucene uses wchar_t as internal representation, while indexes uses UTF-8
>
> You must not set ENABLE_ASCII_MODE in CMake during build, otherwise only
> US-Acscii (or perhaps ISO Latin 1, I‘m not sure) is supported
>
>
>
> Not 100% sure about Standard Analyzer, because we don’t use them, but I
> can’t see any problem in it.
>
>
>
> In your Greek query, the problem can also be with lowercasing and  „ending
> sigma“ (ς) character (see https://en.wikipedia.org/wiki/Sigma)
>
>
>
> Hope this helps
>
>
>
> Borivoj
>
>
>
> *From:* Achyuth Pramod [mailto:ach...@gm...]
> *Sent:* Monday, July 10, 2023 2:32 PM
> *To:* clu...@li...
> *Subject:* [CLucene-dev] Inquiry about CLucene's UTF-8 support
>
>
>
> Dear developers,
>
> I am using CLucene in my project and I would like to inquire about the UTF-8 encoding support in the Standard Analyzer. Specifically, I would like to know if the Standard Analyzer handles tokenization and text processing correctly for non-Latin UTF-8 encoded text.
>
> Could you please confirm if the Standard Analyzer in CLucene has built-in support for UTF-8 encoded text? If not, are there any recommended alternatives or additional analyzers that provide better support for non-Latin UTF-8 text?
>
> The below is the search results of few queries
> Max Docs: 1
> Num Docs: 1
> Current Version: 1688707923968.0
> Term count: 66
>
> Enter query string: dignissimos
> Searching for: dignissimos
>
> 0. /home/nonLatin100Rows.csv - 0.04746387
>
>
> Search took: 0 ms.
> Screen dump took: 0 ms.
>
> Enter query string: διαχειριστής
> Searching for:
>
>
>
> Search took: 0 ms.
> Screen dump took: 0 ms.
> Thank you for your time.
>
> - Achyuth Pramod
>
> _______________________________________________
> CLucene-developers mailing list
> CLu...@li...
> https://lists.sourceforge.net/lists/listinfo/clucene-developers
>

Re: [CLucene-dev] Inquiry about CLucene's UTF-8 support

From: Kostka B. <ko...@to...> - 2023-07-10 13:13:59

CLucene supports at least Unicode plane 0
CLucene uses wchar_t as internal representation, while indexes uses UTF-8
You must not set ENABLE_ASCII_MODE in CMake during build, otherwise only US-Acscii (or perhaps ISO Latin 1, I‘m not sure) is supported

Not 100% sure about Standard Analyzer, because we don’t use them, but I can’t see any problem in it.

In your Greek query, the problem can also be with lowercasing and  „ending sigma“ (ς) character (see https://en.wikipedia.org/wiki/Sigma)

Hope this helps

Borivoj

From: Achyuth Pramod [mailto:ach...@gm...]
Sent: Monday, July 10, 2023 2:32 PM
To: clu...@li...
Subject: [CLucene-dev] Inquiry about CLucene's UTF-8 support


Dear developers,

I am using CLucene in my project and I would like to inquire about the UTF-8 encoding support in the Standard Analyzer. Specifically, I would like to know if the Standard Analyzer handles tokenization and text processing correctly for non-Latin UTF-8 encoded text.

Could you please confirm if the Standard Analyzer in CLucene has built-in support for UTF-8 encoded text? If not, are there any recommended alternatives or additional analyzers that provide better support for non-Latin UTF-8 text?

The below is the search results of few queries
Max Docs: 1
Num Docs: 1
Current Version: 1688707923968.0
Term count: 66

Enter query string: dignissimos
Searching for: dignissimos

0. /home/nonLatin100Rows.csv - 0.04746387


Search took: 0 ms.
Screen dump took: 0 ms.

Enter query string: διαχειριστής
Searching for:



Search took: 0 ms.
Screen dump took: 0 ms.
Thank you for your time.

- Achyuth Pramod

[CLucene-dev] Inquiry about CLucene's UTF-8 support

From: Achyuth P. <ach...@gm...> - 2023-07-10 12:32:47

Attachments: nonLatin100Rows.csv

Dear developers,

I am using CLucene in my project and I would like to inquire about the
UTF-8 encoding support in the Standard Analyzer. Specifically, I would
like to know if the Standard Analyzer handles tokenization and text
processing correctly for non-Latin UTF-8 encoded text.

Could you please confirm if the Standard Analyzer in CLucene has
built-in support for UTF-8 encoded text? If not, are there any
recommended alternatives or additional analyzers that provide better
support for non-Latin UTF-8 text?

The below is the search results of few queries
Max Docs: 1
Num Docs: 1
Current Version: 1688707923968.0
Term count: 66

Enter query string: dignissimos
Searching for: dignissimos

0. /home/nonLatin100Rows.csv - 0.04746387


Search took: 0 ms.
Screen dump took: 0 ms.

Enter query string: διαχειριστής
Searching for:



Search took: 0 ms.
Screen dump took: 0 ms.
Thank you for your time.

- Achyuth Pramod

[CLucene-dev] Proposed Changes to StandardTokenizer

From: Achyuth P. <ach...@gm...> - 2023-03-23 10:00:28

Dear Developers,

I am writing to request your assistance in verifying some proposed changes
to StandardTokenizer for my use case.
Specifically, we would like to know if the changes we plan to make will
function as intended and not cause any unintended consequences.
 into
When using Java Lucene 9.5, a text field containing "text&search" is
tokenized into:
1. text
2. search
using '&' as a delimiter.

Similarly when using CLucene 2.3.3.4, the same field is tokenized into:
1. text&search

 As our use case requires the field to be split into 2 terms, some
modifications were made to StandardTokenizer.cpp,

In StandardTokenizer::ReadAlphaNum(const TCHAR prev, Token* t),
case '&' was commented out. (Line number 278-280)
Post the changes the above mentioned string gets tokenized to 2 terms.
(text, search)

I want to know if the change made is appropriate or not.

Please take some time to review the changes and let us know your thoughts.
If you have any concerns, suggestions, or questions, please do not hesitate
to reach out to me.
Thank you in advance for your help and expertise. We look forward to
hearing from you.

Best regards,
Achyuth Pramod

[CLucene-dev] Fwd: [Libreoffice-commits] core.git: external/clucene: Avoid std::string(nullptr) construction

From: Stephan B. <sbe...@re...> - 2021-08-20 06:23:06

FYI:


-------- Forwarded Message --------
Subject: [Libreoffice-commits] core.git: external/clucene
Date: Thu, 19 Aug 2021 19:04:20 +0000 (UTC)
From: Stephan Bergmann (via logerrit) <log...@ke...>
Reply-To: lib...@li...
To: lib...@li...

  external/clucene/UnpackedTarball_clucene.mk |    1 +
  external/clucene/patches/nullstring.patch   |   11 +++++++++++
  2 files changed, 12 insertions(+)

New commits:
commit 396c0575b2935aeb039e8da260eba739d1a0ed3c
Author:     Stephan Bergmann <sbe...@re...>
AuthorDate: Thu Aug 19 16:43:59 2021 +0200
Commit:     Stephan Bergmann <sbe...@re...>
CommitDate: Thu Aug 19 21:03:45 2021 +0200

     external/clucene: Avoid std::string(nullptr) construction
         The relevant constructor is defined as deleted since incorporating
 
<http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2020/p2166r1.html> "A
     Proposal to Prohibit std::basic_string and std::basic_string_view 
construction
     from nullptr" into the upcoming C++23, and has caused undefined 
behavior in
     prior versions (see the referenced document for details).  That caused
         > 
workdir/UnpackedTarball/clucene/src/core/CLucene/index/SegmentInfos.cpp:361:13: 
error: conversion function from 'long' to 'std::string' (aka 
'basic_string<char, char_traits<char>, allocator<char>>') invokes a 
deleted function
     >                    return NULL;
     >                           ^~~~
     > ~/llvm/inst/lib/clang/14.0.0/include/stddef.h:84:18: note: 
expanded from macro 'NULL'
     > #    define NULL __null
     >                  ^~~~~~
     > ~/llvm/inst/bin/../include/c++/v1/string:849:5: note: 
'basic_string' has been explicitly marked deleted here
     >     basic_string(nullptr_t) = delete;
     >     ^
         at least when building --with-latest-c++ against recent libc++ 
14 trunk (on
     macOS).
         (There might be a chance that the CLucene code naively relied on
     SegmentInfo::getDelFileName actually returning a std::string for 
which c_str()
     would return null at least at some of the call sites, which I did 
not inspect in
     detail.  However, this would unlikely have worked in the past 
anyway, as it is
     undefined behavior and at least contemporary libstdc++ throws a 
std::logic_error
     when constructing a std::string from null, and at least a full 
`make check` with
     this fix applied built fine for me.)
         Change-Id: I2b8cf96b089848d666ec37aa7ee0deacc4798d35
     Reviewed-on: https://gerrit.libreoffice.org/c/core/+/120745
     Tested-by: Jenkins
     Reviewed-by: Stephan Bergmann <sbe...@re...>

diff --git a/external/clucene/UnpackedTarball_clucene.mk 
b/external/clucene/UnpackedTarball_clucene.mk
index 37c1c16dab0f..a8e697784f9b 100644
--- a/external/clucene/UnpackedTarball_clucene.mk
+++ b/external/clucene/UnpackedTarball_clucene.mk
@@ -50,6 +50,7 @@ $(eval $(call gb_UnpackedTarball_add_patches,clucene,\
  	external/clucene/patches/heap-buffer-overflow.patch \
  	external/clucene/patches/c++20.patch \
  	external/clucene/patches/write-strings.patch \
+	external/clucene/patches/nullstring.patch \
  ))
   ifneq ($(OS),WNT)
diff --git a/external/clucene/patches/nullstring.patch 
b/external/clucene/patches/nullstring.patch
new file mode 100644
index 000000000000..6043e9f00890
--- /dev/null
+++ b/external/clucene/patches/nullstring.patch
@@ -0,0 +1,11 @@
+--- src/core/CLucene/index/SegmentInfos.cpp
++++ src/core/CLucene/index/SegmentInfos.cpp
+@@ -358,7 +358,7 @@
+ 	   if (delGen == NO) {
+ 		   // In this case we know there is no deletion filename
+ 		   // against this segment
+-		   return NULL;
++		   return {};
+ 	   } else {
+ 		   // If delGen is CHECK_DIR, it's the pre-lockless-commit file format
+ 		   return IndexFileNames::fileNameFromGeneration(name.c_str(), 
(string(".") + IndexFileNames::DELETES_EXTENSION).c_str(), delGen);

[CLucene-dev] Patch: Clamp float_t to double on s390 to keep ABI stable

From: Marius H. <mh...@li...> - 2020-11-17 10:42:01

Hi,

clucene's API makes heavy use of the type float_t. On s390, float_t has
historically been defined as double for no good reason. For getting rid
of performance overhead in some cases and contradictions with the C
standard in others, we discuss plans to clean up that definition -
float_t should become float on s390.

As a result of that change, all these places in clucene's ABI would flip
from double to float. Existing shared libs of clucene would become
incompatible with binaries built with new versions of glibc/gcc and vice
versa -- potentially causing very bad update experiences.

To avoid that ABI breakage, I propose to stabilize the use of float_t to
always use double on s390x. Please review my patch posted in the ticket
https://sourceforge.net/p/clucene/bugs/233/
where I also posted more background on float_t and its status quo on s390.

What do you think of this approach? What alternative may I have missed?

(If you are also subscribed to the tickets and received this twice,
please excuse the duplication.)

Regards,
Marius
-- 
Marius Hillenbrand
Linux on Z development
IBM Deutschland Research & Development GmbH
Vors. des Aufsichtsrats: Gregor Pillen / Geschäftsführung: Dirk Wittkopp
Sitz der Gesellschaft: Böblingen / Registergericht: Amtsgericht
Stuttgart, HRB 243294

[CLucene-dev] Fwd: [Libreoffice-commits] core.git: external/clucene: Adapt to C++20 CWG2237

From: Stephan B. <sbe...@re...> - 2020-06-18 08:24:31

FYI:

-------- Forwarded Message --------
Subject: [Libreoffice-commits] core.git: external/clucene
Date: Wed, 17 Jun 2020 17:52:19 +0000 (UTC)
From: Stephan Bergmann (via logerrit) <log...@ke...>
Reply-To: lib...@li...
To: lib...@li...

  external/clucene/UnpackedTarball_clucene.mk |    1 +
  external/clucene/patches/c++20.patch        |   11 +++++++++++
  2 files changed, 12 insertions(+)

New commits:
commit 5558256e777b00ac38f455081425fc5b1ee53375
Author:     Stephan Bergmann <sbe...@re...>
AuthorDate: Wed Jun 17 17:34:39 2020 +0200
Commit:     Stephan Bergmann <sbe...@re...>
CommitDate: Wed Jun 17 19:51:38 2020 +0200

     external/clucene: Adapt to C++20 CWG2237
 
...<http://www.open-std.org/jtc1/sc22/wg21/docs/cwg_defects.html#2237> 
"Can a
     template-id name a constructor?", as implemented by GCC 11 trunk since
     <https://gcc.gnu.org/git/?p=gcc.git;a=commit;
     h=4b38d56dbac6742b038551a36ec80200313123a1> "c++: C++20 DR 2237, 
disallow
     simple-template-id in cdtor."
         Change-Id: I507fc5bde20fdf09b4e31a3db8a7554a473f1a9f
     Reviewed-on: https://gerrit.libreoffice.org/c/core/+/96549
     Tested-by: Jenkins
     Reviewed-by: Stephan Bergmann <sbe...@re...>

diff --git a/external/clucene/UnpackedTarball_clucene.mk 
b/external/clucene/UnpackedTarball_clucene.mk
index 1dc64a78faa3..1a373b48b49e 100644
--- a/external/clucene/UnpackedTarball_clucene.mk
+++ b/external/clucene/UnpackedTarball_clucene.mk
@@ -46,6 +46,7 @@ $(eval $(call gb_UnpackedTarball_add_patches,clucene,\
 
external/clucene/patches/clucene-mixes-uptemplate-parameter-msvc-14.patch \
  	external/clucene/patches/ostream-wchar_t.patch \
  	external/clucene/patches/heap-buffer-overflow.patch \
+	external/clucene/patches/c++20.patch \
  ))
   ifneq ($(OS),WNT)
diff --git a/external/clucene/patches/c++20.patch 
b/external/clucene/patches/c++20.patch
new file mode 100644
index 000000000000..c982e861e1b4
--- /dev/null
+++ b/external/clucene/patches/c++20.patch
@@ -0,0 +1,11 @@
+--- src/core/CLucene/util/_bufferedstream.h
++++ src/core/CLucene/util/_bufferedstream.h
+@@ -68,7 +68,7 @@
+     void setMinBufSize(int32_t s) {
+         buffer.makeSpace(s);
+     }
+-    BufferedStreamImpl<T>();
++    BufferedStreamImpl();
+ public:
+     int32_t read(const T*& start, int32_t min, int32_t max);
+     int64_t reset(int64_t pos);
_______________________________________________
Libreoffice-commits mailing list
Lib...@li...
https://lists.freedesktop.org/mailman/listinfo/libreoffice-commits

[CLucene-dev] Fwd: [Libreoffice-commits] core.git: external/clucene: Avoid heap-buffer-overflow

From: Stephan B. <sbe...@re...> - 2020-04-24 08:34:35

FYI:

-------- Forwarded Message --------
Subject: [Libreoffice-commits] core.git: external/clucene
Date: Thu, 23 Apr 2020 18:37:07 +0000 (UTC)
From: Stephan Bergmann (via logerrit) <log...@ke...>
Reply-To: lib...@li...
To: lib...@li...

  external/clucene/UnpackedTarball_clucene.mk         |    1 +
  external/clucene/patches/heap-buffer-overflow.patch |   11 +++++++++++
  2 files changed, 12 insertions(+)

New commits:
commit 92b7e0fd668f580ca573284e8f36794c72ba62df
Author:     Stephan Bergmann <sbe...@re...>
AuthorDate: Thu Apr 23 16:49:17 2020 +0200
Commit:     Stephan Bergmann <sbe...@re...>
CommitDate: Thu Apr 23 20:36:26 2020 +0200

     external/clucene: Avoid heap-buffer-overflow
         ...as seen during a --with-lang=ALL build with ASan on Linux:
         > [XHC] nlpsolver ja
     > =================================================================
     > ==51396==ERROR: AddressSanitizer: heap-buffer-overflow on address 
0x62100000ed00 at pc 0x7fe425640f53 bp 0x7ffd6a0cc900 sp 0x7ffd6a0cc8f8
     > READ of size 4 at 0x62100000ed00 thread T0
     >  #0 in 
lucene::analysis::cjk::CJKTokenizer::next(lucene::analysis::Token*) at 
workdir/UnpackedTarball/clucene/src/contribs-lib/CLucene/analysis/cjk/CJKAnalyzer.cpp:70:19
     >  #1 in 
lucene::index::DocumentsWriter::ThreadState::FieldData::invertField(lucene::document::Field*, 
lucene::analysis::Analyzer*, int) at 
workdir/UnpackedTarball/clucene/src/core/CLucene/index/DocumentsWriterThreadState.cpp:901:32
     >  #2 in 
lucene::index::DocumentsWriter::ThreadState::FieldData::processField(lucene::analysis::Analyzer*) 
at 
workdir/UnpackedTarball/clucene/src/core/CLucene/index/DocumentsWriterThreadState.cpp:798:9
     >  #3 in 
lucene::index::DocumentsWriter::ThreadState::processDocument(lucene::analysis::Analyzer*) 
at 
workdir/UnpackedTarball/clucene/src/core/CLucene/index/DocumentsWriterThreadState.cpp:557:24
     >  #4 in 
lucene::index::DocumentsWriter::updateDocument(lucene::document::Document*, 
lucene::analysis::Analyzer*, lucene::index::Term*) at 
workdir/UnpackedTarball/clucene/src/core/CLucene/index/DocumentsWriter.cpp:946:16
     >  #5 in 
lucene::index::DocumentsWriter::addDocument(lucene::document::Document*, 
lucene::analysis::Analyzer*) at 
workdir/UnpackedTarball/clucene/src/core/CLucene/index/DocumentsWriter.cpp:930:10
     >  #6 in 
lucene::index::IndexWriter::addDocument(lucene::document::Document*, 
lucene::analysis::Analyzer*) at 
workdir/UnpackedTarball/clucene/src/core/CLucene/index/IndexWriter.cpp:681:28
     >  #7 in HelpIndexer::indexDocuments() at 
helpcompiler/source/HelpIndexer.cxx:66:20
     >  #8 in main at helpcompiler/source/HelpIndexer_main.cxx:79:22
     > 0x62100000ed00 is located 0 bytes to the right of 4096-byte 
region [0x62100000dd00,0x62100000ed00)
     > allocated by thread T0 here:
     >  #0 in realloc at 
/data/sbergman/github.com/llvm/llvm-project/compiler-rt/lib/asan/asan_malloc_linux.cpp:164:3
     >  #1 in lucene::util::StreamBuffer<wchar_t>::setSize(int) at 
workdir/UnpackedTarball/clucene/src/core/CLucene/util/_streambuffer.h:114:17
     >  #2 in lucene::util::StreamBuffer<wchar_t>::makeSpace(int) at 
workdir/UnpackedTarball/clucene/src/core/CLucene/util/_streambuffer.h:150:5
     >  #3 in 
lucene::util::BufferedStreamImpl<wchar_t>::setMinBufSize(int) at 
workdir/UnpackedTarball/clucene/src/core/CLucene/util/_bufferedstream.h:69:16
     >  #4 in 
lucene::util::SimpleInputStreamReader::Internal::JStreamsBuffer::JStreamsBuffer(lucene::util::CLStream<signed 
char>*, int) at 
workdir/UnpackedTarball/clucene/src/core/CLucene/util/Reader.cpp:375:6
         Note that this is not a proper fix, which would need to 
properly detect
     surrogate pairs split across buffer boundaries.  But for one the 
comment says
     "however, gunichartables doesn't seem to classify any of the 
surrogates as
     alpha, so they are skipped anyway", and for another the behavior 
until now was
     to replace the high surrogate with soemthing that was likely 
garbage and leave
     the low surrogate at the start of the next buffer (if any) alone, 
so leaving
     both surrogates alone is likely at least no worse behavior.
         Change-Id: Ib6f6f1bc20ef8efe0418bf2e715783c8555068de
     Reviewed-on: https://gerrit.libreoffice.org/c/core/+/92792
     Tested-by: Jenkins
     Reviewed-by: Stephan Bergmann <sbe...@re...>

diff --git a/external/clucene/UnpackedTarball_clucene.mk 
b/external/clucene/UnpackedTarball_clucene.mk
index a4036d72c0bc..cb6efabd1d5d 100644
--- a/external/clucene/UnpackedTarball_clucene.mk
+++ b/external/clucene/UnpackedTarball_clucene.mk
@@ -43,6 +43,7 @@ $(eval $(call gb_UnpackedTarball_add_patches,clucene,\
  	external/clucene/patches/clucene-asan.patch \
 
external/clucene/patches/clucene-mixes-uptemplate-parameter-msvc-14.patch \
  	external/clucene/patches/ostream-wchar_t.patch \
+	external/clucene/patches/heap-buffer-overflow.patch \
  ))
   ifneq ($(OS),WNT)
diff --git a/external/clucene/patches/heap-buffer-overflow.patch 
b/external/clucene/patches/heap-buffer-overflow.patch
new file mode 100644
index 000000000000..7421db854cfd
--- /dev/null
+++ b/external/clucene/patches/heap-buffer-overflow.patch
@@ -0,0 +1,11 @@
+--- src/contribs-lib/CLucene/analysis/cjk/CJKAnalyzer.cpp
++++ src/contribs-lib/CLucene/analysis/cjk/CJKAnalyzer.cpp
+@@ -66,7 +66,7 @@
+ 		//ucs4(c variable). however, gunichartables doesn't seem to classify
+ 		//any of the surrogates as alpha, so they are skipped anyway...
+ 		//so for now we just convert to ucs4 so that we dont corrupt the input.
+-		if ( c >= 0xd800 || c <= 0xdfff ){
++		if ( (c >= 0xd800 || c <= 0xdfff) && bufferIndex != dataLen ){
+ 			clunichar c2 = ioBuffer[bufferIndex];
+ 			if ( c2 >= 0xdc00 && c2 <= 0xdfff ){
+ 				bufferIndex++;
_______________________________________________
Libreoffice-commits mailing list
Lib...@li...
https://lists.freedesktop.org/mailman/listinfo/libreoffice-commits

[CLucene-dev] Fwd: [Libreoffice-commits] core.git: external/clucene: Adapt to C++20 deleted ostream << for non-plain char types

From: Stephan B. <sbe...@re...> - 2020-04-22 15:26:24

FYI:

-------- Original Message --------
Subject: [Libreoffice-commits] core.git: external/clucene
Date: Tue Dec 3 15:07:33 UTC 2019
From: Stephan Bergmann <sbe...@re...>
Reply-To: lib...@li...
To: lib...@li...

  external/clucene/UnpackedTarball_clucene.mk    |    1
  external/clucene/patches/ostream-wchar_t.patch |   29 
+++++++++++++++++++++++++
  2 files changed, 30 insertions(+)

New commits:
commit 48f845dace0aa7a607914db9febdaf73073ea607
Author:     Stephan Bergmann <sbergman at redhat.com>
AuthorDate: Tue Dec 3 11:44:04 2019 +0100
Commit:     Stephan Bergmann <sbergman at redhat.com>
CommitDate: Tue Dec 3 16:06:05 2019 +0100

     external/clucene: Adapt to C++20 deleted ostream << for non-plain 
char types

 
<http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1423r3.html> 
"char8_t
     backward compatibility remediation", as implemented now by 
<https://gcc.gnu.org/
     git/?p=gcc.git;a=commit;h=0c5b35933e5b150df0ab487efb2f11ef5685f713> 
"libstdc++:
     P1423R3 char8_t remediation (2/4)" for -std=c++2a, deletes operator 
<< overloads
     that would print a pointer rather than a (presumably expected) string.

     So this infoStream output appears to have always been broken (the 
strings use
     TCHAR, which appears to unconditionally be a typedef for wchar_t, see
 
workdir/UnpackedTarball/clucene/src/shared/CLucene/clucene-config.h), and
     appears to be just of informative nature, so just simplify it to 
not try to
     print any problematic parts.

     Change-Id: Ie9f8edb03aff461a15718a0c025af57004aba0a9
     Reviewed-on: https://gerrit.libreoffice.org/84320
     Tested-by: Jenkins
     Reviewed-by: Stephan Bergmann <sbergman at redhat.com>

diff --git a/external/clucene/UnpackedTarball_clucene.mk 
b/external/clucene/UnpackedTarball_clucene.mk
index a878947b0871..5303d4d1c036 100644
--- a/external/clucene/UnpackedTarball_clucene.mk
+++ b/external/clucene/UnpackedTarball_clucene.mk
@@ -40,6 +40,7 @@ $(eval $(call gb_UnpackedTarball_add_patches,clucene,\
  	external/clucene/patches/clucene-mutex.patch \
  	external/clucene/patches/clucene-asan.patch \
 
external/clucene/patches/clucene-mixes-uptemplate-parameter-msvc-14.patch \
+	external/clucene/patches/ostream-wchar_t.patch \
  ))

  ifneq ($(OS),WNT)
diff --git a/external/clucene/patches/ostream-wchar_t.patch 
b/external/clucene/patches/ostream-wchar_t.patch
new file mode 100644
index 000000000000..63c9e148144e
--- /dev/null
+++ b/external/clucene/patches/ostream-wchar_t.patch
@@ -0,0 +1,29 @@
+--- src/core/CLucene/index/DocumentsWriterThreadState.cpp
++++ src/core/CLucene/index/DocumentsWriterThreadState.cpp
+@@ -484,7 +484,7 @@
+         last->next = fp->next;
+
+       if (_parent->infoStream != NULL)
+-        (*_parent->infoStream) << "  remove field=" << 
fp->fieldInfo->name << "\n";
++        (*_parent->infoStream) << "  remove field\n";
+
+       _CLDELETE(fp);
+     } else {
+@@ -557,7 +557,7 @@
+     fieldDataArray[i]->processField(analyzer);
+
+   if (maxTermPrefix != NULL && _parent->infoStream != NULL)
+-    (*_parent->infoStream) << "WARNING: document contains at least one 
immense term (longer than the max length " << MAX_TERM_LENGTH << "), all 
of which were skipped.  Please correct the analyzer to not produce such 
terms.  The prefix of the first immense term is: '" << maxTermPrefix << 
"...'\n";
++    (*_parent->infoStream) << "WARNING: document contains at least one 
immense term (longer than the max length " << MAX_TERM_LENGTH << "), all 
of which were skipped.  Please correct the analyzer to not produce such 
terms.\n";
+
+   if (_parent->ramBufferSize != IndexWriter::DISABLE_AUTO_FLUSH
+       && _parent->numBytesUsed > 0.95 * _parent->ramBufferSize)
+@@ -910,7 +910,7 @@
+ 					// truncate the token stream after maxFieldLength tokens.
+ 					if ( length >= maxFieldLength) {
+ 	          if (_parent->infoStream != NULL)
+-	            (*_parent->infoStream) << "maxFieldLength "  << 
maxFieldLength << " reached for field " << fieldInfo->name << ", 
ignoring following tokens\n";
++	            (*_parent->infoStream) << "maxFieldLength "  << 
maxFieldLength << " reached for field, ignoring following tokens\n";
+ 						break;
+ 					}
+ 				} else if (length > IndexWriter::DEFAULT_MAX_FIELD_LENGTH) {