Thread: [CLucene-dev] utf8 in lucene files

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hi guys,

According to [1], the strings in Lucene 2.1 indices are in the
"modified UTF-8 encoding" format. I'm a bit suprised by this, because
it means that CLucene in the most common usecase transforms utf8 to
ucs2 to modified-utf8. This seems rather wasteful to me. Is there a
reason for it?

The reason I looked into it was that Strigi uses 90% of it's indexing
time in CLucene code. So harvesting any low hanging fruit in CLucene
would mean significantly faster indexing.

Cheers,
Jos

Thread: [CLucene-dev] utf8 in lucene files

clucene-developers