From: Jos v. d. O. <jvd...@gm...> - 2007-02-12 21:59:12
|
Hi guys, According to [1], the strings in Lucene 2.1 indices are in the "modified UTF-8 encoding" format. I'm a bit suprised by this, because it means that CLucene in the most common usecase transforms utf8 to ucs2 to modified-utf8. This seems rather wasteful to me. Is there a reason for it? The reason I looked into it was that Strigi uses 90% of it's indexing time in CLucene code. So harvesting any low hanging fruit in CLucene would mean significantly faster indexing. Cheers, Jos |
From: Jos v. d. O. <jvd...@gm...> - 2007-02-12 22:37:23
|
2007/2/12, Jos van den Oever <jvd...@gm...>: > Hi guys, > > According to [1], the strings in Lucene 2.1 indices are in the > "modified UTF-8 encoding" format. I'm a bit suprised by this, because > it means that CLucene in the most common usecase transforms utf8 to > ucs2 to modified-utf8. This seems rather wasteful to me. Is there a > reason for it? > > The reason I looked into it was that Strigi uses 90% of it's indexing > time in CLucene code. So harvesting any low hanging fruit in CLucene > would mean significantly faster indexing. [1] http://lucene.apache.org/java/docs/fileformats.html [2] http://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8 [3] http://java.sun.com/javase/6/docs/api/java/io/DataInput.html#modified-utf-8 Apparently, in Lucene 1.4.3 the text is stored in normal Utf8 and in 2.1 it is stored in "modified utf8" [2,3]. The differences between this format and the standard UTF-8 format are the following: The null byte '\u0000' is encoded in 2-byte format rather than 1-byte, so that the encoded strings never have embedded nulls. Only the 1-byte, 2-byte, and 3-byte formats are used. Supplementary characters are represented in the form of surrogate pairs. |
From: Ben v. K. <bva...@gm...> - 2007-02-13 10:30:26
|
1.4.3 did have modified-utf8 also, i think it was only identified as non-utf8 after 1.4.3 was released. someone tried to have it changed, but nothing ever happened (except to change the documentation :). if clucene was compiled in ascii mode, it would be feasible to make a few changes to accept just utf8. changes would have to be made in the input/output streams and analysers for this to work - changing real utf to modified utf8 wouldn't be that hard, i think - and would probably be more efficient than running the entire thing in unicode. cheers ben On 12/02/07, Jos van den Oever <jvd...@gm...> wrote: > 2007/2/12, Jos van den Oever <jvd...@gm...>: > > Hi guys, > > > > According to [1], the strings in Lucene 2.1 indices are in the > > "modified UTF-8 encoding" format. I'm a bit suprised by this, because > > it means that CLucene in the most common usecase transforms utf8 to > > ucs2 to modified-utf8. This seems rather wasteful to me. Is there a > > reason for it? > > > > The reason I looked into it was that Strigi uses 90% of it's indexing > > time in CLucene code. So harvesting any low hanging fruit in CLucene > > would mean significantly faster indexing. > > [1] http://lucene.apache.org/java/docs/fileformats.html > [2] http://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8 > [3] http://java.sun.com/javase/6/docs/api/java/io/DataInput.html#modified-utf-8 > > Apparently, in Lucene 1.4.3 the text is stored in normal Utf8 and in > 2.1 it is stored in "modified utf8" [2,3]. > The differences between this format and the standard UTF-8 format are > the following: > The null byte '\u0000' is encoded in 2-byte format rather than 1-byte, > so that the encoded strings never have embedded nulls. > Only the 1-byte, 2-byte, and 3-byte formats are used. > Supplementary characters are represented in the form of surrogate pairs. > > ------------------------------------------------------------------------- > Using Tomcat but need to do more? Need to support web services, security? > Get stuff done quickly with pre-integrated technology to make your job easier. > Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 > _______________________________________________ > CLucene-developers mailing list > CLu...@li... > https://lists.sourceforge.net/lists/listinfo/clucene-developers > |
From: Jos v. d. O. <jvd...@gm...> - 2007-02-13 17:30:43
|
2007/2/13, Ben van Klinken <bva...@gm...>: > 1.4.3 did have modified-utf8 also, i think it was only identified as > non-utf8 after 1.4.3 was released. someone tried to have it changed, > but nothing ever happened (except to change the documentation :). > > if clucene was compiled in ascii mode, it would be feasible to make a > few changes to accept just utf8. changes would have to be made in the > input/output streams and analysers for this to work - changing real > utf to modified utf8 wouldn't be that hard, i think - and would > probably be more efficient than running the entire thing in unicode. I'm not sure about that. Conversion of ut8 to modified utf8 or vice versa is not so trivial. Modified utf8 is basically utf16 encoded into utf8. The idea would be to use normal utf8 internall and also read and write that to the index. That would not be compatible with the lucene fileformat, but it would be much faster, easier and less error prone. If you say "more efficient than running the entire thing in unicode" do you mean running in utf16 or utf8? Cheers, Jos |
From: Ben v. K. <bva...@gm...> - 2007-02-14 11:44:19
|
Hi, I don't know if it is a good idea to branch the compatibility of the index. I was talking to Thomas Busch last night and we further discussed this, and the main issue will be in the analyser. we'll lose performance with things like strupr and strlwr. that may end up being less efficient... although, something i didnt think of last night is that the penalty for doing these scans will mostly be at index time, which is not as critical for speed as the search side. but basically the analysers would have to be completely rewritten. so given we were to go down this path... If we say allow utf8 internally and make the index incompatible, the index will have to be overtly incompatible - i.e. the index must be completely incompatible so that there can be no mistake about the running of it from jlucene and having strange errors. i believe the streams have flags, so we can set a version flag so that jlucene cannot load a modified clucene index. at the same time, we can look at writing fixed length VInt's - which would mean writing all VInts backwards, which means we can write content more efficiently to the index. i think we've discussed this on the list before. so IF we did make this change, we'd also have a packaging problem - we'll have to have 2 different packages floating around - a utf8 and a ucs compilation. i'm not sure how wise that would be... comments anyone? ben On 13/02/07, Jos van den Oever <jvd...@gm...> wrote: > 2007/2/13, Ben van Klinken <bva...@gm...>: > > 1.4.3 did have modified-utf8 also, i think it was only identified as > > non-utf8 after 1.4.3 was released. someone tried to have it changed, > > but nothing ever happened (except to change the documentation :). > > > > if clucene was compiled in ascii mode, it would be feasible to make a > > few changes to accept just utf8. changes would have to be made in the > > input/output streams and analysers for this to work - changing real > > utf to modified utf8 wouldn't be that hard, i think - and would > > probably be more efficient than running the entire thing in unicode. > > I'm not sure about that. Conversion of ut8 to modified utf8 or vice > versa is not so trivial. Modified utf8 is basically utf16 encoded into > utf8. The idea would be to use normal utf8 internall and also read and > write that to the index. That would not be compatible with the lucene > fileformat, but it would be much faster, easier and less error prone. > > If you say "more efficient than running the entire thing in unicode" > do you mean running in utf16 or utf8? > > Cheers, > Jos > > ------------------------------------------------------------------------- > Using Tomcat but need to do more? Need to support web services, security? > Get stuff done quickly with pre-integrated technology to make your job easier. > Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 > _______________________________________________ > CLucene-developers mailing list > CLu...@li... > https://lists.sourceforge.net/lists/listinfo/clucene-developers > |
From: Rene R. <ren...@gm...> - 2007-02-14 12:01:18
|
I'm all for the UTF-8 ;) UTF-32 is just a memory hog. And I believe that fixed length VInt's were also the solution to the streaming problem of stored keyword terms :) On 2/14/07, Ben van Klinken <bva...@gm...> wrote: > > Hi, > > I don't know if it is a good idea to branch the compatibility of the > index. I was talking to Thomas Busch last night and we further > discussed this, and the main issue will be in the analyser. we'll lose > performance with things like strupr and strlwr. that may end up being > less efficient... although, something i didnt think of last night is > that the penalty for doing these scans will mostly be at index time, > which is not as critical for speed as the search side. but basically > the analysers would have to be completely rewritten. > > so given we were to go down this path... If we say allow utf8 > internally and make the index incompatible, the index will have to be > overtly incompatible - i.e. the index must be completely incompatible > so that there can be no mistake about the running of it from jlucene > and having strange errors. > > i believe the streams have flags, so we can set a version flag so that > jlucene cannot load a modified clucene index. at the same time, we can > look at writing fixed length VInt's - which would mean writing all > VInts backwards, which means we can write content more efficiently to > the index. i think we've discussed this on the list before. > > so IF we did make this change, we'd also have a packaging problem - > we'll have to have 2 different packages floating around - a utf8 and a > ucs compilation. i'm not sure how wise that would be... > > comments anyone? > > > ben > > On 13/02/07, Jos van den Oever <jvd...@gm...> wrote: > > 2007/2/13, Ben van Klinken <bva...@gm...>: > > > 1.4.3 did have modified-utf8 also, i think it was only identified as > > > non-utf8 after 1.4.3 was released. someone tried to have it changed, > > > but nothing ever happened (except to change the documentation :). > > > > > > if clucene was compiled in ascii mode, it would be feasible to make a > > > few changes to accept just utf8. changes would have to be made in the > > > input/output streams and analysers for this to work - changing real > > > utf to modified utf8 wouldn't be that hard, i think - and would > > > probably be more efficient than running the entire thing in unicode. > > > > I'm not sure about that. Conversion of ut8 to modified utf8 or vice > > versa is not so trivial. Modified utf8 is basically utf16 encoded into > > utf8. The idea would be to use normal utf8 internall and also read and > > write that to the index. That would not be compatible with the lucene > > fileformat, but it would be much faster, easier and less error prone. > > > > If you say "more efficient than running the entire thing in unicode" > > do you mean running in utf16 or utf8? > > > > Cheers, > > Jos > > > > > ------------------------------------------------------------------------- > > Using Tomcat but need to do more? Need to support web services, > security? > > Get stuff done quickly with pre-integrated technology to make your job > easier. > > Download IBM WebSphere Application Server v.1.0.1 based on Apache > Geronimo > > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 > > _______________________________________________ > > CLucene-developers mailing list > > CLu...@li... > > https://lists.sourceforge.net/lists/listinfo/clucene-developers > > > > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share > your > opinions on IT & business topics through brief surveys-and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > _______________________________________________ > CLucene-developers mailing list > CLu...@li... > https://lists.sourceforge.net/lists/listinfo/clucene-developers > |
From: Jos v. d. O. <jvd...@gm...> - 2007-02-14 18:24:48
|
2007/2/14, Ben van Klinken <bva...@gm...>: > I don't know if it is a good idea to branch the compatibility of the > index. I was talking to Thomas Busch last night and we further > discussed this, and the main issue will be in the analyser. we'll lose > performance with things like strupr and strlwr. that may end up being > less efficient... although, something i didnt think of last night is > that the penalty for doing these scans will mostly be at index time, > which is not as critical for speed as the search side. but basically > the analysers would have to be completely rewritten. > > so given we were to go down this path... If we say allow utf8 > internally and make the index incompatible, the index will have to be > overtly incompatible - i.e. the index must be completely incompatible > so that there can be no mistake about the running of it from jlucene > and having strange errors. > > i believe the streams have flags, so we can set a version flag so that > jlucene cannot load a modified clucene index. at the same time, we can > look at writing fixed length VInt's - which would mean writing all > VInts backwards, which means we can write content more efficiently to > the index. i think we've discussed this on the list before. > > so IF we did make this change, we'd also have a packaging problem - > we'll have to have 2 different packages floating around - a utf8 and a > ucs compilation. i'm not sure how wise that would be... > > comments anyone? I agree that it is a large change. The main efficiency gain would be in indexing I think. Maybe toupper and tolower will be more expensive (i doubt it actually), but you avoid two encoding conversions. Also the index will shrink for non-western languages because for characters > 0x10000, you need 4 bytes in utf8 and 6 in modified utf8. Nevertheless, having a separate implementation is a pain and it's really only worth it of the JLucene guys play along. Java support for real utf8 should be good by now, since that's what the java programs on web servers talk all day. Cheers, Jos > > ben > > On 13/02/07, Jos van den Oever <jvd...@gm...> wrote: > > 2007/2/13, Ben van Klinken <bva...@gm...>: > > > 1.4.3 did have modified-utf8 also, i think it was only identified as > > > non-utf8 after 1.4.3 was released. someone tried to have it changed, > > > but nothing ever happened (except to change the documentation :). > > > > > > if clucene was compiled in ascii mode, it would be feasible to make a > > > few changes to accept just utf8. changes would have to be made in the > > > input/output streams and analysers for this to work - changing real > > > utf to modified utf8 wouldn't be that hard, i think - and would > > > probably be more efficient than running the entire thing in unicode. > > > > I'm not sure about that. Conversion of ut8 to modified utf8 or vice > > versa is not so trivial. Modified utf8 is basically utf16 encoded into > > utf8. The idea would be to use normal utf8 internall and also read and > > write that to the index. That would not be compatible with the lucene > > fileformat, but it would be much faster, easier and less error prone. > > > > If you say "more efficient than running the entire thing in unicode" > > do you mean running in utf16 or utf8? > > > > Cheers, > > Jos |
From: Ben v. K. <bva...@gm...> - 2007-02-15 08:54:17
|
yep, i agree about the jlucene thing. the static vint thing would be nice also ben On 14/02/07, Jos van den Oever <jvd...@gm...> wrote: > 2007/2/14, Ben van Klinken <bva...@gm...>: > > I don't know if it is a good idea to branch the compatibility of the > > index. I was talking to Thomas Busch last night and we further > > discussed this, and the main issue will be in the analyser. we'll lose > > performance with things like strupr and strlwr. that may end up being > > less efficient... although, something i didnt think of last night is > > that the penalty for doing these scans will mostly be at index time, > > which is not as critical for speed as the search side. but basically > > the analysers would have to be completely rewritten. > > > > so given we were to go down this path... If we say allow utf8 > > internally and make the index incompatible, the index will have to be > > overtly incompatible - i.e. the index must be completely incompatible > > so that there can be no mistake about the running of it from jlucene > > and having strange errors. > > > > i believe the streams have flags, so we can set a version flag so that > > jlucene cannot load a modified clucene index. at the same time, we can > > look at writing fixed length VInt's - which would mean writing all > > VInts backwards, which means we can write content more efficiently to > > the index. i think we've discussed this on the list before. > > > > so IF we did make this change, we'd also have a packaging problem - > > we'll have to have 2 different packages floating around - a utf8 and a > > ucs compilation. i'm not sure how wise that would be... > > > > comments anyone? > > I agree that it is a large change. The main efficiency gain would be > in indexing I think. Maybe toupper and tolower will be more expensive > (i doubt it actually), but you avoid two encoding conversions. Also > the index will shrink for non-western languages because for characters > > 0x10000, you need 4 bytes in utf8 and 6 in modified utf8. > > Nevertheless, having a separate implementation is a pain and it's > really only worth it of the JLucene guys play along. Java support for > real utf8 should be good by now, since that's what the java programs > on web servers talk all day. > > Cheers, > Jos > > > > > ben > > > > On 13/02/07, Jos van den Oever <jvd...@gm...> wrote: > > > 2007/2/13, Ben van Klinken <bva...@gm...>: > > > > 1.4.3 did have modified-utf8 also, i think it was only identified as > > > > non-utf8 after 1.4.3 was released. someone tried to have it changed, > > > > but nothing ever happened (except to change the documentation :). > > > > > > > > if clucene was compiled in ascii mode, it would be feasible to make a > > > > few changes to accept just utf8. changes would have to be made in the > > > > input/output streams and analysers for this to work - changing real > > > > utf to modified utf8 wouldn't be that hard, i think - and would > > > > probably be more efficient than running the entire thing in unicode. > > > > > > I'm not sure about that. Conversion of ut8 to modified utf8 or vice > > > versa is not so trivial. Modified utf8 is basically utf16 encoded into > > > utf8. The idea would be to use normal utf8 internall and also read and > > > write that to the index. That would not be compatible with the lucene > > > fileformat, but it would be much faster, easier and less error prone. > > > > > > If you say "more efficient than running the entire thing in unicode" > > > do you mean running in utf16 or utf8? > > > > > > Cheers, > > > Jos > > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share your > opinions on IT & business topics through brief surveys-and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > _______________________________________________ > CLucene-developers mailing list > CLu...@li... > https://lists.sourceforge.net/lists/listinfo/clucene-developers > |
From: Jos v. d. O. <jvd...@gm...> - 2007-02-15 11:29:31
|
2007/2/15, Ben van Klinken <bva...@gm...>: > yep, i agree about the jlucene thing. the static vint thing would be nice also can you point me to the discussion about static vints? I'd like to read up on it. Cheers, Jos > > ben > > On 14/02/07, Jos van den Oever <jvd...@gm...> wrote: > > 2007/2/14, Ben van Klinken <bva...@gm...>: > > > I don't know if it is a good idea to branch the compatibility of the > > > index. I was talking to Thomas Busch last night and we further > > > discussed this, and the main issue will be in the analyser. we'll lose > > > performance with things like strupr and strlwr. that may end up being > > > less efficient... although, something i didnt think of last night is > > > that the penalty for doing these scans will mostly be at index time, > > > which is not as critical for speed as the search side. but basically > > > the analysers would have to be completely rewritten. > > > > > > so given we were to go down this path... If we say allow utf8 > > > internally and make the index incompatible, the index will have to be > > > overtly incompatible - i.e. the index must be completely incompatible > > > so that there can be no mistake about the running of it from jlucene > > > and having strange errors. > > > > > > i believe the streams have flags, so we can set a version flag so that > > > jlucene cannot load a modified clucene index. at the same time, we can > > > look at writing fixed length VInt's - which would mean writing all > > > VInts backwards, which means we can write content more efficiently to > > > the index. i think we've discussed this on the list before. > > > > > > so IF we did make this change, we'd also have a packaging problem - > > > we'll have to have 2 different packages floating around - a utf8 and a > > > ucs compilation. i'm not sure how wise that would be... > > > > > > comments anyone? > > > > I agree that it is a large change. The main efficiency gain would be > > in indexing I think. Maybe toupper and tolower will be more expensive > > (i doubt it actually), but you avoid two encoding conversions. Also > > the index will shrink for non-western languages because for characters > > > 0x10000, you need 4 bytes in utf8 and 6 in modified utf8. > > > > Nevertheless, having a separate implementation is a pain and it's > > really only worth it of the JLucene guys play along. Java support for > > real utf8 should be good by now, since that's what the java programs > > on web servers talk all day. > > > > Cheers, > > Jos > > > > > > > > ben > > > > > > On 13/02/07, Jos van den Oever <jvd...@gm...> wrote: > > > > 2007/2/13, Ben van Klinken <bva...@gm...>: > > > > > 1.4.3 did have modified-utf8 also, i think it was only identified as > > > > > non-utf8 after 1.4.3 was released. someone tried to have it changed, > > > > > but nothing ever happened (except to change the documentation :). > > > > > > > > > > if clucene was compiled in ascii mode, it would be feasible to make a > > > > > few changes to accept just utf8. changes would have to be made in the > > > > > input/output streams and analysers for this to work - changing real > > > > > utf to modified utf8 wouldn't be that hard, i think - and would > > > > > probably be more efficient than running the entire thing in unicode. > > > > > > > > I'm not sure about that. Conversion of ut8 to modified utf8 or vice > > > > versa is not so trivial. Modified utf8 is basically utf16 encoded into > > > > utf8. The idea would be to use normal utf8 internall and also read and > > > > write that to the index. That would not be compatible with the lucene > > > > fileformat, but it would be much faster, easier and less error prone. > > > > > > > > If you say "more efficient than running the entire thing in unicode" > > > > do you mean running in utf16 or utf8? > > > > > > > > Cheers, > > > > Jos > > > > ------------------------------------------------------------------------- > > Take Surveys. Earn Cash. Influence the Future of IT > > Join SourceForge.net's Techsay panel and you'll get the chance to share your > > opinions on IT & business topics through brief surveys-and earn cash > > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > > _______________________________________________ > > CLucene-developers mailing list > > CLu...@li... > > https://lists.sourceforge.net/lists/listinfo/clucene-developers > > > > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share your > opinions on IT & business topics through brief surveys-and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > _______________________________________________ > CLucene-developers mailing list > CLu...@li... > https://lists.sourceforge.net/lists/listinfo/clucene-developers > |
From: Ben v. K. <bva...@gm...> - 2007-02-15 12:53:22
|
it started off on java-dev mailing list as 'VInt's as prefix. Was: bytecount as prefix' ben On 15/02/07, Jos van den Oever <jvd...@gm...> wrote: > 2007/2/15, Ben van Klinken <bva...@gm...>: > > yep, i agree about the jlucene thing. the static vint thing would be nice also > can you point me to the discussion about static vints? I'd like to > read up on it. > > Cheers, > Jos > > > > > ben > > > > On 14/02/07, Jos van den Oever <jvd...@gm...> wrote: > > > 2007/2/14, Ben van Klinken <bva...@gm...>: > > > > I don't know if it is a good idea to branch the compatibility of the > > > > index. I was talking to Thomas Busch last night and we further > > > > discussed this, and the main issue will be in the analyser. we'll lose > > > > performance with things like strupr and strlwr. that may end up being > > > > less efficient... although, something i didnt think of last night is > > > > that the penalty for doing these scans will mostly be at index time, > > > > which is not as critical for speed as the search side. but basically > > > > the analysers would have to be completely rewritten. > > > > > > > > so given we were to go down this path... If we say allow utf8 > > > > internally and make the index incompatible, the index will have to be > > > > overtly incompatible - i.e. the index must be completely incompatible > > > > so that there can be no mistake about the running of it from jlucene > > > > and having strange errors. > > > > > > > > i believe the streams have flags, so we can set a version flag so that > > > > jlucene cannot load a modified clucene index. at the same time, we can > > > > look at writing fixed length VInt's - which would mean writing all > > > > VInts backwards, which means we can write content more efficiently to > > > > the index. i think we've discussed this on the list before. > > > > > > > > so IF we did make this change, we'd also have a packaging problem - > > > > we'll have to have 2 different packages floating around - a utf8 and a > > > > ucs compilation. i'm not sure how wise that would be... > > > > > > > > comments anyone? > > > > > > I agree that it is a large change. The main efficiency gain would be > > > in indexing I think. Maybe toupper and tolower will be more expensive > > > (i doubt it actually), but you avoid two encoding conversions. Also > > > the index will shrink for non-western languages because for characters > > > > 0x10000, you need 4 bytes in utf8 and 6 in modified utf8. > > > > > > Nevertheless, having a separate implementation is a pain and it's > > > really only worth it of the JLucene guys play along. Java support for > > > real utf8 should be good by now, since that's what the java programs > > > on web servers talk all day. > > > > > > Cheers, > > > Jos > > > > > > > > > > > ben > > > > > > > > On 13/02/07, Jos van den Oever <jvd...@gm...> wrote: > > > > > 2007/2/13, Ben van Klinken <bva...@gm...>: > > > > > > 1.4.3 did have modified-utf8 also, i think it was only identified as > > > > > > non-utf8 after 1.4.3 was released. someone tried to have it changed, > > > > > > but nothing ever happened (except to change the documentation :). > > > > > > > > > > > > if clucene was compiled in ascii mode, it would be feasible to make a > > > > > > few changes to accept just utf8. changes would have to be made in the > > > > > > input/output streams and analysers for this to work - changing real > > > > > > utf to modified utf8 wouldn't be that hard, i think - and would > > > > > > probably be more efficient than running the entire thing in unicode. > > > > > > > > > > I'm not sure about that. Conversion of ut8 to modified utf8 or vice > > > > > versa is not so trivial. Modified utf8 is basically utf16 encoded into > > > > > utf8. The idea would be to use normal utf8 internall and also read and > > > > > write that to the index. That would not be compatible with the lucene > > > > > fileformat, but it would be much faster, easier and less error prone. > > > > > > > > > > If you say "more efficient than running the entire thing in unicode" > > > > > do you mean running in utf16 or utf8? > > > > > > > > > > Cheers, > > > > > Jos > > > > > > ------------------------------------------------------------------------- > > > Take Surveys. Earn Cash. Influence the Future of IT > > > Join SourceForge.net's Techsay panel and you'll get the chance to share your > > > opinions on IT & business topics through brief surveys-and earn cash > > > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > > > _______________________________________________ > > > CLucene-developers mailing list > > > CLu...@li... > > > https://lists.sourceforge.net/lists/listinfo/clucene-developers > > > > > > > ------------------------------------------------------------------------- > > Take Surveys. Earn Cash. Influence the Future of IT > > Join SourceForge.net's Techsay panel and you'll get the chance to share your > > opinions on IT & business topics through brief surveys-and earn cash > > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > > _______________________________________________ > > CLucene-developers mailing list > > CLu...@li... > > https://lists.sourceforge.net/lists/listinfo/clucene-developers > > > > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share your > opinions on IT & business topics through brief surveys-and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > _______________________________________________ > CLucene-developers mailing list > CLu...@li... > https://lists.sourceforge.net/lists/listinfo/clucene-developers > |