[CLucene-dev] [ clucene-Bugs-1304447 ] StandardTokenizer.cpp: unicode support broken ...

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Bugs item #1304447, was opened at 2005-09-26 03:35
Message generated for change (Tracker Item Submitted) made by Item Submitter
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=558446&aid=1304447&group_id=80013

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: himanshu (cryptomaniac)
Assigned to: Nobody/Anonymous (nobody)
Summary: StandardTokenizer.cpp: unicode support broken ...

Initial Comment:
StandardTokenizer.cpp ver 0.9.8 :

It seems that the standard unicode functions 
"_istalpha" and "_istalnum" are not working properly 
with some languages. 

is the unicode library uptodate with unicode charset ?

I tested clucene by indexing hindi dcuments the 
stadardtokenizer is not able to tokenize hindi stream 
because of "_istalpha" and "_istalnum" returning wrong 
values for some alphabets.

i dont know about other languages, but the unicode 
support seems to be definately broken becuase of 
tokenizer not working properly.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=558446&aid=1304447&group_id=80013