|
From: SourceForge.net <no...@so...> - 2005-09-26 03:35:34
|
Bugs item #1304447, was opened at 2005-09-26 03:35 Message generated for change (Tracker Item Submitted) made by Item Submitter You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=558446&aid=1304447&group_id=80013 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: None Group: None Status: Open Resolution: None Priority: 5 Submitted By: himanshu (cryptomaniac) Assigned to: Nobody/Anonymous (nobody) Summary: StandardTokenizer.cpp: unicode support broken ... Initial Comment: StandardTokenizer.cpp ver 0.9.8 : It seems that the standard unicode functions "_istalpha" and "_istalnum" are not working properly with some languages. is the unicode library uptodate with unicode charset ? I tested clucene by indexing hindi dcuments the stadardtokenizer is not able to tokenize hindi stream because of "_istalpha" and "_istalnum" returning wrong values for some alphabets. i dont know about other languages, but the unicode support seems to be definately broken becuase of tokenizer not working properly. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=558446&aid=1304447&group_id=80013 |