|
From: Hanno S. <ha...@ha...> - 2011-07-31 09:01:54
|
On Sun, Jul 31, 2011 at 5:20 AM, Antonio Carrasco Valero on gmail <car...@gm...> wrote: > The biggest hassle is to put together the mappings of "similar" characters > I.e., all the following unicodes could match for each other: Not really. We already have and ship such a list and use it as part of plone.i18n. It depends on the http://pypi.python.org/pypi/Unidecode library, which has a pretty comprehensive list and maps about 46000 characters from the entire Unicode range. So all we need to do is: from plone.i18n.normalizer.base import baseNormalize ascii = baseNormalize('some text') The baseNormalize function only uses the Unidecode mappings with some upper limit - as the phonetic mappings for Asian languages aren't good enough. This makes sense for this use-case as well, as Asian languages need different approaches for search anyways, like not doing whitespace delimited splitting. But thanks for the pointer :) Hanno |