This patch modifies docutils.nodes.make_id() to create more readable IDs for some non-English languages. It does this by replacing accented characters with non-accented ones. I think it is much better than replacing them with "-".
Hi, I stumpled across this patch... I have done something similar once, but there I used the unicodedata module to do most of the replacing:
>>> u = u'áëîñ ß æøå'
>>> unicodedata.normalize('NFKD', u).encode('ASCII', 'ignore')
The normalize function splits the combined accented letters into their base letter and the combining accent. Converting this to ASCII strips away the accents nicely on most letters. In the above example the sharp s (ß) and the Danish æ and ø are removed completely since they don't count as accented letters. So for best results one would still need a (small) replacement table for these cases.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
using unicodedata.normalize the translation dict can be reduced to 41 entries
normalize does not exist in python2.2 , but then string.translate
"1-n mappings are currently not implemented" in python2.2 also.
File Added: make_id_deaccent_by_unicodedata
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The patch file
Logged In: YES
user_id=1264592
Originator: NO
Hi, I stumpled across this patch... I have done something similar once, but there I used the unicodedata module to do most of the replacing:
>>> u = u'áëîñ ß æøå'
>>> unicodedata.normalize('NFKD', u).encode('ASCII', 'ignore')
The normalize function splits the combined accented letters into their base letter and the combining accent. Converting this to ASCII strips away the accents nicely on most letters. In the above example the sharp s (ß) and the Danish æ and ø are removed completely since they don't count as accented letters. So for best results one would still need a (small) replacement table for these cases.
Logged In: YES
user_id=147070
Originator: NO
File Added: make_id_deaccent_by_unicodedata
patch using unicodedata normalize
Logged In: YES
user_id=147070
Originator: NO
using unicodedata.normalize the translation dict can be reduced to 41 entries
normalize does not exist in python2.2 , but then string.translate
"1-n mappings are currently not implemented" in python2.2 also.
File Added: make_id_deaccent_by_unicodedata
Logged In: YES
user_id=7733
Originator: NO
Updated the patch:
* import unicodedata
* handle Python 2.2 gracefully (change requires Python 2.3+)
* removed 'ij' liguature case (handled by unicodedata.normalize)
See discussion at http://thread.gmane.org/gmane.text.docutils.devel/4354
File Added: make_id_deaccent_by_unicodedata.patch
Thank you for your contribution! It has been checked in to the
Docutils repository.
You can download the most current snapshot from:
http://docutils.sourceforge.net/docutils-snapshot.tgz