make_id(): deaccent characters

Brought to you by: goodger, grubert, milde, tibs, wiemann

#42 make_id(): deaccent characters

Status: closed-accepted

Owner: engelbert gruber

Labels: None

Priority: 5

Updated: 2008-09-28

Created: 2008-01-24

Creator: András Mohari

Private: No

This patch modifies docutils.nodes.make_id() to create more readable IDs for some non-English languages. It does this by replacing accented characters with non-accented ones. I think it is much better than replacing them with "-".

Discussion

András Mohari - 2008-01-24

The patch file

make-id-deaccent.patch

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Martin Geisler - 2008-02-03

Logged In: YES
user_id=1264592
Originator: NO

Hi, I stumpled across this patch... I have done something similar once, but there I used the unicodedata module to do most of the replacing:

>>> u = u'áëîñ ß æøå'
>>> unicodedata.normalize('NFKD', u).encode('ASCII', 'ignore')

The normalize function splits the combined accented letters into their base letter and the combining accent. Converting this to ASCII strips away the accents nicely on most letters. In the above example the sharp s (ß) and the Danish æ and ø are removed completely since they don't count as accented letters. So for best results one would still need a (small) replacement table for these cases.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

engelbert gruber - 2008-08-25

assigned_to: nobody --> grubert
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

engelbert gruber - 2008-09-04

Logged In: YES
user_id=147070
Originator: NO

File Added: make_id_deaccent_by_unicodedata

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

engelbert gruber - 2008-09-04

patch using unicodedata normalize

make_id_deaccent_by_unicodedata

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

engelbert gruber - 2008-09-04

Logged In: YES
user_id=147070
Originator: NO

using unicodedata.normalize the translation dict can be reduced to 41 entries
normalize does not exist in python2.2 , but then string.translate
"1-n mappings are currently not implemented" in python2.2 also.

File Added: make_id_deaccent_by_unicodedata

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

David Goodger - 2008-09-04

make_id_deaccent_by_unicodedata.patch

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

David Goodger - 2008-09-04

Logged In: YES
user_id=7733
Originator: NO

Updated the patch:

* import unicodedata
* handle Python 2.2 gracefully (change requires Python 2.3+)
* removed 'ij' liguature case (handled by unicodedata.normalize)

See discussion at http://thread.gmane.org/gmane.text.docutils.devel/4354
File Added: make_id_deaccent_by_unicodedata.patch

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

engelbert gruber - 2008-09-28

Thank you for your contribution! It has been checked in to the
Docutils repository.

You can download the most current snapshot from:
http://docutils.sourceforge.net/docutils-snapshot.tgz

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

engelbert gruber - 2008-09-28

status: open --> closed-accepted
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Log in to post a comment.