Learn how easy it is to sync an existing GitHub or Google Code repo to a SourceForge project! See Demo

Close

#42 make_id(): deaccent characters

closed-accepted
None
5
2008-09-28
2008-01-24
András Mohari
No

This patch modifies docutils.nodes.make_id() to create more readable IDs for some non-English languages. It does this by replacing accented characters with non-accented ones. I think it is much better than replacing them with "-".

Discussion

  • András Mohari
    András Mohari
    2008-01-24

    The patch file

     
  • Martin Geisler
    Martin Geisler
    2008-02-03

    Logged In: YES
    user_id=1264592
    Originator: NO

    Hi, I stumpled across this patch... I have done something similar once, but there I used the unicodedata module to do most of the replacing:

    >>> u = u'áëîñ ß æøå'
    >>> unicodedata.normalize('NFKD', u).encode('ASCII', 'ignore')

    The normalize function splits the combined accented letters into their base letter and the combining accent. Converting this to ASCII strips away the accents nicely on most letters. In the above example the sharp s (ß) and the Danish æ and ø are removed completely since they don't count as accented letters. So for best results one would still need a (small) replacement table for these cases.

     
    • assigned_to: nobody --> grubert
     
  • Logged In: YES
    user_id=147070
    Originator: NO

    File Added: make_id_deaccent_by_unicodedata

     
  • Logged In: YES
    user_id=147070
    Originator: NO

    using unicodedata.normalize the translation dict can be reduced to 41 entries
    normalize does not exist in python2.2 , but then string.translate
    "1-n mappings are currently not implemented" in python2.2 also.

    File Added: make_id_deaccent_by_unicodedata

     
  • David Goodger
    David Goodger
    2008-09-04

    Logged In: YES
    user_id=7733
    Originator: NO

    Updated the patch:

    * import unicodedata
    * handle Python 2.2 gracefully (change requires Python 2.3+)
    * removed 'ij' liguature case (handled by unicodedata.normalize)

    See discussion at http://thread.gmane.org/gmane.text.docutils.devel/4354
    File Added: make_id_deaccent_by_unicodedata.patch

     
    • status: open --> closed-accepted