Menu

#42 make_id(): deaccent characters

closed-accepted
None
5
2008-09-28
2008-01-24
No

This patch modifies docutils.nodes.make_id() to create more readable IDs for some non-English languages. It does this by replacing accented characters with non-accented ones. I think it is much better than replacing them with "-".

Discussion

  • András Mohari

    András Mohari - 2008-01-24

    The patch file

     
  • Martin Geisler

    Martin Geisler - 2008-02-03

    Logged In: YES
    user_id=1264592
    Originator: NO

    Hi, I stumpled across this patch... I have done something similar once, but there I used the unicodedata module to do most of the replacing:

    >>> u = u'áëîñ ß æøå'
    >>> unicodedata.normalize('NFKD', u).encode('ASCII', 'ignore')

    The normalize function splits the combined accented letters into their base letter and the combining accent. Converting this to ASCII strips away the accents nicely on most letters. In the above example the sharp s (ß) and the Danish æ and ø are removed completely since they don't count as accented letters. So for best results one would still need a (small) replacement table for these cases.

     
  • engelbert gruber

    • assigned_to: nobody --> grubert
     
  • engelbert gruber

    Logged In: YES
    user_id=147070
    Originator: NO

    File Added: make_id_deaccent_by_unicodedata

     
  • engelbert gruber

    patch using unicodedata normalize

     
  • engelbert gruber

    Logged In: YES
    user_id=147070
    Originator: NO

    using unicodedata.normalize the translation dict can be reduced to 41 entries
    normalize does not exist in python2.2 , but then string.translate
    "1-n mappings are currently not implemented" in python2.2 also.

    File Added: make_id_deaccent_by_unicodedata

     
  • David Goodger

    David Goodger - 2008-09-04
     
  • David Goodger

    David Goodger - 2008-09-04

    Logged In: YES
    user_id=7733
    Originator: NO

    Updated the patch:

    * import unicodedata
    * handle Python 2.2 gracefully (change requires Python 2.3+)
    * removed 'ij' liguature case (handled by unicodedata.normalize)

    See discussion at http://thread.gmane.org/gmane.text.docutils.devel/4354
    File Added: make_id_deaccent_by_unicodedata.patch

     
  • engelbert gruber

    Thank you for your contribution! It has been checked in to the
    Docutils repository.

    You can download the most current snapshot from:
    http://docutils.sourceforge.net/docutils-snapshot.tgz

     
  • engelbert gruber

    • status: open --> closed-accepted
     

Log in to post a comment.