#207 Change transscription of non-ASCII chars

HTML writer (7)

The function make_id() in nodes.py converts a string into an identifier. Therefor some non-ascii-chars were escaped or changed. In German the normal way to change umlauts and eszett is: ä -> ae; ö -> oe, ü -ue, ß -> ss. Docutils does ä -> a, ö -> o, ü -> u, ß -> sz, but this is wrong.

The dict _non_id_translate_digraphs should be changed (0x00df: u'ss', # ligature sz/ss) and updated (0x00e4: u'ae', 0x00f6: u'oe', 0x00fc: u'ue').


  • Stefan Merten

    Stefan Merten - 2012-10-14
    • status: open --> closed-rejected
  • Günter Milde

    Günter Milde - 2015-03-23
    • summary: Change transscription of special german chars --> Change transscription of non-ASCII chars
    • status: closed-rejected --> pending-remind
    • Group: --> Default
  • Günter Milde

    Günter Milde - 2015-03-23

    Reopening: especially for people using non-Latin scripts, the generated ids are far from optimal (cf. https://sourceforge.net/p/docutils/feature-requests/42/).

    Especially for deep links into generated documents, it would be a vast improvement to create ids based on a Latin transliteration or to use IRIs (http://tools.ietf.org/html/rfc3987) instead of URIs in the href attribute.

    To address the stability problem for re-generated documents, I suggest a "make-id" config setting for the HTML writer with the alternatives: "legacy" (as currently), "transliterate", "encode".

    To be "language angostic" (not restricted to German umlauts),
    transliteration can use the Unidecode module:

  • Günter Milde

    Günter Milde - 2015-04-13

    As transliteration is language dependent and Docutils documents have a language setting, Unidecode should be supplemented by language-specific translation rules, e.g. ü->u by default but ü->ue in "de" and "se".

  • Günter Milde

    Günter Milde - 2015-09-02
    • status: pending-remind --> open

Log in to post a comment.