#197 unicode error with date directive

closed-fixed
nobody
None
5
2012-07-22
2012-07-21
Toshio Kuratomi
No

I received this bug report against our package in Fedora: https://bugzilla.redhat.com/show_bug.cgi?id=786867

Taking a look, the date directive is passing in a byte string in python2 but the class is trying to feed that byte string to a unicode() method. When localized, the date directive can inject dates with non-ascii characters. So we need to fix that.

Following patch seems to work:
Index: docutils-0.9.1/docutils/parsers/rst/directives/misc.py
===================================================================
--- docutils-0.9.1.orig/docutils/parsers/rst/directives/misc.py
+++ docutils-0.9.1/docutils/parsers/rst/directives/misc.py
@@ -10,6 +10,7 @@ import sys
import os.path
import re
import time
+import locale
from docutils import io, nodes, statemachine, utils
from docutils.error_reporting import SafeString, ErrorString
from docutils.parsers.rst import Directive, convert_directive_function
@@ -474,6 +475,17 @@ class Date(Directive):
'a substitution definition.' % self.name)
format = '\n'.join(self.content) or '%Y-%m-%d'
text = time.strftime(format)
+ if sys.version_info< (3, 0):
+ try:
+ text = unicode(text, locale.getpreferredencoding())
+ except UnicodeError:
+ try:
+ text = unicode(text, 'utf-8')
+ except UnicodeError:
+ # Fallback to something that can decode all bytes to
+ # something. Alternative fallback would be to decode
+ # with errors='replace'
+ text = unicode(text, 'latin-1')
return [nodes.Text(text)]

Note that out of curiosity, I took a look at how often nodes.Text() is getting byte str type instead of unicode type using the following patch:

Index: docutils-0.9.1/docutils/nodes.py

--- docutils-0.9.1.orig/docutils/nodes.py
+++ docutils-0.9.1/docutils/nodes.py
@@ -329,6 +329,12 @@ class Text(Node, reprunicode):
else:
def __new__(cls, data, rawsource=None):
"""Prevent the rawsource argument from propagating to str."""
+ # Python2 is more lenient about mixing str and unicode than
+ # python3 mixing bytes and str but the danger is that our tests
+ # will give only ascii values to this function and be fine but in
+ # the real world someone will give it non-ascii and then crash
+ if isinstance(data, str):
+ raise TypeError('expecting unicode data, not str')
return reprunicode.__new__(cls, data)

def __init__(self, data, rawsource=''):

The results were quite bad:

[...]
File "/srv/git/python-docutils/docutils-0.9.1/docutils/nodes.py", line 337, in __new__
raise TypeError('expecting unicode data, not str')
TypeError: expecting unicode data, not str

----------------------------------------------------------------------
Ran 1192 tests in 8.961s

FAILED (errors=802)
[...]

These are all potential failure points -- whether they can fail in practice depends on whether the data being sent in can contain non-ASCII values or not.

Discussion

  • Patch to fix issue as an attachment

     
  • Here's a simpler reproducer that might be turned into a test case:

    test.sh:

    #!/bin/sh

    LC_ALL=ja_JP.utf8 rst2html --traceback test.rst

    test.rst:

    .. |date| date:: %d %B %Y

     
  • Günter Milde
    Günter Milde
    2012-07-22

    Thanks for the bug report and patch. A modified¹ fix + test case² is now added to the source.

    ¹ some Python implementations do not support a locale, we rather warn, if a failure to decode results in corrupted data.

    ² the test case revealed that we also need to encode the format string. The test case with Japanese locale fails to fail if this locale is not installed on the test system (the fallback C works fine). This is why I test with a non-ASCII character in the format string.

    Most of the 802 potential error sources are literal strings. I leave checking them all as an exercise for the reader...

     
  • Günter Milde
    Günter Milde
    2012-07-22

    • status: open --> closed-fixed