the biggest value of YAML is good readability.  A document full of "\xDCber" can not be read by humans.

Let us read the specification:
p. 5.1 Character Set:
YAML streams  use the printable  subset of the Unicode character set. <...> On output, a YAML processor must only produce these acceptable characters, and should also escape all non-printable Unicode characters.
(German, Russian and Greak characters are printable)

p. 5.2 Character Encoding:
A YAML processor must support the UTF-16 and UTF-8 character encodings. If a character stream does not begin with a byte order mark  (#FEFF), the character encoding shall be UTF-8
(it does not specify that the output should be ASCII)

There is no problem to support old terminals and output ASCII but escaping Unicode should not be the default behavior.

P.S. this is where PyYAML and SnakeYAML deviate - SnakeYAML only emits ASCII when it is explicitly requested.
(from the very beginning Java was very Unicode-friendly)


On Mon, Feb 23, 2009 at 5:42 PM, Kirill Simonov <xi@gamma.dn.ua> wrote:
Andrey Somov wrote:

By default PyYAML outputs an ASCII character stream escaping Unicode characters:

'Über' -> '\xDCber'

Technically, it's "\xDCber", not '\xDCber'.  The former is a representation of the text:


while the latter is a representation of the text:


>>> print yaml.load('Über')
>>> print yaml.load(r''' "\xDCber" ''')
>>> print yaml.load(r''' '\xDCber' ''')

PyYAML is completely in compliance with the YAML specs here.  There are two choices for the emitter when it encounters a non-ASCII character: either emit the scalar in the UTF-8 encoding or use the double-quote style and escape non-ASCII characters.  Both are correct and supported by the PyYAML emitter.  By default, the emitter uses the conservative approach: it escapes non-ASCII characters since it ensures that the document is always readable on ASCII terminals.  While it may produce a less readable result, it's safer.