Menu

#173 Custom representer for unicode strings not used if string contains non-ASCII characters

closed
nobody
None
minor
task
2020-01-23
2020-01-23
No

Hi,

This is a bug I actually found with the standard Python yaml module (aka. python-yaml on Debian). When I tried reporting this bug back in July, I was informed that yaml was no longer maintained and that I should investigate ruamel.yaml.

So, I've taken my test case that I wrote then, and ported it over to ruamel.yaml… it seems I'm able to reproduce the same bug. The situation is this: we want to be able to represent strings that have multi-line values in a human-readable form. That is:

a_value: |
    Like this.
    Each line kept separate, with no control characters obscuring it.

not like this:

a_value: "This is what we typically get\nNearly impossible to read."

I've been able to reproduce this problem on a couple of versions of ruamel.yaml, including the Debian Jessie package and the latest on pypi (0.15.34).

The attached test case produces the following output:

RC=0 stuartl@rikishi ~ $ python2 /tmp/pyyaml.py 
/home/stuartl/.local/lib64/python2.7/site-packages/ruamel/yaml/resolver.py:277: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  if value == u'':
"\nThis is a multi-line\n  string that has some characters\nthat break pyyaml, such
  as these:\n\n    \xB9\u2082\xB3\u2084\u2075\u2086\u2077\u2088\n"

Python 3 is the same, although we avoid a UnicodeWarning there. I was expecting output like this:

|

  This is a multi-line
    string that has some characters
  that break pyyaml, such as these:

      ¹₂³₄⁵₆⁷₈

Regards,
Stuart Longland

(originally posted on 2017-11-07 at 07:02:45 by Stuart Longland <Stuart Longland@bitbucket>)

1 Attachments

Discussion

  • Anthon van der Neut

    First of all apologies for the late reply, I am finally catching up with things a bit. This is more of an incorrect usage post and as such I would prefer if you had posted a question tagged ruamel.yaml on StackOverflow.

    The things easily solved (even with the version available last November):

    import sys
    import ruamel.yaml
    from ruamel.yaml.scalarstring import PreservedScalarString as lit
    
    break_string=lit('''
    This is a multi-line
      string that has some characters
    that break pyyaml, such as these:
    kkkkkkkkkk
        ¹₂³₄⁵₆⁷₈
    ''')
    
    yaml = ruamel.yaml.YAML()
    yaml.dump(break_string, sys.stdout)
    

    gives:

    |
    
    This is a multi-line
      string that has some characters
    that break pyyaml, such as these:
    kkkkkkkkkk
        ¹₂³₄⁵₆⁷₈
    

    There are a few things to take note of:

    • this works with Python3 and Python2,
    • I am not sure why you expect the output to be indented, that is not necessary (see e.g the second document in example 9.3 of the YAML.1.2 spec)
    • the above might not be loadable by PyYAML, it insists on indenting root level literal block style scalars. In practise that is only a problem if you intend to make (multi-)document streams with only these strings. As soon as they are part of a sequence or mapping this thing get moot.
    • the PreservedLiteralScalar is what this gets loaded as. You can roundtrip this in ruamel.yaml
    • don't mess with the CLoader/CDumper, it only supports the old YAML 1.1
    • if you specify from __future__ import unicode_literals marking a Python string with u, as your break_string is, is superfluous
    • my recommendation is that you always import print_function if you import from __future__ anyway. For single argument print usage that doesn't make much difference, but a soon as you add a second one, Python2 will give you a tuple if you don't import print_function
    • PyYAML (and ruamel.yaml) have a streaming interface. PyYAML also streams to an internal buffer and gives the contents of that buffer back if you don't supply a stream. If you then stream that result out using print, you have wasted time and memory: directly stream to stdout. Your print() adds an extra newline behind the one which PyYAML puts at the end of the document.

    (originally posted on 2018-08-13 at 06:29:22)

     
  • Anthon van der Neut

    None
    (originally posted on 2018-08-13 at 06:29:51)

     
  • Anthon van der Neut

    • status set to closed

    (originally posted on 2018-08-13 at 06:30:01)

     

Log in to post a comment.

MongoDB Logo MongoDB