Menu

#427 Unbreakable words are not wrapped

open
nobody
dump (2)
minor
bug
2023-07-13
2022-05-09
No

(ruamel.yaml 0.17.21)

The width property does not respect unbreakable words, such as URLs.

For example, in the snippet below, the description string, originally wrapped to avoid overflowing 120 characters in the document, is dumped as a single line.

from ruamel.yaml import YAML
import sys

yaml = YAML()
yaml.width = 120

input = yaml.load('''\
arn:
  description: ARN of the Log Group to source data from. The expected format is documented at
     https://docs.aws.amazon.com/service-authorization/latest/reference/list_amazoncloudwatchlogs.html#amazoncloudwatchlogs-resources-for-iam-policies
''')

yaml.dump(input, sys.stdout)
 arn:
-  description: ARN of the Log Group to source data from. The expected format is documented at
-    https://docs.aws.amazon.com/service-authorization/latest/reference/list_amazoncloudwatchlogs.html#amazoncloudwatchlogs-resources-for-iam-policies
+  description: ARN of the Log Group to source data from. The expected format is documented at https://docs.aws.amazon.com/service-authorization/latest/reference/list_amazoncloudwatchlogs.html#amazoncloudwatchlogs-resources-for-iam-policies

I am not personally bothered about dump() being that tolerant with unbreakable words, but I am currently facing issues with a project that uses yamllint with the following rule:

extends: default

rules:

  line-length:
    max: 120
    allow-non-breakable-inline-mappings: true

Documents that used to pass the linter's checks are now failing it, and require manual intervention to wrap strictly at 120 characters.

Discussion

  • Anthon van der Neut

    I would have to look into why this is. It is probably in the emitter (write_plain) and I don't think I ever touched that code that could have introduced this behavior (i.e. I think ruamel.yaml inherited that from PyYAML).

    However I see two possible solution that don't involve manual intervention after setting up.
    One is to switch to using a folded scalar, the following dumps back exactly at the input2 and the assertion at the end doesn't throw an error.

    from ruamel.yaml import YAML
    import sys
    
    yaml = YAML()
    yaml.width = 120
    
    input = yaml.load('''\
    arn:
      description: ARN of the Log Group to source data from. The expected format is documented at
         https://docs.aws.amazon.com/service-authorization/latest/reference/list_amazoncloudwatchlogs.html#amazoncloudwatchlogs-resources-for-iam-policies
    ''')
    
    input2 = yaml.load('''\
    arn:
      description: >-
        ARN of the Log Group to source data from. The expected format is documented at
        https://docs.aws.amazon.com/service-authorization/latest/reference/list_amazoncloudwatchlogs.html#amazoncloudwatchlogs-resources-for-iam-policies
    ''')
    
    yaml.dump(input2, sys.stdout)
    assert input['arn']['description'] == input2['arn']['description']
    

    So any following program reading this should be able to read this.
    Such folded scalars round-trip with "break information", but that means they are non-trivial to update.

    If you need to update the other automatable solution is using a transform function on dump, that reads each line, checks if it is too long because the last word was above 120 and then puts it on the next line on its own. If you want something like that, post this on StackOverflow and tag ruamel.yaml and I'll get you an answer (probably within a day).

     
  • Antoine Cotten

    Antoine Cotten - 2022-05-09

    The first option is a good workaround. It doesn't scale for me because I'm dealing with > 70 files, and a large number of documentation strings similar to the one in my example. But it's definitely something worth considering.
    In fact, now that you mention it, I'm using description: | in a few places and these aren't reformatted either.

    The transform function sounds interesting. I just asked a new question on StackOverflow as you suggested. Thanks!

     
  • Antoine Cotten

    Antoine Cotten - 2022-05-09

    Another observation is that lines often overflow even with much shorter words.

    In the example below, the 120 characters limit is right at

    are documented in the|
    

    , but dump() causes that same line to become

    are documented in the| AWS
    

    (where | marks the 120 characters limit).

     nested:
       nested:
         nested:
           nested:
             nested:
               nested:
                 nested:
                   region:
    -                description: Code of the AWS region to source metrics from. Available region codes are documented in the
    -                  AWS General Reference at https://docs.aws.amazon.com/general/latest/gr/rande.html#regional-endpoints.
    +                description: Code of the AWS region to source metrics from. Available region codes are documented in the AWS
    +                  General Reference at https://docs.aws.amazon.com/general/latest/gr/rande.html#regional-endpoints.
    

    The workaround shared on StackOverflow works so thank you a lot for this, and for your reactivity. The code needs to be tweaked a little bit for cases where there is more text following a manually wrapped word, but these are just details.

     
  • Ben Brown

    Ben Brown - 2023-07-13

    The fix implemented for this feels like a step away from what would be expected of the round trip dumper, where you would expect format to be preserved.

    I wouldn't terribly mind if this were opt-in, but there is a hardcoded default best_width of 80 that triggers this behaviour with release 0.17.22 and later.

    It would at least be nice if there was a way to disable this behaviour without setting some arbitrarily large width, such as -1 or 0.

     

    Last edit: Ben Brown 2023-07-13

Log in to post a comment.