From: Oren Ben-K. <or...@ri...> - 2001-12-01 09:53:55
|
Brian Ingerson wrote: > I think that "\n" should load as CRLF on DOS and LF on Unix. Is that the > current understanding? Nope. The current wording says that CR/LF/CRLF are always loaded as LF. The emitter/formatter is free to convert it back to CRLF if it wants to. Many platforms do this for you on the fly when reading DOS text files - like opening an stdio file without the 'binary' bit set etc. Since YAMl is always a text file it seems reasonable to do the same. Doesn't Perl do it too when you open a text file on Windows without a binary indication? Have fun, Oren Ben-Kiki |
From: Brian I. <in...@tt...> - 2001-12-01 14:46:00
|
On 01/12/01 11:54 +0200, Oren Ben-Kiki wrote: > Brian Ingerson wrote: > > I think that "\n" should load as CRLF on DOS and LF on Unix. Is that the > > current understanding? > > Nope. The current wording says that CR/LF/CRLF are always loaded as LF. The > emitter/formatter is free to convert it back to CRLF if it wants to. I was actually referring to the syntax itself, not the real line endings. dos: "This line ends with CRLF\n" Cheers, Brian PS sorry about the repost |
From: Clark C . E. <cc...@cl...> - 2001-12-01 16:46:18
|
| On 01/12/01 11:54 +0200, Oren Ben-Kiki wrote: | > Brian Ingerson wrote: | > > I think that "\n" should load as CRLF on DOS and LF on Unix. Is that the | > > current understanding? | > | > Nope. The current wording says that CR/LF/CRLF are always loaded as LF. The | > emitter/formatter is free to convert it back to CRLF if it wants to. | | I was actually referring to the syntax itself, not the real line endings. | | dos: "This line ends with CRLF\n" According to the specification, YAML can be written using either LF or CRLF (depending on the platform) and on parse time LF and CRLF is turned into LF. The canonical YAML (a project in the future) will impose further restirctions; only LF on output, rules on which scalar styles to use, etc. Best, Clark |
From: Brian I. <in...@tt...> - 2001-12-01 19:08:23
|
On 01/12/01 11:58 -0500, Clark C . Evans wrote: > | On 01/12/01 11:54 +0200, Oren Ben-Kiki wrote: > | > Brian Ingerson wrote: > | > > I think that "\n" should load as CRLF on DOS and LF on Unix. Is that the > | > > current understanding? > | > > | > Nope. The current wording says that CR/LF/CRLF are always loaded as LF. The > | > emitter/formatter is free to convert it back to CRLF if it wants to. > | > | I was actually referring to the syntax itself, not the real line endings. > | > | dos: "This line ends with CRLF\n" > > According to the specification, YAML can be written > using either LF or CRLF (depending on the platform) > and on parse time LF and CRLF is turned into LF. > > The canonical YAML (a project in the future) will > impose further restirctions; only LF on output, > rules on which scalar styles to use, etc. Geez. That's twice misunderstood. I want to say that "backslash followed be an 'n'" in an escaped string will be transferred into the native line ending of that platform. Hmmm. I see the problem now. OK here's a new suggestion: - \ A parser must translate CRLF and CR into LF before parsing. That's what's in the spec today. - \ A loader should turn LF into the native line ending for that platform. Does that make sense? Here's an example of how I might use YAML, that would require such behavior: #!/usr/bin/perl use YAML; $profile = LoadFile('~/.profile'); $name = $profile->{name}; $questionaire = Load <<"--YAML"; --- prompts: - | Hello $name. I'm going to ask you some questions. Are you ready? - | What shape is this: ____ | | |__| --YAML foreach $prompt (@{$questionaire->{prompts}}) { print $prompt; push @answers, <STDIN>; } StoreFile("/var/answers/$name", \@answers); __END__ I guess that's a long way of saying that when an application prints a chunk of text that it got from a YAML loader, it will expect the line endings to be correct. Cheers, Brian |
From: Clark C . E. <cc...@cl...> - 2001-12-01 20:54:00
|
| Geez. That's twice misunderstood. Sorry for avoiding the topic. This one is a pain in the arse. There are two options I know of, and each one has it's problems. ... In most languages (C, Python, Java), \n is line feed LF 0x0A, \r is carriage return CR 0x0D, \r\n is the carriage line feed pair, CRLF 0x0D 0x0A. Thus, these are our escape sequences. Furthermore, the specification currently says that CRLF and LF pairs are normalized to LF when reading. Hence, in your example, according to the specification, would only be LF and not CRLF as you would prefer. Note that this convention was copied from the XML specification, so it is not "new" behavior. I've not heared of Perl people having trouble with this behavior, so I figured adopting it from XML without too much question wouldn't be such a bad idea. However, for the record, IBM fellas have the same gripe... They have NEL (0x85 unicode) which is similar to CRLF. When it gets normlized to LF everything gets screwed up for them. Thus, they can round-trip NEL or have readability, but not both. Right now, the normalization to NL makes things pretty easy in most cases, and still enables round-tripping. Since \r\n or \0x85 can be used in escaped scalars to store specific carriage return requirements. However, it doesn't give the preferred behavior you described. ... We can fix this, but then we give up the ability to round-trip various forms of new lines using escape sequences. Here is a proposal. Let NL stand for new line and be one of LF, CRLF, NEL depending on the platform. Then, when loading, LF, CR, CRLF, NEL will all be converted to NL. Furthermore, we drop \r from the escape sequences and only have \n which is synonymous to NL. Also, to be on the safe side, the hex equivalents of LF, CR, CRLF, and NEL become errors since they will not round-trip. This allows YAML line endings to work as expected for all platforms and allows for line-endings to be automagically converted to their appropriate version on different platforms. The difficulty with this proposal is that it prevents applications from distinguishing between CR, LF, CRLF, or NEL. If they could distinguish, then we could not normalize. For example, in Windows, people often use LF for a "soft" return that dosen't start a new paragraph (where CRLF does). Thus, this particular application distinguishes between these two characters. Furthermore, as our spec stands, one can completely store a binary value using our escaped scalar... with this proposal, we give up the ability to store arbitrary binary data in this manner. Unfortunatly, we can't have it both ways. Either we pick a "basis", such as LF and use it as the standard new line representation, allowing escaping of other new line conventions. Or we let the new line representation "float" -- in this case, round-tripping of the new line characters between platforms using different line endings will not be reliable, and thus escaping of new lines should be forbidden. I hope this helps explain the compromise. I've opted for LF standardization, so that escaping will round-trip as expected. ... So, using the spec as written, you'd have to add the \r yourself in your program. Or, you could use escaped scalars and use \r\n at the end of each line to add the carriage returns explicitly. Neither option is great, but this is what is required... and by the way, it is how XML works. Or... if we went with the proposal above, there would be no way to encode binary values which happened to have \r, \n in them using the escaped scalar. I guess I'm open to either proposal. Given that we will natively support the base64 encoding, I don't mind the normalization. So... I think it comes down to the use cases, which ones are more popular. I hope this explained it well... sorry for all of the duplication. Best, ;) Clark |
From: Clark C . E. <cc...@cl...> - 2001-12-01 21:06:55
|
| Hmmm. I see the problem now. OK here's a new suggestion: | | - \ | A parser must translate CRLF and CR into LF before parsing. That's | what's in the spec today. | - \ | A loader should turn LF into the native line ending for that | platform. Yes. This is the second option that I was trying to explain. If we do this, we cannot distinguish between the various forms of new lines. So, in this case, we should remove \r from the spec, make \n be platform dependent and forbid \x0A, \x0D, \x85. | I guess that's a long way of saying that when an application prints a | chunk of text that it got from a YAML loader, it will expect the line | endings to be correct. Right. And if we go this way, all bets of distinguishing between various line-ending styles is off. .... That said, I don't mind doing it. It just means that applications can't use the distinction between various line-endings as significant. Best, Clark P.S. My remark in the previous email about using binary values in escaped style won't work for obvious reasons.... I don't know what I was thinking. Thus, there is less reason to keep the spec as it is. |
From: Oren Ben-K. <or...@ri...> - 2001-12-01 21:05:52
|
Clark C . Evans wrote: > ... I hope this helps explain the compromise. > I've opted for LF standardization, so that > escaping will round-trip as expected. Nice description Clark. For what its worth, I opt for keeping things as they are. It seems this way one has better control on the outcome; if we go the other way there would be things we simply can't do (e.g., represent certan control strings using YAML strings). Have fun, Oren Ben-Kiki |
From: Clark C . E. <cc...@cl...> - 2001-12-01 21:22:10
|
| Nice description Clark. For what its worth, I opt for keeping things as they | are. It seems this way one has better control on the outcome; if we go the | other way there would be things we simply can't do (e.g., represent certan | control strings using YAML strings). I'm more open to change. XML has had some problems with this approach. For instance, let's say you are on DOS and are saving a string "Hello\r\nWorld" to a YAML scalar. You would have a choice. To really save the string, as is, one would have to use the escaped scalar or quoted string. This would effectively prevent the block and folded scalars from being used in an automated way on IBM Mainframe/Windows boxen. I'm not sure this is so smart. Alternatively, it would require that the user specify which strings must be "preserved" and which don't. All said and done, the current spec as is does have this messy problem. And the IBM mainframe guys don't like it at all. It seems that this is a pretty large use case, and no doubt, why Brian caught it. I'm sure users of XML have to grapple with this daily... That said, I think the case where exact new line sequences are important will be a relatively rare use case... and in this use case, the base64 encoding is always available. So, after some additional thought... I'm leaning more towards the change: 1. Add NEL to the list of "line ending" sequences. 2. Add NL as a "new line" which is platform dependent. 3. Have \n refer to NL. 4. Remove \r and new line escapes \x85, \x0A, \x0D Best, Clark |
From: Oren Ben-K. <or...@ri...> - 2001-12-01 21:29:47
|
Clark C . Evans wrote: > ... if we go this way, all bets of distinguishing > between various line-ending styles is off. > > .... > > That said, I don't mind doing it. It just means that > applications can't use the distinction between various > line-endings as significant. I do. It seems to me that the current way one can do anything - including platform-specific end-of-lines; at worse this requires some filter between the parser or the application. At best this would be a convenience flag in parsers on afflicted platforms ('convert all new lines in text to native form'). As long as it is clear this is a non-YAML operation there's no problem with it being available. If you remove \r etc. this means that certain things become impossible. Other things become much more difficult. Take digital signatures for example - currently I can just feed the parsed text into a generator; the other way I'd have to guess which newline encoding was used in generating the signature and reverse-map the newlines toit. Ugh and Yuck. I say let Brian have his "non-YAML-mapping-of-eol-to-native" flag in his Perl parser but keep the spec as it is. Have fun, Oren Ben-Kiki |
From: Clark C . E. <cc...@cl...> - 2001-12-01 22:01:55
|
| > ... if we go this way, all bets of distinguishing | > between various line-ending styles is off. | > | > .... | > | > That said, I don't mind doing it. It just means that | > applications can't use the distinction between various | > line-endings as significant. | | I do. It seems to me that the current way one can do anything Well, it can't encode binary values due to the various unicode encodings UTF8, UTF16LE, UTF16BE, etc. | including platform-specific end-of-lines 1. It can't do this without using the quoted scalar for anything with a new line (a pretty substantial limitation) 2. Any new platforms will pick \n or \r\n for their new line indicator. | At best this would be a convenience flag in parsers on | afflicted platforms ('convert all new lines in text to | native form'). As long as it is clear this is a non-YAML | operation there's no problem with it being available. icky. | If you remove \r etc. this means that certain things become | impossible. Well, it just requires that anything using \x0A, \x0D, \x85 for anything other than new line indication must use the base64 encoding. The only "use case" I know is the ::MessageBox function which uses \r by itself to signify a "soft-return". Perhaps there is a compromise, that allows \r all by itself? | Other things become much more difficult. Take digital signatures for example | - currently I can just feed the parsed text into a generator; the other way | I'd have to guess which newline encoding was used in generating the | signature and reverse-map the newlines toit. Ahh. Well, we can adopt a variant of Brian's proposal. The parser can have a setting which provides what NL is, and this setting can default to LF. Effectively, we can say that the "generic" layer uses LF for the NL, and the "binding" layer is application specific. Since the signature generator works at the generic layer, this isn't a problem. | I say let Brian have his "non-YAML-mapping-of-eol-to-native" | flag in his Perl parser but keep the spec as it is. Yes; however, I can see Brian insisting that this setting would be the default for his processor... and given that it would be an rather obscure setting, most people won't change it. Thus, effectively, YAML won't be able to distinguish between the various line-endings. So, we should probably plan for it in advance, no? Interesting. This isn't the side of the fence I thought I'd be arguing on. Best, Clark |
From: Oren Ben-K. <or...@ri...> - 2001-12-02 07:18:55
|
Clark C . Evans [mailto:cc...@cl...] wrote: > Ok. So it stays as is. That said, I think I'd like > to add NEL (x83) and bare CR (x0D) to the line > normalization rules; per the Blueberry conversation > a few months ago...[1] I've posted a query to the > XML list and I'll be interested in the results. Bare CR already is. As per http://www.unicode.org/unicode/reports/tr13/ : "Even if you know which characters represents NLF on your particular platform, on input and in interpretation, treat CR, LF, CRLF, and NEL the same. Only on output do you need to distinguish between them." We I think we should add NEL, PS and LS. That's "the right thing to do". |