From: Oren Ben-K. <or...@ri...> - 2001-12-02 07:45:52
|
Sorry - hit the wrong button :-( Clark C . Evans [mailto:cc...@cl...] wrote: > Ok. So it stays as is. That said, I think I'd like > to add NEL (x83) and bare CR (x0D) to the line > normalization rules; per the Blueberry conversation > a few months ago...[1] I've posted a query to the > XML list and I'll be interested in the results. Bare CR already is. About the rest (NEL, PS and LS) - there's a lovely little document in http://www.unicode.org/unicode/reports/tr13/ which describes what we (or anyone else) should be doing: "Even if you know which characters represents NLF on your particular platform, on input and in interpretation, treat CR, LF, CRLF, and NEL the same. Only on output do you need to distinguish between them." We do that - we convert them all to LF. That's allowed by the rules: "1. If you do know the exact usage of any NLF, then convert it to LS or PS. 2. If you don't know the exact usage of any NLF, remap it to your platform NLF. (This doesn't really help you in interpreting Unicode text unless you are the only source of that text, since someone else may have left in LF, CR, CRLF, or NEL.)". We legitimately use #2 and declare our platfom (YAML's) NLF to be LF. So we are covered as far as CR/LF/NEL are concerned. We don't allow an unescaped FF in a YAML file, so we don't have to break our heads on that one. That leaves PS and LS (Paragraph and Line Separators). These are a pain. The rules are: "A readline function should stop at NLF, LS, FF, or PS." So, we must treat LS and PS as valid line breaks at least within text scalars - the simplest is to allow them as a line break everywhere. As for what we do with them afterward: "1. Always interpret PS as paragraph separator and LS as line separator. 2. In word processing, interpret any NLF the same as PS. 3. In simple text editors, interpret any NLF the same as LS. 4. In parsing, choose the safest interpretation." Outside scalars we throw away the line break characters anyway so there's no isse of what PS/LS should map to. Inside text scalars I suggest that never convert PS/LS into LF or fold them into a space. If someone is using them presumably he has a good reason to, and he's aware that notepad wouldn't handle it well. Thoughts? Have fun, Oren Ben-Kiki |
From: Clark C . E. <cc...@cl...> - 2001-12-02 15:19:23
|
| About the rest (NEL, PS and LS) - there's a lovely little document in | http://www.unicode.org/unicode/reports/tr13/ which describes what we (or | anyone else) should be doing Great. So CR, LF, CRLF, and NEL are all normalized to LF. | Outside scalars we throw away the line break characters anyway | so there's no isse of what PS/LS should map to. | | Inside text scalars I suggest that never convert PS/LS into LF or fold them | into a space. If someone is using them presumably he has a good reason to, | and he's aware that notepad wouldn't handle it well. Off hand, I think we should normalize PS/LS just like the others, unless, of course, they are escaped. ... The problem with the scalar treatment is the edge case. (PS = line ended using PS instead of CR/LF/CRLF/NEL) one: bing PS two: \\PS bop PS foo PS three: bar PS So, if we treat them different inside and outside, bing works. However, bop and foo become a single one-line thingy. Now, what happens to bar? Is it part of the map "two"? ... Clark |