Menu

#10 faithful character i/o

open
None
5
2006-12-31
2003-05-03
No

CLISP READ-CHAR reads bytes 10 and 13 as #\Newline:
<http://article.gmane.org/gmane.lisp.clisp.general/6970>
<http://article.gmane.org/gmane.lisp.clisp.general/4718>
Is it possible to read them differently?

Discussion

  • Bruno Haible

    Bruno Haible - 2004-03-18

    Logged In: YES
    user_id=5923

    No. Accepting CR, LF and CRLF as different variations of
    #\Newline implements the recommendations of the Unicode
    consortium in
    http://www.unicode.org/reports/tr13/tr13-9.html. Quote:
    "Even if you know which characters represents NLF on your
    particular platform, on input and in interpretation, treat
    CR, LF, CRLF ...L the same. Only on output do you need to
    distinguish between them."

    It also reflects user wishes: 1) For years, GCC used to give
    parse errors on some C input files that used CRLF as line
    terminators, whereas with just LF the parse succeeded. 2)
    GNU gettext had similar problems, and it was reported as a
    bug, because apparently users on Unix sometimes have Windows
    written files on their disks.
    The way CLISP does it, a priori prevents this kind of bug
    from the beginning.

    There is no need to add complexities to CLISP to implement
    the paradigms of the 1980ies, that are just not valid any
    more in today's world.

     
  • Sam Steingold

    Sam Steingold - 2004-05-25
    • assigned_to: nobody --> haible
    • status: open --> closed-invalid
     
  • Sam Steingold

    Sam Steingold - 2004-05-25

    Logged In: YES
    user_id=5735

    this item is now closed as invalid.
    thanks to Bruno for clarifying it.
    see <impnotes.html#clhs-newline>
    for the exhaustive treatement of the matter.

     
  • Sam Steingold

    Sam Steingold - 2006-10-16
    • status: closed-invalid --> open
     
  • Sam Steingold

    Sam Steingold - 2006-11-17

    Logged In: YES
    user_id=5735
    Originator: YES

    Suppose we add :line-terminator-strict slot to encodings, making the newline input "faithful":

    :UNIX :MAC :DOS
    CR #\Return #\Newline #\Return
    LF #\Newline #\Linefeed #\Linefeed
    CRLF #\Return#\Newline #\Newline#\Linefeed #\Newline

    (row: input characters; column: line terminator of the encoding).

    alas, in CLISP #\Linefeed == #\Newline (as explicitly permitted &c), so the reality is thus:

    :UNIX :MAC :DOS
    CR #\Return #\Newline #\Return
    LF #\Newline #\Newline #\Newline
    CRLF #\Return#\Newline #\Newline#\Newline #\Newline

    which plain sucks for everything but the :UNIX line terminator.

    How about using something other than 10 for Newline?
    How about 0? (i.e., #\Null = #\Newline)
    0 does not normally occur in _text_ streams, so it will not cause the confusion we are experiencing.
    just about any control character (except bs/tab/nl/ret) would do too.
    http://en.wikipedia.org/wiki/ASCII

     
  • Sam Steingold

    Sam Steingold - 2006-11-17

    Logged In: YES
    user_id=5735
    Originator: YES

    actually, using #\Code128==#\U0080 seems to be a good option!

     
  • Bruno Haible

    Bruno Haible - 2006-11-20

    Logged In: YES
    user_id=5923
    Originator: NO

    Such a :line-terminator-strict option is indeed theoretically possible.
    You would need to assign #\Newline to a different code point, outside the
    Unicode range, for example #x110000. (The Unicode people for some time
    favoured the use of #x85 as a 3rd newline character, but apparently
    dropped the idea.)

    So reading in normal mode would produce:

    :UNIX :MAC :DOS
    CR #\Return #\Newline #\Return
    LF #\Newline #\Linefeed #\Linefeed
    CRLF #\Return#\Newline #\Newline#\Linefeed #\Newline

    And reading in :line-terminator-strict would produce:

    :UNIX :MAC :DOS
    CR #\Return #\Return #\Return
    LF #\Linefeed #\Linefeed #\Linefeed
    CRLF #\Return#\Linefeed #\Return#\Linefeed #\Return#\Linefeed

    But what would be the effect of such a change:
    - No longer (eql #\Newline #\Linefeed) -> backward compatibility problem,
    - No longer (= (char-code #\Newline) 10) -> Unix compatibility problem
    (because we would be copying a DOS concept into a Unix world),
    - .fas files that are edited with an editor on Windows (and thus get
    LF converted into CRLF) change their meaning when being saved.

    So forget about it. It creates more problems than it solves.

     
  • Bruno Haible

    Bruno Haible - 2006-11-20
    • status: open --> closed-rejected
     
  • Sam Steingold

    Sam Steingold - 2006-11-20

    Logged In: YES
    user_id=5735
    Originator: YES

    >Such a :line-terminator-strict option is indeed theoretically possible.
    >You would need to assign #\Newline to a different code point, outside the
    >Unicode range, for example #x110000.

    I don't see why I cannot use #x80 (#\Code128==#\U0080) for newline.
    I am not inventing a new unicode char, I am assigning an integer to a CLISP character, and this integer (128) is not used at this time.

    also, your tables indicate that you are missing the point of my message.
    Your first table (identical to my first table) is what you get if :line-terminator-strict is non-nil and #\newline is distinct from both #\lf and #\cr.
    your second table is relevant only to binary input and cannot be produced under any combinations of :line-terminator-strict and separate #\nl proposals.

     
  • Sam Steingold

    Sam Steingold - 2006-12-26

    Logged In: YES
    user_id=5735
    Originator: YES

    I don't see any compatibility issues.
    any text stream knows its preferred encoding, so #\Newline is never written as its char-code.
    the woe32 editing of fas files issue is fairly rare, and the only problem there would occur if there are embedded newlines in strings.
    this should be addressed by always quoting CR&LF in all strings, symbols and package names in compiled files (we know that we are reading from a compiled file when stream is the same as *load-file*).

     
  • Sam Steingold

    Sam Steingold - 2006-12-31
    • status: closed-rejected --> open
     

Log in to post a comment.