#10 faithful character i/o

open
Bruno Haible
None
5
2006-12-31
2003-05-03
Sam Steingold
No

CLISP READ-CHAR reads bytes 10 and 13 as #\Newline:
<http://article.gmane.org/gmane.lisp.clisp.general/6970>
<http://article.gmane.org/gmane.lisp.clisp.general/4718>
Is it possible to read them differently?

Discussion

1 2 > >> (Page 1 of 2)
  • Bruno Haible
    Bruno Haible
    2004-03-18

    Logged In: YES
    user_id=5923

    No. Accepting CR, LF and CRLF as different variations of
    #\Newline implements the recommendations of the Unicode
    consortium in
    http://www.unicode.org/reports/tr13/tr13-9.html. Quote:
    "Even if you know which characters represents NLF on your
    particular platform, on input and in interpretation, treat
    CR, LF, CRLF ...L the same. Only on output do you need to
    distinguish between them."

    It also reflects user wishes: 1) For years, GCC used to give
    parse errors on some C input files that used CRLF as line
    terminators, whereas with just LF the parse succeeded. 2)
    GNU gettext had similar problems, and it was reported as a
    bug, because apparently users on Unix sometimes have Windows
    written files on their disks.
    The way CLISP does it, a priori prevents this kind of bug
    from the beginning.

    There is no need to add complexities to CLISP to implement
    the paradigms of the 1980ies, that are just not valid any
    more in today's world.

     
  • Sam Steingold
    Sam Steingold
    2004-05-25

    • assigned_to: nobody --> haible
    • status: open --> closed-invalid
     
  • Sam Steingold
    Sam Steingold
    2004-05-25

    Logged In: YES
    user_id=5735

    this item is now closed as invalid.
    thanks to Bruno for clarifying it.
    see <impnotes.html#clhs-newline>
    for the exhaustive treatement of the matter.

     
  • Sam Steingold
    Sam Steingold
    2006-10-16

    • status: closed-invalid --> open
     
  • Sam Steingold
    Sam Steingold
    2006-11-17

    Logged In: YES
    user_id=5735
    Originator: YES

    Suppose we add :line-terminator-strict slot to encodings, making the newline input "faithful":

    :UNIX :MAC :DOS
    CR #\Return #\Newline #\Return
    LF #\Newline #\Linefeed #\Linefeed
    CRLF #\Return#\Newline #\Newline#\Linefeed #\Newline

    (row: input characters; column: line terminator of the encoding).

    alas, in CLISP #\Linefeed == #\Newline (as explicitly permitted &c), so the reality is thus:

    :UNIX :MAC :DOS
    CR #\Return #\Newline #\Return
    LF #\Newline #\Newline #\Newline
    CRLF #\Return#\Newline #\Newline#\Newline #\Newline

    which plain sucks for everything but the :UNIX line terminator.

    How about using something other than 10 for Newline?
    How about 0? (i.e., #\Null = #\Newline)
    0 does not normally occur in _text_ streams, so it will not cause the confusion we are experiencing.
    just about any control character (except bs/tab/nl/ret) would do too.
    http://en.wikipedia.org/wiki/ASCII

     
  • Sam Steingold
    Sam Steingold
    2006-11-17

    Logged In: YES
    user_id=5735
    Originator: YES

    actually, using #\Code128==#\U0080 seems to be a good option!

     
  • Bruno Haible
    Bruno Haible
    2006-11-20

    Logged In: YES
    user_id=5923
    Originator: NO

    Such a :line-terminator-strict option is indeed theoretically possible.
    You would need to assign #\Newline to a different code point, outside the
    Unicode range, for example #x110000. (The Unicode people for some time
    favoured the use of #x85 as a 3rd newline character, but apparently
    dropped the idea.)

    So reading in normal mode would produce:

    :UNIX :MAC :DOS
    CR #\Return #\Newline #\Return
    LF #\Newline #\Linefeed #\Linefeed
    CRLF #\Return#\Newline #\Newline#\Linefeed #\Newline

    And reading in :line-terminator-strict would produce:

    :UNIX :MAC :DOS
    CR #\Return #\Return #\Return
    LF #\Linefeed #\Linefeed #\Linefeed
    CRLF #\Return#\Linefeed #\Return#\Linefeed #\Return#\Linefeed

    But what would be the effect of such a change:
    - No longer (eql #\Newline #\Linefeed) -> backward compatibility problem,
    - No longer (= (char-code #\Newline) 10) -> Unix compatibility problem
    (because we would be copying a DOS concept into a Unix world),
    - .fas files that are edited with an editor on Windows (and thus get
    LF converted into CRLF) change their meaning when being saved.

    So forget about it. It creates more problems than it solves.

     
  • Bruno Haible
    Bruno Haible
    2006-11-20

    • status: open --> closed-rejected
     
  • Sam Steingold
    Sam Steingold
    2006-11-20

    Logged In: YES
    user_id=5735
    Originator: YES

    >Such a :line-terminator-strict option is indeed theoretically possible.
    >You would need to assign #\Newline to a different code point, outside the
    >Unicode range, for example #x110000.

    I don't see why I cannot use #x80 (#\Code128==#\U0080) for newline.
    I am not inventing a new unicode char, I am assigning an integer to a CLISP character, and this integer (128) is not used at this time.

    also, your tables indicate that you are missing the point of my message.
    Your first table (identical to my first table) is what you get if :line-terminator-strict is non-nil and #\newline is distinct from both #\lf and #\cr.
    your second table is relevant only to binary input and cannot be produced under any combinations of :line-terminator-strict and separate #\nl proposals.

     
1 2 > >> (Page 1 of 2)