Rationale-sweet

Sweet-expressions (t-expressions)

Sweet-expressions start with neoteric-expressions and add indentation as meaningful.

These eliminate many parentheses, thus making them more readable, by making indentation itself meaningful. Real Lisp programs are already indented, and tools (like editors and pretty-printers) are used to try to keep the indentation (used by humans) and parentheses (used by the computers) in sync. By making the indentation (which humans depend on) actually used by the computer, they are automatically kept in sync, and many parentheses become unnecessary.

The page http://www.gregslepak.com/on-lisps-readability shows one of the many examples of endless closing parentheses and brackets to close an expression, and the confusion that happens when indentation does not match the parentheses. bhurt's response to that article is telling: "I'm always somewhat amazed by the claim that the parens 'just disappear', as if this is a good thing. Bugs live in the difference between the code in your head and the code on the screen - and having the parens in the wrong place causes bugs. And autoindenting isn't the answer- I don't want the indenting to follow the parens, I want the parens to follow the indenting. The indenting I can see, and can see is correct."

An IDE can help keep the indentation consistent with the parentheses, but needing IDEs is considered by some a language smell (see http://www.recursivity.com/blog/2012/10/28/ides-are-a-language-smell/ ). If you need special tools to work around problems with the notation, then the notation itself is a problem.

A solution, of course, is to make the indentation actually matter: Now you don't need an endless march of parentheses, and indentation can't be confusing because it is actually used.

"In praise of mandatory indentation..." notes that it can be helpful to have mandatory indentation:

It hurts me to say that something so shallow as requiring a few extra spaces can have a bigger effect than, say, Hindley-Milner type inference. - Chris Okasaki

Other languages, including Python, Haskell, Occam, and Icon, use indentation to indicate structure, so this is a proven idea. The language Cobra (a variant of Python with strong compile-time typechecking) has decided to use indentation too. PLOT (Programming Language for Old Timers) is another indentation-sensitive language that claims to be a "new dialect of Lisp" even though it does not have S-Expressions, NIL, conses, atoms, or parenthesized Polish prefix syntax; one of the main reasons its developer thinks it's better is that it supports "conventional" syntax instead of traditional Lisp s-expressions. In short, clearly indentation-sensitive languages are considered useful by many, and far more software is developed today using indentation-sensitive syntaxes as compared to traditional S-expressions.

There's a lot of past work on indentation to represent s-expressions, too. Examples include:

  • Paul Graham (developer of Arc) is known to be an advocate of indentation for this purpose. As I noted above, Kragen Sitaker’s notes on Graham and Arc discusses how indentation can really help (in this notation, functions with no parameters need to be surrounded by parentheses, to distinguish them from atoms - “oh well” ). Graham's RTML is implemented using Lisp, but uses indentation instead of parentheses to define structure. RTML is a proprietary programming language that at least was used by Yahoo!’s Yahoo! Store and Yahoo! Site hosting products (though Yahoo may have transitioning away from it). Paul Graham’s comments about the RTML language design and this introduction to RTML by Yahoo.
  • Darius Bacon's ”indent” file, includes his own implementation of a Python/Haskell-like syntax for Scheme using indentation in place of parentheses, and in that file he also includes Paul D. Fernhout's implementation of an indentation approach. Bacon's syntax for indenting uses colons in a way that is limiting (it interferes with other uses of the colon in various Lisp-like languages). I have not had a chance to examine Paul D. Fernhout's yet. (It also includes an I-expression implementation.) All of the files are released under the MIT/X license. (Darius Bacon also created mlish, an infix syntax front end).
  • Lispin discusses a way to get S-expressions with indentation.

The sweet-expression indentation system is based on Scheme SRFI-49 ("surfi-49"), aka I-expressions. The basic rules of SRFI-49 (I-expression) indentation are kept in sweet-expressions; these are:

  • An indented line is a parameter of its parent.
  • Later terms on a line are parameters of the first term.
  • A line with exactly one term, and no child lines, is simply that term; multiple terms are wrapped into a list.

These basic rules seem quite intuitive, and seem to be what most people "expect" indentation to mean. We're grateful to the SRFI-49 author for his work, and at first, we just used SRFI-49 directly. However, SRFI-49 turned out to have problems in practice, so based on that experience and experimentation we made several changes to it. At this point, we think we've corrected the problems of SRFI-49, but we are still building on a useful foundation we're happy to credit.

Below are some aspects to how sweet-expressions deal with indentation, particularly noting changes from SRFI-49.

Blank lines

In sweet-expressions, a blank line always terminates a datum, so once you've entered a complete expression, "Enter Enter" will always end it. In contrast, in sweet-expressions blank lines before an expression starts are ignored. The "blank lines at the beginning are ignored" rule eliminates a usability problem with the original SRFI-49 (I-expression) spec, in which two sequential blank lines before an expression surprisingly returned (). This was a serious usability problem, and one that was quickly fixed. The sample implementation did end expressions on a blank line - the problem was that the spec didn't clearly capture this.

It would be possible to have blank lines end an expression "only in interactive use" - Python does this. However, this means that you couldn't cut-and-paste files into the interpreter and them used. I believe it's important to have exactly the same syntax in both cases, so "Enter Enter" always ends an expression.

Of course, people sometimes want to have something like a blank line in the middle of an s-expression. Thus, comment-only lines are ignored and not considered blank lines; that means you can use them for that purpose. The indentation of comment-only lines is ignored - that way, you don't have to worry about keeping them indented the same way.

Since a line with only indentation may look exactly identical to a blank line, we decided to clearly state that "a line with only indentation is an empty line". This eliminates some nasty usability problems that could arise if a "blank" line actually had some whitespace in it.

Indentation characters

Some like to use spaces to indent; others like tabs. Python allows either, and SRFI-49 allows either as well - you just have to be consistent. Thus, people can use what they like.

One problem horizontal whitespace characters is that they can get lost in many transports (HTML readers, etc.). And sometimes there are indented groups that you'd like to highlight. On the mailing list, users started to use characters (initially period+space) to show where indentation occurred so that they wouldn't get lost. Eventually, the idea was hit upon that perhaps we needed to allow a non-whitespace character for indentation. This is highly unorthodox, but at a stroke it eliminates the complaints some have about syntactically-important indentation, and it provides an easy way to highlight certain indented groups.

At first, we tried to use period, or period+space, as the indent. But period has too many other traditional meanings in Lisp-like languages, including beginning a number (.9), beginning a symbol (.xyz), and as a special operator to set the cdr of a list. Implementations needed an unread-char function, which is not stnadard in Scheme R5RS. Eventually the "!" was used; it practically never begins a line, and if you need it, (. !) will work. This character is a great way to highlight indented groups.

Disabling indentation processing with paired characters

Indentation processing is disabled inside (...), [ ... ], and { ... }. This was also true of SRFI-49, and of Python, and has wonderful side-effects:

  • Indent parsing becomes very safe to use with existing code. Pre-existing code will almost certainly start each expression with an opening parenthesis, disabling the indentation processing it wasn't expecting.
  • It makes it easy to disable indentation processing whenever it is inconvenient. For example, it supports dealing with text that is very close to running off the right-hand side, or is complex to express with indentation.
  • It is similar to what other indentation-sensitive languages do, such as Python.
  • It is a very easy rule to explain.

This means that infix processing by curly-infix disables indentation processing; in practice this doesn't seem to be a problem.

Disabling indentation processing with an initial indent

Initial indentation also disables indentation processing, which improves backward compatibility and makes it easy to disable indentation processing where convenient.

This improves backward compatibility because a program that uses odd formatting with different meaning for sweet-expressions is more likely to have initial indents.

And even if this is not true, it's trivially easy to add an initial indent on oddly-formatted old files. This provides a trivial escape, making it easy to support old files. Then even if you have ancient code with odd formatting like:

(compute me) (show me)

It would still "just work" if there is any initial indentation. I'd like this reader to be a drop-in replacement for read(), so minimizing incompatibilities is important.

There is a risk that this indentation will be accidental (e.g., a user might enter a blank line in the middle of a routine and then start the next line indented). However, this is less likely to happen interactively (users can typically see something happened immediately), and editors can easily detect and show where surprising indentation is occurring (e.g., through highlighting), so this risk appears to be minimal.

Disabling on initial indent also deals with a subtle potential problem in implementation.

In a reader implementation, if we tried to just accept some indentation of the first line and use it as the starting point, we create problems. Typically readers return a whole value once that value has been determined, and in many cases it's tricky to store state (such as that new indentation value) for an arbitrary port. By disabling indentation processing, we eliminate the need to store such state, as well as giving users a useful tool.

Since this latter point isn't obvious, here's a little more detailed explanation. Obviously, to make indentation syntactically meaningful, you need to know where an expression indents, and where it ends. If you read in a line, and it has the same indentation level, that should end the previous expression. If its indentation is less, it should close out all the lines with deeper or equal indentation. But we're trying to minimize the changes to the underlying language, and in particular, we don't want to change the "read" interface and we're not assuming arbitrary amounts of unread-char. Scheme R5RS, for example, doesn't have a standard unread-char at all. So let's say you are trying to read the following:

! ! foo
! ! ! bar
! ! eggs
! ! cheese

You might expect this to return three datums: (foo bar), eggs, and cheese. It won't, in a typical implementation; here's why.

First read(): Reads foo, bar, and it consumes the indentation of "eggs" so that it can determine that another is at the same level. It returns (foo bar).

Second read(): Reads eggs with NO indentation, because the indentation was consumed by the first read() so it could determine when it was finished. It then reads the indentation of cheese, which has an indentation more than zero. It returns (eggs cheese), and we've consumed it all.

Solutions:

  • If you have unlimited unread-char, there is no problem, just unconsume characters once you've found the end. But many Lisps don't have that.
  • Read could store indentation state associated with the port. But the user could call other routines, and a naive implementation would read the wrong values. You'd have to re-wrap the entire I/O system if you really wanted to be able to undo the indentation reliably. That creates a complicated implementation that is likely to be unreliable, and it's lousy for performance.

So for all the reasons above, initial indent disables indentation processing for that line.

Grouping and splicing

SFRI-49 had a mechanism for defining lists of lists, using the symbol "group". This was a valuable contribution, since there needs to be some way to show lists of lists. But after use, it was determined that having an alphabetic symbol being used to indicate a special abbreviation was a mistake; all other abbreviations use punctuation, and this should too. This symbol is called the GROUP symbol, and happens at the start of a line (after indentation).

A different problem is that sometimes you'd like to have a set of parameters, where they are at the "same level" but writing them as indented parameters takes up too much vertical space. An obvious example is keywords in various Lisps; having to write this is painful:

foo
  keyword1:
  parameter1
  keyword2:
  parameter2
  ....

David A. Wheeler created an early splicing proposal (www.mail-archive.com). After much discussion, to solve the latter problem, the SPLIT symbol was created, so that you could do:

foo
  keyword1: \\ parameter1
  keyword2: \\ parameter2
  ....

At first the symbol \ was used for SPLIT, but this would cause serious problem on Lisps that supported slashification. After long discussion, the symbol \\ was decided on for both; although the number of characters in the underlying symbol could vary (depending on whether or not slashification was used), this was irrelevant and seemed to work everywhere. By using the same symbol for both GTROUP and SPLIT, we reduced the number of different symbols that users needed to escape.

We dropped the SRFI-49 method for escaping the symbol by repeating it (group group); the (. e) escape mechanism is more regular, and makes it far more obvious that some special escape is going on.

Why does initial \\ mean nothing if there are datums afterwards on the same line?

Since "let" occurs in many programs, it would have been possible to define \ to allow this:

let
! \\ var1 $ bar x
! !  var2 $ quux x
! nitz var1 var2

We discussed this, but after long discussion we decided on a defined semantic that means that "\\" is an empty symbol, making that expression exactly the same as:

let
! var1 $ bar x
! !  var2 $ quux x
! nitz var1 var2

which is presumably not what the writer intended.

But we did this intentionally. It turns out that there are situations where you want a \\ as an empty symbol, even when text follows it on the line. An example is arc's if-then-else, where there are logically pairs of items, but from a list semantic are at the same level. E.G.:

if
! condition1()
! \\ action1()
! condition2()
! \\ action2()
! \\ otherwise-action()

It's easy to handle let* with an extra line, but there's no easy way to insert a short pseudo-comment character in the front unless we do it this way.

The multi-line nature of let* turns out to be not a real problem, for 2 reasons:

  1. It turns out that in many "let*s" the variable settings can be put on one line. As of 2012-08-02 the "sweeten.sscm" has 305 non-blank, non-comment lines as determined by:

grep -v '^$' sweeten.sscm | grep -v '^ *;' | wc -l

Of those, 13 lines use let or let*, and only one of those "lets" uses \\. It's not worth optimizing a case that only happens approximately once in 300 lines.

  1. Using the abbreviations as intended is REALLY clear, even though it uses an extra vertical line:

    let*
    ! \
    ! ! var1 ...
    ! ! var2 ...
    ! do-in-the-let()

So the savings for let aren't significant, the semantics as designed are clear, and are intentionally using that notation for another* purpose where it's not as easy to use an alternative.

Traditional notation abbreviations

As with SRFI-49, a leading traditional abbreviation (quote, comma, backquote, or comma-at), followed by space or tab, is that operator applied to the sweet-expression starting at the same line. This makes it easy to quote (or whatever) complex indented structures.

Sublist

Alan Manuel Gloria noted that certain constructs were common and annoying to express, e.g., first(second(third(fourth))), and based on Haskell experience, suggested being able to write them as first $ second $ third(fourth). Again, the idea is that this is an abbreviation for a common-enough practice.

This is another example (like GROUP/SPLIT) of a construct that, when you need it, is incredibly useful. It's not all that unusual to have a few processing or cleanup functions that take a single argument, and for all the "real work" to be nested in something else. This would require several levels of indentation without sublist, but they are easily handled with sublist. An example is scsh, which has functions like "run" that another list. With sublist, this is easily expressed. For example, here's a sweet-expression using scsh:

run $ grep -i "xx.*zz" <(oldfile) >(newfile)

Another way to view this is that "$" is meant to be used to "compress" vertically any deeply-indented code that happens to be just monotonically indenting more. So:

foo
! bar
! ! nitz
! ! ! quux meow

is the same as:

foo $ bar
! nitz
! ! quux meow

which is also the same as:

foo $ bar $ nitz
! quux meow

which is also the same as:

foo $ bar $ nitz $ quux meow

For another example, note that:

foo
! bar

is the same as:

foo $ bar

is the same as:

foo(bar)

is the same as:

(foo bar)

Usually, if the expressions are short, you'd just write "foo(bar)". But if "bar" is so lengthy that indentation is useful, "$" can reduce the vertical and horizontal space needed to express it.

After discussion, sublist was accepted in July 2012.

More information

More rationale information is available in SRFI-110.


Related

Wiki: Rationale

MongoDB Logo MongoDB