Thomas Mertes - 2008-12-02

Hello,
I am writing a chapter about the structured syntax definition
of Seed7. It would be nice to get some feedback about it before
the final release. Here is it (read it with a monospaced font):

9. STRUCTURED SYNTAX DEFINITION

    Many languages have predefined constructs like statements and
  operators. This constructs have fixed syntax and semantics. A
  natural language or some meta language is used to describe the
  syntax and semantics of the constructs. Mostly the programmers
  are not allowed to define new syntactic constructs. But sometimes
  a full description of the syntax and semantics of a construct
  written in the programming language itself is helpful. This
  avoids the need to use ambiguous natural language descriptions
  and it is also unnecessary to learn an additional meta language.
  A formal description of new constructs in the language itself
  gives also the opportunity to define new constructs. Note that a
  compiler-compiler does not offer this opportunity and has also a
  meta language.

    There are different notations to specify the syntax of
  programming languages. The syntax statements of Seed7 use a
  notation called Seed7 Structured Syntax Description (abbreviated
  with S7SSD). The Backus-Naur Form (BNF) and its variants like the
  Extended Backus-Naur Form (EBNF) are another example of such a
  syntax specification. Since it is easier to understand new
  concepts when they are compared to well known concepts, the EBNF
  will be used as base to explain the S7SSD.

9.1 The Extended Backus-Naur Form

    As the name says the Extended Backus-Naur Form is an extension
  of the BNF. The extensions allow the definition of repetitions and
  optional parts without the use of recursive definitons. The EBNF
  has the following elements:

     - Nonterminal symbols consist of lower case letters and
       underline characters.
     - Terminal symbols are quoted strings or names in upper case
       characters which describe unprintable characters (control
       characters).
     - The concatenation of nonterminal and/or terminal symbols is
       described by writing them in sequence.
     - With | two alternatives can be separated.
     - Expressions of the extended Backus-Naur form can be put under
       parenthesis.
     - When a subexpression is optional it is enclosed in squared
       brackets [ ... ] .
     - When a subexpression may be omitted or repeated it is
       enclosed in curly braces { ... } .

  The syntax of the extended Backus-Naur form can be described with
  extended Backus-Naur form.

    syntax_description ::=
      { statement } .

    statement ::=
      identifier '::=' expression '.' .

    expression ::=
      term { '|' term } .

    term ::=
      factor factor .

    factor ::=
      identifier | string | '(' expression ')' |
      '[' expression ']' | '{' expression '}' .

9.2 The syntax of a statement

    To explain the Seed7 Structured Syntax Description we propose
  that a new statement, the 'loop' statement, should be defined. A
  'loop' statement is similar to 'while' and 'repeat' loops but
  instead of having the conditional exit at the beginning or at the
  end it's conditional exit is in the middle of the loop. This
  middle conditional exit is part of the 'loop' statement. Therefore
  it should not be confused with the 'exit' statement that exists in
  some programming languages. An example of the 'loop' statement is:

    loop
      ch := getc(inFile);
    until ch = '\n' do
      stri &:= str(ch);
    end loop;

  An equivalent solution without the usage of the 'loop' statement
  would be:

    repeat
      ch := getc(inFile);
      if ch <> '\n' then
        stri &:= str(ch);
      end if;
    until ch = '\n';

  The S7SSD of the 'loop' statement is:

    $ syntax expr: .loop.().until.().do.().end.loop is -> 25;

  For now we concentrate at the heart of the S7SSD, the expression:

    .loop.().until.().do.().end.loop

  For the purpose of the syntax description we can just remove the
  dots, which gives:

      loop () until () do () end loop

  With EBNF the 'loop' statement can be described as:

    loop_statement ::=
      'loop'
        statement
      'until' expression 'do'
        statement
      'end' 'loop' .

  An EBNF description may use many nonterminal symbols such as
  'statement' or 'expression'. The S7SSD does not distinguish
  between different nonterminal symbols. Instead the S7SSD only
  knows one nonterminal symbol: ()

  Therefore the S7SSD cannot distinguish between 'statement',
  'expression' or something else. At the syntax level any kind of
  expression can by substituted for a S7SSD nonterminal symbol ().
  With EBNF it is possible to describe constraints such as the type
  of an expression. The S7SSD relies on semantic checks to verify
  such constraints. An expression like

    loop
      35
    until 1+2 do
      integer
    end loop

  would be legal given the S7SSD of the 'loop' statement:

    $ syntax expr: .loop.().until.().do.().end.loop is -> 25;

  The definition of the semantic of the 'loop' statement is:

    const proc: loop
                  (in proc: statements1)
                until (ref func boolean: condition) do
                  (in proc: statements2)
                end loop is func
      local
        var boolean: exitLoop is FALSE;
      begin
        repeat
          statements1;
          if not condition then
            statements2;
          else
            exitLoop := TRUE;
          end if;
        until exitLoop;
      end func;

  This semantic definition determines which types are accepted
  between the keywords. For the 'loop' example with the semantic
  errors (see above) you would get an error message like:

    *** chkloop.sd7(35):51: Match for {loop 35 until {1 + 2 } do integer end loop } failed

9.3 Priority and assoziativity

    When a syntax construct has parameters before the first symbol
  or after the last symbol the priority and the associativity of the
  construct are significant. Constructs with stronger priority bind
  their parameters earlier than constructs with weaker priority. The
  priority is described by a natural number (inclusive 0). The
  strongest priority is 0. Weaker priorities are described by larger
  numbers. What bind means is can be declared with an example:

                                       =
    A + B = C * D                    /   \                                     /     \     * priority  6                  +       *
    + priority  7                /   \&nbsp;  /   \     = priority 12               A     B C     D

  First the * takes its parameters, then the + and at last the =
  follows.

  The associativity describes, in which order constructs with equal
  priority bind their parameters. For example

    A - B - C

  can be interpreted in two ways:

    (A - B) - C    or   A - (B - C)

  There are four associativities possible:
                                                   Symbol

    Binding from left to right                       ->

    Binding from right to left                       <-

    Neither the left nor the right parameter
    are allowed to have the same priority            <->

    At the left side there is a binding from
    left to right and at the right side there
    is a binding from right to left                 -><-

  The last two possibilities give no legal interpretation in the
  subtraction example. The third kind of assiciativity ( <-> ) is
  used by the equal operator ( = ) of Pascal because there a
  expression like

    A = B = C

  is not legal.

  There is a second way to describe the associativity. The
  associativity describes if an operand must have a stronger
  priority than the priority of the operator. For example:

                             -                     7
    A - B - C              /   \&nbsp;                /   \                           /     \&nbsp;          <=7 /     \ <7
    - priority 7 ->      /       \&nbsp;            /       \                         -         C           7         0
                      /   \&nbsp;                /   \                      /     \&nbsp;          <=7 /     \ <7
                    /       \&nbsp;            /       \                    A         B           0         0

  The numbers in the nodes of the right tree show the priority of
  each sub expression (sub tree). With < and <= the required
  condition for the priority of an operand is described. An
  interpretation is legal if all this conditions are met. If there
  are more than one legal interpretations or no legal
  interpretation the expression is illegal.

  Table for the possibilities of associativity:

    +---------------+------------------------------+
    | associativity |     The priority of the      |
    +---------------+--------------+---------------+
    |               | left operand | right operand |
    |               |    must be   |    must be    |
    +---------------+--------------+---------------+
    |      ->       |      <=      |      <        |
    |      <-       |      <       |      <=       |
    |      <->      |      <       |      <        |
    |     -><-      |      <=      |      <=       |
    +---------------+--------------+---------------+
    |               |  than that of the operator   |
    +---------------+------------------------------+

  The parameter before the operator symbol is called left operand.
  The parameter after the last symbol of a construct is called
  right operand. In case of normal operators the last symbol of a
  construct and the operator symbol are identical. If this is not
  the case there is a third kind of operand. Between the operator
  symbol and the last symbol of a construct are the middle operands.
  Middle operands can have any priority.

9.4 The syntax of operators

    The S7SSD treats everything as operator description. Operators
  have priority and assoziativity. The priority and associativity
  determine in which succession the S7SSD syntax rules get applied.
  To explain priority and assoziativity we use the basic arithmetic
  operations (+,-,*,/). To describe them with EBNF you can write:

    factor :=
      number | name .

    expression_5 ::=
      factor |
      ( '+' expression_5 ) |
      ( '-' expression_5 ) .

    expression_6 ::=
      expression_5 |
      ( expression_6 '*' expression_7 ) |
      ( expression_6 '/' expression_7 ) .

    expression_7 ::=
      expression_6 |
      ( expression_7 '+' expression_6 ) |
      ( expression_7 '-' expression_6 ) .

  This describes the following things:

     - The operators have different priorities:
        - Plus and minus signs are executed first
        - Multiplication and division are executed second.
        - Addition and subtraction are executed last.
     - This priorities are exactly what you expect from an
       arithmetic expression.
     - Additionally you see that ++2 is allowed and interpreted as
       +(+(2)) which means that the plus sign is a right-associative
       operator.
     - You can also see that a*b*c is allowed and interpreted as
       (a*b)*c which means that the multiplication is a
       left-associative operator.

  All this things can also be described with S7SSD:

    $ syntax expr: . + .() is     <-   5;
    $ syntax expr: . - .() is     <-   5;
    $ syntax expr: .(). * .() is   ->  6;
    $ syntax expr: .(). / .() is   ->  6;
    $ syntax expr: .(). + .() is   ->  7;
    $ syntax expr: .(). - .() is   ->  7;

  As you can see the S7SSD is shorter as the description with EBNF.
  A syntax statement is expained as follows:

     - The $ is used to introduce all hardcoded statements.
     - The keyword 'syntax' introduces a structured syntax
       description.
     - The result of the recognized expression will have
       the type 'expr'. The type 'expr' is used between the syntax
       and the semantic analysis. The type 'expr' describes
       expressions which are syntactically analyzed but not
       semantically analyzed. After the semantic analysis (and
       during the runtime) the type 'expr' is not used.
     - The colon ':' is used as separator between type and syntax
       description.
     - A dot expression like '.(). * .()' is introduced (as can
       probably be guessed by the name) with a dot. For the purpose
       of the syntax description you can just remove the dots in
       your mind: '() * ()'
     - The symbol 'is' is used in all Seed7 declarations as
       separator between the name and the value.
     - The associativity is described with one of the symbols
       -> (left-associative), <- (right-associative),
       <-> (not associative) and -><- (both associativitys).
       When there are no left or right operands, as it is the case
       for the 'loop' statement, the associativity is irrelevant.
     - Finally the priority of the syntax construct is defined with
       a integer literal like '6'. The priority '6' is used for the
       operators '*', '/', 'div', 'rem', 'mdiv' and 'mod'.

  The S7SSD can also be easily used to do automatic syntax
  recognition.

  There are also other things which are out of the scope of the
  S7SSD. The syntax of tokens (whitespace, comments, identifiers and
  literals) and expressions (parentheses, function calls and dot
  expressions) is hardcoded. The hardcoded constructs are described
  in chapter 10 (Tokens) and chapter 11 (Expressions).

  For the reasons mentioned above it is not possible to transform
  every EBNF syntax description into S7SSD. Transforming S7SSD
  descriptions to EBNF is always possible.

  The advantage of the S7SSD lies in its simplicity and that a fast
  automated syntax recognition algorithm can be easily implemented.
  It is exactly the combination of hardcoded syntax recognition and
  flexible syntax rules that make it successful.

9.5 Comparison of EBNF and S7SSD

  In the S7SSD of the 'loop' statement

    $ syntax expr: .loop.().until.().do.().end.loop is -> 25;

  are no nonterminal expressions '()' before the first keyword or
  after the last keyword. Therefore the assoziativity does not play
  any role. The nonterminal expressions '()' of the 'loop' statement
  are all surrounded by keywords and therefore they can have any
  priority. As priority of the 'loop' 25 is choosen just because
  most other statements have also priority 25. The assignments
  (:= +:= *:= ...) have priority 20 and all operators used in
  arithmetic, boolean and string expressions have priorities less
  than 20. BTW: The semicolon operator (;) is defined with the
  priority 50. Operators with a priority of 0 get their parameters
  before operators with priority 1 and so on.

  The corresponding EBNF description of the 'loop' statement would
  be:

    expression_25 ::=
      'loop'
        expression_127
      'until' expression_127 'do'
        expression_127
      'end' 'loop' .

  You must keep in mind that alternative rules for expression_25 are
  also possible and that for every priority level a rule like

    expression_127 ::= expression_126 .

  is defined. Additionally the following rules are defined:

    expression_0 ::= token | parentheses_expression |
      call_expression | dot_expression .

    token ::=
      identifier | literal .

    parentheses_expression ::=
      '(' expression_127 ')' .

    call_expression ::=
      expression_127 [ '('
      [ expression_127 { ',' expression_127 } ]
      ')' ] .

    dot_expression ::=
      [ '.' ] call_expression { '.' call_expression } .

  When you want to use some special syntax which should be only
  allowed at some place you do the following:

     - Define the special syntax with S7SSD in a way that does not
       contradict with the rest of the syntax definitions.
     - Use semantic definitions to make sure that this syntax
       construct can only be used at the place desired.

  The 'elsif' and 'else' parts of the 'if' statement use this
  technic. The syntax of the 'elsif' and 'else' statement parts is:

    $ syntax expr: .elsif.().then.() is               <- 60;
    $ syntax expr: .elsif.().then.().() is            <- 60;
    $ syntax expr: .else.() is                        <- 60;

  The types 'ELSIF_RESULT' and 'ELSIF_PROC' are just defined to be
  usable for the 'elsif' and 'else' parts. A special variant of the
  'if' statement is defined with a parameter of type 'ELSIF_PROC':

    $ syntax expr: .if.().then.().().end.if is        -> 25;

    const proc: if (in boolean param) then
                  (in proc param)
                (in ELSIF_PROC param)
                end if is                      action "PRC_IF_ELSIF";

  With 'action' the primitive actions are used (instead of a high
  level definition written in Seed7). Primitive actions are
  explained here: http://seed7.sourceforge.net/manual/actions.htm
  The semantic part of the 'else' and 'elsif' part are defined with:

    const ELSIF_PROC: elsif (in boolean param) then
                        (in proc param) is     action "PRC_IF";
    const ELSIF_PROC: elsif (in boolean param) then
                        (in proc param)
                      (in ELSIF_PROC param) is action "PRC_IF_ELSIF";
    const ELSIF_PROC: else
                        (in void param) is     ELSIF_EMPTY;

  Since no other functions of type 'ELSIF_PROC' are defined only
  legal 'if' statements can be written.

  An explanation of the 'if' statement is here
  http://seed7.sourceforge.net/manual/stats.htm#if-statement

================================

Thanks in advance for your effort.

Greetings Thomas Mertes

Seed7 Homepage:  http://seed7.sourceforge.net
Seed7 - The extensible programming language: User defined statements
and operators, abstract data types, templates without special
syntax, OO with interfaces and multiple dispatch, statically typed,
interpreted or compiled, portable, runs under linux/unix/windows.