[ooc-compiler] Strings and things (longish)

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Folks,

Unfortunately I missed the last discussion of strings, but here are a
few thoughts. 

Michael van Acken wrote:
> 
> "Griebling, Michael" <mic...@ho...> writes:
> 
> > [...]
> > Some features I'd like to see are the overloaded operators
> > from NO (Native Oberon).  Not sure if they're going to be
> > in LO eventually.  With these operators and allowing structured
> > returns you can implement strings entirely within modules
> > with no other compiler support required.

This is interesting. One can probably do strings using just objects, but
the use of operator overloading might be a good way of off-loading some
complexity from the compiler to the libraries.

The best use for structured returns that I can see is for implementing
complex numbers (are there other uses outside scientific computing?).
I've written complex libraries (admittedly, in C) using both procedural
and object-oriented models (eg. similar to OOC's ComplexMath). In
computations like the FFT, its not feasible to return an object for each
computation because this introduces a _huge_ extra cost in
heap-management which easily exceeds the cost of the actual computation.
Using the procedural approach leads to code that is rather difficult to
read, but is faster. Often, its faster again to expand the calculation
completely inline. With an optimiser the inline result is faster.
Without an optimiser (eg. in most Oberon compilers), its slower because
the array address calculations are duplicated unnecessarily. Here are
two alternatives for the inner loop of the FFT:

#if usecomplexlib
  /* procedural */
  cmul(&w, &a[i2], &b);
  csub(&a[i], &b, &a[i2]);
  cadd(&a[i], &b, &a[ i]);
#else
  /* inline */
  ai2=a[i2]; ai=a[i];
  b.re=w.re*ai2.re-w.im*ai2.im;
  b.im=w.re*ai2.im+w.im*ai2.re;
  a[i2].re=ai.re-b.re; a[i2].im=ai.im-b.im;
  a[i].re=ai.re+b.re;  a[i].im=ai.im+b.im;
#endif

The use of structured returns and overloading would allow one to write
code that is probably equivalent to the procedural version in terms of
performance. It is certainly the most readable alternative:

  b := w * a[i2];
  a[i2] := a[i] - b;
  a[i] := a[i] + b;

Presumably, this is the motivation behind the OberonX extensions.

Hmm. Almost as neat as FORTRAN...

> Let's investigate a little bit further in this direction.
> 
> 1. How much convenience can a statically typed language offer as far
>    as string handling is concerned?  Or, putting it the other way
>    round, taking the string support of a script language like Python,
>    which features cannot be implemented in a language similar to O2?

I don't know Python, but I'll comment on what I know.

1) Java.

Java provides Strings which are immutable, and StringBuffers which are
mutable (allowing insert, delete, etc). Java has a concatenation
operator for strings, but its most powerful feature is the ability to
turn any object into a string. Therefore, one can write:

  Object o; double x, y;
  ...
  System.out.println("Object " + o + " is at (" + x + ", " + y + ")");

Since every Java object has a toString() method, this is trivial to do.
If the user does not define a toString method, the default method in the
Object class uses meta-programming to output selected (public) fields of
objects. This is _extremely_ useful for debugging and general formatting
of messages.

The basic Java string class has about 50 methods for comparing whole or
parts of strings, converting case, extracting sub-strings, and
converting from various other types (eg. int, float, char).

The Java compiler uses StringBuffers (not Strings) within string
expressions, which means that its string operations are actually very
efficient. This seems to me to be a good approach.

<QUOTE>
String buffers are used by the compiler to implement the binary string
concatenation operator +. For example, the code: 

          x = "a" + 4 + "c"

is compiled to the equivalent of: 

          x = new StringBuffer().append("a").append(4).append("c")
                                .toString()

which creates a new string buffer (initially empty), appends the string
representation of each operand to the string buffer in turn,
and then converts the contents of the string buffer to a string.
Overall, this avoids creating many temporary strings. 
</QUOTE>

Note that the StringBuffer append() method is overloaded to handle all
primitive data types.

There is one trap in the Java string model. Since Strings are objects,
the equality operator "==" actually tests for identity not equality.
Therefore, the expression 
  s == "hello"
is _always_ false, because a variable string can never be identical to a
constant string. To get the intended effect, one needs to use:
  s.equals("hello");
Its easy to make this mistake. The compiler will never complain because
its actually not an error.

2) Component Pascal

Component Pascal extends Oberon-2 strings by defining String types:

<QUOTE>
6.6 String Types

Values of a string type are sequences of characters terminated by a null
character (0X). The length of a string is the number of characters it
contains excluding the null character.

Strings are either constants or stored in an array of character type.
There are no predeclared identifiers for string types because there is
no need to use them in a declaration.
</QUOTE>

Their argument is that string types are unnecessary because strings can
be contained within an ARRAY OF CHAR. I think this is a mistake: ARRAYs
can contain things that are not valid strings (eg. are non-terminated).
Their solution is to introduce the "$" designator:

<QUOTE>
If a designates an array of character type, then a$ denotes the null
terminated string contained in a. It leads to a run-time error if a does
not contain a 0X character. The $ selector is applied implicitly if a is
used as an operand of the concatenation operator (Ch. 8.2.4), a
relational operator (Ch. 8.2.5), or one of the predeclared procedures
LONG and SHORT (Ch. 10.3).
</QUOTE>

This allows character arrays to be used as "carriers" (my term) for
strings, but one still cannot, for example, write a simple function that
returns a string. It must either take a character array as an OUT
parameter, or return a pointer to an ARRAY OF CHAR containing the
string. In most cases, the Blackbox framework uses the former approach
which means that data structures tend to be littered with arrays of
arbitrary fixed lengths. To make a string of unknown length persistent
(eg. for assigning to a variable, or returning as a function result),
one still needs to do an explicit allocation once the length has been
calculated. The neatest way to do this is to write a procedure to make a
heap copy. For example:

PROCEDURE CopyString(a : ARRAY OF CHAR) : POINTER TO ARRAY OF CHAR;
VAR result : POINTER TO ARRAY OF CHAR;
BEGIN
  NEW(result, LEN(a$)+1);
  result^ := a$;
  RETURN result;
END CopyString;

PROCEDURE MakeString1(a, b, c : ARRAY OF CHAR) : POINTER TO ARRAY OF
CHAR;
BEGIN
  RETURN CopyString(a$ + ":" + b$ + "/" + c$);
END MakeString1;

This means that at run time the system has to build the string, make a
copy to pass it by value, calculate its length, and then make another
copy. Alternatively, I could have written:

PROCEDURE MakeString2(a, b, c : ARRAY OF CHAR) : POINTER TO ARRAY OF
CHAR;
VAR result : POINTER TO ARRAY OF CHAR;
BEGIN
  NEW(result, LEN(a$)+LEN(b$)+LEN(c$)+3);
  RETURN CopyString(a$ + ":" + b$ + "/" + c$);
END MakeString2;

This saves one copy operation, but is more error prone because the
length is explicitly calculated. Both versions 1 and 2 require that the
original arguments (a, b, c) be passed by value, which introduces
another copying stage. Aesthetically, its not quite right that the input
and output types need to be different. Unfortunately, this is necessary
if one wants to be able to pass string constants to the functions.
Really, one should be able to just write:

PROCEDURE MakeString3(a, b, c : STRING) : STRING;
BEGIN
  RETURN a + ":" + b + "/" + c;
END MakeString3;

If STRING values are immutable, one can always pass them by reference
which makes string operations very efficient. Note that in Oberon, its
not possible to define an immutable string of non-arbitrary length
(unless its encapsulated in an opaque object). You can write:

  x- : ARRAY 128 OF CHAR;

and protect x from write access. However, if one writes:

  x- : POINTER TO ARRAY OF CHAR;

only the pointer is read-only; the array is still writable. There is no
equivalent of the C declaration:

  const char * const x;

3) BASIC

This is not a joke! BASIC has always had adequate string support. Its
really not a big deal.

BASIC supports immutable strings as fundamental types. It provides
concatenation as a primtive operator. Other facilities are provided by
built-in functions. The standard ones seem to be:

LEFT$, RIGHT$ and MID$ return substrings
STR$ converts any type (usually ordinal types) to a string
VAL returns a variant value for a string. If this is assigned or cast to
a particular type it probably generates a run-time check.
INSTR$ determines the ordinal position of a sub-string within a string

> 2. What support in the libraries and within the compiler is necessary
>    to get convenient string handling?

The Oberon-2 language already has extensive support for strings. The
problem is that strings currently exist only as constants, not as true
types.

In the Oberon-2 language report:
  3 (4) defines a string as a sequence of characters.
  8.2 (4) defines relations =,  #, < <=, >, and >= for strings.
  9.1 (3) allows strings to be assigned to character arrays.
  Appendix A defines strings (constants) as Assignment- and
    Array-compatible with ARRAY OF CHAR
  No operators are defined for strings.

Here are the changes that I think need to happen:

1) Strings should be true types. 

2) String concatenation should be supported via the "+" operator.

3) String variables and literals need to be treated uniformly. The
easiest way to do this is to make strings immutable. This also makes it
possible to pass strings by reference, which should be efficient.

4) Strings should not be unnecessarily tied up with character arrays. In
Oberon, strings are already assignment-compatible with ARRAY OF CHAR.
There should be a standard procedure to extract a string from an ARRAY
OF CHAR. Eg.
  STR(VAR v : ARRAY OF CHAR) : STRING;

5) Provide some low-level operations on strings. A minimalist set would
probably be:

LEN(v : STRING) : LONGINT;
  Return number of characters in string v

AT(v : STRING; i: INTEGER) : CHAR;
  Return the character at position i in v

SUB(v : STRING; i, j : INTEGER) : STRING;
  Return a sub-string of v starting at position i and ending at position
j.

For compatibility with C language APIs, it should be possible to pass a
STRING where a POINTER TO ARRAY OF CHAR is expected. OOC already
implements this using the [CSTRING] tag in interface modules.

Another workable alternative would be to use a model similar to Java:
define mutable and immutable strings as "inner classes" that can then be
implemented in their own modules, but get some special treatment by the
compiler. With the exception of concatenation, most string operations
can be just as efficiently implemented as type-bound procedures.

To support both CHAR and LONGCHAR, probably requires both STRING and
LONGSTRING. String concatenation needs to widen the result to the
largest string type. It might also be important to allow explicit
widening / narrowing of string values. Component Pascal defines:
  LONG(s : STRING) : LONGSTRING
  SHORT(s : LONGSTRING) : STRING

Having two string types makes things a bit annoying at the low level,
especially since their implementations will be almost identical. Java
supports only 16-bit characters, so it has only one string type. A
solution might be to generalise strings by making the STRING type
parametric over ordinal types. Therefore, one could define:
  STRING OF CHAR;
  STRING OF LONGCHAR;
  STRING OF INTEGER;
  ... etc
This is actually not hard to implement. I once wrote something similar
using C++ templates, but perhaps it is best not to speak of such
things...

Cheers,
  Stewart