Menu

Container arrays (or "what the heck am I working on")

2018-10-11
2018-10-16
  • Scott Franco

    Scott Franco - 2018-10-11

    Container arrays (or "what the heck am I working on")

    Container arrays for Pascaline originate back in 1993, when I was working on a
    new compiler for the 80386 series. I did that because I had already recoded
    my original Z80 assembly coded compiler front end to be Pascal only in the
    1980's, and because the off the shelf 32 bit compilers I was using by that time
    were dying as a business model (another story).

    Container arrays are perhaps the oldest major feature of Pascaline and are
    still one of the most important. As Wirth often said, you can't look at a
    feature in isolation, but rather how it meshes with the rest of Pascal, and
    how it dovetails with other Pascaline features.

    Pretty much everyone using ISO 7185 Pascal feels the need for variable string
    length allocation, and perhaps other types of arrays as well. There are
    solutions to that using standard ISO 7185 Pascal (yes, there are). One of these
    was used in Don Knuth's TeX system, which used a pool of strings placed into
    a single large character array. Another one was the "quanta" system I used
    in the Pascal-P5 compiler.

    Before the 80386 compiler, IP Pascal handled the need for variable length string
    allocation by simply assuming that all strings were some fairly long length
    (like 250 characters), then using a call:

    type string = packed array [1..250] of char;
         pstring = ^string;
    
    var p: pstring;
    
    procedure getspace(var p: pstring; l: integer);
    
    ...
    
    getspace(p, 42);
    

    getspace() was, of course, implemented in assembly language and was an
    an alternative to new(). It would be the equivalent to new(p), except that it
    only allocated 42 characters, and not the full 250. It meant that you only
    allocated what you needed, that you needed a separate routine for each type,
    and (most importantly) that there was no protection against overrunning the
    end of the string. Just like in C, if you index off the end of the array, you
    are corrupting the system [1].

    In the run up to 80386 IP Pascal, I went over several plans to implement
    variable length arrays. In this, I have to owe a debt of gratitude to the
    ISO 10206 standard, a standard I have been very critical in the past, mainly
    because it got me thinking in the correct direction. In fact, many of the
    things I really didn't like about ISO 10206 I ended up adopting, and container
    arrays is one of them.

    After a few years of producing various plans for variable length array
    allocation, I settled on the following founding ideas:

    1. Variable array allocation should not increase the cost of standard Pascal
      arrays. At all. Whereas it is reasonable to expect that use of variable length
      arrays would increase the cost when using them, a reasonable Pascal user could
      expect that their program not suffer speed or space inefficiencies while using
      legacy Pascal code. In specific, this stopped the idea of having a single
      normalized form for arrays that indicated their length, regardless of if they
      were ISO 7185 arrays or extended Pascaline arrays.

    2. They should interoperate completely with ISO 7185 Pascal. Specifically,
      variable arrays should be assignment and expression compatible with ISO 7185
      arrays, in both directions (to and from variable arrays). An ISO 7185 array
      should be able to be passed as a variable length array parameter.

    3. Variable arrays should not implement the ability to vary the length of the
      array at runtime. Rather, they only allocate the space in the array according
      to the programmer request, and then enforce protected access to that space.
      This does not mean that tracking variable length arrays is not possible, only
      that it is left to the user. The compiler only provides the ability to mark off
      a length of space in an array, then enforce legal access to that space. What
      you do with that space is up to you.

    4. The system should work for any type of array.
      The condition #3, is very important. I have seen other systems that
      attempt to give the programmer the "illusion" of variable length strings (Basic
      comes to mind), but really implement it by cutting off a fixed length of space
      and controlling the "size" of it via procedures and functions. This dramatically
      increases the complexity of the concept. Pascaline has such a series of string
      functions and procedures, but it is in a completely portable, user specified
      library.

    I called this "containers", and later "container arrays", because I found that
    the term "container" was already in common use for other things. The term
    "container" goes to point #3 above. Pascaline does not implement true variable
    length arrays, but rather allows you to create fixed arrays of arbitrary length
    that "contain" the actual array you want.

    Container arrays are built on one single idea, in fact a powerful one, that
    carries over to class/object orientation as well: a complex pointer. Each
    complex pointer consists of:

    0: Base data pointer.
    1: Template/descriptor.

    The base pointer is simply the address of the data in the array. The template
    describes the length of the array. In 1993, this was simply the integer length
    of the array. In Pascaline, this got expanded to a template pointer, which I
    will explain later.

    The advantage of keeping the target data as a binary pair is that at any time
    it can interoperate with standard ISO 7185 arrays. Going from ISO 7185 fixed
    array to complex pointer means adding a template to the array pointer. Going
    from container array to standard array means checking the template against the
    fixed length and dropping the template.

    So in the 1993 compiler, you specify:

    type string = packed array of char;
                  pstring = ^string;
    
    var p: pstring;
    
    ...
    
    new(p, 25);
    
    p^ := 'hi there bob            ';
    

    Means to create an array of 25 elements and assign it to the pointer, which is
    a complex pointer. Unlike the getspace() scheme described before, the compiler
    enforces any accesses to the array, so p^[26] gives you an error. you will note
    also that the assignment above actually mixes fixed and container array types,
    since a fixed string is being assigned to a container array.

    The original 1993 implementation only used vectors (single dimension arrays) and
    the "template" was always just a length. In fact, the first couple of times I
    used this method, it struck me as pretty ugly. In the assignment:

    new(p, 25);
    
    p^ := 'hi there bob            ';
    

    you have to create a string of the exact correct number of characters (25),
    then assign to it. It has to match the string it is being assigned from exactly.
    This applies even if you are using two container arrays:

    var p1, p2: pstring;
    
    ...
    
    new(p1, 25);
    new(p2, 25);
    
    p1^ := p2^;
    

    But note that this only applies if you are matching two different arrays. In the
    correct context, it actually makes a lot of sense:

    procedure lcases(var s: string);
    
    var p: pstring;
        i: integer;
    
    begin
    
       new(p, max(s));
       for i := 1 to max(s) do p^[i] := lcase(s[i]);
       s := p^
    
    end;
    

    Where max() is a built in function giving you the size of the array argument.

    In fact, creating, using, and then discarding an array is a frequent paradigm
    in Pascaline. The string library of Pascaline is built on that (as indeed is
    the language C#).

    One of the more interesting characteristics of the 1993 compiler is it allowed
    easily passing constant strings to routines. In ISO 7185 Pascal, you can do:

    type string = packed array [1..12] of char;
    
    procedure prtstr(p: string);
    
    ...
    
    prtstr('hello, world');
    

    But there is a lot wrong with that. First, the string length is fixed, so only
    one size string can be passed. Second, this is using value parameter rules, so
    the constant string is in fact copied entirely to a buffer before being passed
    to the routine.

    Now contrast that with Pascaline:

    type string = packed array of char;
    
    procedure prtstr(view p: string);
    
    ...
    
    prtstr('hello, world');
    

    Now the string can be any length, and nothing is copied, because the rules of
    "view" parameters are that the parameter can be any expression result, and
    cannot be modified during the routine. This means the compiler is free to pass
    the actual string, not a copy.

    ON TO PASCALINE

    The 1993 version of container arrays had several limitations. First, it could
    not be used with value parameters. VAR and VIEW parameters, yes, but not value.
    Also, it could not allocate multidimensional arrays (matrices or higher).
    Finally, it could only be used with pointer or dynamically allocated variables.
    None of this was a binding limitation. For value parameters, you make copies of
    arrays. For matrix or higher array levels, you roll your own:

    type vector = array of integer;
         pvector = ^vector;
         matrix = array of pvector;
         pmatrix = ^matrix;
    
    var mp: pmatrix;
    
    ...
    
    new(mp, 100); ! allocate matrix top index
    for i := 1 to 100 do new(mp[i], 100); { allocate matrix second level }
    
    mp^[42]^[73] := 1;
    

    Ungainly, but serviceable. In fact, this is how you make so called "sparse
    arrays" which are popular in C, where sparse means that each vector in a matrix
    can have its own size.

    For local variables of container arrays, you simply have to create a dynamic
    pointer at the start of a block, and terminate it at the end.

    In pascaline, container arrays are orthogonal over all these operations. Any
    index level of array is acceptable. They are usable with any parameter mode,
    value, VAR, VIEW or OUT. And declaring them as variables is not only possible, it
    is encouraged as part of the "Pascaline style":

    program test;
    
    type maxtrix = array of array of integer;
    
    var m(100, 100): matrix;
    
    procedure setmatrix(mt: matrix);
    
    begin
    
      m[42, 73] := 1;
    
    end;
    
    begin
    
       setmatrix(m)
    
    end.
    

    In this trival example, matrix m is created as a program global, passed as a
    value parameter to setmatrix, and operated on. Because it is a value
    parameter, a copy of the entire matrix must be created, or 10,000 integers,
    quite a lot of data. Its not really modified by setmatrix, since it is just a
    copy. And we don't worry about how or where the compiler creates the matrix m,
    or how it gets rid of it.

    Note that the "array of array of..." notation becomes a side effect of not
    specifying the indices of the array. Another way to do that (which Pascaline
    DOES NOT implement) would be array [*,*] of integer[3].

    Now going back to the idea of a complex pointer internal implementation:

    0: Base data pointer.
    1: Template/descriptor.

    What is different? Well, the template is no longer just a length, but a pointer
    to a true template, as in a list of indices:

    template address -> 100,100

    In other words, its a table that describes the layout of the matrix in terms of
    its index sizes.

    It turns out, this is backwards compatible with both simple length based
    templates, because templates with only a single dimension are simplified to only
    hold a single length. Thus:

    Array type          Contents of template field
    ================================================
    ISO 7185            None
    Vector              Length
    Matrix or higher    Pointer to template table
    

    In operations, pointers to arrays are always stated in their simplest form. When
    the above matrix is "sliced", or has only a single vector picked out:

    m[42]

    Then it is automatically simplified to address/length only form. If an array is
    reduced to an element:

    m[42,73]

    The template part of the pointer is dropped entirely.

    In the case of matrices or higher order indexing, ISO 7185 Pascal arrays
    interoperate with container arrays by using what I call "pick up templates". For
    each fixed array in ISO 7815 Pascal, we keep a template filled out at compile
    time handy in the constants area. If we need to, say, assign a fixed matrix to
    a container array based matrix, we can convert the fixed version to a complex
    pointer by simply adding a pointer to the fixed table in memory.

    WHERE DID YOU PUT THAT?

    It turns out that WHERE you put container arrays is not only interesting, but
    it adds a new capability to the language, one that is hidden from the
    programmer. A simple global container array:

    program x;
    
    var a(10): array of integer;
    
    begin
    
       ...
    
    end.
    

    In Pascal-P6 this is actually a synonym for:

    program x;
    
    type ai = array of integer;
    
    var ap: ^ai;
    
    begin
    
       new(ap, 10);
       ...
       dispose(ap)
    
    end.
    

    Note that in the case of containers, you don't specify the length at dispose.
    The allocation of program variables is done statically, that is, they are
    compiled in to the program and the total space allocated to globals does not
    change. The compiler cannot allocate variable spaces in that that are not known
    until runtime. So what it does instead is allocate them dynamically, and it
    does that automatically for you, both when setting them up and tearing them
    down.

    What happens if the variable is a local?

    program x;
    
    procedure y;
    
    var m(100, 100): array of array of integer;
    
    begin
    
       ...
    
    end;
    
    begin
    
       ...
    
    end.
    

    Here, procedure y could allocate matrix m as a dynamic variable, but there is
    a better way: it can allocate them on the stack. This has several advantages.
    It reduces heap fragmentation, is usually faster than dynamic allocation, and
    has no overhead on teardown at all. When the stack frame for a routine is thrown
    away, the allocation for such variables gets thrown away with it.

    In fact, this is the basis of the C function alloca() which is available on
    some systems. The compiler actually gives you this for free[2]

    INITIALIZERS

    You probably have noticed that the examples so far are just a different form
    of ISO 7185 fixed arrays. The declaration:

    var m(100, 100): array of array of integer;

    Is equivalent to the ISO 7185 declaration:

    var m: array [1..100,1..100] of integer;

    In Pascaline, the first form is preferred. Why? In fact, A.N.Habermann gave the
    reason back in 1973 when he identified the problem with fixed arrays in Pascal:

    "The size of an array can hardly be considered as part of it's type"

    That is to say, the basic form of an array as a structure does not, and should
    not, change with its length, any more than a file with 11 elements is completely
    different than a file with 10 elements, or a set of 11 members is different from
    a set of 10 members, etc.

    In fact, fixed array limits are the only part of Pascal type specifications that
    have the property of making the type incompatible. A subrange of 1..10 is
    compatible with a subrange of 1..11. The only thing that comes close is a
    variant record:

    type r = record case boolean of false: (i: integer); true: (c: char) end;

    But in that case it is a list of different forms grouped under the same type[4].
    Thus, Pascaline considers arrays to be equal in type if their element type is
    the same and in the same format, which (curiously) makes their packed and
    unpacked status incompatible. More on that later. Pascaline considers the sizes
    of arrays to be a property of the variable, not its type. Thus:

    var a(100), b(200): array of integer;

    Are different size arrays, but both compatible with each other (and any other
    array of integer).

    What about:

    var a(100): packed array of integer;
        b(200): array of integer;
    

    These are not compatible arrays. Why? The reason is that the packing method used
    to store the values could actually change the format of the data in the array.
    If we have:

    var a(100): packed array of boolean;
        b(200): array of boolean;
    

    This (hopefully) becomes more obvious. An array of boolean probably takes one
    byte for each of the boolean values. Thus b is 200 bytes long. However, a could
    well be 12.5 bytes long, or 13 bytes (rounded up), because 100/8 = 12.5. The
    compiler can, if it is able, store booleans in only one bit, and stuff as many
    of those bits into the available space as it can. This is not just a change in
    array size, but a fundamental change in format.

    In fact, Pascaline considers any container arrays with the same packed status
    and base type to be equivalent. This is is perhaps the most questionable feature
    of Pascaline container arrays, but it has a purpose. Any library function with
    an equivalent array parameter is automatically compatible.

    Ok, so if these forms are equivalent:

    var m(100, 100): array of array of integer;

    var m: array [1..100,1..100] of integer;

    What is the point? Well, it is because the initializers for the variable are not
    limited to constants. They can be full expressions. They can contain constants,
    parameters and even functions:

    type vector = array of integer;
    
    procedure addone(view v: vector);
    
    var vc(max(v)): vector;
        i: integer;
    
    begin
    
      for i := 1 to max(v) do vc[i] := v[i]+1;
      ...
    
    end;
    

    This routine creates a local array vc with the same size as the parameter v,
    then copies the contents of v to vc and operates on it. A more complex
    example would be:

    type vector = array of integer;
    
    procedure addone(view v: vector);
    
    function vsize: integer;
    
    begin
    
       vsize := max(v)
    
    end;
    
    var vc(vsize): vector;
        i: integer;
    
    begin
    
      for i := 1 to max(v) do vc[i] := v[i]+1;
      ...
    
    end;
    

    Here we actually use a function to determine the size.

    The one thing you cannot, or more correctly should not, do is use local
    variables in initializers, since by definition they are not themselves
    initialized:

    type vector = array of integer;
    
    procedure addone(view v: vector);
    
    var x: integer;
    var vc(x): vector;
    
    begin
    
    end;
    

    This results in an error because x is not initialized. You could get around
    this (but don't!):

    type vector = array of integer;
    
    procedure addone(view v: vector);
    
    var x: integer;
    
    function vsize: integer;
    
    begin
    
       x := 10;
       doit := 0
    
    end;
    
    var vc(vsize+x): vector;
        i: integer;
    
    begin
    
      for i := 1 to max(v) do vc[i] := v[i]+1;
      ...
    
    end;
    

    This is dirty pool in a lot of ways. First, the function vsize has a side
    effect, which is bad form. Second, the evaluation of vsize and x is order
    dependent, and Pascal does not guarantee order of execution of expression
    operators (nor should it!). In fact, I am willing to bet there is no clean way
    to introduce locals into initializers (globals are a different story, as well
    as locals belonging to an outer block).

    SUMMING UP

    Container arrays add a lot of new capability to ISO 7185 Pascal, and finally
    solve the issue of creating and working with variable length array data in a
    way seamless with original Pascal. They also dovetail with Pascaline being
    an object oriented language, but that is a subject for another time.

    [1] In fact, the Z80 IP Pascal compiler had several features in common with C,
    including its modular structure.

    [2] To be fair, this could actually work in the program block as well, and
    Pascal-P6 actually does allocate some variables on the stack.

    [3] I have occasionally thought about adding this notation, but there is not
    a pressing need to declare arrays with huge numbers of indices.

    [4] Which makes variant records redundant with classes. However, although
    Pascaline proposes better methods to do some tasks than ISO 7185 Pascal, it
    never replaces them (or "depreciates" them).

     

    Last edit: Scott Franco 2018-10-17
  • Scott Franco

    Scott Franco - 2018-10-16

    An epilog of sorts: Why Pascaline is different?

    So GPC implements part of the ISO 10206 standard for "schemas". Here is a schema program:

    samiam@samiam-MACH-WX9:~/projects/franco/pascal/pascal-p6$ cat -n test1.pas
         1  program test(output);
         2  
         3  type a (n: integer) = array [1..n] of integer;
         4       b = array [1..10] of integer;
         5  
         6  var av: a(10);
         7      bv: b;
         8      i: integer;
         9  
        10  procedure q(x: a);
        11  
        12  begin
        13  end;
        14  
        15  begin
        16  
        17     for i := 1 to 10 do bv[i] := i+20;
        18     q(bv);
        19     av := bv;
        20     bv := av;
        21     q(av);
        22     for i := 1 to 10 do writeln(av[i]);
        23  
        24  end.
    samiam@samiam-MACH-WX9:~/projects/franco/pascal/pascal-p6$ gpc -o test1 test1.pas
    test1.pas: In main program:
    test1.pas:18: error: type mismatch in argument 1 of `q'
    test1.pas:10: error:  routine declaration
    test1.pas:19: error: type mismatch in array assignment
    test1.pas:20: error: type mismatch in array assignment
    samiam@samiam-MACH-WX9:~/projects/franco/pascal/pascal-p6$ 
    

    What this did was create two types, a and b, and variables from those types, av and bv. Type a is schema for integer array with n members. Type b is a fixed array with 10 integers.

    We also create a procedure q that takes a parameter of type a. This is a legal schema array parameter, it takes any length of integer array.

    Note that none of the combinations of schema types and non-schema types are compatible with each other, even though they contain the same number of array elements and same base type. A standard Pascal array cannot be assigned to a schema array, nor a schema to a standard Pascal array[1]. A standard Pascal array cannot be passed as a schema parameter. Note that in line 21 we demonstrated that this is perfectly legal -- when passing a schema array to a schema parameter.

    This is one reason I was not wild about the ISO 10206 standard when it came out. Its like saying "we have this neat new powerful type we added to the language for you, but it does not interoperate with the old type". Thus you are forced to convert all of your old array types to the new "improved" type to get work done. ISO 10206 is not the only Pascal extension to do this. In Borland Pascal, and several others, they have a string type that can accept any string length. And they added named file handling routines. But the named file handling only accepts extended strings. Thus:

    type filename: packed array [1..20] of char;
    
    var s: filename;
    
    assign(f, s);
    

    Does not work, because s, although a perfectly valid string definition in original Pascal, is invalid for file handling procedures. You are forced to use a string type specific to that particular language implementation[2].

    This also has performance implications. In ISO 10206, the fact that schema types are incompatible with original Pascal arrays makes it possible to have all schema arrays have a standard format, that is:

    1. Minimum index.
    2. Maximum index.
    3. Array data.

    (ISO 10206 gives the ability to specify both the minimum and maximum index value).

    Thus it is easy to handle schemas, because they all have the same machine level format. However, since you are forced to use schemas everywhere to take advantage of schemas, you also promoted your old Pascal fixed format arrays to schemas, and thus they all have the overhead of specifying the indexes for the array, even if you don't need it. An array:

    type a = array [1..100] of integer;

    Might be just what you need, and if you want to call a sort routine like sort(x), but a general purpose one formulated with schemas, you have to use a schema array to be able to use that routine.

    Now lets do the same program in Pascaline:

    samiam@samiam-MACH-WX9:~/projects/franco/pascal/pascal-p6$ p6 test
    Compiling test...
    P6 Pascal compiler vs. 0.1.x
    
         1      -56 program test(output); 
         2      -56  
         3      -56 type a = array of integer; 
         4      -56      b = array [1..10] of integer; 
         5      -56  
         6      -56 var av(10): a; 
         7      -56     bv: b; 
         8      -56     i: integer; 
         9      -56  
        10      -56 procedure q(x: a); 
        11      -72  
        12      -72 begin 
        13       18 end;
        14       19  
        15       19 begin 
        16       22  
        17       22    for i := 1 to 10 do bv[i] := i+20; 
        18       51    q(bv); 
        19       56    av := bv; 
        20       64    bv := av; 
        21       72    q(av); 
        22       77    for i := 1 to 10 do writeln(av[i]); 
        23      109  
        24      109 end. 
    
    Errors in program: 0
    

    [1] Actually, they can be assigned if you assign one element at a time. This is, in fact, a design principle of Pascaline, that a language construct is valid if it can be shown to be possible with other means in the same language, the "principle of equivalents" in the Pascaline specification.

    [2] The implementation specific libraries of Pascal-P6 actually use these calls to adapt to particular compilers. In that case the method used is to copy from an ISO 7185 Pascal standard string to a string type specific to the host compiler.

     

    Last edit: Scott Franco 2018-10-17

Log in to post a comment.