Basically a touch of Forth

Authors:

Basically a touch of Forth

If you write a compiler like 4tH, you need to write some serious applications to show people it is not a toy. I considered a BASIC interpreter serious enough, so I went for that one. True, I didn't truly write it from scratch, it was based on existing code. When it was finished it was a very limited implementation. It didn't even have a REM statement.

So, I called it uBasic (unaware at the moment that about half a dozen of people had thought of that name as well). I added some whistles and bells to make it half useful and started to port some example programs. I hadn't really written "Minimal BASIC" in a long while and then I realized again why.

It's so simple to write barely maintainable code. At the moment you're writing it, you're not quite really aware, but once it is finished, you forget about the underlying logic fairly quickly. And although I don't mind using global variables, once you have to reserve some for use in subroutines - the fun goes right out the window. So all that needed to change.

Structured programming

I started with structured programming. Maybe you're not aware of it, but even the most primitive BASIC has one single structured programming statement: FOR..NEXT. In uBasic FOR..NEXT is implemented using a stack. As we will see, that has several advantages, but also several drawbacks. The main one is that you can't just leave a FOR..NEXT loop as you will corrupt (and probably exhaust) the stack. So this one is a big "nono":

10 FOR x=1 TO 10
20 IF x=5 THEN GOTO 40
30 NEXT x
40 PRINT x

In other BASICs this is not a problem. E.g. Sinclair BASIC (which was used in the ZX Spectrum) has special variables which apart from the value contain the information concerning the FOR..NEXT loop, like limit, increment and location of the original FOR statement (you have to jump to somewhere, don't you). In uBasic all this information is stored in the stack frame, along with the address of the bound variable.

That last one makes uBasic very versatile. You can use a local variable in a FOR..NEXT loop - or even an array element. All those variables are located statically, so if we were to adopt the Sinclair Spectrum scheme, we would be forced to turn at least a certain type into FOR variables. That would help porting classic BASIC programs (which are usually not too well-written), but otherwise it would simply be too restrictive. Especially when you're trying to do some structured programming.

The best way was to offer some constructs that would you to leave a FOR..NEXT loop in a controlled way. But that's quite a challenge, since it is virtually impossible to determine whether a GOTO effectively exits a FOR..NEXT loop. RETURN would be not as much of a problem, but it would require that we save the current nesting level before the jump and compare it upon return, discarding any superfluous stack frames if needed. So at best it's only half the solution.

And there we are. Either we put the responsibility of managing the FOR..NEXT stack entirely in the hands of the programmer or we introduce half a solution and additional overhead. Not an easy choice. I chose consistency. Introduce UNLOOP. UNLOOP simply pops the stack frame and discards it. Yes, just like Forth!

On the other hand, I also wanted something more transparent. Introduce BREAK. BREAK calls UNLOOP and subsequently leaves the loop by going to the matching NEXT. The routine that jumps to the nearest NEXT is factored out, so we simply have to call it. CONTINUE is simply a NEXT (well, almost) that isn't caught by that routine. Job done. So now we can write:

10 FOR x=1 TO 10
20 IF x=5 THEN BREAK
30 NEXT x
40 PRINT x

Note we can write as many "IF .. THEN BREAK" statements as we want. Now define a statement that evaluates an expression and if is non-zero, it calls BREAK. Welcome, UNTIL. So, now we can write:

10 FOR x=1 TO 10
20 UNTIL x=5
30 NEXT x
40 PRINT x

WHILE is even easier. It is almost identical to UNTIL, you just have to reverse the flag. Note that uBasic has a few treats that make it even easier to turn it into a language, that supports structured programming. First, it doesn't really require line numbers. If you don't refer to one, you don't need 'em. Second, since all looping information is stored in the stack frame, we don't really need the variable after NEXT. I just added that to make it easier to port stuff. If you are Sinclair BASIC you desperately need that variable, because that is where your looping info is. So we could write the above just as easily as:

For x=1 To 10
  Until x=5
Next

Print x

No questions asked! I always liked C's for statement, so I started some work to make FOR a little more flexible, making TO optional as well. Then I even did away with the initialization, leaving nothing more than a dead FOR statement, which built a stack frame consisting of nothing more than a dummy value, some flags and the address of FOR. That's a far cry from (TOS) the address of the bound variable, the increment, the address of FOR and the loop limit.

Still, these values are not useless. they allow to go through NEXT very quickly by signalling they have no bound variable and comparing the values always returns FALSE. Finally, we add a few aliasses, like DO and LOOP and we can do all this:

Do While x<5
  (..)
Loop

Do
  (..)
  Until x=5
Loop

Or even:

Do
  (..)
While x<10
  (..)
While y%13>0
  (..) 
Until z=0
Loop

Sounds familiar, huh? It's a plain descendant of 4tH's BEGIN..WHILE..WHILE..REPEAT construct. It's much more flexible than your average BASIC loop and uses far less code than implementing different loop constructs.

Data stack

uBasic didn't have any functions to begin with and it took me a little while before I got that right. But when I did, it was easy to add new ones. Sure, the average SGN(), ABS() and RND() were quickly implemented. One day I thought that when I can't have functions and procedures, may be I should try it the Forth way. So I implemented a data stack.

You could push values on the stack with the PUSH statement:

PUSH 1,2,3,4,5

And retrieve them with the POP() function:

LET x = POP()

For added flexibility I even added a TOS() function, which can be regarded as an equivalent of R@. It was cool! You could pass a value to a subroutine by pushing it on the stack and handle most stuff using the stack:

9000 PUSH ABS(POP()%62832) : IF TOS()>31416 THEN PUSH 62832-POP()
     LET A=TOS()>15708 : IF A THEN PUSH 31416-POP()
     PUSH TOS() : PUSH (POP()*POP())/10000 : PUSH 10000+((10000*-(TOS()/56))/10000)
     PUSH 10000+((POP()*-(TOS()/30))/10000): PUSH 10000+((POP()*-(TOS()/12))/10000)
     PUSH 10000+((POP()*-(POP()/2))/10000) : IF A THEN PUSH -POP()
     RETURN

Returning a value to the caller was as easy as pushing it on the stack and let the caller retrieve it through POP(). But could we do better? Yeah, sure.. But for that we needed one more thing: local variables.

Local variables

This may sound pretty sophisticated, but as a matter of fact it is very simple. Of course, it is hard to define symbols when you don't have a symbol table, but I could use fixed symbols. 4tH supports local variables as well, but it is implemented in a library that doesn't have access to the symbol table. So, in 4tH local variables have fixed names. There is no reason to do it differently here. Still, I quickly found out that 4tH's local variable library was not well suited to interface with uBasic. So I wrote a dedicated one.

Implementing local variables is really very simple. You just need a large stack or array. I always find it easiest to maintain it from high memory to low memory, but that's just me. You start off with a pointer in the top cell pointing nowhere. That's how you detect you can't free any more frames. A GOSUB simply allocates a new frame below that pointer, containing a pointer, pointing to the previous pointer.

uBasic local variables

Let's say you need three locals. Easy, you just issue the statement "LOCAL(3)" and it will free up the required space - that's it: no initialization! You follow the current pointer (which hasn't changed) to the previous pointer and calculate the offset. And there you are. RETURN simply discards the current frame and reinstates the previous pointer. And presto: you got local variables.

Note it's not a problem to call LOCAL() more than once, because it will just move the current pointer to lower memory, freeing up space without affecting the current variables. And RETURN will still be able to free up the whole thing.

Tying it all together

Now we want to go a step further and that requires tinkering with the GOSUB and RETURN statements once more. Because we want to pass those variables much more transparently than:

PUSH B*20000/573 : GOSUB 9000 : LET I=(R*POP())/10000

It's simple: check whether GOSUB has any trailing arguments after the label. If so, call PUSH to put them on the stack. Nice, they are on the stack now. How do we get 'em off? Simple, define a statement PARAM(), which is basically the same as LOCAL(), but uses the stack to initialize it's variables.

Now that may be a problem, because the first parameter is at the bottom of the stack, right? Wrong. We know where the last defined local variable is, because it's right above the current pointer. So we pop a value and put it there, move up, pop another value and repeat the whole shebang. Done!

Now it's time to turn to RETURN. We do the same thing here, but in reverse. We check whether it is followed by an expression and if it is, we push the result on the stack. But now it becomes really tricky. How do we get it off there? There is no other way to do it, we have to turn GOSUB into a function.

Now this is the really tricky part. Whatever we do, it has to be bulletproof enough to survive recursive calls. To make a long story short, we need to execute a second interpreter until we are sure we've returned - that is, reached the address on the return stack. It's not as dramatic as it sounds, though. We didn't have to make a second uBasic, it's just a simple one line loop. We are Forthers - we know our interpreters.

The result was FUNC(), which is much more capable than its "Minimal BASIC" cousin FN(), since you can turn entire subroutines into functions. The only thing left was getting rid of these pesky numeric labels. But that's simple. All we have to do is convince the tokenizer that the thing he encounters is a number. It only checks the first character. I decided to use the underscore for that.

Now, when the interpreter calls its bluff and tries to convert it into a real number, we quickly test the first character of the "number". Is it an underscore, we don't convert it to a number, but hash the entire token. The result is the same: we get a number. Note that the order of labels is not significant to uBasic. You can easily start the first line with "200" and the second with "100". It still will run perfectly.

Conclusion

All in all, we're done now. We can condense that entire line into:

LET I=(R*FUNC(_Cos(B*20000/573)))/10000

Well, that isn't too alien, is it? And if you want to know what our cosine routine now looks like:

_COS PARAM(1) : PUSH ABS(A@%62832) : IF TOS()>31416 THEN PUSH 62832-POP()
     LET A@=TOS()>15708 : IF A@ THEN PUSH 31416-POP()
     PUSH TOS() : PUSH (POP()*POP())/10000 : PUSH 10000+((10000*-(TOS()/56))/10000)
     PUSH 10000+((POP()*-(TOS()/30))/10000): PUSH 10000+((POP()*-(TOS()/12))/10000)
     PUSH 10000+((POP()*-(POP()/2))/10000) : IF A@ THEN PUSH -POP()
     RETURN

What? Are we still using the stack? Yes - it hasn't disappeared. We've just offered another way to access it by abstracting it, but it's still the same old stack. Sure, you could turn the cosine subroutine into a fully structured version if you wish, but if you violate the principles behind it by not balancing its stack properly, you would still get stack errors.

I think uBasic is one of the very few programming languages that exposes its underlying abstractions so transparently, so you can play with it and study its internals. Basically, it's just a BASIC with a touch of Forth.

4tH compiler Wiki

A Forth compiler with a little difference