4tH compiler Wiki

A Forth compiler with a little difference

Brought to you by: thebeez

16 Blocks

Labels: Forth (10) Programming (13) Philosophy (12)

Introduction

Classic Forth doesn't deal with files at all. The entire mass-storage medium is converted to blocks, chunks of memory with a 1K size. In order to store source code, it is divided into 16 lines of 64 characters - without any line terminators. So if, you use only two characters of a line, the rest is padded with blanks.

Converting a block to an ordinary text file isn't difficult. Create a buffer of 64 characters, read in 64 characters at the time, -TRAILING the thing and spit it out as a perfectly terminated line. That's it.

However, doing the inverse is harder. Much harder than you think.

A walk around the block

First of all, text files are not made with blocks in mind. Most lines will be longer than 64 characters. That means you have to break them up at one point or another. Of course, you can simply dump them, but cosmetically and practically that isn't the right way to go. Of course - due to Forth's nature, spilling a single text line over two block lines won't prevent it from compiling properly, but it makes editing - or maintaining in general - near impossible.

To tokenize a line, you could resort to PARSE, but if you got a lot of leading spaces, all that's returned is an empty string. You have to resort to >IN to see if you got an empty string, because no token is returned. Or maybe you've just reached the end of the buffer. You could prevent that by using PARSE-WORD or PARSE-NAME, but that one will make you completely lose the layout of the original. You'll just get back individual tokens.

And we're not done yet. Some words - often string related words like ." or S" - require you to switch delimiters - since they're not delimited by a space, but by a double quote, for example. Splitting such a line isn't too wise. Let's say ABORT" still fits on the current line, but the message itself needs to start on the next line. Given that's is not good practice to fill the entire line up to the full 64 characters - that might join two different words together as far as the compiler is concerned - an extra space has to be added to prevent this unintentional merge.

So the best thing is to consider the expression ABORT" This is an error!" as one single token. Of course, if that token exceeds 63 characters, all you can do is "dump it as is" and hope for the best. You can't win 'em all. Always adding a space after it may disturb the original layout - but not doing it may introduce errors in an otherwise perfectly compiling program. Darned if you do, darned if you don't, so to say.

And finally, blocks are not text files. If you want to insert an entire 400 lines subroutine in a text file, just do it. In Forth, you have to make space for these 26 odd blocks by moving all the other blocks following it down the line. In order to preserve some room for expansion, it is rare to see blocks filled up to their 16 lines limit. Most of the time about 25% of a block is left empty for that reason.

And so 4tH..

In 4tH, things are even more complex. Before v3.64.1, expressions like CHAR A or [DEFINED] DUP requires one single space between the word and its argument. So these had to be taken in consideration as well. On top of that: the compiler required that back slashed comments were properly terminated. The solution was to eliminate these completely.

This opened up another problem. Let's assume that your program begins with lots of comments - just because you thought it needed proper documentation. That means your converted program begins with lots of completely empty blocks.

Version 1

Now, when the first publicly available 4tH was released back in 1996, it came with both converters. This first version was simple, but quite effective. It read a line and determined whether it fit in a block line. If it did, it was converted "as is". If not, it extended the parsed token if a word required an additional string. If the token was still too long, it split it at the last viable space and iterated this until the last part of the line had been written to the block.

If the line rendered to an empty line - after eliminating ASCIIZ delimited comments - a flag was set. If the next line proved to be empty as well, a "newline" was repressed, greatly reducing the number of blocks required.

Finally, any non-printable characters were converted - either to a space or a caret.

Needless to say that a great deal of the operations required were directly applied to the TIB - simply because it was the easiest thing to do. That is a practice that 4tH allows, but it isn't considered to be a recommendable practice.

Still, the program produced many quite acceptable results. A conversion of the uBasic/4tH program failed only at a single CHAR A occurrence - which ultimately lead to the elimination of this notorious bug. This restriction still applies to INCLUDE, but it is much more unlikely to occur and much easier to trace.

Furthermore, if you plan to convert to a block file, you can use the equivalent [NEEDS word instead.

Version 2

This program is shorter than its predecessor and offers two extra options. You can add a header to each block and determine to what degree the blocks have to be filled. Its internals are completely different. When it reads a line, it composes tokens. It does this by first determining the number of leading spaces. Then it reads the token - and expands it if necessary - for S", for example. If the current position of the tokenizer holds a delimiter, this becomes part of the token. If not, the delimiter in question would simply vanish. Not a good idea.

The consequence of this is, that any trailing blank is considered to be part of the token. Consequently, long lines aren't filled up to the 63rd character, but the 62nd. But that is only a minor problem in the scheme of things.

Every time a block line is composed, it is closely monitored. If a token threatens to exceed the length of a block line, the current line is terminated and the entire token is dumped "as is" in its entire length. The counters are updated properly, though. Any subsequent tokens are evaluated in the usual way - unless they also exceed the current block line.

Detecting empty lines is quite different as well. The value of >IN is stored when reading a token. When >IN was zero and the token read was empty, the entire line is flagged as empty. In that case, no leading spaces are written to the block file. If the previous line was empty as well, a "newline" is suppressed.

A line is considered to be completely processed when it was either empty or the value of >IN hasn't changed.

Tests

Both programs were tested by using the uBasic/4tH source code. Both compiled flawlessly under v3.64.1 to identical compilants - which means no strings were elongated by the conversion programs. The conversion resulted in proper block files with the correct length.

The text file was slightly over 47K. Version I converted it to a 88 screens block file, Version II required a block more because its "newline" suppression algorithm proved to be slightly less effective.

Both files were converted almost instantly on the 64-bit Intel i5.

The 4tH block editor

For the reasons stated in the introduction, the 4tH editor includes a "block to text" converter. It is a quite straight forward routine that changes very little to the source. For example, there is no detection of subsequent empty lines.

Being a descendant of the original "FIG editor", it is quite capable - but quite tedious to work with. I never imagined somebody designing an entire program with this editor - it was more like, "if I want to do some light development, I got an editor to work with".

Apart from its "FIG" roots, it's quite an old program. It came with every single publicly available 4tH release and was integrated with the 4tH executable not too much later. However, it was developed before the BLOCK wordset was implemented.

Since 4tH couldn't read block files the "Forth way" another solution had to be found. As usual, the ZX Spectrum was my inspiration - Abersoft Forth to be correct. Since the ZX Spectrum didn't support files at all, Abersoft Forth created a "RAM disk" on this 48K machine. ZX Forth - which was actually Artic Forth - used headerless blocks on cassette tape. Which worked fine, but was even more tedious since it only had a single screen buffer.

This resulted in the 4tH block editor reading in an entire block file in a buffer and transferring blocks from this buffer to a single single screen buffer. When a write command is given, the entire block file buffer was simply written back to disk. The size of this block file buffer was set to 64K - although for technical reasons it had to be much smaller on 16-bit MS-DOS versions.

I rarely see a 32K+ 4tH source file - because 4tH code is quite dense - let alone somebody developing it in a block editor. So IMHO the size of the block file buffer is still sufficient for most users. In the meanwhile, another block file editor was added: RetroEd - which does use the ANS-file BLOCK wordset.

Are blocks still relevant?

Yes, not only because it's quite easy to create an editor for this format, but also because it has some unexpected advantages. It is quite possible to make 4tH programs multitask - with one important limitation: "all I/O has to be performed within a single context".

So what does that mean? It means that when you switch from one program to another, all files have to be closed. You don't have to monitor that - 4tH will do it for you, since its VM ultimately controls all I/O. So, if you have an open file and issue PAUSE 4tH will close it. When you get control back, no matter what - but your file handle is gone. Invalidated. Welcome I/O errors.

Blocks however, work differently. Suppose you've opened a block file using the Sourceforge API. You may think you got an open file, but all you got is one or more block buffers and a block file name. When you request a block, a file is opened and the correct block is loaded into one of the buffers. Then the file is actually closed again.

That's an operation you can easily perform within one single context. That's also the main reason why 4tsH scripts are written in block files - apart from the fact that loops and conditionals are much more easier to implement when the entire source is already in memory.

It's also a great idea for data files - since frequently used blocks are efficiently buffered. And if your record layout matches powers of two, it fits neatly into a 1K block - some padding may be required, though.

So don't dismiss blocks as an ancient, outdated technique. Once you dive into it, it proves to be a very elegant way to store data - and IMHO it should be used more often. And 4tH has got the tools for it..