Menu

You can't do string handling in Forth!

thebeez

You can't do string handling in Forth

That's one complaint about Forth you'll often see on blogs and forums. And I see where these people are coming from. There are some strange omissions in the ANS-Forth 94 standard.

First of all, although I will try to refrain from rumors and hearsay as much as I can, some things cannot be explained by merely referring to the ANS-Forth 94 standard. But I'll try to do my best to indicate these explanations properly.

Counted strings

C popularized the "zero terminated string", but before C other languages - like Pascal - were popular. And the internal representation of the traditional Pascal string is identical to Forth's "counted string".

So what is a "counted string"? A counted string is essentially divided into two parts. First the "count byte" - which contains the length of the string - and then the string itself. Problem is: if you want to handle strings that are longer than 255 characters, you're in big trouble. A zero terminated string doesn't suffer from that limitation, since it's delimited by a null byte. Which means you can't define any strings containing that character, of course, and still expect its standard string handling functions to work properly.

Both implementations have their strengths and weaknesses. If you want the length of a counted string, just get the "count byte". If you want to know the size of a zero terminated string, you've got to count all the characters one by one until you reach the null byte.

If you want to shorten a counted string, all you have to do is change the count byte. In C you have to get your hands dirty, because you have to poke in that null byte somewhere - and irretrievably destroy its original contents in the process. If you want to skip the first few characters, you have to move the remaining string part of a counted string to lower memory and adjust the count byte accordingly - again: losing the original string. In C, just move up the string pointer a bit in the opposite direction.

Both forms have problems with appending strings - since you can't figure out that easily what the original size of its allocation was. Yes - I know of sizeof(). But often you just get the size of the string pointer - which is not particularly helpful. With Forth - you're on your own. But that aside - with a zero terminated string you let the new string overwrite the original null byte - and presto: you're done. With a counted string you have to figure out what the size of the new string is (hopefully it's not over 255 characters) and write that back to the count byte.

So are counted strings better? As always - it depends on the situation. But IMHO a maximum length of 255 characters is rather limiting in the 64-bit era. So figured the maintainers of the Pascal standard. A whole slew of new string types were added. Which is rather confusing, but at least you do have a choice.

ANS-Forth 94

In ANS-Forth 94 a new string type was added, the so-called "address/count" representation. Here, the "address" represented the address of the string and the "count" the length of the string. Since the "count" was a full cell wide, the 255 characters limit was no longer a problem.

Or was it? Don't get me wrong: the "address/count" representation was the best thing to happen to Forth strings since the invention of sliced bread, but there was one thing missing: how do you store an "address/count" string. Sure, you could store the values using 2! - but not the string itself. And to this very day, there is no standard way to store a string. Not even a counted string. You can only convert a counted string to an "address/count" representation using COUNT.

Even worse, What COUNT does is fetch the contents of a byte and increment the pointer. That's the exact same definition as another unofficial, but widely used Forth word named C@+. And people do use it that way. Not that I agree. IMHO, if you see COUNT you'd expect it to operate on a counted string. You don't expect to see it in a tight loop processing a binary string containing BCD digits.

So why? Well, here we come to the point where oral tradition takes precedence over what the standard actually says. In 1994 it wasn't sure which character representation would become dominant. It could be UTF-8. Or UTF-16. For this reason, the word CHARS was introduced - since there was a possibility that in the near future a character would take up several address units.

In 2010 it became clear that UTF-8 was the winner - and hence the XCHAR wordset was accepted. As a consequence the difference between "address unit" and "character" was completely eradicated with the "1 CHAR=1" proposal. That is in alignment with the C-standard and in practice not much of a problem. Not even when you're using EBCDIC.

But while there were plenty of developments in the string department, there was still no "offical" way to store strings. Which is pretty ridiculous when you think about it. And although several prominent members of the Forth community declared the counted string "almost obsolescent" it was still the only way to represent a full string, which was formally supported by the ANS-Forth 94 standard.

Consequences

To this day, many Forth programs are still limited to 255 characters. Some implementers have made a variant on the counted string by e.g. storing a "count cell" instead of a "count byte". Others, like 4tH, have simply embraced the "zero terminated string". For 4tH it was an easy choice, since it has close ties with C.

Some have developed a whole slew of new "string words" to allow the use of these new types. So there you'll find words like ZCOUNT. Others, like 4tH, assume that programs are well written and properly abstract the use of COUNT and the like. They run in trouble however, if programs assume "carnal knowledge" of strings - like using C@ to get the length of the string. And they themselves often "forget" that the assumption that the address of the string is equivalent to the start of their "zero terminated string" is only valid within their own implementation. That'll teach me to use COUNT properly.

And lets not forget the abuse of COUNT as an "unofficial" equivalent of C@+. Which will run haywire in any of these non-standard implementations. It's even "featured" in official example implementations (sic!).

And yes, there is still no official way to store or append a string. The use of PLACE and +PLACE (as proposed by Wil Baden several years ago) is wide spread - even in 4tH. But it's not part of any standard or proposal. If you don't think that's weird - I think it is.

So what's against all that? Well, the old mantra of "counted strings are almost out" still rules. And because they're "almost out", there is no need to discuss any proposal. And that's how the world goes round..

So what's the solution?

IMHO the solution would be to abstract strings out of existence. We already made great progress by making the "address/count" representation standard, all we need to now is offering the multitude of "custom" string implementations a word set to abstract the operations on them. There are three operations I consider "essential":

  1. Storing a string from an "address/count" representation;
  2. Appending a string to a stored string from an "address/count" representation;
  3. Converting a stored string to an "address/count" representation.

I know these are essential, because it's all I needed to implement a tiny dynamic string library. If you want to chop off trailing characters, you can do fine using an "address/count" representation. If you want to clip off leading characters, you can do that using /STRING. If you want to do anything else, you can compose that string by using one of the string storing functions above. So if you want to manipulate strings, you can get away with the "address/count" representation more often than you think. E.g. making a string upper or lower case.

Another function which can be useful is C>S which converts a character into a (temporary) "address/count" string - further reducing the need for "carnal knowledge". "Temporary strings" are nothing new to Forth - ever heard of S"? I wonder if there is any serious Forth out there that doesn't offer any internal facilities for that.

So, what's the score? There have been lots of discussions in the past - very much in line with this solution. Some people even proposed names for these words, like S!, S+! and S@. But none of these proposals ever even got close to being seriously discussed - let alone be voted upon. Some claim the discussion was over with the introduction of the XCHAR wordset - but I doubt that very much.

It simply isn't used that much - and I can understand that since it seems like overkill for most English language 7-bit ASCII programs. Let alone that the XCHAR wordset still lacks words for storing and retrieving strings. And it is badly integrated with the other word sets.

How does this affect 4tH?

Frankly - very little. 4tH does offer a full set of words to convert to and from an "address/count" representation. And since its internal format is identical to "zero terminated strings" there is no practical limit to the length of strings.

There are plenty of example programs that do "heavy duty" string handling. E.g. 4tH's own preprocessor is written in 4tH.

Does this mean that Forth's lousy reputation in this regard can be easily fixed? Frankly? Sure. All it needs is one single, clever proposal that can be easily adapted to the current custom implementations and we're there.

The real question is: is the Forth community that smart or does it value conservatism more?