[Nebula-Discuss] '4CCO' (four character codes) versus Symbols

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hi guys,

I'd like to share some details about an extension we have done to the
Nebula 2 engine kernel to begin some discussion and getting some
comments. Basically it is the substitution of four character codes
(fourcc or 4cc) by symbols. I'll introduce now to the concepts with
some examples and details about the implementation. BTW, this is a
copy from a post in my blog, which I've inaugurated recently,
http://sharedming.blogspot.com, although there is no much to see there
yet.

A fourcc is basically a very efficient way to represent a string. It
is small in size (just the size of an integer variable), and it is way
faster to compare fourccs than strings. An additional adavantage is
that it can be used as a hash key in order to get fast lookups. Those
where the advantages, but which are the drawbacks ? Just one, it makes
programming more cumbersome. Some examples of fourccs are 'SCPN' for
"SetCompanyName" and 'GCRS' for "GetCurrentState".

I agree that fourccs are efficient and I'd like to keep that somehow,
but they make the programmer's life more difficult, they force us to
write more code than needed making the resulting code more difficult
to read and maintain. "Less code, better code". Let's analyze what a
programmer needs to do in order to use fourccs:

* Create a fourcc from a string they represent (and remember them).
* Register somehow the relation of the fourcc and its string.
* Write two versions of the methods, one for accessing by string
(slower) and another accesing by fourcc (faster but harder to use and
read).

There are several examples of these in Nebula 2 code base, like the
command names and the signal names. And there are even more examples
in Nebula 3, like the class names and the attributes.

Let's go to the point. The idea of symbols is basically a constant
string, a string that does not change during the runtime of the
application. Using symbols the programmer just have to rememeber one
string and code one version of the function, that's all, less work and
more important easier to read and therefore to maintain.

Let's see some examples of usage to clarify it:

void IncIntAttribute(nSymbol attributeName)
{
  int val = this->GetIntAttribute(attributeName);
  val++;
  this->SetIntAttribute(attributeName, val);
}

obj->RegisterIntAttribute( NS(LoopCount) );
obj->SetIntAttribute( NS(LoopCount), 0 );
obj->IncIntAttribute( NS(LoopCount) );

Note: the macro NS(XXX) is a preprocessor macro that does some magic
to convert the parameter XXX into an actual value (NS is a shortcut
for NEBULASYMBOL). Actually this is the hard part of the system, but
it can be done since symbols are known at compile time.

Implementation details:

* There is a preprocessor macro NS(XXX) which basically maps into a
C++ preprocessor define. These defines are generated automatically in
a process explained later on. For example NS(LoopCount) translates
into the preprocessor define NSYMBOLID_LoopCount (which maps to an
integer).

#define NS(XXX) NSYMBOLID_ ## XXX

* There is a nSymbolId type which is basically a typedef of an int.
This is the same size of a fourcc.

* There is a nSymbol C++ object, which wraps a nSymbolId and provides
some handy functions to do conversions to and from strings and fast
symbol comparison. Passing nSymbol and nSymbolId as function arguments
is as efficient as with fourccs.

* How to calculate the symbol id ? Any way for mapping from string to
an intteger can be used. But one property must be enforced, it has
always to give the same value in any source code in any file. That's
why we use the CRC (Cyclic Redundancy Check) algorithm, which could
provide some collisions in theory (two different symbols given the
same id), but it has never happenned in practice. In the case of this
situation happens, we detect it and warn the programmer.

* When to calculate the symbol id ? It can be done in several ways,
but basically it has to be done between the time after writing the
source code and before compiling. It could be a pre-compile build
step, or part of the build system of Nebula. This process basically
generates an include file common for the whole target which has all
the symbols included in the target, for example:

#define NSYMBOLID_CLASS 2819245958

#define NSYMBOLID_nroot 4018013252

* Additionally, we have a symbol table, which basically maps from
nSymbolId to a C string. So when the string has to be recovered from
the symbol there is a small penalty, although this operation is not
done normally (and in some systems it is just kept as a debug
feature). There is also an autogenereated C++ file (generated in the
same build process) which does the automatic registration of the
symbol ids and symbol strings.

cheers
  Mateu