Re: [Gramadoir-devel] [rkwtavdw@puknet.puk.ac.za: tokenization]

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

> In Afrikaans we have the word 'n (which is similar to the English word a)
>
> I tried adding the following rule to token-af.in:
> 'n:<L>
>
> but when I run gram-af.pl on a text that includes a 'n, it gives the error
> that n is unknown (for some reason it strips the '). I thought that the
> above rule should solve the problem... any ideas?

One thing I didn't mention earlier is that
the *order* of the rules in token-af.in is
very important. They are applied in sequence from
the top to the bottom.   So it is important that
you don't tokenize the n alone before you
include the rule for 'n.

With this in mind, it looks like the best place
to put your new rule would be just after
[A-Za-z-0-9][A-Za-z-'-]*[A-Za-z-0-9]:<c>

(which doesn't apply to n since it tokenizes only things of
length at least two) and just before

[A-Za-z-]:<c>

which handles the length one words.

Scottish Gaelic has a similar situation with "a'":
http://cvs.sourceforge.net/viewcvs.py/gramadoir/gd/token-gd.in?view=markup

Scanner generators like flex work a bit differently.
They will find the rule that matches the most text possible,
even if it comes near the end of the list of tokens.

Kevin