Re: [Gramadoir-devel] [rkwtavdw@puknet.puk.ac.za: tokenization]
Status: Beta
Brought to you by:
cos
From: Kevin S. <sca...@sl...> - 2005-09-22 14:32:32
|
> In Afrikaans we have the word 'n (which is similar to the English word a) > > I tried adding the following rule to token-af.in: > 'n:<L> > > but when I run gram-af.pl on a text that includes a 'n, it gives the error > that n is unknown (for some reason it strips the '). I thought that the > above rule should solve the problem... any ideas? One thing I didn't mention earlier is that the *order* of the rules in token-af.in is very important. They are applied in sequence from the top to the bottom. So it is important that you don't tokenize the n alone before you include the rule for 'n. With this in mind, it looks like the best place to put your new rule would be just after [A-Za-z-0-9][A-Za-z-'-]*[A-Za-z-0-9]:<c> (which doesn't apply to n since it tokenizes only things of length at least two) and just before [A-Za-z-]:<c> which handles the length one words. Scottish Gaelic has a similar situation with "a'": http://cvs.sourceforge.net/viewcvs.py/gramadoir/gd/token-gd.in?view=markup Scanner generators like flex work a bit differently. They will find the rule that matches the most text possible, even if it comes near the end of the list of tokens. Kevin |