Hello

I'm trying to make a improved version of the danish pattern-file for Hyphen, but encountered a possible bug/feature!

I typed some strong splitting rules by hand, and then turned PatGen loose to find the rest.

One of my rules was:
8m7m8

PatGen made this rule:
umme4r

The output from ../hyphen-2.8.3/example -n -d hyph_da_DK_2013.dic testword.txt
0870300000
tyg=ge=gummi
 - tyg=gegummi
 - tygge=gummi

Same test only with the two rules:
0000000000
tyggegummi

As you can see umme4r overrules m7m, even though umme(4r) don't matches umm(i).

Is it a bug? Or should all patterns be unique, and not a subset of another?


Kind regards

Esben Aaberg



Another example, cut and paste ready!:

hyph_da_DK_test.dic
UTF-8
LEFTHYPHENMIN 1
RIGHTHYPHENMIN 1
m1m
um2me

test.txt
xxxmmxxx
tyggegummi
gummilakker
gummilagger
gummi
flummi
flummy
lomme
lumme
ummr

Result:
00010000
xxxm=mxxx
 - xxxm=mxxx
0000000000
tyggegummi
00000000000
gummilakker
00000000000
gummilagger
00000
gummi
000000
flummi
000000
flummy
00100
lom=me
 - lom=me
00200
lumme
0000
ummr