jjmeric - 2013-06-12

Hi there! I'm new to tle list ans almost as new to Hunspell - apologies, please be tolerant for my learning curve...

We are working on a spell checker for Bambara (local name Bamanan) a language spoken in West Africa and spanning several countries, mainly Mali, Burkina, Guinea, Ivory Coast. Extensive works on the language has been done in the recent years in St Petersburg and here at INALCO in Paris, culminating in the creation ongoing building of a Corpus, seee http://cormand.tge-adonis.fr/
A spell-checker is a logical output of that work.

We are quite happy at the first attempt, very promising results so far ; a lot more work is needed on affixes, but it's already very usable as it is. It even supports tones (bambara is a tonal language).

The big missing part so far is compound words, and this is not a detail since nouns and verb composition is a very productive process in bambara.

We understand that bambara is probably no more difficult in this respect as german, hungarian or quechua are. But we do need initial guidance from the experts !

We had a look at COMPOUNDBEGIN, C-MIDDLE, C-END, C-PERMITFLAG ; looks fine but it also looks not strict enough for our needs, which I'll try to explain in my words as :
We have a limited list of composition patterns, or profiles (My understanding of COMPOUNDPATTERN is that it is different from this notion, you may prove me wrong!).
For instance for names, the patterns are as follow
Code : N=name, ADJ=adjectives, PP=affixes, V=verbs, NUM=numerals, "yɛrɛ" means "self"
N + N
N + ADJ + N
N + V
yɛrɛ + V
yɛrɛ + V + N
N + V + N
V + N
N + PP + N
N + PP + V

as you can see, other profiles like V+PP+V, or V+PP+N ... are not to be accepted!

How have you, or how would you approach that ?

Thanks for help, thanks for your time reading this (and off topic comments or questions welcome as well!)