Thread: [Hunspell-devel] Re: Hungarian spell file

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Bram Moolenaar <Br...@mo...> irta:

> 
> Laci -
> 
> I now took some time to look through the hunspell code (version 1.0.8)
> and the manual page.  Following are my remarks.
> 
> Please don't take these remarks too negative!  I know developing code

Dear Bram,

Your work is fantastic! I am happy to see your great solutions!
Thank you for this and your wonderful help.
In the near future I'd like improve Hunspell with your help
(extend/integrate/replace Hunspell with your work).

> like this is difficult and requires a lot of knowledge.  I have a lot
> of
> experience with portable C programming, not much with specific
> languages, esp. Hungarian.  My remarks are aimed at a definition of a
> generic affix file format, which can be used by many spelling tools.
> That has different requirements than making a Hungarian spell checker.
> I also think that my experience with various alternatives helps you to
> avoid going in the wrong direction.

Great! It is a gift for my development.

> 
> Feel free to respond on each item separately.  This has grown way to
> long!
> 
> BTW, the maillist at Yahoo appears to be dead.  Is there a new one?

I have moderated your letter, because magyarispell.yahoogroups.com
is a Hungarian dictionary specifics mailing list in Hungarian.
I'm very sorry, that I haven't responded your letter yet.
I have just made a mailing list on Sourceforge. I will post your and
this letter to hun...@li..., when it will
be created today or tomorrow.

> 
> 
> Using a hash table
> 
> It appears you have run into a lot of trouble isolating a word.  I
> have
> had the same problem in Vim when I was using a hash table.  Since then
> I
> have switched to using a trie.  Then it's not necessary to first
> locate
> or guess the end of the word.  This makes all the code a lot simpler,
> especially for making suggestions and for compound words.  You can
> also
> use words that end in a dot, e.g. "etc.", or include a space, with no
> extra effort.

Sounds very good to me.
For multiple affix stripping trie will be a more better format.

I have a little problem with trie.
We need morphological analyser for grammar checking, or suggestion
of synonyms with affixes.

For morphological analysis trie must be extended
with state informations (roots and morphemes with morphological
descriptions). It seems for me a little difficult, because
we need handle root and affix homonyms, too.
In addition, for morphological derivation, we need transducers, or
something less good data constructions (linear affix
search in trie data).

> 
> When you go into making suggestions you will find that the hash table
> is
> making it nearly impossible to find words with more than one
> insert/delete/swap/replace edit operation.  The trie I'm using makes
> this possible.  I can't say the code is simple, but I've already
> written
> it and works very well for all languages.  Currently suggestions with
> up
> to three or four edit operations are found.  This can be tuned, it's a
> trade-off with speed, not a limitation of the mechanism.

Kevin Hendricks had written an ngram suggestion code, now
I extended with a refinement based on the longest common subsequent
algorithm.
For example it works well
for foreign name suggestion (Montesquo -> Montesquieu) and not
neigbhoured differences (permenant -> permanent). But this function
also good for trie.

> 
> When trying to locate word breaks to check for compound words you run
> into trouble.  You need to try every position to make sure you don't
> miss a possible compounding.  With the trie you only need to try at
> the
> end of a recognized word.  That is a huge speed increase, esp. when
> compounding more than two words.  Perhaps this is also a solution for
> Thai without spaces between words.

I think, Hunspell left-to-right recursive compound check algorithm is
similar.
When left word is bad, Hunspell doesn't check the right word(s).
(But Hunspell call a lot of functions when checks substrings, so you
are right.)

Speed is interest question. I think, hash is faster for root,
but trie is better for affix checking. Trie is more compact,
this may be a great advantage with CPU cache. 

> 
> Another advantage of the trie is that it works for utf-8 without much
> trouble.  Especially when the word with affixes applied are put in the
> trie, so that the conditions can be checked while building the trie,
> instead of when spell checking the words.  Note that not all affixes
> can
> be pre-processed this way, it takes too much time to generate the trie
> then (for that reason I haven't been able to check about its size
> yet).
> For Hebrew and Hungarian I currently don't put words with prefixes in
> the word trie, they are stored in a separate trie.  That's a bit like
> the compound word mechanism.

Yes, there is a similarity and a trade-off in affix checking and
compoundings.
Multiple affix stripping would be ideal for agglutinative languages,
but realising it is impossible in run-time. (Hunspell's file format
is ready for defining multiple affixes, but now Hunspell handles
only twofold suffix stripping. I can imagine an off-line multiple
affixes -> twofold affix precompiler. (I understand you have made a
single affixes -> affixed words run-time precompiler.)

(BTW. A generalized solution for spell checking and morphological analysis
and generations for every languages with the best efficiency is a
two level rule compiler. Sorry I have no experience with
difficulties of this method. A GPL-ed tool:
http://www.ims.uni-stuttgart.de/projekte/gramotron/SOFTWARE/SFST.html)

> 
> Vim stores the trie in a .spl file, together will all other required
> data.  For most languages the .spl file is only 50% the size of the
> .dic
> file.

First element of the Hunspell TODO is the mmap support for dictionary files.
Michael Meeks OOo developer have written a MySpell patch to solve
memory problems in multi-user environments. I will make an
optimized binary dic format for Hunspell that can be share beetween users
in a client-server environment by mmap support of operating system.

This format will be optional for backward compatibility: Hunspell
search this format first, then the original uncompressed format.

> 
> As Geoff Kuenning (the author of ispell) wrote: Using a hashtable is a
> dead end.  He refers to an article by Kemal Oflazer that proposes
> using
> a finite state machine.  That's what the Vim code is doing.

I'm sorry to say, I have no chance to implement a better solution,
but I hope I can integrate your better solution/code with Hunspell.

> 
> The choice between a hash table or a trie has only minor impact on the
> affix file.  The only thing I can think of is that Vim doesn't need
> the
> TRY entry, since the trie specifies which characters may appear at a
> certain position in the word.  Thus the remarks below are valid no
> matter if you continue to use a hash table or not.
> 
> 
> LANG in affix file
> 
> A generic remark about the affix file format is the use of the LANG
> entry.  This results in various checks that are not specific for one
> language to depend on the language name.  I think that's a bad choice.
> For example, the code to do compounding with dashes now depends on
> LANG
> to specify Hungarian or German.  But it probably also applies to Dutch
> and other languages.  These mechanisms should be defined in the affix
> file separately, e.g. with a COMPOUNDDASH item.

I will remove this language specific code.
Compounding with dashes need a hybrid solution. Real compound
words handled with dash compounds and prefixes (but your excellent
COMPOUNDFLAGS, not need to use dash prefixes). Other
compounds (for example in Hungarian, twin word-like word pairs, or
word lists) checked by the grammar checker.

> 
> Another example is syllable counting.  It's easy to define this in a
> generic way and avoid doing it only for Hungarian.  Then other
> languages
> to which this applies can do the same thing.  And possibly for
> Hungarian
> the method may need to be tuned to the word list used.

You are right.

> 
> A few more hints about reducing the dependency on LANG:
> 
> - Remove all the language-specific strings from affixmgr.css.  I see
>   strings like "ccscs" for Hungarian consonant.  These can be defined
> in
>   the affix file.
> 
> - I think that the rule to replace the sharp s by SS when making a
> word
>   upper case can be done independent of the language.  It depends on
> the
>   character itself, not the language.  Same for the suggestion to
> change
>   '-' to ' '.  Vim actually tries that change for all non-word
> characters.
> 
> - Some specific affix names are used, these should really be defined
> in
>   the affix file.
> 
> - In SuggestMgr::twowords() check_forbidden() is only called for
>   Hungarian.  I would think this is a generic check, not specific for
>   Hungarian.

This call is specific for Hungarian, because we put dash beetween
two word, when bad compound word is in the dictionary.
You are right, it would be fine to generalize this possibility.
(For example,with the SUGGESTTWOWORDSWITHDASH flag).

> 
> - I see repl_check() being used inside compound_check().  This means
>   that REP items are used to check for correct spelling.  So far REP
>   items were only used for making suggestions.  That might be wrong,
> or
>   otherwise it should be explained in the documentation.

I have documented it in the source code before repl_check() function.
Using REP is a great advantage for Hungarian and presumbebly other
languages with rich compoundings. You are right, we need generalize
this options, too. I have already planned it. (BACKCHECKCOMPOUNDS
or a similar 

LANG is a sandbox for developers, and an excuse for me.
I have a lot of dubiety about my improvements. I have been
trying to generalize language specific codes of Hunspell only for a
short time, since summer. Thank you for your help!

> 
> 
> Flags in affix file
>   
> I notice you also allow flags for affixes (the ones that cannot be
> used
> in the .dic file).  I find this a bit confusing, using flags on flags
> (except for double suffixes, that's something different).  I already
> added the "rare" item for affixes and recently the "nocomp" item
> (disallow compounding for a word with this suffix).
> 
> Similarly I could add "comp" to allow compounding for a word with this
> affix, like you do with COMPOUNDFLAG.  Using these verbose names
> instead
> of flags make the affix file easier to understand, while the extra
> file
> size is negligible.

We have a sophisticated preprocessor named Hunlex for Hunspell under
development.

> 
> On the other hand, it looks like the affix flags replace the flags for
> the word it's used with.  In that case it becomes a generic mechanism
> and should be described as such.  I can see this is useful: for a
> specific suffix you can use different flags.  This would disallow
> using
> a prefix, for example, by using a slash and no flags.
> 
> But if it's also possible to keep using the flags of the word, or the
> flags are optionally added, then it becomes complicated.  And it's
> already complicated enough...
> 
> At least we should avoid using a character flag for things that cannot
> be used in the word list.  This especially applies to
> COMPOUNDFORBIDFLAG.  I'm currently using "nocomp" after the affix for
> this.
> 
> The ONLYINCOMPOUND flag appears only to apply to the affix file.  But
> it
> would also be useful for the word list, to specify words that can only
> be used in a compound, not by themselves.
> 
> It's unclear to me what COMPOUNDROOT is used for.  Can you explain
> that?

COMPOUNDROOT signs the compound words in the dictionary. For 
example:

-----aff---
COMPOUNDROOT x

-----dic---
kávé		# coffee
szünet		# pause
kávészünet/x    # coffee pause

In Hungarian orthography there is a special rule. Compound words with min.
3 words and 7 syllables have to write with dash:

kávészünetigény # (need for coffe break)
kávészünet-rendelet  (coffe break regulations)

Yes, why do we remove compounds from dictionary?
For example, we have a ready-to-use dictionary with
lot of unsigned compounds, or we want restrict compound support
for dictionary compounds when we need a more scrict spell checking
(proof-reading etc.). Sometimes REP back checking in compound_check()
forbid good compounds too, and we need to put these compounds
into the dictionary, with the COMPOUNDROOT flag.

> 
> 
> Compound word count
> 
> I had already implemented the maximum nr of words for compounding and
> the maximum nr of syllables for compounding.  But I guessed a word
> would
> have to meet both criteria.  The new documentation specifies that word
> needs to meet one of these rules.  Is this always so?  If so I'll
> change
> my implementation. 

Need both criteria for Hungarian, but I separated for other
languages.

It is great! I hope with your effort we can make more better
language tools and not only for Vim. :)

> If not then adding an affix item to choose between
> OR/AND could be used.
> 
> I have another method to define compound words, I'll include the
> current
> Vim help text about this at the end.  I'm not sure in how far this
> replaces some of the language-specific mechanisms that are in the
> hunspell code.  It will certainly replace COMPOUNDFLAG, COMPOUNDBEGIN,
> COMPOUNDMIDDLE and COMPOUNDLAST in a generic way.  It supports
> compounding of a certain group of words with another group of words.
> This rules out nonsense words.  E.g., when allowing "tomato" and
> "bicycle" for start, and "soup" and "shop" for last, you can make the
> legal "tomatosoup" and "bicycleshop", but also "bicyclesoup", which is
> wrong.

It is wonderful!

> 
> 
> Word characters
> 
> I notice you use WORDCHARS for this.  In Vim I made it a bit more
> generic: FOL, LOW and UPP lines specify both word characters and case
> folding.  This is required for when the locale is not available on a
> system or another locale is currently being used.  Especially relevant
> for utf-8, which allows a user to edit text without adjusting the
> locale.  But also to allow editing text in various Microsoft codepages
> on Unix, for which there is no locale.
> 
> In Vim MIDWORD specifies characters that should only be considered to
> be
> word characters when used in between word characters.  Especially
> useful for ' and -.

It is not enough. For example last dot may be period or
dot after abbrevations. OpenOffice.org has the following
algorithm for this. With example:

OOo check "dxg."

1. send "dxg." to Myspell

Myspell (and now Hunspell 1.0.9) suggests dog (without period).
User choose this item and

2. OOo add period to dog, and replace "dxg." to "dog."

1. OOo check "exc."

Myspell suggest "etc."

2. OOo replace "exc." with "etc."

Hunspell former default period suggestion (I set it for Abiword
anno) have been optional by SUGSWITHDOTS affix parameter.

For a grammar checker we need a more sophisticated tokenization.
(For example, we have word pairs, like "vice versa" etc.)

> 
> 
> Affix required
> 
> I notice PSEUDOROOT, which specifies that a word can't be used without
> affixes.  I'm using NEEDAFFIX for that.  I guess they do the same
> thing.
> I find NEEDAFFIX simpler to understand. (I'm not a linguist!)

You are right. I will use it. (BTW. NEEDAFFIX is better,
because this flag is usable for affixes, too. See tests/pseudoroot3.*)

> 
> I wonder if something similar should be used for words that are only
> valid when used in a compound word?  ONLYINCOMPOUND could be re-used
> for
> this.

I have planned it, because now this is a little difficult with NEEDAFFIX and
zero morpheme:

NEEDAFFIX A
COMPOUNDFLAG B
ONLYINCOMPOUND C

SFX X Y 1
SFX 0 0/BC

--- dic ---
foo/A

> 
> 
> Decapitalising
> 
> The method apparently requires specifying an affix for each letter a
> word can start with.  I wonder, is it ever allowed to have an
> upper-case letter halfway a compound word?  I don't think so, it would
> make the word huh-cap.  A DECAPITALIZE item in the affix file would be
> sufficient.  Perhaps the rule about what to do after a dash need to be
> specified explicitly.  I think for German they always keep the
> capital.

But not in Hungarian geographical names.

Nyugat-Európa (West Europa)
nyugat-európai (West European)
etc.

But I think the idea is good, but the implementation is difficult
in Hunspell.

(BTW. Now I will handle Hungarian geographical names with special dash
prefixes: E -> -e  Európa -> -európai)

> 
> 
> Circumfix
> 
> The mechanism to match a prefix with a suffix looks very complicated
> to
> me.  It also doesn't appear to be possible to use more than one flag
> for
> this, thus all suffixes with CIRCUMFIX can be used with all prefixes
> with CIRCUMFIX.  That probably requires writing a separate suffix for
> each possible prefix.

Or vice versa. See tests/circumfix example.

> 
> Why not add a new item that defines both at the same time?  Something
> like:
> 	PSFX {flag} {pchop} {padd} {pcond} {schop} {sadd} {scond} 
> 
> The {pcond} would need to match at the start of the word, {scond} at
> the
> end.  Similarly for the chop and add strings.  Perhaps flags could be
> added after {sadd}, like you do with other suffixes.
> 
> An additional advantage of this item is that you can add a prefix
> while
> using a condition on the end of the word.  Don't know if there is a
> language where this is useful though...

:)

I'm afraid to implement new syntax in affix file.
But we can implement a more sophisticated preprocessor grammar
for users. Hunlex has a similar circumfix syntax, like yours.

> 
> Would it be possible to use this mechanism for a compound word, so
> that
> the prefix applies to the first word and the suffix to the last word?
> Or is that not needed for any language?

Quite right. There is in Hungarian with compound adjectives

-barát (-friendly)
felhasználóbarát (user friendly)
legfelhasználóbarátabb (most user friendly)
legmacskabarátabb (most cat friendly)
etc.

But it is not frequent in our 1,5 milliard word Hungarian corpus:
[laci@lalilili szabaly]$ look leg szoszablya.txt | grep barátabb
leg-felhasználóbarátabb 1       1       0       0
legbababarátabb 1       1       1       0
legbarátabbak   1       1       1       0
legbőrbarátabb  1       1       1       1
legcsaládbarátabb       1       1       0       0
legfelhasználó-barátabb 5       5       1       0
legfelhasználóbarátabb  26      26      6       2
legfelhasználóbarátabbikát      3       3       0       0
legfogyasztóbarátabb    1       1       0       0
legfénybarátabb 1       1       0       0
leghátbarátabb  6       6       6       6
legkörnyezetbarátabb    36      36      30      22
legkörnyezetbarátabb.   2       2       2       2
legkörnyezetbarátabbnak 4       4       1       1
legközönségbarátabb     1       1       0       0
leglóbarátabb   3       3       3       2
legmelegbarátabb        1       1       1       0
legmosleybarátabb       2       2       0       0
legnyomdafestékbarátabb 2       2       2       2
legpolgárbarátabb       1       1       1       1
legrádióbarátabb        1       1       0       0
legrádióbarátabbnak     1       1       0       0
legszuperlóbarátabb     1       1       1       1
legtermelőbarátabb      2       2       2       0
legtermészetbarátabb    3       3       3       3
legtuningbarátabb       2       2       0       0
legtuningbarátabbak     1       1       0       0
legállatbarátabb        5       5       4       0
legönkormányzatbarátabb 1       1       1       1
legügyfélbarátabb       3       3       1       1

It seems, Hungarian doesn't like compound adjectives with
superlatives.

But this feature would be fine for Hungarian, perhaps as a
special compoundings. But in the
artifical? (orthographical?) linguistic paradigm `leg' (~most) is not
root, but prefix.)

> 
> 
> More flags
> 
> I agree that the problem with running out of single character flags is
> relevant.  Your solution is to allow two-character flags and numbered
> ones, separated by commas.  A third method, which would be attractive
> generally, is to use single-character flags as before, and also allow
> an
> upper case letter with another character, thus Aa, Bx, etc.  Using A
> to
> Z as a single character flag isn't possible then.  This has two
> advantages:
> - Allows using single character flags for the most common ones.
> - Two-character flags are much easier to read when concatenated:
> AbZcNf
> - Allows for about 68 + 26 x 68 = 1832 flags.  That should be
> sufficient
>   for all languages, especially with the double-suffix mechanism.

Hunspell also has a similar, but not restricted mode (FLAG long).

I like your multi-byte long format. :)

I will implement it in Hunspell, if you have supported it in Vim.

I also think, 1832 is enough for affixes. 

> A few more things that I noticed:
> 
> It very much helps in the code if you add a comment above each method
> to
> briefly explain what it does and what the return value is.  I see
> functions returning 0 or 1 without suggesting what that means.  This
> makes it a lot more difficult to understand the code.

I will do it.

> 
> COMPOUNDMIN is defined as "Minimum length of words in compound words.
> Default value is 3 letter."  I understand this means that characters
> are
> counted instead of bytes.  But how about counting non-letters?  Would
> "d'e" be valid for compounding?  From the code I deduct that the
> explanation should be that the length is defined in characters, not
> letters.

Yes, you are right. Thanks. But with Unicode, this characters may
be multibyte UTF-8 characters.

> 
> The manpage mentions: "Hungarian has a standard ASCII character set
> (ISO
> 8859-2)".  "ASCII" should be "8-bit".  ASCII is only 7-bit.

I will fix it.

> 
> SYLLABLENUM appears to be a hack.  Any plans to replace this with
> something more generic?  Perhaps defining flags for adding one, two or
> three syllables to a word?

Yes, it is a hack for Hungarian with hardwired affixes in the code.
We need calculate syllable number of
some (derivative) suffixes for Hungarian. 
I will make a DERIVATIVE(AFFIX) affix attribute for stemming, and
I will change SYLLABLENUM with a language specific code for calculating
syllable numbers of derivative affixes.

> 
> About ACCENT: Vim uses the MAP items for this.  This also implies that
> these changes in accents don't need to be put in REP items.

I have already removed ACCENT support. Thanks for your comment.

I think, we need this little MAP-REP redundancy for sorting
better the suggestions. I have changed the order of MAP and REP
function call in Hunspell source for this purpose. REP suggestions
need higher priority, because REP contains typical (frequent) errors.

> > You appear to use FORBIDDENWORD where Vim uses BAD.  I mostly use the
> !
> flag for BAD to make it stand out.

I like it. FORBIDDENWORD will be deprecated in Hunspell.
! is nice. :)

> 
> 
> Well, that's enough for now!

Well, I have planned for a long ago a non deterministic
finite state automate for compound checking with an ugly
and uncomprehensible syntax.

Your COMPOUNDFLAGS has a wonderful, admirable clear syntax.
It has huge advantage over old compound support.

I had started to convert our idiotic syllable counting
Hungarian compound rule to a non deterministic automate, but
I wanted also coding the affix flags into this rules.

(Here is my first (and last) horrible trial:
alap       1      2      3      4      5
1 1_01 -> 2_12 | 2_13 | 2_14 | 2_15 | 2_1n
2 1_02 -> 2_23 | 2_24 | 2_25 | 2_2n | 2_2n
3 1_03 -> 2_34 | 2_35 | 2_3n | 2_3n | 2_3n
4 1_04 -> 2_45 | 2_4n | 2_4n | 2_4n | 2_4n
5 1_0n -> 2_nn | 2_nn | 2_nn | 2_nn | 2_nn

harmadik
előző szótagszám
2 (2_12) -> 3_23 | 3_24 | 3_25 | 3_26
3 (2_13, 2_23, 3_23) -> 3_34 | 3_35 | 3_36
4 (2_14, 2_24, 2_34, 3_24, 3_34) -> 3_45 | 3_46
5 (2_15, 2_25, 2_35, 2_45, 3_25, 3_35, 3_45) -> 3_56

1 szótagúak: 1_01 2_12 2_23 2_34 2_45 2_nn 3_23 3_34 3_45 3_56
2 szótagúak: 1_02 2_13 2_24 2_35 2_4n 2_nn 3_24 3_35 3_46
3 szótagúak: 1_03 2_14 2_25 2_3n 2_4n 2_nn 3_25 3_36
4 szótagúak: 1_04 2_15 2_2n

I imagined state definitions with next states in affix file, but
it would be more difficult to use.) 

With a lot of COMPOUNDFLAGS, I hope I can define similar
non deterministic automate with a more esimple syntax.

Other interest type of compoundings in Hungarian:
- numbers
(1999 = ezerkilencszázkilencvenkilenc 2005 = kétezer-öt
1 872 453 123 =
egymilliárd-nyolcszázhetvenkétmillió-négyszázötvenháromezer-százhuszonhárom,

- compounds with moving rule:
személygépkocsinyereménybetétkönyvkiviteliengedély-kérés

- twin-words:
pörögtünk-forogtunk (words need the same affix: pörgött-forgott, but
*pörögtünk-forgott)

I admire your excellent work.
Congratulations! I will implement COMPOUNDFLAGS in Hunspell.

With my reverse logic, Vim will be the best interactive
interface for our spell checker development. :) 

Thanks a lot. I'm very sorry that with my English I cannot
express enough my happiness. :)

Best regards,

Laci

> 
> - Bram
> 
> 
> Part of the current Vim help on affix items:
> 
> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
> 
> COMPOUND WORDS						*spell-affix-compound*
> 
> A compound word is a longer word made by concatenating words that
> appear in
> the .dic file.  To specify which words may be concatenated a character
> is
> used.  This character is put in the list of affixes after the word. 
> We will
> call this character a flag here.  Obviously these flags must be
> different from
> any affix IDs used.
> 
> 							*spell-COMPOUNDFLAG*
> The Myspell compatible method uses one flag, specified with
> COMPOUNDFLAG.
> All words with this flag combine in any order.  This means there is no
> control
> over which word comes first.  Example:
> 	COMPOUNDFLAG c 
> 
> 							*spell-COMPOUNDFLAGS*
> A more advanced method to specify how compound words can be formed
> uses
> multiple items with multiple flags.  This is not compatible with
> Myspell 3.0.
> Let's start with an example:
> 	COMPOUNDFLAGS c+ 
> 	COMPOUNDFLAGS se 
> 
> The first line defines that words with the "c" flag can be
> concatenated in any
> order.  The second line defines compound words that are made of one
> word with
> the "s" flag and one word with the "e" flag.  With this dictionary:
> 	bork/c 
> 	onion/s 
> 	soup/e 
> 
> You can make these words:
> 	bork
> 	borkbork
> 	borkborkbork
> 	(etc.)
> 	onion
> 	soup
> 	onionsoup
> 
> The COMPOUNDFLAGS item may appear multiple times.  The argument is
> made out of
> one or more groups, where each group can be:
> 	one flag			e.g., c
> 	alternate flags inside []	e.g., [abc]
> Optionally this may be followed by:
> 	*	the group appears zero or more times, e.g., sm*e
> 	+	the group appears one or more times, e.g., c+
> 
> This is similar to the regexp pattern syntax (but not the same!).  A
> few
> examples with the sequence of word flags they require:
>     COMPOUNDFLAGS x+	    x xx xxx etc.
>     COMPOUNDFLAGS yz	    yz
>     COMPOUNDFLAGS x+z	    xz xxz xxxz etc.
>     COMPOUNDFLAGS yx+	    yx yxx yxxx etc.
> 
>     COMPOUNDFLAGS [abc]z    az bz cz
>     COMPOUNDFLAGS [abc]+z   az aaz abaz bz baz bcbz cz caz cbaz etc.
>     COMPOUNDFLAGS a[xyz]+   ax axx axyz ay ayx ayzz az azy azxy etc.
>     COMPOUNDFLAGS sm*e	    se sme smme smmme etc.
>     COMPOUNDFLAGS s[xyz]*e  se sxe sxye sxyxe sye syze sze szye szyxe 
> etc.
> 
> 							*spell-COMPOUNDMIN*
> The minimal byte length of a word used for concatenation is specified
> with
> COMPOUNDMIN.  Example:
> 	COMPOUNDMIN 5 
> 
> When omitted a minimal length of 3 bytes is used.  Obviously you could
> just
> leave out the compound flag from short words instead, this feature is
> present
> for compatibility with Myspell.
> 
> 							*spell-COMPOUNDMAX*
> The maximum number of words that can be concatenated into a compound
> word is
> specified with COMPOUNDMAX.  Example:
> 	COMPOUNDMAX 3 
> 
> When omitted there is no maximum.  It applies to all compound words.
> 
> To set a limit for words with specific flags make sure the items in
> COMPOUNDFLAGS where they appear don't allow too many words.
> 
> 							*spell-COMPOUNDSYLMAX*
> The maximum number of syllables that a compound word may contain is
> specified
> with COMPOUNDSYLMAX.  Example:
> 	COMPOUNDSYLMAX 6 
> 
> This has no effect if there is no SYLLABLE item.  Without
> COMPOUNDSYLMAX there
> is no limit on the number of syllables.
> 
> 							*spell-SYLLABLE*
> The SYLLABLE item defines characters or character sequences that are
> used to
> count the number of syllables in a word.  Example:
> 	SYLLABLE aáeéiíoóöőuúüűy/aa/au/ea/ee/ei/ie/oa/oe/oo/ou/uu/ui 
> 
> Before the first slash is the set of characters that are counted for
> one
> syllable, also when repeated and mixed, until the next character that
> is not
> in this set.  After the slash come sequences of characters that are
> counted
> for one syllable.  These are preferred over using characters from the
> set.
> With the example "ideeen" has three syllables, counted by "i", "ee"
> and "e".
> 
> Only case-folded letters need to be included.
> 
> Above another way to restrict compounding was mentioned above: adding
> "nocomp"
> after an affix causes all words that are made with that affix not be
> be used
> for compounding. |spell-affix-nocomp|
> 
> 
> 
> 
> 
> 

Thread: [Hunspell-devel] Re: Hungarian spell file

hunspell-devel