From: <ne...@gy...> - 2005-08-24 23:17:17
|
Bram Moolenaar <Br...@mo...> irta: > > Laci - > > I now took some time to look through the hunspell code (version 1.0.8) > and the manual page. Following are my remarks. > > Please don't take these remarks too negative! I know developing code Dear Bram, Your work is fantastic! I am happy to see your great solutions! Thank you for this and your wonderful help. In the near future I'd like improve Hunspell with your help (extend/integrate/replace Hunspell with your work). > like this is difficult and requires a lot of knowledge. I have a lot > of > experience with portable C programming, not much with specific > languages, esp. Hungarian. My remarks are aimed at a definition of a > generic affix file format, which can be used by many spelling tools. > That has different requirements than making a Hungarian spell checker. > I also think that my experience with various alternatives helps you to > avoid going in the wrong direction. Great! It is a gift for my development. > > Feel free to respond on each item separately. This has grown way to > long! > > BTW, the maillist at Yahoo appears to be dead. Is there a new one? I have moderated your letter, because magyarispell.yahoogroups.com is a Hungarian dictionary specifics mailing list in Hungarian. I'm very sorry, that I haven't responded your letter yet. I have just made a mailing list on Sourceforge. I will post your and this letter to hun...@li..., when it will be created today or tomorrow. > > > Using a hash table > > It appears you have run into a lot of trouble isolating a word. I > have > had the same problem in Vim when I was using a hash table. Since then > I > have switched to using a trie. Then it's not necessary to first > locate > or guess the end of the word. This makes all the code a lot simpler, > especially for making suggestions and for compound words. You can > also > use words that end in a dot, e.g. "etc.", or include a space, with no > extra effort. Sounds very good to me. For multiple affix stripping trie will be a more better format. I have a little problem with trie. We need morphological analyser for grammar checking, or suggestion of synonyms with affixes. For morphological analysis trie must be extended with state informations (roots and morphemes with morphological descriptions). It seems for me a little difficult, because we need handle root and affix homonyms, too. In addition, for morphological derivation, we need transducers, or something less good data constructions (linear affix search in trie data). > > When you go into making suggestions you will find that the hash table > is > making it nearly impossible to find words with more than one > insert/delete/swap/replace edit operation. The trie I'm using makes > this possible. I can't say the code is simple, but I've already > written > it and works very well for all languages. Currently suggestions with > up > to three or four edit operations are found. This can be tuned, it's a > trade-off with speed, not a limitation of the mechanism. Kevin Hendricks had written an ngram suggestion code, now I extended with a refinement based on the longest common subsequent algorithm. For example it works well for foreign name suggestion (Montesquo -> Montesquieu) and not neigbhoured differences (permenant -> permanent). But this function also good for trie. > > When trying to locate word breaks to check for compound words you run > into trouble. You need to try every position to make sure you don't > miss a possible compounding. With the trie you only need to try at > the > end of a recognized word. That is a huge speed increase, esp. when > compounding more than two words. Perhaps this is also a solution for > Thai without spaces between words. I think, Hunspell left-to-right recursive compound check algorithm is similar. When left word is bad, Hunspell doesn't check the right word(s). (But Hunspell call a lot of functions when checks substrings, so you are right.) Speed is interest question. I think, hash is faster for root, but trie is better for affix checking. Trie is more compact, this may be a great advantage with CPU cache. > > Another advantage of the trie is that it works for utf-8 without much > trouble. Especially when the word with affixes applied are put in the > trie, so that the conditions can be checked while building the trie, > instead of when spell checking the words. Note that not all affixes > can > be pre-processed this way, it takes too much time to generate the trie > then (for that reason I haven't been able to check about its size > yet). > For Hebrew and Hungarian I currently don't put words with prefixes in > the word trie, they are stored in a separate trie. That's a bit like > the compound word mechanism. Yes, there is a similarity and a trade-off in affix checking and compoundings. Multiple affix stripping would be ideal for agglutinative languages, but realising it is impossible in run-time. (Hunspell's file format is ready for defining multiple affixes, but now Hunspell handles only twofold suffix stripping. I can imagine an off-line multiple affixes -> twofold affix precompiler. (I understand you have made a single affixes -> affixed words run-time precompiler.) (BTW. A generalized solution for spell checking and morphological analysis and generations for every languages with the best efficiency is a two level rule compiler. Sorry I have no experience with difficulties of this method. A GPL-ed tool: http://www.ims.uni-stuttgart.de/projekte/gramotron/SOFTWARE/SFST.html) > > Vim stores the trie in a .spl file, together will all other required > data. For most languages the .spl file is only 50% the size of the > .dic > file. First element of the Hunspell TODO is the mmap support for dictionary files. Michael Meeks OOo developer have written a MySpell patch to solve memory problems in multi-user environments. I will make an optimized binary dic format for Hunspell that can be share beetween users in a client-server environment by mmap support of operating system. This format will be optional for backward compatibility: Hunspell search this format first, then the original uncompressed format. > > As Geoff Kuenning (the author of ispell) wrote: Using a hashtable is a > dead end. He refers to an article by Kemal Oflazer that proposes > using > a finite state machine. That's what the Vim code is doing. I'm sorry to say, I have no chance to implement a better solution, but I hope I can integrate your better solution/code with Hunspell. > > The choice between a hash table or a trie has only minor impact on the > affix file. The only thing I can think of is that Vim doesn't need > the > TRY entry, since the trie specifies which characters may appear at a > certain position in the word. Thus the remarks below are valid no > matter if you continue to use a hash table or not. > > > LANG in affix file > > A generic remark about the affix file format is the use of the LANG > entry. This results in various checks that are not specific for one > language to depend on the language name. I think that's a bad choice. > For example, the code to do compounding with dashes now depends on > LANG > to specify Hungarian or German. But it probably also applies to Dutch > and other languages. These mechanisms should be defined in the affix > file separately, e.g. with a COMPOUNDDASH item. I will remove this language specific code. Compounding with dashes need a hybrid solution. Real compound words handled with dash compounds and prefixes (but your excellent COMPOUNDFLAGS, not need to use dash prefixes). Other compounds (for example in Hungarian, twin word-like word pairs, or word lists) checked by the grammar checker. > > Another example is syllable counting. It's easy to define this in a > generic way and avoid doing it only for Hungarian. Then other > languages > to which this applies can do the same thing. And possibly for > Hungarian > the method may need to be tuned to the word list used. You are right. > > A few more hints about reducing the dependency on LANG: > > - Remove all the language-specific strings from affixmgr.css. I see > strings like "ccscs" for Hungarian consonant. These can be defined > in > the affix file. > > - I think that the rule to replace the sharp s by SS when making a > word > upper case can be done independent of the language. It depends on > the > character itself, not the language. Same for the suggestion to > change > '-' to ' '. Vim actually tries that change for all non-word > characters. > > - Some specific affix names are used, these should really be defined > in > the affix file. > > - In SuggestMgr::twowords() check_forbidden() is only called for > Hungarian. I would think this is a generic check, not specific for > Hungarian. This call is specific for Hungarian, because we put dash beetween two word, when bad compound word is in the dictionary. You are right, it would be fine to generalize this possibility. (For example,with the SUGGESTTWOWORDSWITHDASH flag). > > - I see repl_check() being used inside compound_check(). This means > that REP items are used to check for correct spelling. So far REP > items were only used for making suggestions. That might be wrong, > or > otherwise it should be explained in the documentation. I have documented it in the source code before repl_check() function. Using REP is a great advantage for Hungarian and presumbebly other languages with rich compoundings. You are right, we need generalize this options, too. I have already planned it. (BACKCHECKCOMPOUNDS or a similar LANG is a sandbox for developers, and an excuse for me. I have a lot of dubiety about my improvements. I have been trying to generalize language specific codes of Hunspell only for a short time, since summer. Thank you for your help! > > > Flags in affix file > > I notice you also allow flags for affixes (the ones that cannot be > used > in the .dic file). I find this a bit confusing, using flags on flags > (except for double suffixes, that's something different). I already > added the "rare" item for affixes and recently the "nocomp" item > (disallow compounding for a word with this suffix). > > Similarly I could add "comp" to allow compounding for a word with this > affix, like you do with COMPOUNDFLAG. Using these verbose names > instead > of flags make the affix file easier to understand, while the extra > file > size is negligible. We have a sophisticated preprocessor named Hunlex for Hunspell under development. > > On the other hand, it looks like the affix flags replace the flags for > the word it's used with. In that case it becomes a generic mechanism > and should be described as such. I can see this is useful: for a > specific suffix you can use different flags. This would disallow > using > a prefix, for example, by using a slash and no flags. > > But if it's also possible to keep using the flags of the word, or the > flags are optionally added, then it becomes complicated. And it's > already complicated enough... > > At least we should avoid using a character flag for things that cannot > be used in the word list. This especially applies to > COMPOUNDFORBIDFLAG. I'm currently using "nocomp" after the affix for > this. > > The ONLYINCOMPOUND flag appears only to apply to the affix file. But > it > would also be useful for the word list, to specify words that can only > be used in a compound, not by themselves. > > It's unclear to me what COMPOUNDROOT is used for. Can you explain > that? COMPOUNDROOT signs the compound words in the dictionary. For example: -----aff--- COMPOUNDROOT x -----dic--- kávé # coffee szünet # pause kávészünet/x # coffee pause In Hungarian orthography there is a special rule. Compound words with min. 3 words and 7 syllables have to write with dash: kávészünetigény # (need for coffe break) kávészünet-rendelet (coffe break regulations) Yes, why do we remove compounds from dictionary? For example, we have a ready-to-use dictionary with lot of unsigned compounds, or we want restrict compound support for dictionary compounds when we need a more scrict spell checking (proof-reading etc.). Sometimes REP back checking in compound_check() forbid good compounds too, and we need to put these compounds into the dictionary, with the COMPOUNDROOT flag. > > > Compound word count > > I had already implemented the maximum nr of words for compounding and > the maximum nr of syllables for compounding. But I guessed a word > would > have to meet both criteria. The new documentation specifies that word > needs to meet one of these rules. Is this always so? If so I'll > change > my implementation. Need both criteria for Hungarian, but I separated for other languages. It is great! I hope with your effort we can make more better language tools and not only for Vim. :) > If not then adding an affix item to choose between > OR/AND could be used. > > I have another method to define compound words, I'll include the > current > Vim help text about this at the end. I'm not sure in how far this > replaces some of the language-specific mechanisms that are in the > hunspell code. It will certainly replace COMPOUNDFLAG, COMPOUNDBEGIN, > COMPOUNDMIDDLE and COMPOUNDLAST in a generic way. It supports > compounding of a certain group of words with another group of words. > This rules out nonsense words. E.g., when allowing "tomato" and > "bicycle" for start, and "soup" and "shop" for last, you can make the > legal "tomatosoup" and "bicycleshop", but also "bicyclesoup", which is > wrong. It is wonderful! > > > Word characters > > I notice you use WORDCHARS for this. In Vim I made it a bit more > generic: FOL, LOW and UPP lines specify both word characters and case > folding. This is required for when the locale is not available on a > system or another locale is currently being used. Especially relevant > for utf-8, which allows a user to edit text without adjusting the > locale. But also to allow editing text in various Microsoft codepages > on Unix, for which there is no locale. > > In Vim MIDWORD specifies characters that should only be considered to > be > word characters when used in between word characters. Especially > useful for ' and -. It is not enough. For example last dot may be period or dot after abbrevations. OpenOffice.org has the following algorithm for this. With example: OOo check "dxg." 1. send "dxg." to Myspell Myspell (and now Hunspell 1.0.9) suggests dog (without period). User choose this item and 2. OOo add period to dog, and replace "dxg." to "dog." 1. OOo check "exc." Myspell suggest "etc." 2. OOo replace "exc." with "etc." Hunspell former default period suggestion (I set it for Abiword anno) have been optional by SUGSWITHDOTS affix parameter. For a grammar checker we need a more sophisticated tokenization. (For example, we have word pairs, like "vice versa" etc.) > > > Affix required > > I notice PSEUDOROOT, which specifies that a word can't be used without > affixes. I'm using NEEDAFFIX for that. I guess they do the same > thing. > I find NEEDAFFIX simpler to understand. (I'm not a linguist!) You are right. I will use it. (BTW. NEEDAFFIX is better, because this flag is usable for affixes, too. See tests/pseudoroot3.*) > > I wonder if something similar should be used for words that are only > valid when used in a compound word? ONLYINCOMPOUND could be re-used > for > this. I have planned it, because now this is a little difficult with NEEDAFFIX and zero morpheme: NEEDAFFIX A COMPOUNDFLAG B ONLYINCOMPOUND C SFX X Y 1 SFX 0 0/BC --- dic --- foo/A > > > Decapitalising > > The method apparently requires specifying an affix for each letter a > word can start with. I wonder, is it ever allowed to have an > upper-case letter halfway a compound word? I don't think so, it would > make the word huh-cap. A DECAPITALIZE item in the affix file would be > sufficient. Perhaps the rule about what to do after a dash need to be > specified explicitly. I think for German they always keep the > capital. But not in Hungarian geographical names. Nyugat-Európa (West Europa) nyugat-európai (West European) etc. But I think the idea is good, but the implementation is difficult in Hunspell. (BTW. Now I will handle Hungarian geographical names with special dash prefixes: E -> -e Európa -> -európai) > > > Circumfix > > The mechanism to match a prefix with a suffix looks very complicated > to > me. It also doesn't appear to be possible to use more than one flag > for > this, thus all suffixes with CIRCUMFIX can be used with all prefixes > with CIRCUMFIX. That probably requires writing a separate suffix for > each possible prefix. Or vice versa. See tests/circumfix example. > > Why not add a new item that defines both at the same time? Something > like: > PSFX {flag} {pchop} {padd} {pcond} {schop} {sadd} {scond} > > The {pcond} would need to match at the start of the word, {scond} at > the > end. Similarly for the chop and add strings. Perhaps flags could be > added after {sadd}, like you do with other suffixes. > > An additional advantage of this item is that you can add a prefix > while > using a condition on the end of the word. Don't know if there is a > language where this is useful though... :) I'm afraid to implement new syntax in affix file. But we can implement a more sophisticated preprocessor grammar for users. Hunlex has a similar circumfix syntax, like yours. > > Would it be possible to use this mechanism for a compound word, so > that > the prefix applies to the first word and the suffix to the last word? > Or is that not needed for any language? Quite right. There is in Hungarian with compound adjectives -barát (-friendly) felhasználóbarát (user friendly) legfelhasználóbarátabb (most user friendly) legmacskabarátabb (most cat friendly) etc. But it is not frequent in our 1,5 milliard word Hungarian corpus: [laci@lalilili szabaly]$ look leg szoszablya.txt | grep barátabb leg-felhasználóbarátabb 1 1 0 0 legbababarátabb 1 1 1 0 legbarátabbak 1 1 1 0 legbőrbarátabb 1 1 1 1 legcsaládbarátabb 1 1 0 0 legfelhasználó-barátabb 5 5 1 0 legfelhasználóbarátabb 26 26 6 2 legfelhasználóbarátabbikát 3 3 0 0 legfogyasztóbarátabb 1 1 0 0 legfénybarátabb 1 1 0 0 leghátbarátabb 6 6 6 6 legkörnyezetbarátabb 36 36 30 22 legkörnyezetbarátabb. 2 2 2 2 legkörnyezetbarátabbnak 4 4 1 1 legközönségbarátabb 1 1 0 0 leglóbarátabb 3 3 3 2 legmelegbarátabb 1 1 1 0 legmosleybarátabb 2 2 0 0 legnyomdafestékbarátabb 2 2 2 2 legpolgárbarátabb 1 1 1 1 legrádióbarátabb 1 1 0 0 legrádióbarátabbnak 1 1 0 0 legszuperlóbarátabb 1 1 1 1 legtermelőbarátabb 2 2 2 0 legtermészetbarátabb 3 3 3 3 legtuningbarátabb 2 2 0 0 legtuningbarátabbak 1 1 0 0 legállatbarátabb 5 5 4 0 legönkormányzatbarátabb 1 1 1 1 legügyfélbarátabb 3 3 1 1 It seems, Hungarian doesn't like compound adjectives with superlatives. But this feature would be fine for Hungarian, perhaps as a special compoundings. But in the artifical? (orthographical?) linguistic paradigm `leg' (~most) is not root, but prefix.) > > > More flags > > I agree that the problem with running out of single character flags is > relevant. Your solution is to allow two-character flags and numbered > ones, separated by commas. A third method, which would be attractive > generally, is to use single-character flags as before, and also allow > an > upper case letter with another character, thus Aa, Bx, etc. Using A > to > Z as a single character flag isn't possible then. This has two > advantages: > - Allows using single character flags for the most common ones. > - Two-character flags are much easier to read when concatenated: > AbZcNf > - Allows for about 68 + 26 x 68 = 1832 flags. That should be > sufficient > for all languages, especially with the double-suffix mechanism. Hunspell also has a similar, but not restricted mode (FLAG long). I like your multi-byte long format. :) I will implement it in Hunspell, if you have supported it in Vim. I also think, 1832 is enough for affixes. > A few more things that I noticed: > > It very much helps in the code if you add a comment above each method > to > briefly explain what it does and what the return value is. I see > functions returning 0 or 1 without suggesting what that means. This > makes it a lot more difficult to understand the code. I will do it. > > COMPOUNDMIN is defined as "Minimum length of words in compound words. > Default value is 3 letter." I understand this means that characters > are > counted instead of bytes. But how about counting non-letters? Would > "d'e" be valid for compounding? From the code I deduct that the > explanation should be that the length is defined in characters, not > letters. Yes, you are right. Thanks. But with Unicode, this characters may be multibyte UTF-8 characters. > > The manpage mentions: "Hungarian has a standard ASCII character set > (ISO > 8859-2)". "ASCII" should be "8-bit". ASCII is only 7-bit. I will fix it. > > SYLLABLENUM appears to be a hack. Any plans to replace this with > something more generic? Perhaps defining flags for adding one, two or > three syllables to a word? Yes, it is a hack for Hungarian with hardwired affixes in the code. We need calculate syllable number of some (derivative) suffixes for Hungarian. I will make a DERIVATIVE(AFFIX) affix attribute for stemming, and I will change SYLLABLENUM with a language specific code for calculating syllable numbers of derivative affixes. > > About ACCENT: Vim uses the MAP items for this. This also implies that > these changes in accents don't need to be put in REP items. I have already removed ACCENT support. Thanks for your comment. I think, we need this little MAP-REP redundancy for sorting better the suggestions. I have changed the order of MAP and REP function call in Hunspell source for this purpose. REP suggestions need higher priority, because REP contains typical (frequent) errors. > > You appear to use FORBIDDENWORD where Vim uses BAD. I mostly use the > ! > flag for BAD to make it stand out. I like it. FORBIDDENWORD will be deprecated in Hunspell. ! is nice. :) > > > Well, that's enough for now! Well, I have planned for a long ago a non deterministic finite state automate for compound checking with an ugly and uncomprehensible syntax. Your COMPOUNDFLAGS has a wonderful, admirable clear syntax. It has huge advantage over old compound support. I had started to convert our idiotic syllable counting Hungarian compound rule to a non deterministic automate, but I wanted also coding the affix flags into this rules. (Here is my first (and last) horrible trial: alap 1 2 3 4 5 1 1_01 -> 2_12 | 2_13 | 2_14 | 2_15 | 2_1n 2 1_02 -> 2_23 | 2_24 | 2_25 | 2_2n | 2_2n 3 1_03 -> 2_34 | 2_35 | 2_3n | 2_3n | 2_3n 4 1_04 -> 2_45 | 2_4n | 2_4n | 2_4n | 2_4n 5 1_0n -> 2_nn | 2_nn | 2_nn | 2_nn | 2_nn harmadik előző szótagszám 2 (2_12) -> 3_23 | 3_24 | 3_25 | 3_26 3 (2_13, 2_23, 3_23) -> 3_34 | 3_35 | 3_36 4 (2_14, 2_24, 2_34, 3_24, 3_34) -> 3_45 | 3_46 5 (2_15, 2_25, 2_35, 2_45, 3_25, 3_35, 3_45) -> 3_56 1 szótagúak: 1_01 2_12 2_23 2_34 2_45 2_nn 3_23 3_34 3_45 3_56 2 szótagúak: 1_02 2_13 2_24 2_35 2_4n 2_nn 3_24 3_35 3_46 3 szótagúak: 1_03 2_14 2_25 2_3n 2_4n 2_nn 3_25 3_36 4 szótagúak: 1_04 2_15 2_2n I imagined state definitions with next states in affix file, but it would be more difficult to use.) With a lot of COMPOUNDFLAGS, I hope I can define similar non deterministic automate with a more esimple syntax. Other interest type of compoundings in Hungarian: - numbers (1999 = ezerkilencszázkilencvenkilenc 2005 = kétezer-öt 1 872 453 123 = egymilliárd-nyolcszázhetvenkétmillió-négyszázötvenháromezer-százhuszonhárom, - compounds with moving rule: személygépkocsinyereménybetétkönyvkiviteliengedély-kérés - twin-words: pörögtünk-forogtunk (words need the same affix: pörgött-forgott, but *pörögtünk-forgott) I admire your excellent work. Congratulations! I will implement COMPOUNDFLAGS in Hunspell. With my reverse logic, Vim will be the best interactive interface for our spell checker development. :) Thanks a lot. I'm very sorry that with my English I cannot express enough my happiness. :) Best regards, Laci > > - Bram > > > Part of the current Vim help on affix items: > > -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- > > COMPOUND WORDS *spell-affix-compound* > > A compound word is a longer word made by concatenating words that > appear in > the .dic file. To specify which words may be concatenated a character > is > used. This character is put in the list of affixes after the word. > We will > call this character a flag here. Obviously these flags must be > different from > any affix IDs used. > > *spell-COMPOUNDFLAG* > The Myspell compatible method uses one flag, specified with > COMPOUNDFLAG. > All words with this flag combine in any order. This means there is no > control > over which word comes first. Example: > COMPOUNDFLAG c > > *spell-COMPOUNDFLAGS* > A more advanced method to specify how compound words can be formed > uses > multiple items with multiple flags. This is not compatible with > Myspell 3.0. > Let's start with an example: > COMPOUNDFLAGS c+ > COMPOUNDFLAGS se > > The first line defines that words with the "c" flag can be > concatenated in any > order. The second line defines compound words that are made of one > word with > the "s" flag and one word with the "e" flag. With this dictionary: > bork/c > onion/s > soup/e > > You can make these words: > bork > borkbork > borkborkbork > (etc.) > onion > soup > onionsoup > > The COMPOUNDFLAGS item may appear multiple times. The argument is > made out of > one or more groups, where each group can be: > one flag e.g., c > alternate flags inside [] e.g., [abc] > Optionally this may be followed by: > * the group appears zero or more times, e.g., sm*e > + the group appears one or more times, e.g., c+ > > This is similar to the regexp pattern syntax (but not the same!). A > few > examples with the sequence of word flags they require: > COMPOUNDFLAGS x+ x xx xxx etc. > COMPOUNDFLAGS yz yz > COMPOUNDFLAGS x+z xz xxz xxxz etc. > COMPOUNDFLAGS yx+ yx yxx yxxx etc. > > COMPOUNDFLAGS [abc]z az bz cz > COMPOUNDFLAGS [abc]+z az aaz abaz bz baz bcbz cz caz cbaz etc. > COMPOUNDFLAGS a[xyz]+ ax axx axyz ay ayx ayzz az azy azxy etc. > COMPOUNDFLAGS sm*e se sme smme smmme etc. > COMPOUNDFLAGS s[xyz]*e se sxe sxye sxyxe sye syze sze szye szyxe > etc. > > *spell-COMPOUNDMIN* > The minimal byte length of a word used for concatenation is specified > with > COMPOUNDMIN. Example: > COMPOUNDMIN 5 > > When omitted a minimal length of 3 bytes is used. Obviously you could > just > leave out the compound flag from short words instead, this feature is > present > for compatibility with Myspell. > > *spell-COMPOUNDMAX* > The maximum number of words that can be concatenated into a compound > word is > specified with COMPOUNDMAX. Example: > COMPOUNDMAX 3 > > When omitted there is no maximum. It applies to all compound words. > > To set a limit for words with specific flags make sure the items in > COMPOUNDFLAGS where they appear don't allow too many words. > > *spell-COMPOUNDSYLMAX* > The maximum number of syllables that a compound word may contain is > specified > with COMPOUNDSYLMAX. Example: > COMPOUNDSYLMAX 6 > > This has no effect if there is no SYLLABLE item. Without > COMPOUNDSYLMAX there > is no limit on the number of syllables. > > *spell-SYLLABLE* > The SYLLABLE item defines characters or character sequences that are > used to > count the number of syllables in a word. Example: > SYLLABLE aáeéiíoóöőuúüűy/aa/au/ea/ee/ei/ie/oa/oe/oo/ou/uu/ui > > Before the first slash is the set of characters that are counted for > one > syllable, also when repeated and mixed, until the next character that > is not > in this set. After the slash come sequences of characters that are > counted > for one syllable. These are preferred over using characters from the > set. > With the example "ideeen" has three syllables, counted by "i", "ee" > and "e". > > Only case-folded letters need to be included. > > Above another way to restrict compounding was mentioned above: adding > "nocomp" > after an affix causes all words that are made with that affix not be > be used > for compounding. |spell-affix-nocomp| > > > > > > |
From: Bram M. <Br...@mo...> - 2005-08-25 14:02:06
|
Laci - I have subscribed to the hunspell-devel list. Perhaps you would like to invite others to subscribe to this list? Especially the people involved with Myspell and OpenOffice.org spelling. I don't know where this is going, but it certainly looks good. I'm glad you are taking over a few of my suggestions. Hopefully we can really work out a common affix file format. If we can share the binary file format remains to be seen. We can at least try. But it's a secondary goal. If you like how the trie works in Vim, you might consider using the same code and/or mechanisms in hunspell. At least it will be good if there is a library with the spell code I'm now using for Vim, so that other programs can use the functionality. A few things added this week: - Two-letter flags. I'm using the "FLAG long" item in the affix file, just like hunspell. "FLAG num" is also supported. As an extra I implemented "FLAG huh", which allows both single-character flags and two-character flags starting with A-Z. Perhaps "huh" isn't a good name, since besides "Aa" and "Xx", which form HuH-cap words, "AA", "AX", etc. are also allowed. What would be a better name for this? "short-long"? "mix"? Hmm, perhaps "caplong" is clearer. - Concatenating words without spaces in between. Mostly for Thai and similar languages. I'm using NOBREAK in the affix file for this. I'm not sure it works properly, I don't understand Thai and only found one usable word list (which apparently is partly corrupt). An important thing to add next is specifying flags on an affix, like hunspell are using: SFX a chop add/FLAGS cond I haven't been able to figure out how these FLAGS are to be used exactly. Do they replace the flags of the word, add to them or something more complicated? We need a clear definition for the people who write an affix file. And besides serving the purpose of Hungarian, it should be generic enough for other languages. Obviously the FLAGS are used for second level affixes. The flags from the word itself will not be used for this, they are only used for the first level affixes. The word the affix is used with may define flags for compounding. The FLAGS may also define compounding flags, since a word plus affix may compound in a different way. We need to define how this works exactly. I can see it can be useful to have the FLAGS overrule the compound flags of the word, thus that the affix changes how compounding can be done. You already mentioned this is needed for Hungarian. I can also see it can be useful to keep the compound flags of the word. Especially if the affix doesn't change the compounding and it can be used on words with different compound flags. Logically there are these possiblities: 1. OR: Compound flags of the word and affix are both used. 2. AND: Compound only if a flag appears both on word and in FLAGS 3. SET: Use FLAGS only Which one of these are you currently using? Would it be necessary to allow the other ones? We could use an extra item after the affix for this, just like I now have "rare". Perhaps these could be used: compOR compAND compSET Example: SFX a 0 add/FLAGS . compOR Using "compSET" and not including compound flags in FLAGS is equivalent to disallow compounding. This replaces my current "nocomp" flag. Using "compOR" and not including compound flags in FLAGS is equivalent to using the flags of the word without modifications. We don't need all three, one of them can be the default. A complicated default mechanism would be to use compOR when FLAGS does not contain compound flags and compSET when FLAGS does have compound flags. This may be confusing, since it's not directly clear from the flags which ones are for compounding. I currently think using compOR as the default would be most useful. I would expect compAND is rarely used. compSET would then be used to disable and redefine compounding. And finally a wild idea: For the affix separate the flags for second level affixes and for compounding. This makes it easier to understand and reduces mistakes. SFX a 0 add/AFFIXFLAGS . comp/COMPFLAGS And since the condition is in a strange place now, this might be better: SFX a 0 add . /AFFIXFLAGS comp/COMPFLAGS I don't know why you had put the flags on the "add" part of the affix, putting them separately with a leading slash seems simpler to me, while it still allows for an optional morphological field. I'll move the rest of my reply to a second message. - Bram -- hundred-and-one symptoms of being an internet addict: 97. Your mother tells you to remember something, and you look for a File/Save command. /// Bram Moolenaar -- Bram@Moolenaar.net -- http://www.Moolenaar.net \\\ /// Sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\ \\\ Project leader for A-A-P -- http://www.A-A-P.org /// \\\ Buy LOTR 3 and help AIDS victims -- http://ICCF.nl/lotr.html /// |
From: Bram M. <Br...@mo...> - 2005-08-25 18:47:26
|
Laci - The second part of my reply. > > Using a hash table > > > > It appears you have run into a lot of trouble isolating a word. I > > have had the same problem in Vim when I was using a hash table. > > Since then I have switched to using a trie. Then it's not necessary > > to first locate or guess the end of the word. This makes all the > > code a lot simpler, especially for making suggestions and for > > compound words. You can also use words that end in a dot, e.g. > > "etc.", or include a space, with no extra effort. > > Sounds very good to me. > For multiple affix stripping trie will be a more better format. > > I have a little problem with trie. > We need morphological analyser for grammar checking, or suggestion > of synonyms with affixes. > > For morphological analysis trie must be extended with state > informations (roots and morphemes with morphological descriptions). It > seems for me a little difficult, because we need handle root and affix > homonyms, too. In addition, for morphological derivation, we need > transducers, or something less good data constructions (linear affix > search in trie data). It would be possible to add morphological information to the trie. The compression is based on combining words with identical tails. If words have different morphological info their tails cannot be shared. Thus compression will be less efficient. By how much is difficult to predict. If the morphological information can be put in a small number of properties it might still work. For example "has suffix of three letters" can be added. Then you can go back three letters to find other suffixes at that that point. Obviously, flags like "verb" and "noun" can be added easily. > > When you go into making suggestions you will find that the hash table > > is making it nearly impossible to find words with more than one > > insert/delete/swap/replace edit operation. The trie I'm using makes > > this possible. I can't say the code is simple, but I've already > > written it and works very well for all languages. Currently > > suggestions with up to three or four edit operations are found. > > This can be tuned, it's a trade-off with speed, not a limitation of > > the mechanism. > > Kevin Hendricks had written an ngram suggestion code, now I extended > with a refinement based on the longest common subsequent algorithm. > For example it works well for foreign name suggestion (Montesquo -> > Montesquieu) and not neigbhoured differences (permenant -> permanent). > But this function also good for trie. Is this the SuggestMgr::ngsuggest() method? It apparently looks through all the root words. In Aspell there are remarks that using ngrams is very slow. I haven't tried it. I did try sound folding and know that doing something for every possible word can be very slow. First going over root words is an optimization for languages with lots of affixes, that is a good hint. However, you still need to try affixes in a second step. I doubt it will give better results than what I'm already doing. And it won't find suggestions by splitting the word. For example, "Montesquo" results in these suggestions with Vim (first five, using English): Change "Montesquo" to: 1 "Monte quo" 2 "Mont's quo" 3 "Mantes quo" 4 "Montesquieu" 5 "Monte's quo" This takes a fraction of a second on my system. > > Vim stores the trie in a .spl file, together will all other required > > data. For most languages the .spl file is only 50% the size of the > > .dic file. > > First element of the Hunspell TODO is the mmap support for dictionary > files. Michael Meeks OOo developer have written a MySpell patch to > solve memory problems in multi-user environments. I will make an > optimized binary dic format for Hunspell that can be share beetween users > in a client-server environment by mmap support of operating system. > > This format will be optional for backward compatibility: Hunspell > search this format first, then the original uncompressed format. I think mmap only works on Unix systems. I don't like non-portable solutions. Now that you can buy 1Gbyte of memory for $100 and most word lists don't use more than 5 Mbyte, it's not worth spending much time on these memory saving mechanisms. Especially when they complicate the code. > > LANG in affix file > > > > A generic remark about the affix file format is the use of the LANG > > entry. This results in various checks that are not specific for one > > language to depend on the language name. I think that's a bad choice. > > For example, the code to do compounding with dashes now depends on > > LANG to specify Hungarian or German. But it probably also applies > > to Dutch and other languages. These mechanisms should be defined in > > the affix file separately, e.g. with a COMPOUNDDASH item. > > I will remove this language specific code. I'm glad you agree. > Compounding with dashes need a hybrid solution. Real compound > words handled with dash compounds and prefixes (but your excellent > COMPOUNDFLAGS, not need to use dash prefixes). Other > compounds (for example in Hungarian, twin word-like word pairs, or > word lists) checked by the grammar checker. I don't intend to add a grammar checker, because it requires context. I already have enough trouble with line breaks. Thus I want to put as much checking as possible in the words. I don't quite understand your remark. Perhaps dashes can be used in COMPOUNDFLAGS to specify rules where dashes are inserted in between words? Something like: COMPOUNDFLAGS s-m*e Which means: Starting word, dash, any number of middle words, ending word. But the rules for compound words (word count, syllable count) may work differently then. Perhaps that is why you want to leave it to the grammar checker? Simplest would be to see a dash not as a word character. Would that work for Hungarian? Not for other languages though. Could make compound rules for "before a dash" and "after a dash". It gets complicated then, but perhaps it's needed anyway. > > - I see repl_check() being used inside compound_check(). This means > > that REP items are used to check for correct spelling. So far REP > > items were only used for making suggestions. That might be wrong, > > or otherwise it should be explained in the documentation. > > I have documented it in the source code before repl_check() function. > Using REP is a great advantage for Hungarian and presumbebly other > languages with rich compoundings. You are right, we need generalize > this options, too. I have already planned it. (BACKCHECKCOMPOUNDS > or a similar I'm afraid I don't understand the comment above repl_check(). I don't see how you can use REP items to check a word for being valid or not. You made a remark elsewhere that sometimes good compound words are rejected, thus tuning the REP items for this might be required. The example shows how a compound word can be wrong, but since the same word appears as good word anyway, it doesn't matter for spell checking. I guess you must do this for morphological purposes. The REP items define arbitrary changes to a word to be able to find suggestions. In general the REP items cannot be used to check words for being right or wrong, because they change a word in an arbitrary way. You could perhaps limit the REP items for another purpose, but then the suggestions would suffer. Thus this is a wrong dependency. Perhaps separate items need to be added to check compound words? I'm still wondering what you are actually doing here. It seems to be a way to define an additional condition for compounds. > LANG is a sandbox for developers, and an excuse for me. > I have a lot of dubiety about my improvements. I have been > trying to generalize language specific codes of Hunspell only for a > short time, since summer. Thank you for your help! That's OK. But currently experimental things are mixed with things which are intended for actual use, that is confusing. Adding remarks to the docs and/or code helps a lot, e.g., "experimental". That would at least help me decide what to include in Vim. > > It's unclear to me what COMPOUNDROOT is used for. Can you explain > > that? > > COMPOUNDROOT signs the compound words in the dictionary. For > example: > > -----aff--- > COMPOUNDROOT x > > -----dic--- > kávé # coffee > szünet # pause > kávészünet/x # coffee pause > > In Hungarian orthography there is a special rule. Compound words with min. > 3 words and 7 syllables have to write with dash: > > kávészünetigény # (need for coffe break) > kávészünet-rendelet (coffe break regulations) > > Yes, why do we remove compounds from dictionary? > For example, we have a ready-to-use dictionary with > lot of unsigned compounds, or we want restrict compound support > for dictionary compounds when we need a more scrict spell checking > (proof-reading etc.). Sometimes REP back checking in compound_check() > forbid good compounds too, and we need to put these compounds > into the dictionary, with the COMPOUNDROOT flag. If I understand it correctly, this flag means that the word actually is a compound word, and thus when using it in a compound word it must be counted for two words when checking the compounding rules. The flag name is confusing for me. COMPOUNDWORD would be clearer, except that it was previously used for something else (maximum word count). Perhaps "COMPOUNDED" is a better name. Theoretically a word could be a compound of more than two words. Perhaps this applies to German. Just to be ready for this, we could use "COMPOUNDED2" and "COMPOUNDED3". > > Compound word count > > > > I had already implemented the maximum nr of words for compounding and > > the maximum nr of syllables for compounding. But I guessed a word > > would have to meet both criteria. The new documentation specifies > > that word needs to meet one of these rules. Is this always so? If > > so I'll change my implementation. > > Need both criteria for Hungarian, but I separated for other > languages. Now you say the opposite of what I see in the code, thus I'm confused. Please be precise: What exactly are the rules for compounding for the number of words and syllables? > > In Vim MIDWORD specifies characters that should only be considered > > to be word characters when used in between word characters. > > Especially useful for ' and -. > > It is not enough. For example last dot may be period or > dot after abbrevations. Vim doesn't need to know. If the word without a dot appears in the word list, then the dot is a full stop. If the word appears with the dot, then it is required and the word without the dot is misspelled. The dot then is included in the word. If the sentence ends cannot be decided, that's a problem for grammatical analysis. I understand that when you use a hashtable you need to figure out where the word ends, thus it's much more complicated. But you need to check the word list to find out if the dot is a full stop or part of an abbreviation. The affix file specifications can only give the information that the dot MIGHT be part of a word. Since Vim doesn't use the hash mechanism we don't need to know. > OpenOffice.org has the following algorithm for this. With example: > > OOo check "dxg." > > 1. send "dxg." to Myspell > > Myspell (and now Hunspell 1.0.9) suggests dog (without period). > User choose this item and > > 2. OOo add period to dog, and replace "dxg." to "dog." How can OOo know that the dot should be added back? Also, changing "dxg." to "dog" means two changes, thus a bad score. Unless you add special code to ignore the dot. > 1. OOo check "exc." > > Myspell suggest "etc." > > 2. OOo replace "exc." with "etc." > > Hunspell former default period suggestion (I set it for Abiword > anno) have been optional by SUGSWITHDOTS affix parameter. Of course you can add SUGSWITHDOTS and I'll have Vim ignore it. > For a grammar checker we need a more sophisticated tokenization. > (For example, we have word pairs, like "vice versa" etc.) Well, I'm glad I don't need to play these tricks for Vim. Using the trie this all comes for free. > > I wonder if something similar should be used for words that are only > > valid when used in a compound word? ONLYINCOMPOUND could be re-used > > for this. > > I have planned it, because now this is a little difficult with NEEDAFFIX and > zero morpheme: > > NEEDAFFIX A > COMPOUNDFLAG B > ONLYINCOMPOUND C > > SFX X Y 1 > SFX 0 0/BC > > --- dic --- > foo/A I'm afraid this example looks wrong. "foo/A" means "foo" requires an affix, but there isn't one. Anyway, it's clear that using ONLYINCOMPOUND in the .dic file is a lot simpler. I was thinking that analogous to NEEDAFFIX we could use NEEDCOMPOUND instead of ONLYINCOMPOUND. A bit more consistency. > > Decapitalising > > > > The method apparently requires specifying an affix for each letter a > > word can start with. I wonder, is it ever allowed to have an > > upper-case letter halfway a compound word? I don't think so, it would > > make the word huh-cap. A DECAPITALIZE item in the affix file would be > > sufficient. Perhaps the rule about what to do after a dash need to be > > specified explicitly. I think for German they always keep the > > capital. > > But not in Hungarian geographical names. > > Nyugat-Európa (West Europa) > nyugat-európai (West European) > etc. > > But I think the idea is good, but the implementation is difficult > in Hunspell. > > (BTW. Now I will handle Hungarian geographical names with special dash > prefixes: E -> -e Európa -> -európai) Hmm, I suppose you have to do that for every letter. I don't like that. I think for German the rule is: When compounding without a dash, the leading capital is made lower case; when compounding with a dash the leading capital is kept. I haven't seen an exception yet. Is the rule for Hungarian to always make the leading capital lower case, also when there is a dash? If not, then is it easy to make a generic rule or are flags needed to specify it on the specific words? In the last case it might actually be simpler to add the words to the .dic file with the NEEDCOMPOUND flag. That depends on how many words this applies to. > > Circumfix > > > > The mechanism to match a prefix with a suffix looks very complicated > > to me. It also doesn't appear to be possible to use more than one > > flag for this, thus all suffixes with CIRCUMFIX can be used with all > > prefixes with CIRCUMFIX. That probably requires writing a separate > > suffix for each possible prefix. > > Or vice versa. See tests/circumfix example. Hmm, this suggests that when the morphological information isn't needed, the affix file can be written as: PFX A Y 1 PFX A 0 leg . PFX B Y 1 PFX B 0 legesleg . SFX C Y 3 SFX C 0 obb/AB . Perhaps there is another situation where you really can't add a suffix without adding a prefix as well. But then you can use the NEEDAFFIX flag to avoid using "obb" by itself. Can't you always use the NEEDAFFIX flag in the situations where you now use CIRCUMFIX? > > Why not add a new item that defines both at the same time? Something > > like: > > PSFX {flag} {pchop} {padd} {pcond} {schop} {sadd} {scond} > > > > The {pcond} would need to match at the start of the word, {scond} at > > the > > end. Similarly for the chop and add strings. Perhaps flags could be > > added after {sadd}, like you do with other suffixes. > > > > An additional advantage of this item is that you can add a prefix > > while using a condition on the end of the word. Don't know if there > > is a language where this is useful though... > > :) > > I'm afraid to implement new syntax in affix file. > But we can implement a more sophisticated preprocessor grammar > for users. Hunlex has a similar circumfix syntax, like yours. That creates a dependency on a tool, which then becomes required, and defines another file format. I rather invest a bit of time in implementing this new affix type. Since it's a combination of the existing prefix and suffix, it's not really new. Anyway, so long as no word list uses this it makes no sense to add support for it to Vim. > > Would it be possible to use this mechanism for a compound word, so > > that the prefix applies to the first word and the suffix to the last > > word? Or is that not needed for any language? > > Quite right. There is in Hungarian with compound adjectives [example removed] > It seems, Hungarian doesn't like compound adjectives with > superlatives. > > But this feature would be fine for Hungarian, perhaps as a special > compoundings. But in the artifical? (orthographical?) linguistic > paradigm `leg' (~most) is not root, but prefix.) OK, it was just an idea. It will have to wait until someone finds use for it. > > COMPOUNDMIN is defined as "Minimum length of words in compound words. > > Default value is 3 letter." I understand this means that characters > > are counted instead of bytes. But how about counting non-letters? > > Would "d'e" be valid for compounding? From the code I deduct that > > the explanation should be that the length is defined in characters, > > not letters. > > Yes, you are right. Thanks. But with Unicode, this characters may > be multibyte UTF-8 characters. Right. And since the rule doesn't change if you write a word in an 8-bit encoding or in UTF-8 it should really count characters, not bytes. > > SYLLABLENUM appears to be a hack. Any plans to replace this with > > something more generic? Perhaps defining flags for adding one, two or > > three syllables to a word? > > Yes, it is a hack for Hungarian with hardwired affixes in the code. > We need calculate syllable number of > some (derivative) suffixes for Hungarian. > I will make a DERIVATIVE(AFFIX) affix attribute for stemming, and > I will change SYLLABLENUM with a language specific code for calculating > syllable numbers of derivative affixes. I mentioned the COMPOUNDED2 and COMPOUNDED3 items above. We can do something similar for syllables. Or just specify the syllable count directly: PFX B 0 legesleg . =2 Using the "=" character to recognize the syllable count item here. Perhaps it's good to do this for all extra items: PFX B 0 legesleg . =2 /FLAGS :morpho-text #comment That starts looking like a generic mechanism, easy to parse. It would be easy to define something for words or affixes with a different number of syllables than what the normal algorithm would come up with. The trouble is doing it for the combination of a word and an affix. Especially if there is a chop string. At least, if you check affix names in the code, we could just as well add a flag to the affix in the .aff file. Hard-wiring affix names in the source code should be avoided at all cost. > > About ACCENT: Vim uses the MAP items for this. This also implies that > > these changes in accents don't need to be put in REP items. > > I have already removed ACCENT support. Thanks for your comment. > > I think, we need this little MAP-REP redundancy for sorting > better the suggestions. I have changed the order of MAP and REP > function call in Hunspell source for this purpose. REP suggestions > need higher priority, because REP contains typical (frequent) errors. The score for suggestions is something that can be tuned again and again. Of course it's possible to have REP items for accented characters too. > Well, I have planned for a long ago a non deterministic > finite state automate for compound checking with an ugly > and uncomprehensible syntax. > > Your COMPOUNDFLAGS has a wonderful, admirable clear syntax. > It has huge advantage over old compound support. I'm glad you like it. Together with the possibility to use many flags this hopefully covers most rules. However, the idea to also make compound rules with conditions still exists. You will have to tell if these conditions would still be needed. I know that for some languages the base word has to be modified when used in a compound. Defining this can be complicated, as a few examples for German show (e.g., Frau - Fräulein, adding "lein" changes a to ä). Hopefully it's sufficient to define the base word without a compound flag and the modified word with a compound flag and NEEDCOMPOUND. So long as we don't have a way to define these complicated rules that's the way to do it. > I imagined state definitions with next states in affix file, but > it would be more difficult to use.) I suppose Hungarian linguists doesn't define compounding with a state machine :-). Sticking close to how linguists define the language is often best, because language specialists can then write the .aff and .dic files. > With my reverse logic, Vim will be the best interactive > interface for our spell checker development. :) As you noticed I can mostly implement something quickly to try it out. That makes discussions a lot easier. And it certainly helps to locate bugs! - Bram -- To be rich is not the end, but only a change of worries. /// Bram Moolenaar -- Bram@Moolenaar.net -- http://www.Moolenaar.net \\\ /// Sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\ \\\ Project leader for A-A-P -- http://www.A-A-P.org /// \\\ Buy LOTR 3 and help AIDS victims -- http://ICCF.nl/lotr.html /// |
From: Bram M. <Br...@mo...> - 2005-09-30 10:45:17
|
Laci - I noticed that hunspell 1.1.0 is out. I see there are many improvements and some of my suggestions have been included. It would be good if you announce new releases in this hunspell-devel list. I don't read the other list, since I can't read Hungarian. Before I include some of this in Vim, I would like to know what more will change in the coming months. Especially about the flags used with an affix. Where can I find the Hungarian .dic and .aff files that use the 1.1.0 features? Are there German files that do compounding? - Bram -- Kisses may last for as much as, but no more than, five minutes. [real standing law in Iowa, United States of America] /// Bram Moolenaar -- Bram@Moolenaar.net -- http://www.Moolenaar.net \\\ /// Sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\ \\\ Project leader for A-A-P -- http://www.A-A-P.org /// \\\ Buy LOTR 3 and help AIDS victims -- http://ICCF.nl/lotr.html /// |