#20 OOo-like output option for debugging

open
None
5
2008-04-12
2007-11-02
sjurum
No

(sorry for the long post)

It would be very helpful when developing spelling dictionaries intended to be used in a graphical environment like OOo and others, to be able to spell check texts in exactly the same way as it would be in said graphical environment. Currently I have found no such option.

Such an option would influence several aspects of hunspell, at least the following:
- tokenisation
- suggestions
- suggestion order

Examples:

Tokenisation
------------

Input: www.infonuorra.no

OOo tokenisation: one string - "www.infonuorra.no"

hunspell -a tokenisation: three stings: "www", "infonuorra", "no"

Wanted: a tokenisation behaviour that is replicating the one in OOo as closely as possible, such that given the same input, you would get the same tokens out in the other end.

Suggestions
-----------

Input: Mikkel

OOo suggestions: Mielkke, Baikke, dárkkel, Råhkkel, Fuoikke

hunspell -a suggestions: Mielkke, Mikrof, Baikke, Mierkká

Wanted: a command line option that would produce exactly the same set of suggestions as the library version/OOo version

Suggestion order
----------------

Because there is so huge a difference in many cases between the suggestions given by OOo/hunspell and those given by command-line hunspell, it is hard to find examples of order difference (and there might not be any among the suggestions that are identical). But to be able to test the quality of the suggestions (e.g. whether the expected suggestion is among the given suggestions, and which position it has), it is important that the command line version of hunspell can produce suggestions in exactly the same order as given to OOo and other clients.

To have a look at what such testing can provide in order of quality measurements and statistics, please have a look at:

http://www.divvu.no/doc/proof/spelling/testing/regression-pl-forrest-smj-20071031.html

(the site is down from time to time - if so, retry in a while)

Environment:
OOo: 2.3/MacOS/X11
hunspell: 1.1.12
hunspell dic+aff files: http://divvun.no/static_files/hunspell-sme-smj-30-10-2007.tar.bz2

I used only the smj files, renamed as et_EE (Estonian) in OOo (there is not yet built-in support for smj in OOo - request for it has been submitted).

smj = Lule Sámi

Discussion

    • assigned_to: nobody --> nemethl
     
  • Logged In: YES
    user_id=726595
    Originator: NO

    Now tokenization of paths, URLs and e-mails is similar to OpenOffice.org.
    Suggestion of OpenOffice.org 2.4 is equal with Hunspell 1.1.12 (because OOo 2.4 contains that version.) Hunspell 1.2.2 has some nice new features for debugging and analysing dictionaries (for example, use -m option on dictionaries with and without morphological data).