morfologik-svn Mailing List for Morfologik
Brought to you by:
dawidweiss,
milek_pl
You can subscribe to this list here.
2006 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
(6) |
Sep
(7) |
Oct
(2) |
Nov
|
Dec
(5) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2007 |
Jan
|
Feb
(5) |
Mar
(7) |
Apr
(11) |
May
(16) |
Jun
|
Jul
|
Aug
(2) |
Sep
(22) |
Oct
(2) |
Nov
|
Dec
(8) |
2008 |
Jan
(2) |
Feb
(1) |
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(5) |
2009 |
Jan
(3) |
Feb
(1) |
Mar
(40) |
Apr
(3) |
May
|
Jun
(1) |
Jul
|
Aug
(9) |
Sep
(5) |
Oct
|
Nov
|
Dec
|
2010 |
Jan
|
Feb
(9) |
Mar
(11) |
Apr
(43) |
May
(2) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(7) |
Nov
(51) |
Dec
|
2011 |
Jan
(19) |
Feb
(15) |
Mar
(2) |
Apr
(23) |
May
|
Jun
(12) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2012 |
Jan
|
Feb
|
Mar
(1) |
Apr
|
May
(4) |
Jun
(34) |
Jul
|
Aug
|
Sep
|
Oct
(5) |
Nov
|
Dec
|
2013 |
Jan
|
Feb
|
Mar
(11) |
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
S | M | T | W | T | F | S |
---|---|---|---|---|---|---|
1
|
2
|
3
|
4
(4) |
5
(1) |
6
|
7
|
8
|
9
|
10
|
11
|
12
|
13
|
14
|
15
|
16
|
17
(2) |
18
(6) |
19
|
20
|
21
|
22
(1) |
23
(4) |
24
(6) |
25
(6) |
26
(6) |
27
(1) |
28
|
29
|
30
|
31
(3) |
|
|
|
|
From: <daw...@us...> - 2009-03-27 08:15:07
|
Revision: 154 http://morfologik.svn.sourceforge.net/morfologik/?rev=154&view=rev Author: dawidweiss Date: 2009-03-27 08:14:42 +0000 (Fri, 27 Mar 2009) Log Message: ----------- Added FSA version that works for me. Added Paths: ----------- fsa/ fsa/CHANGES fsa/INSTALL fsa/Makefile fsa/README fsa/TROUBLESHOOTING fsa/Times fsa/accent.cc fsa/accent.h fsa/accent_main.cc fsa/build_fsa.cc fsa/build_fsa.h fsa/builds_fsa.cc fsa/buildu_fsa.cc fsa/chkmorph.pl fsa/common.cc fsa/common.h fsa/compile_options.h fsa/de.acc fsa/de.lang fsa/de_morph_data.awk fsa/de_morph_infix.awk fsa/deguess.awk fsa/deguess.pl fsa/demorph.awk fsa/demorph.pl fsa/dump.cc fsa/filesel.tcl fsa/filesel.tcl.in fsa/find_irregular.awk fsa/find_irregular.pl fsa/fr.acc fsa/fr.lang fsa/fsa.h fsa/fsa_accent.1 fsa/fsa_accent.exe fsa/fsa_build.1 fsa/fsa_build.exe fsa/fsa_dump fsa/fsa_guess.1 fsa/fsa_guess.5 fsa/fsa_guess.exe fsa/fsa_hash.1 fsa/fsa_hash.exe fsa/fsa_morph.1 fsa/fsa_morph.5 fsa/fsa_morph.exe fsa/fsa_prefix.1 fsa/fsa_prefix.exe fsa/fsa_spell.1 fsa/fsa_spell.exe fsa/fsa_ubuild.1 fsa/fsa_ubuild.exe fsa/fsa_version.h fsa/fsa_visual.1 fsa/fsa_visual.exe fsa/gendata.pl fsa/guess.cc fsa/guess.h fsa/guess_main.cc fsa/hash.cc fsa/hash.h fsa/hash_main.cc fsa/ie1 fsa/jaccent-skeleton fsa/jguess-skeleton fsa/jmorph-skeleton fsa/jspell-skeleton fsa/jspell.el fsa/mkindex.cc fsa/mmorph23c.awk fsa/mmorph23c.pl fsa/morph.cc fsa/morph.h fsa/morph_data.awk fsa/morph_data.pl fsa/morph_infix.awk fsa/morph_infix.pl fsa/morph_main.cc fsa/morph_prefix.awk fsa/morph_prefix.pl fsa/nindex.cc fsa/nindex.h fsa/nnode.cc fsa/nnode.h fsa/nstr.cc fsa/nstr.h fsa/one_word_io.cc fsa/out fsa/pl.acc fsa/pl.chcl fsa/pl.lang fsa/prefix.cc fsa/prefix.h fsa/prefix_main.cc fsa/prep_atg.awk fsa/prep_atg.pl fsa/prep_ati.awk fsa/prep_ati.pl fsa/prep_atl.awk fsa/prep_atl.pl fsa/prep_atp.awk fsa/prep_atp.pl fsa/putinplace.pl fsa/simplify.pl fsa/snode.cc fsa/sortatt.pl fsa/sortondesc.pl fsa/spell.cc fsa/spell.h fsa/spell_main.cc fsa/tclmacq-help.txt fsa/tclmacq-lang.txt fsa/tclmacq.tcl fsa/tclmacq.tcl.in fsa/text_io.cc fsa/unode.cc fsa/unode.h fsa/visual_main.cc fsa/visualize.cc fsa/visualize.h Property Changed: ---------------- / Property changes on: ___________________________________________________________________ Modified: svn:ignore - fsa* + Added: fsa/CHANGES =================================================================== --- fsa/CHANGES (rev 0) +++ fsa/CHANGES 2009-03-27 08:14:42 UTC (rev 154) @@ -0,0 +1,339 @@ +Version 0.5: +- Word length in programs using automata increased to 120. +- Option `clean' provided in Makefile. +- Option `-v' provided in all programs (gives version details). +- Sorting of arcs on frequency in optimization phase of automaton creation. +- Merging two nodes that share the same arc. +- This file added. +Version 0.6: +- Option -v corrected. +- fr.acc file added to distribution. +- man pages provided. +- Compilation options shown in -v in all programs. +- Option -X provided in fsa_build (makes an index a tergo for word category + guessing). +- New program - fsa_guess - added; it predicts word categories based on + word endings. +- New program - fsa_hash - added; it is used for perfect hashing. +- Option -i added to programs using automata; it specifies input files. +- Option -l added to programs using automata; it provides information + on language specific features, such as which characters form words, + and on case conversions. +- New module - text_io - provided that processes text files (many words + in line, punctuation, etc.), and gives grep-like output. +Version 0.7: +- In one_word_io, replacements are now separated by a comma and a space + (was: space only); this makes it possible to have a two-word + replacement for one word - in other words: now run-on words can be + corrected. +- New compile option RUNON_WORDS added; if turned on, fsa_spell checks + for run-on words, i.e. it checks whether inserting a space somewhere + inside the word results in two correct words. +- New compile option CHCLASS added; if turned on, a dedicated file + specifies equivalent sequences of characters, so that e.g. `rz' and + `z' with a dot above (\.z in TeX) may be only one edit distance unit + apart from each other. +- Emacs interface for spelling correction added; it is an adaptation of + ispell.el. +Version 0.8: +- New program fsa_morph performs morphological analysis (but not generation). +- Improved INSTALL guidelines. +- README more up to date, obsolete data removed, better file list. +- fsa_guess now guesses lexemes as well (with GUESS_LEXEMES). +- awk scripts for data preparation. +Version 0.9: +- Corrected a bug that caused segment violations when using dictionaries + of different sizes, and thus preventing users from using personal + dictionaries. +- fsa_guess now recognizes prefixes with GUESS_PREFIX option. +- New options -g and -p for fsa_guess to simulate compile options. +- Words and lines can now be of arbitrary length. +- Binary search in leaf vectors of the register - this does speed up + processing considerably. +- New compile option for creating an index a tergo: GENERALIZE; it gives + smallest automata sizes. +- New compile option STATISTICS prints... wait for it... some statistics + in fsa_build. +Version 0.10: +- Corrected a bug in fsa_build that showed up when using PRUNE_ARCS options + while compiling an index a tergo. +- Corrected a bug in fsa_guess that prevented the proper use of -g option. + Now -g and -p are independent. +- Introduced a limit on the number of analyses in fsa_guess. +- Introduced a limit on the depth of search for suffixes. +- Corrected a bug in fsa_build man page. +- Changed definitions of node and arc_node classes, so that the automaton + requires less memory than before (by a quarter). +Version 0.11: +- Corrected a bug in statistics. +- Option -r added to the function usage() in fsa_spell. +- Removed random inline in fsa.h. +- Updated #ifdefs so that all #ifdef NUMBERS are enclosed in #ifdef FLEXIBLE. +- Updated Makefile so that it contains description of NUMBERS +- Corrected a bug in fsa_build that appeared while reading long input lines +- Updated description of -v option for all programs +- Corrected the effect of GENERALIZE option +- Introduced -m option in fsa_guess (prediction of mmorph descriptions + of words based on inflected forms). mmorph is a morphology program + available from ISSCO, Geneva, http://www.issco.unige.ch/ + or http://issco-www.unige.ch/ +- fsa_build is now faster. +- Corrected a bug in PRUNE_ARCS option application. +Version 0.12: +- Added a new program: fsa_ubuild. +Version 0.13: +- Corrected a bug in fsa_ubuild that excluded some words from the automaton; + the bug was in the function already_there(). +- Added new program: fsa_visual. +- Added an entry for version 0.12 in this file. +Version 0.14: +- Corrected a bug in Makefile (introduced in 0.12) - there was no rule + for making buildu_fsa.o. +- Changed declarations in fsa.h to simplify their use. +- Added perl scripts (awk scripts translated with a2p) for portability. +- Corrected a bug in fsa_hash: -N did not work correctly. +- fsa_visual uses manhattan edges. +- Introduced a new compile option STOPBIT that changes the format + version, and makes automata smaller (by nearly 20% for large + automata). +- Included more information on data preparation in README, and on + compile options in INSTALL. +- Compiled the package on Solaris using g++ 2.6.0 to improve + portability (thanks Sabine). +Version 0.15: +- Corrected a bug in list.empty_list - a memory leak that could be a nuisance + with fsa_prefix operationg on large data. +- Corrected perl scripts. +- Added new script: morph_infix.{awk,pl}. It prepares data for an automaton + to be used with fsa_morph for languages that have prefixes and infixes + (like German). +- Added new compile option: MORPH_INFIX, and two new runtime options for + fsa_morph: -I and -P. They make it possible to use data prepared with + morph_infix.{awk,pl}. +- Added new compile option: POOR_MORPH that enables -A option in + fsa_morph. That option enables morphological analysis giving only + categories, and no base form. +- Added new script: morph_prefix.{awk.pl}. It prepares data for an automaton + to be used with fsa_morph for languages that have prefixes (like Polish). +Version 0.16: +- Corrected a memory leak bug in fsa_morph. Now fsa_morph works two orders + of magnitude faster. +- Corrected manual pages (format errors). +Version 0.17: +- Added new compile option for fsa_build and fsa_ubuild: + DESCENDING. If on, makes resulting automata smaller, but slower. +- Improved morph_infix.{awk,pl}. +- New option -F added to fsa_build and fsa_ubuild. It sets the filler + character. +- New scripts added: prep_ati.{awk,pl}. They prepare data with coded infixes + and prefixes for guessing lexemes and categories using fsa_guess. +- New scripts added: prep_atp.{awk,pl}. They prepare data with coded prefixes + for guessing lexemes and categories using fsa_guess. +- fsa_hash now works correctly with the STOPBIT option. +- corrected another bug in fsa_hash, which probably lingered there + from the beginning, and which made fsa_hash unusable for more than + 256 words. +Version 0.18: +- Added new compile option MORE_COMPR that tries to get more compression + when using fsa_build or fsa_ubuild compiled with NEXTBIT. +Version 0.19: +- Added new compile option TAILS that enables compression of tails + (last transitions) of states. +- Now MORE_COMPR also tries to squeeze some bytes without NEXTBIT. +- Corrected a bug in Makefile introduced in 0.18 (one comment too + many). +- Enriched documentation on options in INSTALL. +- Corrected a bug in fsa_visual that showed up with variable size + arcs, i.e. NEXTBIT or TAILS. +- Added a check on whether -O option should be used in fsa_visual and + supply it when necessary even when the user doesn't do that. +- Added LOOSING_RPM compile option to circumvent a bug in g++ or + stdlibc++ found in new rpms (I have to use SuSE now, and I got + reports of the same bugs appearing on Red Hat, but no problems on + Debian). This does not solve all the problems - if they appear, + switch optimization off (remove -O2 from compile options). +- Added a small program fsa_dump. It is not in Makefile, as it is not + tested yet. The source is in dump.cc. The program lists the contents + of an automaton as transitions. +- Added scripts: de_morph_data.{awk,pl} and de_morph_infix.{awk,pl} + that produce the 3 column format out of data for fsa_build. +- Added scripts: demorph.{awk,pl} that produce the 3 column format from + the output of fsa_morph. +Version 0.20 +- Moved mark_inner() to nnode.cc, as it can be used without A_TERGO option. +- Added an info on producing the contents of an automaton. +- Fixed display of statistics for NEXTBIT and TAILS +- Corrected placement of conditionals so that compilation without + FLEXIBLE is possible. But do use FLEXIBLE! +- Added a Tcl script - an interface for fsa_guess as a tool for + acquisition of descriptions for a morphological dictionary. +- Added additional information to -v option of all programs. +- MORE_COMPR is now *much* faster; actually, it became usable. +- Added a new perl script chkmorph.pl that removes those predictions + made by fsa_guess that cannot produce the required flectional form. +- Added sortatt.pl perl script that sorts words on their + categories/features; it is used by the tcl/tk interface, and it is + specially useful when comparing output of two descriptions. +- added gendata.pl - a perl script that generates data for guessing + morphological descriptions in mmorph format of unknown words. +Version 0.21 +- Corrected some bugs in gendata.pl. +- Added new compile option - WEIGHTED. +- Corrected a bug in chkmorph.pl. +- Corrected a bug in fsa_ubuild (thanks to Christen Blom - Dahl) +- Totally rewritten GENERALIZE. I hope it provides better results. +- Added new script sortondesc.pl that sorts morphological descriptions + of words so that the most probable come first. A description is + judged to be more probable when it appears in more words. +- Corrected a horrible bug in fsa_spell that manifested itself when + the edit distance was set to 0. Program gave arbitrary results. +- Tcl/Tk interface for lexical acquisition is now much more powerful. +- Added a new script putinplace.pl that should put descriptions chosen + with the Tcl/Tk tool in their appropriate places. +Version 0.22 +- Corrected conditional compilation so that it is now possible to + compile without MORE_COMPR. +- Added guided correction (right mouse button on description) to the + dictionary acquisition tool. The interface is improved. +- Added statistics to the dictionary acquisition interface. +Version 0.23 +- In the Tcl/Tk tool, corrected output from mmorph matching so that if + all values of a feature are generated, nothing comes out, and when + no features are generated, the feature name is deleted from the + output. +- In the Tcl/Tk tool, corrected deleting features using the right mouse + button menu. +- Corrected the script chkmorph.pl so that no phony item appears at + the end (there is no dangling comma at the end). +- Added a new option to ignore the filler character in morphology. +- Corrected building a weighted guessing automaton. It still needs my + attention. +Version 0.24 +- Corrected dropping one hypothesis in sortondesc.pl script. +- Corrected a bug in fsa_build that make pointer size calculation invalid + (thanks to Gertjan van Noord). +- Corrected a bug in fsa_spell for distances greater than 1 (thanks to + Jiri Andel). +Version 0.25 +- Included perl and tcl scripts in installation in Makefile. +- Corrected a bug in fsa_hash: null pointers were followed in word->number + conversion (thanks to Martin Povolny). +Version 0.26 +- Included perl and tcl scripts deleted by mistake from 0.25. +- Corrected Makefile so that it does not delete perl and tcl scripts + in make realclean. +Version 0.27 +- Corrected a bug in Undo operation in Tcl/Tk interface. +- Moved customization of tclmacq to Makefile. +- Adapted tclmacq to new version of Tcl/Tk. +Version 0.28 +- Corrected a bug in tclmacq (Tcl/Tk interface for dictionary + acquisition). Sorting was done before (and not after) expansion of + alternatives, which resulted in apparently random order. +- Added some include directives needed in the most recent compilers + (thanks to Dawid Weiss). +- Corrected setting the FILLER character in builds_fsa.cc (thanks to + Dawid Weiss). +- Corrected usage info for dump.cc (thanks to Dawid Weiss). +Version 0.29 +- Corrected a bug in simplify.pl (it produced duplicates). +Version 0.30 +- Corrected a bug in fsa_morph. When one entry was a prefix of another + entry, the words were the same, but one annotation was shorter then + the other one, the longer entry was not printed (thanks to Gertjan + van Noord). +Version 0.31 +- Corrected the use of one variable so that the package compiles with + the old set of options (thanks to Michael Daum). +Version 0.32 +- The package now compiles under g++ 3.1.1. +Version 0.33 +- jguess is again produced (thanks to Leonoor van der Beek) +- Corrected fsa_hash so that words not in the dictionary return -1 + and not a slash (thanks to Vinay Middha). +- Added a file TROUBLESHOOTING describing the most common problems people have + while trying to install and use the package. As a bonus, I included some + solutions as well. +- Added possibility of morphological analysis of words without tags, i.e. + stemming or lemmatization (thanks to Gertjan van Noord). Just remove + the last annotation separator (+) and anything that follows it from + the output of a script preparing morphological data. +Version 0.34 +- States can have up to 255 (was: 127) outgoing transitions when + compiled with STOPBIT (thanks to Gertjan van Noord). +- Closed memory leaks in handling of lists (thanks to Martin Povolny). +Version 0.35 +- Corrected a bug introduced in the previous version (deleting the + wrong thing). +Version 0.36 +- Corrected a bug in dynamic growth of strings read from input in + programs that use automata, i.e. not fsa_build nor fsa_ubuild + (thanks to Gertjan van Noord). +Version 0.37 +- Replaced recursion with iteration in some programs, e.g. fsa_hash. + fsa_hash is now about 3.5 percent faster. +Version 0.38 +- Introduced a "-a" runtime option to list the contents of the whole + dictionary. The updated glibc++ version I have now treats reading an + empty line as an error, so there is no way to learn if an empty line + was indeed read. +- Introduced a new compile option DUMP_ALL to supress printing the + leading space in fsa_prefix. +- Corrected some type errors and vestiges of previous versions when + using DEBUG compile option (thanks to Nikolay Ketsaris). +- Corrected dump.cc to print non-ASCII characters. +Version 0.39 +- fsa_spell now compiles also without CHCLASS (thanks to Nikolay Ketsaris). +- Added ios::binary in 3 places for the benefit of those who have the + misfortune of being forces to use the virus distribution system from + M$. +- Corrected exit code in fsa_prefix when -a is used (thanks to Marcin + Mi\xB3kowski) +- Corrected a bug in fsa_build and fsa_ubuild when -O was + used. Certain states were compressed "too much", i.e. comparison of + transitions did not work in part_cmp_nodes due to a modification + introduced several versions ago. +- Changed the script ie1 to make it immediately useful for debugging + should anything unpredictable happen. +- Removed the outrageously outdated file ToDo. +Version 0.40 +- Corrected a bug in the initialization of the H_matrix in fsa_spell + (thanks to Guillaume Rousse). +- Corrected a bug in ie1 (return value for fgrep) +- Changed the way parameters are passed to most functions in programs + that use automata (passing first arc instead of the parent arc). + This might have introduced some new errors... +- Added a parameter to fsa_spell to force the search for replacements + (thanks to Guillaume Rousse). +- Added two new compile options. The first one -- SPARSE -- changes + the way the automaton is represented. if the option is used, then + most of transitions of the automaton are stored as a sparse + matrix. Only annotations (e.g. in morphological dictionaries) are + still stored as lists of transitions. The new representation is + faster for most tasks, but it takes longer to produce, and it is + larger. The option SLOW_SPARSE makes sure that we try to fill in + every hole in the sparse matrix, but it results in *VERY* slow + construction, and the results are practically the same. +Version 0.41 +- Corrected a bug in fsa_ubuild that caused the FILLER not to be set (thanks + to Marco Baroni) +Version 0.42 +- Corrected a compile error in mmorph.cc when MORPH_INFIX was undefined. +- Corrected an error in fsa_prefix that gave infinite loops while + listing words with certain prefixes. +- Corrected a bug that resulted from new glibc++ I/O behaviour (thanks + to Gertjan van Noord) +- Changed the licence so that the package is freer than it used to be. +Version 0.43 +- Corrected a bug in fsa_morph that never got a chance to manifest + itself there because of the way C++ initializes variables, but it + was a bug anyway (thanks to Jirka Mikulasek). +Version 0.44 +- Corrected a bug in fsa_morph that was introduced in version 0.40 and + resulted in inability to process infixes (thanks to Marcin Milkowski). +- Corrected a bug in fsa_guess that was introduced in version 0.40 and + resulted in inability to process infixes (thanks to Marcin Milkowski). +Version 0.45 +- Corrected a bug in counting transitions with the next flag set that + resulted in incorrect pointer size in fsa_build/fsa_ubuild (thanks + to Marcin Milkowski and Dawid Weiss). Added: fsa/INSTALL =================================================================== --- fsa/INSTALL (rev 0) +++ fsa/INSTALL 2009-03-27 08:14:42 UTC (rev 154) @@ -0,0 +1,810 @@ +1. COMPILATION + +1.1. General + + All programs are written in C++. You need a C++ compiler to compile them. + I have used GNU g++ 2.6.0 under SunOS 4.1.4, and later under + Solaris. This version was compiled with g++ 2.7.2.1. Previous versions + may have problems with templates. I had problems compiling this + version with Solaris CC - again, templates were to blame. + + If you work under Unix, and you have a g++ compiler, a simple command: + + make + + should work. If you use a different compiler, append CXX=that_compiler to + the command line, e.g.: + + make CXX=CC + + If you use another operating system, and a different compiler, you should + have manuals for them. Consult them. Under the infamous so called + operating system from Microsoft, you should consider adding + ios::binary to the declaration of fstream dict(...) in file common.cc. + + Note that emacs lisp package works with emacs 19.34, and it will + almost certainly not work with emacs 20. + +1.2. Compile options + + Before you jump to experiment with various options, or jump out + of the window on seeing how many options there are, please note + that a default set of them is provided in the Makefile, so do not worry. + + Please note that you can see what options were used for compiling + a particular program by invoking it with -v option. When you change + compile options and recompile the programs, please do "make clean" first + - it may save you a lot of troubles. + + There are some compile options that may be worth trying. First, + normal optimization for speed (done by the compiler): + + CPPFLAGS=-O + or + CPPFLAGS=-O2 + + Then there are options used for conditional compilation of the source + code. They are specified in CFLAGS with -D, i.e. use e.g. + + make CPPFLAGS=-DJOIN_PAIRS + + to compile the programs with JOIN_PAIRS option on. To specify more + options, put them into quotes, e.g.: + + make CPPFLAGS='-DA_TERGO -DSORT_ON_FREQ' + + Be careful when specifying MORE_COMPR. The construction time may + rise dramatically when you use -O run-time option of fsa_build or + fsa_ubuild. That time is spent not on construction itself, but + rather on reordering the arcs, and trying to match them. + + In the following descriptions, the following fileds are used: + Assumes: those options must be defined. + Excludes: those options cannot be defined. + Used in: this option is given to programs in the list. + Affects: this option changes the output of programs in the list. + +1.2.1. Options changing format version number + + In the present version, there are 6 numbered format versions: 0, 1, 2 + 4, 5, and 128 (or -127, or 0x80). For differences between these formats + see file fsa.h. The formats correspond to different settings of the + following compile options: LARGE_DICTIONARIES, FLEXIBLE, STOPBIT, NEXTBIT, + TAILS, SPARSE. In the following table, LARGE_DICTIONARIES appears as L_D. + + L_D FLEXIBLE STOPBIT NEXTBIT TAILS WEIGHTED SPARSE + 0: - - - - - - - + 1: - + - - - - - + 2: - + - + - - - + 4: - + + - - - - + 5: - + + + - - - + 6: - + + - + - - + 7: - + + + + - - + 8: - + + + - + - + 9: - + + - - - + + 10: - + + + - - + + 11: - + + - + - + + 12: - + + + + - + + 128: + - - - - - - + + Note that in order to produce an automaton in format version 8, -W + runtime option must be given to fsa_build or fsa_ubuild. Otherwise + version 5 will be produced. + + + FLEXIBLE + makes it possible to produce dictionaries (automata) tailored to + particular needs. The size of arcs is determined dynamically. This + should be on, as the old way gives (usually) bigger + dictionaries. This option also makes the automata portable - another + reason for using it. I may remove inflexible code from the future + versions of this package. + Assumes: no options. + Excludes: LARGE_DICTIONARIES. + Used in: all programs. + Affects: fsa_build, fsa_ubuild. + When to use: always. + + LARGE_DICTIONARIES + This is an old option to be used without FLEXIBLE when the automaton + gets too big. Note that FLEXIBLE makes it possible to produce + dictionaries of any size while making them as small as possible, so + you do not need this LARGE_DICTIONARIES. I am not sure whether it + still works. + Assumes: no options. + Excludes: FLEXIBLE, STOPBIT, NEXTBIT, NUMBERS. + Used in: all programs. + Affects: fsa_build, fsa_ubuild. + When to use: never. + + NEXTBIT + introduces a 1b flag that is set when the target of the arc + is placed right after the current one in the automaton, and cleared + otherwise. Otherwise the bit is not set. In case the flag is set, + the go_to field, i.e. the address of the node to which this arc + points, is dropped - only the (1 byte) part than contains + the flag is kept. This usually produces smaller automata, as there + are frequently chains of nodes one following another, and for the + arcs of those nodes it is not necessary to store the whole addresses + of the next nodes in those chains. However, since the nodes are no + longer fixed size, and we have additional 1b flag that takes place + in the go_to field, the size of the resulting automaton may actually + be higher when the additional 2-3 bytes cross the byte boundary in + the go_to field. Also note that in order to increase the + compression, the numbering scheme is different from the usual one in + that it starts numbering the children from the last arc. This is + done in order to have more nodes lying just after the arc that + points to them. + Assumes: FLEXIBLE. + Excludes: LARGE_DICTIONARIES, JOIN_PAIRS. + Used in: all programs. + Affects: fsa_build, fsa_ubuild, fsa_prefix. + When to use: always. + + STOPBIT + replaces counters that hold the number of arcs for each node with + one bit for each arc that says whether it is the last one in the + node. This gives smaller automata, although maybe a fraction of a + percent slower. Note that while automata produced with this option + are never larger than those produced without it, for some automata, + the size does not change. The reason is that 1-bit markers have to + find room in the goto bytes, and they may provoke crossing the byte + barrier. + Assumes: FLEXIBLE. + Excludes: LARGE_DICTIONARIES, JOIN_PAIRS. + Used in: all programs. + Affects: fsa_build, fsa_ubuild, fsa_prefix. + When to use: always. + + TAILS + introduces a 1b flag that is set for a particular node when the tail + of that node (i.e. a number of arcs that are the last arcs of the + node) matches the tail of another node somewhere else in the + automaton. If the byte is set, then the present arc is followed by + the address of the isomorphic tail in another node in the + automaton. For example, if we have node A with arcs (a, c, d) (we + skip the addresses, and markers of finality for brevity), and a node + B with arcs (b, c, d), then in node B, we can have only an arc with + b, and a pointer to (c, d) from A. The arc b in B has the flag + set. Or we can do that the other way round, i.e. node A may contain + only arc a with the flag set, and the node B is written in + whole. Note that the flag takes space in the goto field, so it may + leed to increase in space. However, it should normally produce + smaller automata. It always leeds to bigger construction times. + Assumes: FLEXIBLE, STOPBIT. + Excludes: LARGE_DICTIONARIES, JOIN_PAIRS. + Used in: all programs. + Affects: fsa_build, fsa_ubuild, fsa_prefix. + When to use: for static dictionaries after testing (for certain + sizes the automata can actually be bigger). + + WEIGHTED + introduces weights in every arc. The weights are proportional to the + number of strings recognized in the part of the automaton reachable + via that arc. Weights take only one byte, so if the number of + strings is too large to fit into one byte, the weights on all arcs + of the parent node are descreased proportinally. This option + requires more memory during construction process, and automata are + larger (they may even contain multiple copies of isomorphic nodes, + but with different weights). However, this option makes it possible + to introduce probabilities to fsa_guess. + Assumes: FLEXIBLE, STOPBIT, NEXTBIT, A_TERGO. + Excludes: LARGE_DICTIONARIES, JOIN_PAIRS. + Used in: all_programs. + Affects: fsa_build, fsa_ubuild, fsa_guess. + When to use: for adding new words to a morphological dictionary, for tagging. + + SPARSE + introduces sparse matrix representation. If there is no annotation + separator in the strings, the entire automaton is stored using it + (except for some dummy data). If there are annotations, they are + stored in the traditional format (list of transitions). This option + gives fast recognition times, fast word to number + conversion (perfect hashing), but larger dictionaries, slow listing + of contents, slow number to word conversion, and slow search for + candidates in spelling correction, slow guessing, slow construction. + Assumes: FLEXIBLE, STOPBIT + Excludes: LARGE_DICTIONARIES, JOIN_PAIRS, WEIGHTED + Used in: all programs. + Affects: fsa_build, fsa_ubuild, fsa_prefix. + When to use: When the programs should be optimized for speed rather + than for size, number to word conversion speed is not + critical to the system, and spelling correction is + called mostly on correct words. See file `Times' for + results of my experiments. + +1.2.2. Options changing format without changing format version number + + NUMBERS + makes it possible to build automata that have word numbering + information in them, and to use them. That information is used by + fsa_hash. To build automata that have the numbering information in + them, use -N option of fsa_build. Note that when using -N, its is + not arcs, but bytes that are addressable, so we need 2 or usually 3 + bits more for the goto field. This in turn may be translated into + increasing the arc size by one byte. Even when we have room for + those additional bits in the current byte frame, note that the + numbering information also takes place (as many bytes as it takes to + number all words stored in the automaton). You cannot use + compression (runtime option -O) with -N. + Assumes: FLEXIBLE. + Excludes: LARGE_DICTIONARIES. + Used in: all programs. + Affects: fsa_build, fsa_ubuild, fsa_hash, fsa_prefix. + When to use: if you use perfect hashing. + +1.2.3. Options changing the size of the automaton without changing the format + + DESCENDING + makes the resulting automaton built with -O a bit smaller, but much slower. + Assumes: SORT_ON_FREQ. + Excludes: No options. + Used in: fsa_build, fsa_ubuild. + Affects: fsa_build, fsa_ubuild, fsa_prefix, fsa_hash. + When to use: If you want a bit smaller but a bit slower to use automata. + + JOIN_PAIRS + makes the resulting automaton smaller if you use fsa_build with "-O" + (the option of fsa_build, or fsa_ubuild, not the compiler). It works + by sharing one arc by two two-arc nodes, where possible. + Assumes: No options. + Excludes: STOPBIT, NEXTBIT. + Used in: fsa_build, fsa_ubuild. + Affects: fsa_build, fsa_ubuild, fsa_prefix, fsa_hash. + When to use: never. + + MORE_COMPR + changes the order of arcs to get more compression. Requires more + memory. With -O, the execution time is much, much longer. + Assumes: NEXTBIT or STOPBIT. + Excludes: No options. + Used in: fsa_build, fsa_ubuild. + Affects: fsa_build, fsa_ubuild, fsa_prefix, fsa_hash. + When to use: for static dictionaries. + + SORT_ON_FREQ + makes the the automaton smaller (independently of JOIN_PAIRS). It + works by sorting the arcs on frequency. Note that this changes the + order of words in the automaton. If DESCENDING not set, can make the + resulting automaton built with -O faster. + Assumes: no options. + Excludes: no options. + Used in: fsa_build, fsa_ubuild. + Affects: fsa_build, fsa_ubuild, fsa_prefix, fsa_hash. + When to use: always except for cases when you build something huge in + real time. + +1.2.4. Option affecting the way guessing automata (index a tergo) are built. + + A_TERGO + enables -X option in fsa_build. This creates an index a tergo (a + guessing automaton). + Assumes: no options. + Excludes: no options. + Used in: fsa_build, fsa_ubuild. + Affects: fsa_build, fsa_ubuild, fsa_guess. + When to use: if you use fsa_guess. + + GENERALIZE + In fsa_build called with -X option, reduces the size of the automaton + while loosing the advantage of always annotating correctly words that + are already in the dictionary. This options makes the automaton + smaller than PRUNE_ARCS. + Assumes: A_TERGO. + Excludes: PRUNE_ARCS. + Used in: fsa_build, fsa_ubuild. + Affects: fsa_build, fsa_ubuild, fsa_guess. + When to use: if you use fsa_guess for adding new words to a dictionary. + + PRUNE_ARCS + launches additional pruning during guessing automaton (index a + tergo) creation. The resulting automaton will be smaller, and + predictions narrower (maybe more precise, but those less probable + may be missing). Automata produced with this option are larger than + with GENERALIZE. + Assumes: A_TERGO. + Excludes: GENERALIZE. + Used in: fsa_build, fsa_ubuild. + Affects: fsa_build, fsa_ubuild, fsa_guess. + When to use: if you use fsa_guess for tagging. + +1.2.5. Options affecting the way guessing automata are interpreted + + GUESS_LEXEMES + makes fsa_guess tries to guess not only categories, but lexemes as + well. The data must be prepared differently (see man pages for + fsa_build and fsa_guess). Run-time option -g switches off guessing + lexemes. + Assumes: no options. + Excludes: no options. + Used in: fsa_guess. + Affects: fsa_guess. + When to use: if you use fsa_guess for more tasks than tagging. + + GUESS_MMORPH + makes it possible to use -m option in fsa_guess, i.e. prediction of + mmorph descriptions. mmorph is a morphology program developed at + ISSCO, Geneva. + Assumes: no options. + Excludes: no options. + Used in: fsa_guess. + Affects: fsa_guess. + When to use: for using fsa_guess in acquisition of new words for a + morphological dictionary. + + GUESS_PREFIX + makes fsa_guess use information about prefixes to disambiguate + morphological parses. Requires GUESS_LEXEMES. Data must be prepared + differently (see man pages for fsa_build and fsa_guess). Reduces the + size of the a tergo dictionary compared with that created to be used + with GUESS_LEXEMES only. Run-time option -p switches off the use of + prefixes in guessing. + Assumes: no options. + Excludes: no options. + Used in: fsa_guess. + Affects: fsa_guess. + When to use: when you use fsa_guess, and the language you are + working on has prefixes or infixes. + +1.2.6. Options changing the way morphological automata are interpreted + + MORPH_INFIX + makes it possible to use -P and -I options that interpret coded + prefixes (-P), and coded prefixes and infixes (-I) in fsa_morph. + For more details, see README file, and the man page for fsa_morph(5). + Assumes: no options. + Excludes: no options. + Used in: fsa_morph. + Affects: fsa_morph. + When to use: when you use fsa_morph, and the language you are + working on has prefixes or infixes. + + POOR_MORPH + makes it possible to use -A option, so that the automata can contain + only information about categories, and no information about the base + form of an inflected form. + Assumes: no options. + Excludes: no options. + Used in: fsa_morph. + Affects: fsa_morph. + When to use: if you use fsa_morph only for tagging. + +1.2.7. Various options. + + CASECONV + works with fsa_spell. It makes it possible to check capitalized words + as if they were all lowercase. + Assumes: no options. + Excludes: no options. + Used in: fsa_accent, fsa_morph, fsa_spell. + Affects: fsa_accent, fsa_morph, fsa_spell. + When to use: when case conversion is needed. + + CHCLASS + makes it possible to treat certain two-letter sequences in certain + context as if they were single letters. This is useful in + spelling. E.g. in Polish, `rz' and `z' with a dot above (\.z in TeX) + are pronounced in exactly the same way, so they may be confused. This + option makes it possible to treat such replacements as if they were + one edit distance unit apart from each other. This option is used in + fsa_spell. + Assumes: no options. + Excludes: no options. + Used in: fsa_spell. + Affects: fsa_spell. + When to use: for spelling correction in languages for which edit + distance one is not sufficient. + + DEBUG + If you have a few spare months, you can compile the programs with + CFLAGS=-DDEBUG. That will give huge amounts of information about program + internals during execution time. It may also give compile errors. In + debugging the program, I just comment out particular ifdefs. + Assumes: no options. + Excludes: no options. + Used in: all programs. + Affects: all programs. + When to use: never. + + DUMP_ALL + works with fsa_prefix. If you compile the program with this option, + no space will be prepended to listed entries. In particular, this + can list the contents of the dictionary without the need to remove + the leading space. Use -a run-time option to list the contents. + Assumes: no options. + Excludes: no options. + Used in: fsa_prefix. + Affects: fsa_prefix. + When to use: to list the contents of a dictionary. + + LOOSING_RPM + makes it possible to use the programs even on linux distributions + using rpms. The libstdc++ distributed with RedHat and SuSE has + broken I/O. You probably do need to use that option with more stable + distributions. This option does not fix the -O2 problem, + however. You will still have to use -O only. + Assumes: no options. + Excludes: no options. + Used in: all programs. + Affects: all programs. + When to use: with corrupted versions og libg++, e.g. Red Hat and SuSE. + + PROGRESS + In fsa_build, shows how many lines have been read so far, and what is + being done at the moment, i.e. what phase the processing is in. + Assumes: no options. + Excludes: no options. + Used in: fsa_build, fsa_ubuild. + Affects: fsa_build, fsa_ubuild. + When to use: when you build something huge and you are not sure if + it works. + + RUNON_WORDS + makes it possible to check whether inserting a space inside the + checked word produces two correct words. This works with fsa_spell. + Assumes: no options. + Excludes: no options. + Used in: fsa_spell. + Affects: fsa_spell. + When to use: for spellchecking. + + SHOW_FILLERS + enables printing of filler characters by fsa_prefix (they are normally + not printed). + Assumes: no option. + Excludes: no option. + Used in: fsa_prefix. + Affects: fsa_prefix. + When to use: for diagnostics. + + SLOW_SPARSE + checks for every hole in a sparse matrix whether it can still be filled, + which could lead to smaller automata. This slows down construction + process for large automata by orders of magnitude. + Assumes: FLEXIBLE, STOPBIT, SPARSE. + Excludes: WEIGHTED, LARGE_DICTIONARIES. + Used in: fsa_build, fsa_ubuild. + Affects: fsa_build, fsa_ubuild. + When to use: If you think you waist too many transitions in a sparse + matrix in small automata. + + STATISTICS + In fsa_build, shows some statistics on the resulting automaton: the + number of states, transitions, etc. + Assumes: no option. + Excludes: no options. + Used in: fsa_build, fsa_ubuild. + Affects: fsa_build, fsa_ubuild. + When to use: when you are interested in properties of automata. + +2. CONSTANTS + + Max_word_len + Defined in: common.h + Default value: 120. + Affects: All programs except fsa_build and fsa_ubuild. + Description: + Restrictions: Must be positive. + + LIST_INIT_SIZE + Defined in: common.h + Default value: 16. + Affects: All programs except fsa_build and fsa_ubuild. + Description: Initial size of a list, e.g. list of replacements, list + of dictionary names etc. The bigger, the faster. + Restrictions: Must be positive. + + LIST_STEP_SIZE + Defined in: common.h + Default value: 8. + Affects: All programs except fsa_build and fsa_ubuild. + Description: If a list grows beyond LIST_INIT_SIZE, its size is + increased by this value. The bigger, the faster. + Restrictions: Must be positive. + + MAX_ARCS_PER_NODE + Defined in: fsa.h + Default value: 255 or 128, depending on compile options. + Affects: All programs. + Description: Maximal number of outgoing transitions per state. Do + not change. + Restrictions: Depends on the structure of states and transitions. Do + not change. + + MAX_NOT_CYCLE + Defined in: common.h + Default value: 1024. + Affects: + Description: Maximal length of a string in the automaton. It is used + to detect errors. + Restrictions: Must be positive. + + MAX_VANITY_LEVEL + Defined in: guess.h + Default value: 5. + Affects: fsa_guess. + Description: + Restrictions: + + PAIR_REG_LEN + Defined in: nindex.h + Default value: 32. + Affects: fsa_build and fsa_ubuild. + Description: + Restrictions: + + MAX_SPARSE_WAIT + Defined in: nnode.h + Default value: 3. + Affects: fsa_build and fsa_ubuild. + Description: + Restrictions: + + Max_edit_distance + Defined in: spell.h + Default value: 3. + Affects: fsa_spell. + Description: + Restrictions: + + WORD_BUFFER_LENGTH + Defined in: build_fsa.cc + Default_value: 128. + Affects: fsa_build and fsa_ubuild. + Description: + Restrictions: + + UNREDUCIBLE + Defined in: mkindex.cc + Default value: 4. + Affects: + Description: + Restrictions: + + WITH_ANNOT + Defined in: mkindex.cc + Default value: 2. + Affects: fsa_build and fsa_ubuild. + Description: + Restrictions: Do not change. + + NO_ANNOT + Defined in: mkindex.cc + Default_value: 1. + Affects: fsa_build and fsa_ubuild. + Description: + Restrictions: Do not change. + + NODE_TO_BE_REDUCED + Defined in: mkindex.cc + Default value: -5. + Affects: fsa_build and fsa_ubuild. + Description: + Restrictions: Do not change. + + NODE_UNREDUCIBLE + Defined in: mkindex.cc + Default value: -6. + Affects: fsa_build and fsa_ubuild. + Description: + Restrictions: Do not change. + + NODE_IN_TAGS + Defined in: mkindex.cc + Default value: -7. + Affects: fsa_build and fsa_ubuild. + Description: + Restrictions: Do not change. + + NODE_MERGED + Defined in: mkindex.cc + Default value: -8. + Affects: fsa_build and fsa_ubuild. + Description: + Restrictions: Do not change. + + NODE_TO_BE_MERGED + Defined in: mkindex.cc + Default value: -9. + Affects: fsa_build and fsa_ubuild. + Description: + Restrictions: Do not change. + + MIN_PRUNE + Defined in: mkindex.cc + Default value: 2. + Affects: fsa_build and fsa_ubuild. + Description: + Restrictions: + + MAX_DESTS + Defined in: mkindex.cc + Default value: 32. + Affects: fsa_build and fsa_ubuild. + Description: + Restrictions: + + MIN_DESTS_MEMBERS + Defined in: mkindex.cc + Default value: 0. + Affects: fsa_build and fsa_ubuild. + Description: + Restrictions: + + MAX_ANNOTS + Defined in: mkindex.cc + Default value: 20. + Affects: fsa_build and fsa_ubuild. + Description: + Restrictions: + + MAX_DIFF_ANNOTS + Defined in: mkindex.cc + Default value: 20. + Affects: fsa_build and fsa_ubuild. + Description: + Restrictions: + + MIN_KIDS_TO_MERGE + Defined in: mkindex.cc + Default value: 2. + Affects: fsa_build and fsa_ubuild. + Description: + Restrictions: + + MIN_ANNOTS + Defined in: mkindex.cc + Default value: 3. + Affects: fsa_build and fsa_ubuild. + Description: + Restrictions: + + AN_NOM + Defined in: mkindex.cc + Default value: 1. + Affects: fsa_build and fsa_ubuild. + Description: + Restrictions: + + AN_DENOM + Defined in: mkindex.cc + Default value: 2. + Affects: fsa_build and fsa_ubuild. + Description: + Restrictions: + + INDEX_SIZE_STEP + Defined in: nindex.cc and nnode.cc. + Default value: 16. + Affects: fsa_build and fsa_ubuild. + Description: + Restrictions: + + +3. INSTALLATION + + Copy all dictionaries you want to be installed into the source dictionary + of the package. Dictionaries are provided separately, so make sure you + have copied them (at least those you may need). Note that the + dictionaries on http://www.pg.gda.pl/~jandac/fsa.html have been + prepared long time ago using only those options that were available + at that time. If you want to use them, compile the program as you + like, try to use the dictionaries, and you will probably get an + error message saying what compile options were used for compilation + of fsa_build that constructed the dictionaries. Save Makefile, + delete some options from it, make clean, make, use fsa_prefix to get + the contents, restore Makefile, make clean, make, and build the + automata again (they should be much smaller with the default set of + options for the current version). + +3.1. Admin part + + For Polish users, you may look at pl.chcl file and uncomment some + lines, if too many users watch too much tv, and read too little. + + There are a few variables in Makefile that you can change. These are + PREFIXDIR - parent dir of BINDIR, MANDIR, DICTDIR (default: /usr/local); + BINDIR - where the programs should be placed (default: $PREFIXDIR/bin); + MANDIR - where the man pages should be placed (default: + $PREFIXDIR/man); + DICTDIR - where dictionaries, accent files, language files, and + character class files should be placed (default: + $PREFIXDIR/lib); + LISPDIR - where jspell.el should be placed. I think the site-lisp + directory is better than lisp directory. Check your emacs + version as it normally forms a part of that name. + The directory specified in Makefile by default will + probably not work for you; + TCLMACQDIR- where files supporting execution of the tcl/tk interface + tclmacq should go (help file and language file); + TCLMACQBINDIR + - where tcl scripts and perl scripts supporting tclmacq + should go (it should be the same as BINDIR); + PREP_FCONF- It should be set to \# for Tcl versions prior to 8.2 (I + think), and to nothing for 8.2 and higher. If man + fconfigure shows -encoding option present, then the + variable should be empty, otherwise it should be set to \# + MANSECT - in which section of the manual the pages should be placed. + + You can specify those variables on the command line, e.g.: + + make installlisp LISPDIR=/utl/share/gnu/emacs/site-lisp + + make install - installes everything, + make installbin - installes the binaries without man pages, + make installman - installes the manpages, + make installscripts - installes interface scripts (jspell & jaccent), + make installlisp - installes jspell.el (byte compile it afterward), + make installdicts - installes dictionaries (if any), accent files, + language files, character class files. + + Note that with newer linux emacs distributions, the LISPDIR should + point to something like /etc/emacs/site-start.d, and the file names + should have a prefix `50'. If you put the jspell.el there, it will + be loaded automatically, so that you will not need (require 'jspell) + in your .emacs file. + +3.2. Admin or user part + + The following commands can be put either in site-start.el file in + the site-lisp directory, or in users' .emacs files. The first method + makes the packet functions available to all users, and it should be + done by the administrator. The second method enables the functions + on a per-user basis. Note that emacs functions used to work with + emacs19, they will probably (almost certainly) not work with emacs20. + + ;; make functions known to emacs + (require 'jspell) + + ;; install menus + (define-key-after + (lookup-key global-map [menu-bar edit]) + [jspell] '("Jspell" .jspell-menu-map) 'ispell) + + You may want to specify the default dictionary with e.g.: + + (setq jspell-dictionary "polski") + + You may want to compile additional dictionaries. Read the README + file and the man page for fsa_build. Remember to sort the data and + exclude duplicates (use sort -u) for fsa_build. + + New perl and tcl/tk scripts: sortatt.pl and tclmacq.tcl require + setting some variables located at the top of those files. + +3.3. Mostly user part + + The following variables can be changed by the user in their .emacs + file: + + jmorph-format - defines the format of morphotactic annotations + (tags). It is an argument to format function + (resembles C format in printf). It contains 3 + %s, corresponding to the inflected word, + lexeme, and tag. The correspondence between + those items and particular %s is given by the + variable jmorph-order. Morphotactic + annotations are added by jmorph-* functions. + Example: (setq jmorph-format "%s_%s+%s"). + + jmorph-order - defines the correspondance between the + inflected word, lexeme, and annotations and %s + in jmorph-format; in other words, it defines + the order in which they appear as %s in + jmorph-format. Example: (setq jmorph-order '(1 + 2 3)). + + jspell-morph-sep - defines a separator character that separates a + lexeme from annotations in the output from + jmorph script. Example: (setq jspell-morph-sep + "&") + + jaccent-automatically - accents are restored without asking the user + for permission if there is only one choice. + Example: (setq jaccent-automatically t) + + jmorph-automatically - morphotactic annotations are added without + asking the user for permission if there is + only one choice. Example: (setq + jmorph-automatically nil). + Added: fsa/Makefile =================================================================== --- fsa/Makefile (rev 0) +++ fsa/Makefile 2009-03-27 08:14:42 UTC (rev 154) @@ -0,0 +1,369 @@ +# Makefile for building a final state automaton +# Copyright (c) Jan Daciuk <ja...@pg...>, 1996, 1997, 1998, 1999 +# +# The most difficult parts written by Dominique Petitpierre + +# These define i/o behaviour of programs +TEXT_IO = one_word_io.o # texts as input, grep -like output +WORD_IO = one_word_io.o # one word per line input + + +# Installation program +INSTALL = cp -i + +# C++ compiler +CXX=g++ + +# Compile options (see the file INSTALL for detail) +# A_TERGO - include code to build an index a tergo (recognizing word +# categories) +# CASECONV - the first letter in spellchecking may be uppercase - check +# both upper & lower +# CHCLASS - checks if a string is replaced with another string that +# sounds similar; in the present form, this checks one-letter +# strings against two-letter strings, and vice versa +# DEBUG - produces huge amounts of useless data +# DESCENDING - produces a bit smaller, but much slower automata +# DUMP_ALL - does not print the leading space in fsa_prefix +# FLEXIBLE - arc size should be adapted to automaton size; better +# compression, (slightly) less speed, architecture independence +# GENERALIZE - used with A_TERGO to reduce the size of the guessing +# automaton, and to increase recall +# GUESS_LEXEMES - tries to guess not only tags, but lexemes as well +# in fsa_guess +# GUESS_MMORPH - makes it possible to use -m option in fsa_guess to predict +# morphological descriptions of lexemes corresponding to +# unknown inflected words; the descriptions are in the format +# of mmorph - MULTEXT morphology tool developed at ISSCO. +# GUESS_PREFIX - tries to include information about prefixes to disambiguate +# morphological parses in fsa_guess +# JOIN_PAIRS - used to prune the automaton (arcs share memory) with -X +# option in fsa_build +# LARGE_DICTIONARIES +# - to build big but a little bit faster automata (do not use it) +# LOOSING_RPM - to work around a bug in rpm libstdc++ library +# MORE_COMPR - to built smaller automata more slowly +# MORPH_INFIX - makes it possible to use -I and -P options in fsa_morph +# for recognition of coded prefixes and infixes +# NEXTBIT - changes the format of the automaton, so that when there are +# chains of nodes, one following another, one bit is set +# in the goto field to indicate that fact, and only one byte +# from the goto field is used; it usually gives smaller +# automata +# NUMBERS - it is possible to use fsa_hash and build dictionaries for +# perfect hashing +# POOR_MORPH - enables -A option in fsa_morph for morphological analysis +# giving only categories, and no base forms. +# PROGRESS - shows how many lines were read, what fsa_build does +# PRUNE_ARCS - used with A_TERGO to reduce the size of the guessing +# automaton, and to increase precision +# RUNON_WORDS - checks whether inserting a space inside the word results +# in two correct words in fsa_spell +# SHOW_FILLERS - the filler character should be displayed in fsa_prefix +# SORT_ON_FREQ - arcs should be sorted on frequency (better compression) +# SLOW_SPARSE - try to fill every hole in sparse matrix representation +# SPARSE - use sparse matrix representation +# STATISTICS - shows some statistics after having built an automaton +# STOPBIT - changes the format of the automaton, so that there are +# no counters, but for each arc there is a bit that says +# whether it is the last one in the node; this gives smaller +# automata +# TAILS - changes the format of the automaton, allowing for more +# arc sharing, so more compression at the cost of construction +# time +# WEIGHTED - introduces weights on arcs for guessing automata. +# +# PRUNE_ARCS works only with A_TERGO +# GUESS_LEXEMES works only with A_TERGO +# LARGE_DICTIONARIES and FLEXIBLE cannot be specified together +# NUMBERS works only with FLEXIBLE +# STOPBIT works only with FLEXIBLE +# (use FLEXIBLE) +# +# See INSTALL file for info on compile options. +# +# Some versions of g++ (or stdlibc++) are broken - if so, don't use -O2! +# !!! If you change these, please do make clean first before each make +CPPFLAGS=-O2 --pedantic -Wall \ + -DFLEXIBLE \ + -DNUMBERS \ + -DA_TERGO \ + -DGENERALIZE \ + -DSORT_ON_FREQ \ + -DSHOW_FILLERS \ + -DSTOPBIT \ + -DNEXTBIT \ + -DMORE_COMPR \ + -DCASECONV \ + -DRUNON_WORDS \ + -DMORPH_INFIX \ + -DPOOR_MORPH \ + -DCHCLASS \ + -DGUESS_LEXEMES -DGUESS_PREFIX \ + -DGUESS_MMORPH \ + -DDUMP_ALL \ + -DSTATISTICS \ + -DPROGRESS \ + -DLOOSING_RPM #-DDMALLOC + + + + +# -pg + +# -DTAILS \ +# -DJOIN_PAIRS \ +# -DPRUNE_ARCS \ +# -DPROGRESS \ +# -DWEIGHTED \ +# -DSTATISTICS \ +# -DSPARSE \ + +# Normally empty +#LDFLAGS=-L/usr/local/lib -ldmallocxx +LDFLAGS= + +# Install directories +PREFIXDIR = /usr/local + +# this is where fsa_build, fsa_spell, etc. should go +BINDIR = ${PREFIXDIR}/bin +# this is where the manuals should be kept +MANDIR = ${PREFIXDIR}/man +# this is where the dictionaries should go; also accent and language files +DICTDIR = ${PREFIXDIR}/lib +# this is where emacs lisp files go +LISPDIR = /usr/lib/emacs/site-lisp +# this is where tcl scripts go (also perl scripts used in tclmacq) +TCLMACQBINDIR = ${BINDIR} +# this is where tclmacq support files (help, language) go +TCLMACQDIR = ${PREFIXDIR}/lib +# The following should be empty if man fconfigure shows -encoding option, +# and set to \# otherwise. In other words, if your Tcl version is 8.0, +# you should set it to \#, and if it is 8.2 or higher -- leave it empty. +PREP_FCONF = \# +#PREP_FCONF +# to which man section man pages for fsa belong +MANSECT1 = 1 +MANSECT5 = 5 + +######################################################################## + +# Objects that make particular programs +SPELL_OBJECTS = common.o spell.o nstr.o ${TEXT_IO} spell_main.o +ACCENT_OBJECTS = common.o nstr.o ${TEXT_IO} accent_main.o accent.o +FSA_B_OBJECTS = build_fsa.o nnode.o nindex.o nstr.o +FSA_S_OBJECTS = builds_fsa.o snode.o +FSA_U_OBJECTS = buildu_fsa.o unode.o +PREFIX_OBJECTS = common.o nstr.o one_word_io.o prefix.o prefix_main.o +GUESS_OBJECTS = common.o nstr.o ${TEXT_IO} guess.o guess_main.o +HASH_OBJECTS = common.o nstr.o ${TEXT_IO} hash.o hash_main.o +MORPH_OBJECTS = common.o nstr.o ${TEXT_IO} morph.o morph_main.o +VISUAL_OBJECTS = common.o nstr.o ${TEXT_IO} visualize.o visual_main.o +ALL_PROGS = fsa_spell fsa_build fsa_accent fsa_prefix fsa_guess fsa_hash \ + fsa_morph fsa_ubuild fsa_visual +SKL_SCRIPTS = jspell jaccent jmorph jguess +TCL_SCRIPTS = tclmacq.tcl filesel.tcl +ALL_SCRIPTS = ${SKL_SCRIPTS} chkmorph.pl deguess.pl demorph.pl \ + find_irregular.pl gendata.pl mmorph23c.pl morph_data.pl morph_infix.pl \ + morph_prefix.pl prep_atg.pl prep_ati.pl prep_atl.pl prep_atp.pl \ + putinplace.pl simplify.pl sortatt.pl sortondesc.pl tclmacq.tcl filesel.tcl +# Note that awk scripts are not portable +AWK_SCRIPTS = de_morph_data.awk de_morph_infix.awk deguess.awk demorph.awk \ + find_irregular.awk mmorph23c.awk morph_data.awk morph_infix.awk \ + morph_prefix.awk prep_atg.awk prep_ati.awk prep_atl.awk prep_atp.awk +TCL_SUPP_FILES = tclmacq-help.txt tclmacq-lang.txt + +ALL_OBJ = common.o spell.o nstr.o spell_main.o \ + accent_main.o accent.o build_fsa.o nnode.o nindex.o prefix.o prefix_main.o \ + guess.o guess_main.o hash.o hash_main.o morph.o morph_main.o builds_fsa.o \ + buildu_fsa.o unode.o snode.o visualize.o visual_main.o + + +all: ${ALL_PROGS} + + +fsa_spell: ${SPELL_OBJECTS} + ${CXX} ${CPPFLAGS} ${SPELL_OBJECTS} ${LDFLAGS} -o fsa_spell + +fsa_accent: ${ACCENT_OBJECTS} + ${CXX} ${CPPFLAGS} ${ACCENT_OBJECTS} ${LDFLAGS} -o fsa_accent + +fsa_build: ${FSA_B_OBJECTS} ${FSA_S_OBJECTS} + ${CXX} ${CPPFLAGS} ${FSA_B_OBJECTS} ${FSA_S_OBJECTS} ${LDFLAGS} -o fsa_build + +fsa_ubuild: ${FSA_B_OBJECTS} ${FSA_U_OBJECTS} + ${CXX} ${CPPFLAGS} ${FSA_B_OBJECTS} ${FSA_U_OBJECTS} ${LDFLAGS} -o fsa_ubuild + + +fsa_prefix: ${PREFIX_OBJECTS} + ${CXX} ${CPPFLAGS} ${PREFIX_OBJECTS} ${LDFLAGS} -o fsa_prefix + +fsa_guess: ${GUESS_OBJECTS} + ${CXX} ${CPPFLAGS} ${GUESS_OBJECTS} ${LDFLAGS} -o fsa_guess + +fsa_hash: ${HASH_OBJECTS} + ${CXX} ${CPPFLAGS} ${HASH_OBJECTS} ${LDFLAGS} -o fsa_hash + +fsa_morph: ${MORPH_OBJECTS} + ${CXX} ${CPPFLAGS} ${MORPH_OBJECTS} ${LDFLAGS} -o fsa_morph + +fsa_visual: ${VISUAL_OBJECTS} + ${CXX} ${CPPFLAGS} ${VISUAL_OBJECTS} ${LDFLAGS} -o fsa_visual + +fsa_dump: dump.cc + ${CXX} ${CPPFLAGS} dump.cc ${LDFLAGS} -o fsa_dump + +common.o: common.cc fsa.h nstr.h common.h + ${CXX} ${CPPFLAGS} -c common.cc + +spell.o: spell.cc fsa.h nstr.h spell.h common.h + ${CXX} ${CPPFLAGS} -c spell.cc + +nstr.o: nstr.cc nstr.h + ${CXX} ${CPPFLAGS} -c nstr.cc + +build_fsa.o: build_fsa.cc nnode.h nindex.h nstr.h fsa.h fsa_version.h mkindex.cc + ${CXX} ${CPPFLAGS} -c build_fsa.cc + +builds_fsa.o: builds_fsa.cc nnode.h nindex.h nstr.h fsa.h fsa_version.h mkindex.cc compile_options.h + ${CXX} ${CPPFLAGS} -c builds_fsa.cc + +buildu_fsa.o: buildu_fsa.cc nnode.h unode.h nindex.h nstr.h fsa.h fsa_version.h mkindex.cc compile_options.h + ${CXX} ${CPPFLAGS} -c buildu_fsa.cc + +nnode.o: nnode.cc nnode.h nstr.h fsa.h nindex.h + ${CXX} ${CPPFLAGS} -c nnode.cc + +unode.o: unode.cc unode.h nnode.h nstr.h fsa.h nindex.h + ${CXX} ${CPPFLAGS} -c unode.cc + +snode.o: snode.cc nnode.h nstr.h fsa.h nindex.h + ${CXX} ${CPPFLAGS} -c snode.cc + +nindex.o: nindex.cc nindex.h nnode.h + ${CXX} ${CPPFLAGS} -c nindex.cc + +one_word_io.o: one_word_io.cc fsa.h common.h + ${CXX} ${CPPFLAGS} -c one_word_io.cc + +text_io.o: text_io.cc common.h fsa.h + ${CXX} ${CPPFLAGS} -c text_io.cc + +spell_main.o: spell_main.cc common.h spell.h fsa_version.h compile_options.h + ${CXX} ${CPPFLAGS} -c spell_main.cc + +accent_main.o: accent_main.cc comm... [truncated message content] |