in lttoolbox/fst_processor.cc:763 and on we have
FSTProcessor::compoundAnalysis(wstring input_word, bool uppercase, bool firstupper) {
const int MAX_COMBINATIONS = 500;
…
if(current_state.size() > MAX_COMBINATIONS) {
Was this limit picked out of the air, or well-tested? Presumably computers are faster now than it was first written too, so it may be ripe for increasing. So we should compare time and memory usage with e.g. 500 vs 1000 vs 2000 vs 4000 on a large corpus and with different, large analysers (nob, deu, others?), and possibly increase to a size that doesn't hurt too much.
I got an idea of the source code and what I understood was that this function (
compoundAnalysis) is called whenlt-proc -eis called. So, what this task requires us to do is run lt-comp on a standard dictionary and lt-proc on a large corpus (cat large_corpus | lt-proc -e bin_file_generated.bin) and compare the time and memory usage for different values for the constantMAX_COMBINATIONS. I don't understand how analysers play a role. Can you please elaborate on that?Last edit: Venkat Parthasarathy 2017-03-13
Your
file_generated.binis a morphological analyser compiled as a finite state transducer. They're typically named things likedeu.automorf.bin(forapertium-deu)So, we consider large analysers by compiling those dictionaries (like
apertium-deu.deu.dixorapertium-nob.nob.dix)?apertium-get apertium-deu(or -nob) will give you that (if you haveapertium-all-devinstalled)Last edit: Kevin Brubeck Unhammer 2017-03-13
I routinely run dewiki through apertium-deu if you need testings.
I did some unscientific testing with apertium-deu & dewiki and it seems to me that MAX_COMBINATIONS does not set a bottleneck regardless of how high it is.
Fix3ed?
It seems to have been scientificly tested.