Helsinki Finite-State Technology / Bugs / #311 hfst-optimized-lookup should fail on long words if requested

#311 hfst-optimized-lookup should fail on long words if requested

Milestone: future

Status: open

Owner: nobody

Labels: None

Priority: 1

Updated: 2015-10-30

Created: 2015-09-11

Creator: Anonymous

Private: No

hfst-opimized-lookup and hfst-lookup should have an option not to analyze words longer than a user-specified limit.

Sometimes input text contains very long words. Lookup in a transducer with a compounding mechanism can require a lot of memory and take quite long. It may be preferable to not analyze these words at all.

The user should be able to specify an upper limit in terms of character count on the length of words whih are analyzed.

Discussion

Flammie Pirinen - 2015-09-11

just FYI I've ran into same problem, omorfi-segment with interjection compounding grinds to a halt if you feed it with a string consisting of 120*"ha", which is a real-world example from our internet corpus. It happens to be incidentally near pathological case since the interjections ha, hah, and ah will ensure plenty of ambiguity.

Other ways to work around this, and other, even infinite, lookups, would be cycle counting (do not visit same state more than N times) or even timers (output all gathered paths and ignore rest after T seconds).

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jussi Piitulainen - 2015-10-15

The single token
"isoisoisoisoisoisoisoisoisoisoisoisoisoisoisoisoisoisoisoisoisoisoisoisoisoäitisi",
occurring once in two billion tokens worth of a discussion forum, had both Turku dependency parser and our own finer run out of memory. I think a similar but shorter token was responsible when another job took a long time and had to be re-run. (I'm using the batch system in taito.) Both tools contain a version of omorfi and use some version of hfst lookup.

The problem is not the length of the token but the number of analyses, I think (in this case the large number of different lemmas where there may or may not be a hyphen at each word boundary in the lemma, but that's beside the point - the next bomb token will be something else altogether).

And it's quite a nuisance when a large job fails due to a single token. Please fix, and make the fix the default operating mode.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sam Hardwick - 2015-10-19

hfst-optimized-lookup, meaning the C++ standalone programs in /trunk/hfst3/tools/src/hfst-optimized-lookup.cc and /trunk/hfst-optimized-lookup/hfst-optimized-lookup, do actually enforce a 5000 character input limit. However I don't think that's really optimal, because as Jussi says, it's really a matter of "bomb tokens" that produce a very large amount of ambiguity without being input-epsilon-cyclic.

Cycle counting is implemented in the library optimized-lookup code (used by hfst-lookup), but it's only used for infinitely ambiguous cases.

Another thing that is implemented in the library optimized-lookup code is a hard fail-early number-of-results limitation, but for some reason that is not used by hfst-lookup, which currently collects all results and only the n best are actually printed. So one possibility would be to modify hfst-lookup to use the limit parameter when calling optimized-lookup transducers. The downside is that the result limitation code doesn't guarantee n best results - that can of course be done, but it becomes much less reliable at not spending a large amount time when looking for possibly better results.

Erik, is there some reason not to use it?
Jussi and Tommi, how do you feel about having a real n-best guarantee vs. avoiding extreme slowness?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Flammie Pirinen - 2015-10-19

I think a good solution is not easy for our use cases, since the problem is, this really affects one in a hundred million tokens, downgrading he whole lookup run to get n-random instead of all-paths or n-best is probably not a good thing.

There's a solution used by some versions of hunspell that's a bit tedious to program but guarantees good results for all non-pathological cases and whatever for the few problem ones that is timeout. So, setting up n second callback/interrupt/whatever that aborts traversal and prints the current n-best. But that requires significant changes in the algorithm.

I guess cycle counting will give no results for both of our examples, since it involves high number of cycles.

It may also be possible to write a greedy traversal algorithm, if it already isn't, combined with fail-early should usually give good enough results.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sam Hardwick - 2015-10-30

I couldn't come up with a really realiable default setting that would protect against bombs, but I implemented a time cutoff parameter for hfst-lookup on optimized-lookup mode.

The cutoff is not completely hard in the sense that some backing-off can still happen after the time limit has been exceeded, but not that much that it would ruin a batch job.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link: