Helsinki Finite-State Technology / Bugs / #314 hfst-proc stuck in a loop on very long analyses

hfst-proc stuck in a loop on very long analyses

#314 hfst-proc stuck in a loop on very long analyses

Milestone: future

Status: open

Owner: nobody

Labels: proc (7) hfst-proc (4)

Priority: 1

Updated: 2015-09-21

Created: 2015-09-21

Creator: Kevin Brubeck Unhammer

Private: No

this command

echo '0226 04 0-6 0–7–90–90 07 1177 1182 11 12-14 12 13 13 14-15 148 1-4 1.-4 1500- 1600 15 1600 17 1814 1840 1848 1-8 18 1900 1902 1924 1956 1959 1963 1967 1970 1972-1982 1973 1975 1976 1979 1980-81 1983 1985 19870612-056 1987 1988 1991 1997-4 19980717061 1998 1999 1 2000 2002 20050617-064 2005 2007-13 2007 2009-2013 200 2011 30 2012-06 2013 2014 2016 204 207 2.2 25 267 2.-6 27 752 000 29 2 30 32 350 3'|hfst-proc smj-sme.automorf-untrimmed.hfst |wc -l

makes proc eat RAM and never output anything (well, I didn't try letting it eat my swap), while hfst-lookup seems to work (40961 analyses, after 17s CPU time; Xerox lookup somehow takes <2s).

1 Attachments

smj-sme.automorf-untrimmed.hfst

Discussion

hfst-optimised-lookup is much faster than xerox, but skips most analyses (which might be ok):

$ time echo '0226 04 0-6 0–7–90–90 07 1177 1182 11 12-14 12 13 13 14-15 148 1-4 1.-4 1500- 1600 15 1600 17 1814 1840 1848 1-8 18 1900 1902 1924 1956 1959 1963 1967 1970 1972-1982 1973 1975 1976 1979 1980-81 1983 1985 19870612-056 1987 1988 1991 1997-4 19980717061 1998 1999 1 2000 2002 20050617-064 2005 2007-13 2007 2009-2013 200 2011 30 2012-06 2013 2014 2016 204 207 2.2 25 267 2.-6 27 752 000 29 2 30 32 350 3'| lookup -q build/xerox/src/analyser-gt-norm.xfst | wc -l
   40961

real    0m1.803s
user    0m1.809s
sys 0m0.050s
$ time echo '0226 04 0-6 0–7–90–90 07 1177 1182 11 12-14 12 13 13 14-15 148 1-4 1.-4 1500- 1600 15 1600 17 1814 1840 1848 1-8 18 1900 1902 1924 1956 1959 1963 1967 1970 1972-1982 1973 1975 1976 1979 1980-81 1983 1985 19870612-056 1987 1988 1991 1997-4 19980717061 1998 1999 1 2000 2002 20050617-064 2005 2007-13 2007 2009-2013 200 2011 30 2012-06 2013 2014 2016 204 207 2.2 25 267 2.-6 27 752 000 29 2 30 32 350 3'| hfst-optimised-lookup -q build/hfst/src/analyser-gt-norm.hfstol | wc -l
       8

real    0m0.207s
user    0m0.101s
sys 0m0.056s

The issue in proc might be related to it trying to tokenize the input as it does the analysis. Because of all spaces, the tokenization is potentially very ambiguous.

So the 40961 lines have something to do with the apertium-version of the analyser:

$ hfst-fst2fst -O -i analyser-mt-gt-desc.hfst -o tmp.hfstol 
$ time echo '0226 04 0-6 0–7–90–90 07 1177 1182 11 12-14 12 13 13 14-15 148 1-4 1.-4 1500- 1600 15 1600 17 1814 1840 1848 1-8 18 1900 1902 1924 1956 1959 1963 1967 1970 1972-1982 1973 1975 1976 1979 1980-81 1983 1985 19870612-056 1987 1988 1991 1997-4 19980717061 1998 1999 1 2000 2002 20050617-064 2005 2007-13 2007 2009-2013 200 2011 30 2012-06 2013 2014 2016 204 207 2.2 25 267 2.-6 27 752 000 29 2 30 32 350 3'|hfst-optimised-lookup tmp.hfstol |wc -l     
   40961

real    0m1.877s
user    0m1.652s
sys     0m0.285s

I guess it makes sense that the proc's tokenisation can take a lot of space if there are 40961 end states along with all those spaces.

hfst-proc stuck in a loop on very long analyses

Group

Searches

Help

#314 hfst-proc stuck in a loop on very long analyses

Discussion