When calculating match statistics, the fuzzy match threshold seems to affect the result. Specifically, any match below the threshold is generally counted as "No match" instead of the real percentage. Please find a sample project attached.
From what I can see in https://sourceforge.net/p/omegat/feature-requests/1450/, the threshold was added under the condition that it did not affect the calculation of statistics, because it has a direct impact on work estimates (money and time) and that should be independent from user configuration. This seems to be no longer the case, maybe due to a regression at some point.
Results with a threshold of 30%:
| Segments | Words | Characters (w/o spaces) | Characters (w/ spaces) | |
|---|---|---|---|---|
| Repetitions: | 425 | 2171 | 12988 | 14577 |
| Exact match: | 0 | 0 | 0 | 0 |
| 95%-100%: | 1444 | 10162 | 53617 | 62188 |
| 85%-94%: | 23 | 334 | 1693 | 1993 |
| 75%-84%: | 44 | 545 | 3144 | 3587 |
| 50%-74%: | 563 | 5213 | 26966 | 31404 |
| No match: | 568 | 6160 | 33210 | 38242 |
| Total: | 3067 | 24585 | 131618 | 151991 |
Results with a threshold of 70%:
| Segments | Words | Characters (w/o spaces) | Characters (w/ spaces) | |
|---|---|---|---|---|
| Repetitions: | 425 | 2171 | 12988 | 14577 |
| Exact match: | 0 | 0 | 0 | 0 |
| 95%-100%: | 1444 | 10162 | 53617 | 62188 |
| 85%-94%: | 23 | 334 | 1693 | 1993 |
| 75%-84%: | 48 | 560 | 3238 | 3692 |
| 50%-74%: | 71 | 800 | 4208 | 4876 |
| No match: | 1056 | 10558 | 55874 | 64665 |
| Total: | 3067 | 24585 | 131618 | 151991 |
This is on OmegaT 6.0.0 under Arch Linux (up to date) with Java 11 (OpenJDK).
Thank you very much in advance.
Can you reproduce the behavior in 5.7?
Yes, the numbers are slightly different in 5.7.1, but there is the same big change in the statistics when changing the threshold.
Marc, could you confirm that the problem exists in 5.3 and not in 5.2?
I can confirm the problem exists in 5.3 and is not present in 5.2, where there is no option to adjust the match threshold.
Now start making reproducible in test.
There is no good test data and expectations because the reported case is too large to integrated into source code.
Could you provide smaller but effective test data and expectations?
https://github.com/omegat-org/omegat/pull/871
Now the fix is proposed.
See my comment in the Git pull request
Personally I would not agree on that: that would mean that sometimes a segment is considered as 70% matches by statistics but it does not appear in the matches pane, so the translator will not see it and not be able to auto-insert it, but according to the statistics, he will receive less money for this segment?
No, what I would do
I agree with you that the issue has deeper implications and I also find strange that a user could be paid less due to statistics not matching the threshold. I initially raised the issue because I found there was a change of behaviour that was not documented anywhere, and I did not find anything in the documentation implying the threshold would affect statistics.
Personally, I find two acceptable solutions:
Decouple statistics from the TM threshold (legacy behaviour). Add a note in both in the user manual and in that specific part of the settings UI to raise awareness. In the match statistics window, warn the user that the TM threshold may hide certain matches. If possible, in the "Fuzzy matches" pane, show a warning when there are less than 5 matches because certain matches have been filtered out.
Keep statistics coupled to the TM threshold (current behaviour). Add a note in both in the user manual and in that specific part of the settings UI to raise awareness. In the match statistics window, warn the user that the TM threshold may affect the calculation of statistics. If possible, in the "Fuzzy matches" pane, show a warning when there are less than 5 matches because certain matches have been filtered out.
With either solution, I find that users can take an informed decision regarding the threshold. In my opinion, the current issue is more of lack of clarity than lack of functionality. This whole problem came up after a project management session with my students, where statistics were different for many people and, apparently, everyone had followed the same steps.
Thanks!
The threshold has nothing to do with the statistics. It is only here to set which match is relevant in the match pane. It is just a convenience setting.