Menu

#1236 Match statistics are affected by fuzzy match threshold

6.1
open-fixed
None
5
2023-12-28
2023-12-17
Marc Riera
No

When calculating match statistics, the fuzzy match threshold seems to affect the result. Specifically, any match below the threshold is generally counted as "No match" instead of the real percentage. Please find a sample project attached.

From what I can see in https://sourceforge.net/p/omegat/feature-requests/1450/, the threshold was added under the condition that it did not affect the calculation of statistics, because it has a direct impact on work estimates (money and time) and that should be independent from user configuration. This seems to be no longer the case, maybe due to a regression at some point.

Results with a threshold of 30%:

Segments Words Characters (w/o spaces) Characters (w/ spaces)
Repetitions: 425 2171 12988 14577
Exact match: 0 0 0 0
95%-100%: 1444 10162 53617 62188
85%-94%: 23 334 1693 1993
75%-84%: 44 545 3144 3587
50%-74%: 563 5213 26966 31404
No match: 568 6160 33210 38242
Total: 3067 24585 131618 151991

Results with a threshold of 70%:

Segments Words Characters (w/o spaces) Characters (w/ spaces)
Repetitions: 425 2171 12988 14577
Exact match: 0 0 0 0
95%-100%: 1444 10162 53617 62188
85%-94%: 23 334 1693 1993
75%-84%: 48 560 3238 3692
50%-74%: 71 800 4208 4876
No match: 1056 10558 55874 64665
Total: 3067 24585 131618 151991

This is on OmegaT 6.0.0 under Arch Linux (up to date) with Java 11 (OpenJDK).

Thank you very much in advance.

1 Attachments

Discussion

  • Jean-Christophe Helary

    Can you reproduce the behavior in 5.7?

     
    • Marc Riera

      Marc Riera - 2023-12-18

      Yes, the numbers are slightly different in 5.7.1, but there is the same big change in the statistics when changing the threshold.

       
      • Jean-Christophe Helary

        Marc, could you confirm that the problem exists in 5.3 and not in 5.2?

         
        • Marc Riera

          Marc Riera - 2023-12-19

          I can confirm the problem exists in 5.3 and is not present in 5.2, where there is no option to adjust the match threshold.

           
  • Hiroshi Miura

    Hiroshi Miura - 2023-12-22

    Now start making reproducible in test.
    There is no good test data and expectations because the reported case is too large to integrated into source code.

    Could you provide smaller but effective test data and expectations?

    https://github.com/omegat-org/omegat/pull/871

     
    • Hiroshi Miura

      Hiroshi Miura - 2023-12-22

      Now the fix is proposed.

       
      • Thomas CORDONNIER

        See my comment in the Git pull request

         
  • Thomas CORDONNIER

    From what I can see in https://sourceforge.net/p/omegat/feature-requests/1450/, the threshold was added under the condition that it did not affect the calculation of statistics, because it has a direct impact on work estimates (money and time)

    Personally I would not agree on that: that would mean that sometimes a segment is considered as 70% matches by statistics but it does not appear in the matches pane, so the translator will not see it and not be able to auto-insert it, but according to the statistics, he will receive less money for this segment?

    and that should be independent from user configuration.

    No, what I would do

    • if the translator is internal to the same company as the manager, force him to have the same configuration as the manager, almost for such sensitive parameters. Can be done by setting config file read only
    • if the translator is external, then the value of the parameter should be a part of the contract, so the translator is invited to set the same parameter (if he does not, then it becomes his problem, the requester company will apply rules of the contract)
     
    • Marc Riera

      Marc Riera - 2023-12-22

      I agree with you that the issue has deeper implications and I also find strange that a user could be paid less due to statistics not matching the threshold. I initially raised the issue because I found there was a change of behaviour that was not documented anywhere, and I did not find anything in the documentation implying the threshold would affect statistics.

      Personally, I find two acceptable solutions:

      1. Decouple statistics from the TM threshold (legacy behaviour). Add a note in both in the user manual and in that specific part of the settings UI to raise awareness. In the match statistics window, warn the user that the TM threshold may hide certain matches. If possible, in the "Fuzzy matches" pane, show a warning when there are less than 5 matches because certain matches have been filtered out.

      2. Keep statistics coupled to the TM threshold (current behaviour). Add a note in both in the user manual and in that specific part of the settings UI to raise awareness. In the match statistics window, warn the user that the TM threshold may affect the calculation of statistics. If possible, in the "Fuzzy matches" pane, show a warning when there are less than 5 matches because certain matches have been filtered out.

      With either solution, I find that users can take an informed decision regarding the threshold. In my opinion, the current issue is more of lack of clarity than lack of functionality. This whole problem came up after a project management session with my students, where statistics were different for many people and, apparently, everyone had followed the same steps.

      Thanks!

       
  • Jean-Christophe Helary

    The threshold has nothing to do with the statistics. It is only here to set which match is relevant in the match pane. It is just a convenience setting.

     
  • Hiroshi Miura

    Hiroshi Miura - 2023-12-26
    • assigned_to: Hiroshi Miura
     
  • Hiroshi Miura

    Hiroshi Miura - 2023-12-28
    • status: open --> open-fixed
     

Log in to post a comment.

MongoDB Logo MongoDB