OmegaT - multiplatform CAT tool / Bugs / #1236 Match statistics are affected by fuzzy match threshold

	Segments	Words	Characters (w/o spaces)	Characters (w/ spaces)
Repetitions:	425	2171	12988	14577
Exact match:	0	0	0	0
95%-100%:	1444	10162	53617	62188
85%-94%:	23	334	1693	1993
75%-84%:	44	545	3144	3587
50%-74%:	563	5213	26966	31404
No match:	568	6160	33210	38242
Total:	3067	24585	131618	151991

	Segments	Words	Characters (w/o spaces)	Characters (w/ spaces)
Repetitions:	425	2171	12988	14577
Exact match:	0	0	0	0
95%-100%:	1444	10162	53617	62188
85%-94%:	23	334	1693	1993
75%-84%:	48	560	3238	3692
50%-74%:	71	800	4208	4876
No match:	1056	10558	55874	64665
Total:	3067	24585	131618	151991

Jean-Christophe Helary - 2023-12-18

Can you reproduce the behavior in 5.7?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Marc Riera - 2023-12-18
  
  Yes, the numbers are slightly different in 5.7.1, but there is the same big change in the statistics when changing the threshold.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Jean-Christophe Helary - 2023-12-19
    
    Marc, could you confirm that the problem exists in 5.3 and not in 5.2?
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Marc Riera - 2023-12-19
      
      I can confirm the problem exists in 5.3 and is not present in 5.2, where there is no option to adjust the match threshold.
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Hiroshi Miura - 2023-12-22

Now start making reproducible in test.
There is no good test data and expectations because the reported case is too large to integrated into source code.

Could you provide smaller but effective test data and expectations?

https://github.com/omegat-org/omegat/pull/871

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Hiroshi Miura - 2023-12-22
  
  Now the fix is proposed.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Thomas CORDONNIER - 2023-12-22
    
    See my comment in the Git pull request
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Thomas CORDONNIER - 2023-12-22

From what I can see in https://sourceforge.net/p/omegat/feature-requests/1450/, the threshold was added under the condition that it did not affect the calculation of statistics, because it has a direct impact on work estimates (money and time)

Personally I would not agree on that: that would mean that sometimes a segment is considered as 70% matches by statistics but it does not appear in the matches pane, so the translator will not see it and not be able to auto-insert it, but according to the statistics, he will receive less money for this segment?

and that should be independent from user configuration.

No, what I would do

if the translator is internal to the same company as the manager, force him to have the same configuration as the manager, almost for such sensitive parameters. Can be done by setting config file read only

if the translator is external, then the value of the parameter should be a part of the contract, so the translator is invited to set the same parameter (if he does not, then it becomes his problem, the requester company will apply rules of the contract)
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Marc Riera - 2023-12-22
  
  I agree with you that the issue has deeper implications and I also find strange that a user could be paid less due to statistics not matching the threshold. I initially raised the issue because I found there was a change of behaviour that was not documented anywhere, and I did not find anything in the documentation implying the threshold would affect statistics.
  
  Personally, I find two acceptable solutions:
  
  Decouple statistics from the TM threshold (legacy behaviour). Add a note in both in the user manual and in that specific part of the settings UI to raise awareness. In the match statistics window, warn the user that the TM threshold may hide certain matches. If possible, in the "Fuzzy matches" pane, show a warning when there are less than 5 matches because certain matches have been filtered out.
  
  Keep statistics coupled to the TM threshold (current behaviour). Add a note in both in the user manual and in that specific part of the settings UI to raise awareness. In the match statistics window, warn the user that the TM threshold may affect the calculation of statistics. If possible, in the "Fuzzy matches" pane, show a warning when there are less than 5 matches because certain matches have been filtered out.
  
  With either solution, I find that users can take an informed decision regarding the threshold. In my opinion, the current issue is more of lack of clarity than lack of functionality. This whole problem came up after a project management session with my students, where statistics were different for many people and, apparently, everyone had followed the same steps.
  
  Thanks!
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jean-Christophe Helary - 2023-12-22

The threshold has nothing to do with the statistics. It is only here to set which match is relevant in the match pane. It is just a convenience setting.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Hiroshi Miura - 2023-12-26

assigned_to: Hiroshi Miura
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Hiroshi Miura - 2023-12-28

status: open --> open-fixed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Match statistics are affected by fuzzy match threshold

The free computer aided translation (CAT) tool for professionals

Group

Searches

Help

#1236 Match statistics are affected by fuzzy match threshold

Discussion