OmegaT - multiplatform CAT tool / Feature Requests / #1260 Paralellize match statistics calculations

The free computer aided translation (CAT) tool for professionals

#1260 Paralellize match statistics calculations

Milestone: 4.0

Status: closed-fixed

Owner: Aaron Madlon-Kay

Labels: None

Priority: 5

Updated: 2016-09-06

Created: 2016-07-07

Creator: Aaron Madlon-Kay

Private: No

With the move to Java 1.8, we now have access to new Java features such as parallel streams, which lets you easily perform certain kinds of calculations in parallel for potentially large speed-ups. The most obvious application in OmegaT is match statistics calculations (Tools > Match Statistics), specifically the portion calculating the best match for untranslated segments.

An initial naive implementation (merely forEach-ing a parallel stream with the rest of the code path mostly unchanged) gave a 100-200% boost on a 4-core processor (4-core 2.3 GHz Core i7 I7-4850HQ). Refactoring this to use a proper merging Collector gave an additional boost, but I have not benchmarked it.

The processing is performed in parallel only if more than one processor is available.

Discussion

Aaron Madlon-Kay - 2016-07-07

I also tried valiantly to paralellize the fuzzy match calculations (FindMatches.java) but the result was just a bit slower than the current implementation, except for when doing "separate segment matching" (when the project is not segmented and we fuzzy match subsegments of a segment), which was 10–20% faster.

Overall it didn't seem worthwhile, so I have not committed the changes to trunk. However the code is available in git on the branch topic/aaron/parallel-matching.

Why it was slower:

Our fuzzy matching algorithm is stateful in that it keeps a running list of the five best matches encountered so far. An expensive calculation (tokenization + Levenshtein distance) is performed three times (stemmed, not stemmed, verbatim) on each candidate string, but by comparing against the list we can give up early and skip some of the calculations.

When run in parallel, each thread will have a list of the best matches that thread has encountered so far, but the list will not represent the global best matches, and so each thread will early-out less and thus do more of the expensive calculations. This, plus the parallellization overhead, is probably what makes it slower overall.

Last edit: Aaron Madlon-Kay 2016-07-07

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Didier Briel - 2016-09-06

status: open-fixed --> closed-fixed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Didier Briel - 2016-09-06

Implemented in the released version 4.0 of OmegaT.

Didier

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Paralellize match statistics calculations

The free computer aided translation (CAT) tool for professionals

Group

Searches

Help

#1260 Paralellize match statistics calculations

Discussion