Menu

#1260 Paralellize match statistics calculations

4.0
closed-fixed
None
5
2016-09-06
2016-07-07
No

With the move to Java 1.8, we now have access to new Java features such as parallel streams, which lets you easily perform certain kinds of calculations in parallel for potentially large speed-ups. The most obvious application in OmegaT is match statistics calculations (Tools > Match Statistics), specifically the portion calculating the best match for untranslated segments.

An initial naive implementation (merely forEach-ing a parallel stream with the rest of the code path mostly unchanged) gave a 100-200% boost on a 4-core processor (4-core 2.3 GHz Core i7 I7-4850HQ). Refactoring this to use a proper merging Collector gave an additional boost, but I have not benchmarked it.

The processing is performed in parallel only if more than one processor is available.

Discussion

  • Aaron Madlon-Kay

    I also tried valiantly to paralellize the fuzzy match calculations (FindMatches.java) but the result was just a bit slower than the current implementation, except for when doing "separate segment matching" (when the project is not segmented and we fuzzy match subsegments of a segment), which was 10–20% faster.

    Overall it didn't seem worthwhile, so I have not committed the changes to trunk. However the code is available in git on the branch topic/aaron/parallel-matching.

    Why it was slower:

    Our fuzzy matching algorithm is stateful in that it keeps a running list of the five best matches encountered so far. An expensive calculation (tokenization + Levenshtein distance) is performed three times (stemmed, not stemmed, verbatim) on each candidate string, but by comparing against the list we can give up early and skip some of the calculations.

    When run in parallel, each thread will have a list of the best matches that thread has encountered so far, but the list will not represent the global best matches, and so each thread will early-out less and thus do more of the expensive calculations. This, plus the parallellization overhead, is probably what makes it slower overall.

     

    Last edit: Aaron Madlon-Kay 2016-07-07
  • Didier Briel

    Didier Briel - 2016-09-06
    • status: open-fixed --> closed-fixed
     
  • Didier Briel

    Didier Briel - 2016-09-06

    Implemented in the released version 4.0 of OmegaT.

    Didier

     

Log in to post a comment.

MongoDB Logo MongoDB