Menu

#136 Case sensitivity for segmentation rules

1.6
closed-fixed
None
8
2006-01-31
2005-12-30
Gabix
No

The default segmentation rules for English suggest no
segment break after "M.". Unfortunately, this rule
makes OmegaT (1.6 RC5) to join into one segment two
sentences if the first one ends with a word that in
turn ends with "m", even if it's NOT the "M."
abbreviation. For example, I translated a text with
lots of usages of "foam" — and, respectively, lots of
usages at sentence ends. As result, I got lots of
segments containing 2-3 sentences and had to bring that
segmentation rule to the end of the list, thus in fact
disabling it.

My suggestion is to make segmentation rules case
sensitive to avoid such situations.

Discussion

  • Jean-Christophe Helary

    Logged In: YES
    user_id=915082

    I think the point here is to _not_ segment at (space)M.(space) but the first space
    being ommited it considers any string with an "m. ".

    Add the first space in the definition and I think you'll notice a difference. As far
    as all the other definitions are concerned the relevant context should be added
    at the beginning of the "before" string.

     
  • Gabix

    Gabix - 2005-12-30

    Logged In: YES
    user_id=1311251

    This also makes sence, but I think, case sensitivity will
    make segmentation more flexible.

     
  • Jean-Christophe Helary

    Logged In: YES
    user_id=915082

    Maybe add a check box next to the rule. There is a flag in Java to switch case
    sensitivity so I suppose that would be easy enough.

    As for the rules as they are written now, they need to be checked one last time
    before stable release. I think what you mention should be taken care of as soon
    as possible.

     
  • Gabix

    Gabix - 2005-12-30

    Logged In: YES
    user_id=1311251

    This also makes sence. I feel like Hoca Nasreddin :-) But
    from the point of view of coding, simple case sensitivity is
    simpler as a check box [you just change String.
    equalsIgnoreCase() to String.equals()] — but it's up to
    developers.

     
  • Maxym Mykhalchuk

    Logged In: YES
    user_id=488500

    Actually, it should not.
    And it does not.
    Test sentence "Test the foam. Here it is!" gets broke.

    So by default all rules are case-sensitive, if you *want*
    your rule to become case-insensitive, add a flag "(?i)"
    before the pattern (see the rule for not breaking on "e.g.").

     
  • Maxym Mykhalchuk

    • assigned_to: nobody --> mihmax
    • status: open --> closed-works-for-me
     
  • Gabix

    Gabix - 2006-01-16

    Screenshot of segmentation with default rules

     
  • Gabix

    Gabix - 2006-01-16

    Logged In: YES
    user_id=1311251

    I'm sorry, but I have to reopen this. I attach two
    screenshots with a real-world example (not just some toy).
    The first one with the default rules -- you'll see three
    sentences grouped into one segment. The second one is with
    the "M." rule brought to the end of the list and thus
    disabled -- you'll see that each sentence now becomes a
    separate segment.

     
  • Gabix

    Gabix - 2006-01-16
    • status: closed-works-for-me --> open-works-for-me
     
  • Gabix

    Gabix - 2006-01-16

    Logged In: YES
    user_id=1311251

    Here's the second screenshot with "M." actually disabled.

     
  • Maxym Mykhalchuk

    • status: open-works-for-me --> open-accepted
     
  • Maxym Mykhalchuk

    • priority: 5 --> 8
     
  • Maxym Mykhalchuk

    Logged In: YES
    user_id=488500

    Yes, I confirm this is a bug, at least I can reproduce it.

     
  • Maxym Mykhalchuk

    Logged In: YES
    user_id=488500

    sorry that I reported it works previously...
    actually fixing, so it will appear in 1.6.RC6 (to be
    released soon)

     
  • Maxym Mykhalchuk

    Logged In: YES
    user_id=488500

    fixed in 1.6.RC6, closing...

     
  • Maxym Mykhalchuk

    • status: open-accepted --> closed-fixed
     

Log in to post a comment.