OmegaT - multiplatform CAT tool / Bugs / #136 Case sensitivity for segmentation rules

#136 Case sensitivity for segmentation rules

Milestone: 1.6

Status: closed-fixed

Owner: Maxym Mykhalchuk

Labels: None

Priority: 8

Updated: 2006-01-31

Created: 2005-12-30

Creator: Gabix

Private: No

The default segmentation rules for English suggest no
segment break after "M.". Unfortunately, this rule
makes OmegaT (1.6 RC5) to join into one segment two
sentences if the first one ends with a word that in
turn ends with "m", even if it's NOT the "M."
abbreviation. For example, I translated a text with
lots of usages of "foam" — and, respectively, lots of
usages at sentence ends. As result, I got lots of
segments containing 2-3 sentences and had to bring that
segmentation rule to the end of the list, thus in fact
disabling it.

My suggestion is to make segmentation rules case
sensitive to avoid such situations.

Discussion

Jean-Christophe Helary - 2005-12-30

Logged In: YES
user_id=915082

I think the point here is to _not_ segment at (space)M.(space) but the first space
being ommited it considers any string with an "m. ".

Add the first space in the definition and I think you'll notice a difference. As far
as all the other definitions are concerned the relevant context should be added
at the beginning of the "before" string.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Gabix - 2005-12-30

Logged In: YES
user_id=1311251

This also makes sence, but I think, case sensitivity will
make segmentation more flexible.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jean-Christophe Helary - 2005-12-30

Logged In: YES
user_id=915082

Maybe add a check box next to the rule. There is a flag in Java to switch case
sensitivity so I suppose that would be easy enough.

As for the rules as they are written now, they need to be checked one last time
before stable release. I think what you mention should be taken care of as soon
as possible.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Gabix - 2005-12-30

Logged In: YES
user_id=1311251

This also makes sence. I feel like Hoca Nasreddin :-) But
from the point of view of coding, simple case sensitivity is
simpler as a check box [you just change String.
equalsIgnoreCase() to String.equals()] — but it's up to
developers.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Maxym Mykhalchuk - 2006-01-13

Logged In: YES
user_id=488500

Actually, it should not.
And it does not.
Test sentence "Test the foam. Here it is!" gets broke.

So by default all rules are case-sensitive, if you *want*
your rule to become case-insensitive, add a flag "(?i)"
before the pattern (see the rule for not breaking on "e.g.").

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Maxym Mykhalchuk - 2006-01-13

assigned_to: nobody --> mihmax

status: open --> closed-works-for-me
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Gabix - 2006-01-16

Screenshot of segmentation with default rules

default_segmentation_rules.png

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Gabix - 2006-01-16

Logged In: YES
user_id=1311251

I'm sorry, but I have to reopen this. I attach two
screenshots with a real-world example (not just some toy).
The first one with the default rules -- you'll see three
sentences grouped into one segment. The second one is with
the "M." rule brought to the end of the list and thus
disabled -- you'll see that each sentence now becomes a
separate segment.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Gabix - 2006-01-16

status: closed-works-for-me --> open-works-for-me
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Gabix - 2006-01-16

Logged In: YES
user_id=1311251

Here's the second screenshot with "M." actually disabled.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Maxym Mykhalchuk - 2006-01-25

status: open-works-for-me --> open-accepted
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Maxym Mykhalchuk - 2006-01-25

priority: 5 --> 8
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Maxym Mykhalchuk - 2006-01-25

Logged In: YES
user_id=488500

Yes, I confirm this is a bug, at least I can reproduce it.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Maxym Mykhalchuk - 2006-01-25

Logged In: YES
user_id=488500

sorry that I reported it works previously...
actually fixing, so it will appear in 1.6.RC6 (to be
released soon)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Maxym Mykhalchuk - 2006-01-31

Logged In: YES
user_id=488500

fixed in 1.6.RC6, closing...

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Maxym Mykhalchuk - 2006-01-31

status: open-accepted --> closed-fixed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Log in to post a comment.