The default segmentation rules for English suggest no
segment break after "M.". Unfortunately, this rule
makes OmegaT (1.6 RC5) to join into one segment two
sentences if the first one ends with a word that in
turn ends with "m", even if it's NOT the "M."
abbreviation. For example, I translated a text with
lots of usages of "foam" — and, respectively, lots of
usages at sentence ends. As result, I got lots of
segments containing 2-3 sentences and had to bring that
segmentation rule to the end of the list, thus in fact
disabling it.
My suggestion is to make segmentation rules case
sensitive to avoid such situations.
Logged In: YES
user_id=915082
I think the point here is to _not_ segment at (space)M.(space) but the first space
being ommited it considers any string with an "m. ".
Add the first space in the definition and I think you'll notice a difference. As far
as all the other definitions are concerned the relevant context should be added
at the beginning of the "before" string.
Logged In: YES
user_id=1311251
This also makes sence, but I think, case sensitivity will
make segmentation more flexible.
Logged In: YES
user_id=915082
Maybe add a check box next to the rule. There is a flag in Java to switch case
sensitivity so I suppose that would be easy enough.
As for the rules as they are written now, they need to be checked one last time
before stable release. I think what you mention should be taken care of as soon
as possible.
Logged In: YES
user_id=1311251
This also makes sence. I feel like Hoca Nasreddin :-) But
from the point of view of coding, simple case sensitivity is
simpler as a check box [you just change String.
equalsIgnoreCase() to String.equals()] — but it's up to
developers.
Logged In: YES
user_id=488500
Actually, it should not.
And it does not.
Test sentence "Test the foam. Here it is!" gets broke.
So by default all rules are case-sensitive, if you *want*
your rule to become case-insensitive, add a flag "(?i)"
before the pattern (see the rule for not breaking on "e.g.").
Screenshot of segmentation with default rules
Logged In: YES
user_id=1311251
I'm sorry, but I have to reopen this. I attach two
screenshots with a real-world example (not just some toy).
The first one with the default rules -- you'll see three
sentences grouped into one segment. The second one is with
the "M." rule brought to the end of the list and thus
disabled -- you'll see that each sentence now becomes a
separate segment.
Logged In: YES
user_id=1311251
Here's the second screenshot with "M." actually disabled.
Logged In: YES
user_id=488500
Yes, I confirm this is a bug, at least I can reproduce it.
Logged In: YES
user_id=488500
sorry that I reported it works previously...
actually fixing, so it will appear in 1.6.RC6 (to be
released soon)
Logged In: YES
user_id=488500
fixed in 1.6.RC6, closing...