Thread: sentence-level versus paragraph-level segmenting

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi,
What amount of effort would it take have OmegaT segment by sentence?
And maybe create a preference setting for sentence-level or paragraph 
level segmentation?
Cheerio,
Dierk

Hello Dierk,

In your letter written on 2004-10-05, 20:09, you wrote:

> What amount of effort would it take have OmegaT segment by sentence?

There's an OpenOffice macro for that. Check out OmegaT's website, the
link to the macro is there.

HTH,

-- 
Take care,
Krzysiek Drozdowski

> There's an OpenOffice macro for that. Check out OmegaT's website, the link 
to the macro is there.

And a Tcl/Tk utility.

It would probably not be too much work to add the function within OmegaT; the 
problem is though that sentence-level segmenting makes a segment split/merge 
function necessary, whereas we can get by without it more easily with 
paragraph-level segmenting. A segment split/merge function is, I think, a lot 
harder to implement. 

Suggestion: try using Ben's macros or my scripts, and make suggestions - they 
could serve as prototypes for an integral function.

Marc

Marc,
Thanks, I've used Ben's macros, and they work ok, but are very bare
bones and not customizable :-)
I found your Tcl/Tk kit, but not your scripts - are the algorithms the
same as in the macros?
In my experience and environment, sentence-level segmentation is a
must to get decent reusability, YMMV.
Cheerio,
Dierk

--- In Om...@ya..., Marc Prior <mail@m...> wrote:
> > There's an OpenOffice macro for that. Check out OmegaT's website,
the link 
> to the macro is there.
> 
> And a Tcl/Tk utility.
> 
> It would probably not be too much work to add the function within
OmegaT; the 
> problem is though that sentence-level segmenting makes a segment
split/merge 
> function necessary, whereas we can get by without it more easily with 
> paragraph-level segmenting. A segment split/merge function is, I
think, a lot 
> harder to implement. 
> 
> Suggestion: try using Ben's macros or my scripts, and make
suggestions - they 
> could serve as prototypes for an integral function.
> 
> Marc

Hi Dierk,

My Tcl/Tk script for sentence-level segmenting ("Sentseg") is in the forum 
library. Access the forum from the web, then Files > 3-OmegaTk > 
sentseg-0.0.6.zip.

The algorithm is ultra-simple: Sentseg searches for any lower-case character, 
followed by a full stop, question mark or exclamation mark, followed by any 
number of spaces, followed by any upper-case character. It inserts a dummy 
paragraph break (recognized by OmegaT, but not by OOo) before the upper-case 
character. That's it. 

Ben's algorithm is much the same, IIRC. 

I'm refining it as and when I have time. (Version 0.0.5 added the lower-case 
character element, version 0.0.6 the alternative punctuation marks ? and !). 
The next modification I have in mind will be handling of a list of 
abbreviations. 

A few points to note about Sentseg:

- You need the Java JDK (or SDK), not just the JRE. This is needed in order to 
extract and replace the file content.xml from the OOo zip archive. You need 
to edit the Tcl/Tk script with the paths of the extraction utility (not 
difficult, details are in the readme file supplied with the script). It 
doesn't actually have to be the JDK; any command-line zip utility (such as 
pkzip & pkunzip) should do. I opted for the JDK because it's available on 
several platforms, and because being Java, it was the solution OmegaT users 
were most likely to be familiar with, purely with regard to installation.
- You need Tcl/Tk itself. This is easy to obtain and install.
- Besides editing the zip/unzip line, you'll also need to edit the script if 
you use Windows. Tcl/Tk is essentially cross-platform, but Samuel Murray 
informs me that the directory handling mechanism in my script didn't work on 
his windows machine. If you decide to try Sentseg on Windows, get back to me 
and I'll help you with the changes you need to make. It's not difficult, but 
I don't have a Windows machine so I can't make the changes myself.
- The regular expression syntax in Tcl/Tk was changed with version 8.1. Most 
of the changes were enhancements (backwardly compatible), but there were a 
couple of changes that might throw you if you are using documentation for an 
earlier version, so my advice is to make sure that both the version and the 
documentation you are using are 8.1 or later.
- A "desegmenter" script, which removes the dummy paragraph break from the OOo 
file, is included in the package, but it isn't actually needed, since opening 
the file and saving it in OOo does the same thing.

> In my experience and environment, sentence-level segmentation is a must to 
get decent reusability, YMMV.

The texts I translate don't tend to be that repetitive; I use OmegaT mainly 
for the "Find" function. I've been using Sentseg routinely for several months 
now. I prefer sentence-level segmenting not because of the greater 
reusability, but because some of my texts have very long paragraphs that are 
tiresome to handle at once. In fact, OmegaT often fails to find existing 
high-similarity matches for very short segments, but I presume this is 
related to the bug Maxym is currently working on. 

As I mentioned earlier, the big issue with sentence-level segmenting is the 
greater need for a merge/split facility.

Marc

A couple more points on Sentseg, now that I've started playing around with it 
again:

- It only works on OOo files, but it could easily be adapted to work on (say) 
HTML files. In fact, it would probably be a lot simpler on HTML, since there 
would be no need for the zip extraction routine; I don't know what a dummy 
paragraph marker element would look like in HTML, though.
- The "any number of spaces" part of the regular expression appears to be 
superfluous, since as far as I can see, OmegaT replaces multiple spaces with 
a tag.

Marc

Marc Prior wrote:
 >
> I don't know what a dummy 
> paragraph marker element would look like in HTML, though.
 >
You could use the "class" attribute: <p class="OmegaT-dummy">, or 
something like that.

Henry

Ah, found it.  Unfortunately, can't download it at work (proxy issues)
- could you e-mail it to me, please?
Glad to see that it's being actively maintained :-)
I'm on Windows at work and want to try it out, any help is appreciated.
Maybe if more people see the need for sentence segmentation they might
ask Maxym to look into it :-)
Thanks,
Dierk

--- In Om...@ya..., Marc Prior <mail@m...> wrote:
> Hi Dierk,
> 
> My Tcl/Tk script for sentence-level segmenting ("Sentseg") is in the
forum 
> library. Access the forum from the web, then Files > 3-OmegaTk > 
> sentseg-0.0.6.zip.
> 
> The algorithm is ultra-simple: Sentseg searches for any lower-case
character, 
> followed by a full stop, question mark or exclamation mark, followed
by any 
> number of spaces, followed by any upper-case character. It inserts a
dummy 
> paragraph break (recognized by OmegaT, but not by OOo) before the
upper-case 
> character. That's it. 
> 
> Ben's algorithm is much the same, IIRC. 
> 
> I'm refining it as and when I have time. (Version 0.0.5 added the
lower-case 
> character element, version 0.0.6 the alternative punctuation marks ?
and !). 
> The next modification I have in mind will be handling of a list of 
> abbreviations. 
> 
> A few points to note about Sentseg:
> 
> - You need the Java JDK (or SDK), not just the JRE. This is needed
in order to 
> extract and replace the file content.xml from the OOo zip archive.
You need 
> to edit the Tcl/Tk script with the paths of the extraction utility (not 
> difficult, details are in the readme file supplied with the script). It 
> doesn't actually have to be the JDK; any command-line zip utility
(such as 
> pkzip & pkunzip) should do. I opted for the JDK because it's
available on 
> several platforms, and because being Java, it was the solution
OmegaT users 
> were most likely to be familiar with, purely with regard to
installation.
> - You need Tcl/Tk itself. This is easy to obtain and install.
> - Besides editing the zip/unzip line, you'll also need to edit the
script if 
> you use Windows. Tcl/Tk is essentially cross-platform, but Samuel
Murray 
> informs me that the directory handling mechanism in my script didn't
work on 
> his windows machine. If you decide to try Sentseg on Windows, get
back to me 
> and I'll help you with the changes you need to make. It's not
difficult, but 
> I don't have a Windows machine so I can't make the changes myself.
> - The regular expression syntax in Tcl/Tk was changed with version
8.1. Most 
> of the changes were enhancements (backwardly compatible), but there
were a 
> couple of changes that might throw you if you are using
documentation for an 
> earlier version, so my advice is to make sure that both the version
and the 
> documentation you are using are 8.1 or later.
> - A "desegmenter" script, which removes the dummy paragraph break
from the OOo 
> file, is included in the package, but it isn't actually needed,
since opening 
> the file and saving it in OOo does the same thing.
> 
> > In my experience and environment, sentence-level segmentation is a
must to 
> get decent reusability, YMMV.
> 
> The texts I translate don't tend to be that repetitive; I use OmegaT
mainly 
> for the "Find" function. I've been using Sentseg routinely for
several months 
> now. I prefer sentence-level segmenting not because of the greater 
> reusability, but because some of my texts have very long paragraphs
that are 
> tiresome to handle at once. In fact, OmegaT often fails to find
existing 
> high-similarity matches for very short segments, but I presume this is 
> related to the bug Maxym is currently working on. 
> 
> As I mentioned earlier, the big issue with sentence-level segmenting
is the 
> greater need for a merge/split facility.
> 
> Marc

Dierk Seeburg wrote:

> What amount of effort would it take have OmegaT segment by sentence?

A detailed technical and non-technical answer to this question would be 
an excellent candidate for inclusion in the FAQ.

Hi Dierk,

Well, I don't know, to say true.
Will take a look at it after 1.4.4 release
You may monitor
http://sf.net/tracker/index.php?func=detail&aid=1053692&group_id=68187&atid=520350
to be informed when I evaluate the issue.

ciao
Maxym

--- Dierk Seeburg <dierk_seeburg@...> wrote:

> Hi,
> What amount of effort would it take have OmegaT segment by
> sentence?
> And maybe create a preference setting for sentence-level or
> paragraph 
> level segmentation?
> Cheerio,
> Dierk
> 

_______________________________
Do you Yahoo!?
Declare Yourself - Register online to vote today!
http://vote.yahoo.com

Thread: sentence-level versus paragraph-level segmenting

The free computer aided translation (CAT) tool for professionals

omegat-users