|
From: Dierk S. <dierk_seeburg@...> - 2004-10-05 18:09:27
|
Hi, What amount of effort would it take have OmegaT segment by sentence? And maybe create a preference setting for sentence-level or paragraph level segmentation? Cheerio, Dierk |
|
From: Krzysiek D. <dwukwiat72@...> - 2004-10-05 19:51:04
|
Hello Dierk, In your letter written on 2004-10-05, 20:09, you wrote: > What amount of effort would it take have OmegaT segment by sentence? There's an OpenOffice macro for that. Check out OmegaT's website, the link to the macro is there. HTH, -- Take care, Krzysiek Drozdowski |
|
From: Marc P. <mail@...> - 2004-10-05 20:03:10
|
> There's an OpenOffice macro for that. Check out OmegaT's website, the link to the macro is there. And a Tcl/Tk utility. It would probably not be too much work to add the function within OmegaT; the problem is though that sentence-level segmenting makes a segment split/merge function necessary, whereas we can get by without it more easily with paragraph-level segmenting. A segment split/merge function is, I think, a lot harder to implement. Suggestion: try using Ben's macros or my scripts, and make suggestions - they could serve as prototypes for an integral function. Marc |
|
From: dierkseeburg <dierk_seeburg@...> - 2004-10-24 07:32:48
|
Marc, Thanks, I've used Ben's macros, and they work ok, but are very bare bones and not customizable :-) I found your Tcl/Tk kit, but not your scripts - are the algorithms the same as in the macros? In my experience and environment, sentence-level segmentation is a must to get decent reusability, YMMV. Cheerio, Dierk --- In Om...@ya..., Marc Prior <mail@m...> wrote: > > There's an OpenOffice macro for that. Check out OmegaT's website, the link > to the macro is there. > > And a Tcl/Tk utility. > > It would probably not be too much work to add the function within OmegaT; the > problem is though that sentence-level segmenting makes a segment split/merge > function necessary, whereas we can get by without it more easily with > paragraph-level segmenting. A segment split/merge function is, I think, a lot > harder to implement. > > Suggestion: try using Ben's macros or my scripts, and make suggestions - they > could serve as prototypes for an integral function. > > Marc |
|
From: Marc P. <mail@...> - 2004-10-25 06:50:33
|
Hi Dierk,
My Tcl/Tk script for sentence-level segmenting ("Sentseg") is in the forum
library. Access the forum from the web, then Files > 3-OmegaTk >
sentseg-0.0.6.zip.
The algorithm is ultra-simple: Sentseg searches for any lower-case character,
followed by a full stop, question mark or exclamation mark, followed by any
number of spaces, followed by any upper-case character. It inserts a dummy
paragraph break (recognized by OmegaT, but not by OOo) before the upper-case
character. That's it.
Ben's algorithm is much the same, IIRC.
I'm refining it as and when I have time. (Version 0.0.5 added the lower-case
character element, version 0.0.6 the alternative punctuation marks ? and !).
The next modification I have in mind will be handling of a list of
abbreviations.
A few points to note about Sentseg:
- You need the Java JDK (or SDK), not just the JRE. This is needed in order to
extract and replace the file content.xml from the OOo zip archive. You need
to edit the Tcl/Tk script with the paths of the extraction utility (not
difficult, details are in the readme file supplied with the script). It
doesn't actually have to be the JDK; any command-line zip utility (such as
pkzip & pkunzip) should do. I opted for the JDK because it's available on
several platforms, and because being Java, it was the solution OmegaT users
were most likely to be familiar with, purely with regard to installation.
- You need Tcl/Tk itself. This is easy to obtain and install.
- Besides editing the zip/unzip line, you'll also need to edit the script if
you use Windows. Tcl/Tk is essentially cross-platform, but Samuel Murray
informs me that the directory handling mechanism in my script didn't work on
his windows machine. If you decide to try Sentseg on Windows, get back to me
and I'll help you with the changes you need to make. It's not difficult, but
I don't have a Windows machine so I can't make the changes myself.
- The regular expression syntax in Tcl/Tk was changed with version 8.1. Most
of the changes were enhancements (backwardly compatible), but there were a
couple of changes that might throw you if you are using documentation for an
earlier version, so my advice is to make sure that both the version and the
documentation you are using are 8.1 or later.
- A "desegmenter" script, which removes the dummy paragraph break from the OOo
file, is included in the package, but it isn't actually needed, since opening
the file and saving it in OOo does the same thing.
> In my experience and environment, sentence-level segmentation is a must to
get decent reusability, YMMV.
The texts I translate don't tend to be that repetitive; I use OmegaT mainly
for the "Find" function. I've been using Sentseg routinely for several months
now. I prefer sentence-level segmenting not because of the greater
reusability, but because some of my texts have very long paragraphs that are
tiresome to handle at once. In fact, OmegaT often fails to find existing
high-similarity matches for very short segments, but I presume this is
related to the bug Maxym is currently working on.
As I mentioned earlier, the big issue with sentence-level segmenting is the
greater need for a merge/split facility.
Marc
|
|
From: Marc P. <mail@...> - 2004-10-25 07:07:51
|
A couple more points on Sentseg, now that I've started playing around with it again: - It only works on OOo files, but it could easily be adapted to work on (say) HTML files. In fact, it would probably be a lot simpler on HTML, since there would be no need for the zip extraction routine; I don't know what a dummy paragraph marker element would look like in HTML, though. - The "any number of spaces" part of the regular expression appears to be superfluous, since as far as I can see, OmegaT replaces multiple spaces with a tag. Marc |
|
From: Henry P. <henry.pijffers@...> - 2004-10-25 07:37:18
|
Marc Prior wrote: > > I don't know what a dummy > paragraph marker element would look like in HTML, though. > You could use the "class" attribute: <p class="OmegaT-dummy">, or something like that. Henry |
|
From: dierkseeburg <dierk_seeburg@...> - 2004-10-26 20:33:04
|
Ah, found it. Unfortunately, can't download it at work (proxy issues)
- could you e-mail it to me, please?
Glad to see that it's being actively maintained :-)
I'm on Windows at work and want to try it out, any help is appreciated.
Maybe if more people see the need for sentence segmentation they might
ask Maxym to look into it :-)
Thanks,
Dierk
--- In Om...@ya..., Marc Prior <mail@m...> wrote:
> Hi Dierk,
>
> My Tcl/Tk script for sentence-level segmenting ("Sentseg") is in the
forum
> library. Access the forum from the web, then Files > 3-OmegaTk >
> sentseg-0.0.6.zip.
>
> The algorithm is ultra-simple: Sentseg searches for any lower-case
character,
> followed by a full stop, question mark or exclamation mark, followed
by any
> number of spaces, followed by any upper-case character. It inserts a
dummy
> paragraph break (recognized by OmegaT, but not by OOo) before the
upper-case
> character. That's it.
>
> Ben's algorithm is much the same, IIRC.
>
> I'm refining it as and when I have time. (Version 0.0.5 added the
lower-case
> character element, version 0.0.6 the alternative punctuation marks ?
and !).
> The next modification I have in mind will be handling of a list of
> abbreviations.
>
> A few points to note about Sentseg:
>
> - You need the Java JDK (or SDK), not just the JRE. This is needed
in order to
> extract and replace the file content.xml from the OOo zip archive.
You need
> to edit the Tcl/Tk script with the paths of the extraction utility (not
> difficult, details are in the readme file supplied with the script). It
> doesn't actually have to be the JDK; any command-line zip utility
(such as
> pkzip & pkunzip) should do. I opted for the JDK because it's
available on
> several platforms, and because being Java, it was the solution
OmegaT users
> were most likely to be familiar with, purely with regard to
installation.
> - You need Tcl/Tk itself. This is easy to obtain and install.
> - Besides editing the zip/unzip line, you'll also need to edit the
script if
> you use Windows. Tcl/Tk is essentially cross-platform, but Samuel
Murray
> informs me that the directory handling mechanism in my script didn't
work on
> his windows machine. If you decide to try Sentseg on Windows, get
back to me
> and I'll help you with the changes you need to make. It's not
difficult, but
> I don't have a Windows machine so I can't make the changes myself.
> - The regular expression syntax in Tcl/Tk was changed with version
8.1. Most
> of the changes were enhancements (backwardly compatible), but there
were a
> couple of changes that might throw you if you are using
documentation for an
> earlier version, so my advice is to make sure that both the version
and the
> documentation you are using are 8.1 or later.
> - A "desegmenter" script, which removes the dummy paragraph break
from the OOo
> file, is included in the package, but it isn't actually needed,
since opening
> the file and saving it in OOo does the same thing.
>
> > In my experience and environment, sentence-level segmentation is a
must to
> get decent reusability, YMMV.
>
> The texts I translate don't tend to be that repetitive; I use OmegaT
mainly
> for the "Find" function. I've been using Sentseg routinely for
several months
> now. I prefer sentence-level segmenting not because of the greater
> reusability, but because some of my texts have very long paragraphs
that are
> tiresome to handle at once. In fact, OmegaT often fails to find
existing
> high-similarity matches for very short segments, but I presume this is
> related to the bug Maxym is currently working on.
>
> As I mentioned earlier, the big issue with sentence-level segmenting
is the
> greater need for a merge/split facility.
>
> Marc
|
|
From: Samuel M. <leuce@...> - 2004-10-10 07:27:43
|
Dierk Seeburg wrote: > What amount of effort would it take have OmegaT segment by sentence? A detailed technical and non-technical answer to this question would be an excellent candidate for inclusion in the FAQ. |
|
From: Maxym M. <mihmax@...> - 2004-10-25 12:42:17
|
Hi Dierk, Well, I don't know, to say true. Will take a look at it after 1.4.4 release You may monitor http://sf.net/tracker/index.php?func=detail&aid=1053692&group_id=68187&atid=520350 to be informed when I evaluate the issue. ciao Maxym --- Dierk Seeburg <dierk_seeburg@...> wrote: > Hi, > What amount of effort would it take have OmegaT segment by > sentence? > And maybe create a preference setting for sentence-level or > paragraph > level segmentation? > Cheerio, > Dierk > _______________________________ Do you Yahoo!? Declare Yourself - Register online to vote today! http://vote.yahoo.com |