From: Per T. <per...@op...> - 2012-09-23 14:47:36
|
Hi, existing one to many translations I've ran into by accident: inte - ikke icke - ikke fru - kone fru - frue fru - fru vid - bred vid - rummelig stad - by stad - stad New ones I've happened to add: utdelning - fordeling (usage: distribution of e.g. bread, is this correct?) utdelning - uddeling (usage: share - information technology) fungera - fungere (I don't know the difference in Danish) fungera - virke godkännande - accept (normal usage) accept - accept (usage: finance/economy and legal) And there are a lot more you can find by searching for RL and LR. I have some more questions below. Yours, Per Tunedal On Wed, Sep 12, 2012, at 16:24, Francis Tyers wrote: > El dc 12 de 09 de 2012 a les 16:04 +0200, en/na Per Tunedal va escriure: > > Hi again, > > there already are some double translations in the pair Swedish (se) - > > Danish (da). (In the new direction from Danish to Swedish.) I cannot > > remove them, can I? Or should I comment them out!? > > If they are valid translations, leave them. I've updated the modes file > to use the -b mode which in the case of ambiguity picks the first > translation. Seems like a convenient way to treat the problem. Which one is "the first"? Simply the one first in the bidix reading from the top to the bottom or something else? > > Also, could you give examples ? It's easier to work things out with > examples... See above. > > > How should I prepare for one to many translations? Or at least not put > > any obstacles in the way of using Francis Tyers' new solution. > > Leave them in. OK. I might change the order, though. If that would improve the translations. Should I remove the original r="RL" and r="LR"? And the ones I've added? > > > I don't fully understand what you wrote, Francis: > > > > > > Regarding the bilingual dictionaries, if you would like to use the > > > > module in the future (please use it, it's cool!), then the most > > > > important things are: > > > > > > > > * add as many reliable translations as possible per word. > > > > * do not use LR/RL for lexical selection -- only for grammatical stuff. > > > > It is better to use i="yes", or slr="..." with the for translations > > ------------------------------------------------------------------------ > > > > which you don't want to pick. If you use slr, you'll need the > > > > lexchoicebil.xsl script. > > > > > > > > You can choose to mark the default translation or not. In the case that > > > > you don't mark it, it can be learnt. > > > > I cannot use LR/RL in the wordlists? What can I do instead? > > You can use LR/RL but not for translation divergence, just for > grammatical divergence. > > > > > It is better to use i="yes", or slr="..." with the for translations > > > > with the for translations? With the what? > > Leave them in. Please explain the tags i="yes" and slr="..." I'm curious to my nature. That's my way of learning things. Sometimes it might lead to a new solution for an old problem. > > Fran > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Apertium-stuff mailing list > Ape...@li... > https://lists.sourceforge.net/lists/listinfo/apertium-stuff |
From: Per T. <per...@op...> - 2012-09-24 12:10:25
|
Hi, please see my questions below. Yours, Per Tunedal On Sun, Sep 23, 2012, at 17:10, Francis Tyers wrote: > El dg 23 de 09 de 2012 a les 16:47 +0200, en/na Per Tunedal va escriure: > > Hi, > > existing one to many translations I've ran into by accident: > > > > inte - ikke > > icke - ikke > > > > fru - kone > > fru - frue > > fru - fru > > > > vid - bred > > vid - rummelig > > > > stad - by > > stad - stad > > > > New ones I've happened to add: > > > > utdelning - fordeling (usage: distribution of e.g. bread, is this > > correct?) > > utdelning - uddeling (usage: share - information technology) > > > > fungera - fungere (I don't know the difference in Danish) > > fungera - virke > > > > godkännande - accept (normal usage) > > accept - accept (usage: finance/economy and legal) > > > > And there are a lot more you can find by searching for RL and LR. > > > > I have some more questions below. > > > > Yours, > > Per Tunedal > > > > On Wed, Sep 12, 2012, at 16:24, Francis Tyers wrote: > > > El dc 12 de 09 de 2012 a les 16:04 +0200, en/na Per Tunedal va escriure: > > > > Hi again, > > > > there already are some double translations in the pair Swedish (se) - > > > > Danish (da). (In the new direction from Danish to Swedish.) I cannot > > > > remove them, can I? Or should I comment them out!? > > > > > > If they are valid translations, leave them. I've updated the modes file > > > to use the -b mode which in the case of ambiguity picks the first > > > translation. > > > > Seems like a convenient way to treat the problem. > > Which one is "the first"? Simply the one first in the bidix reading from > > the top to the bottom or something else? > > It's not predictable from the order in the bidix because of the > compilation process. "the -b mode which in the case of ambiguity picks the first translation" Thus the result is unpredictable? Such a solution doesn't appeal to me. > > > > > > Also, could you give examples ? It's easier to work things out with > > > examples... > > > > See above. > > > > > > > > > How should I prepare for one to many translations? Or at least not put > > > > any obstacles in the way of using Francis Tyers' new solution. > > > > > > Leave them in. > > > > OK. I might change the order, though. If that would improve the > > translations. > > Changing the order would not change the translation. > > > Should I remove the original r="RL" and r="LR"? And the ones I've added? > > You can do. OK, but which one of the translations will be used? > > > > > > > > I don't fully understand what you wrote, Francis: > > > > > > > > > > Regarding the bilingual dictionaries, if you would like to use the > > > > > > module in the future (please use it, it's cool!), then the most > > > > > > important things are: > > > > > > > > > > > > * add as many reliable translations as possible per word. > > > > > > * do not use LR/RL for lexical selection -- only for grammatical stuff. > > > > > > It is better to use i="yes", or slr="..." with the for translations > > > > ------------------------------------------------------------------------ > > > > > > which you don't want to pick. Is a word missing above? Do you mean something like: with the LINES/ENTRIES for translations which you don't want to pick? If you use slr, you'll need the > > > > > > lexchoicebil.xsl script. I haven't found any documentation of the lexchoicebil.xsl script in the Wiki. > > > > > > > > > > > > You can choose to mark the default translation or not. In the case that > > > > > > you don't mark it, it can be learnt. > > > > > > > > I cannot use LR/RL in the wordlists? What can I do instead? > > > > > > You can use LR/RL but not for translation divergence, just for > > > grammatical divergence. > > > > > > > > > It is better to use i="yes", or slr="..." with the for translations > > > > > > > > with the for translations? With the what? > > > > > > Leave them in. > > > > Please explain the tags i="yes" and slr="..." > > I'm curious to my nature. That's my way of learning things. Sometimes it > > might lead to a new solution for an old problem. > > i="yes" means that when the dictionary is compiled, the entry is > ignored. > > slr="" is a way of marking distinct "senses" for words. In your examples > above, that's probably what I'd use. > > Fran > I suppose the tags should be within the <e>-tag? Like this <e slr="XXX">. Senses? Is there any standard for that? Above I indicated domains. That's easy, senses are a bit more complicated to explain in short. Do I set codes for senses in the lexchoicebil.xsl script, or what? Practical considerations: 1. Finally, when correcting the pair Swedish (se) - Danish (da), what shall I do? Simply strip the RL and LR tags, or adding an slr-tag instead? 2. Obviously, the sme-nob solution <e srl="1">, and using a CG rule to pick the alternative when needed, is out of the question for the Swedish (se) - Danish (da) pair. But what's the best solution for the "pair" Norwegian (nn/nb) - Swedish (se)? I cannot entirely avoid the problem as there are already two to one relations from the very start. I have to treat them when I meet them. > > ------------------------------------------------------------------------------ > Everyone hates slow websites. So do we. > Make your web apps faster with AppDynamics > Download AppDynamics Lite for free today: > http://ad.doubleclick.net/clk;258768047;13503038;j? > http://info.appdynamics.com/FreeJavaPerformanceDownload.html > _______________________________________________ > Apertium-stuff mailing list > Ape...@li... > https://lists.sourceforge.net/lists/listinfo/apertium-stuff |
From: Francis T. <ft...@pr...> - 2012-09-24 12:24:16
|
El dl 24 de 09 de 2012 a les 14:10 +0200, en/na Per Tunedal va escriure: > Hi, > please see my questions below. > Yours, > Per Tunedal > > On Sun, Sep 23, 2012, at 17:10, Francis Tyers wrote: > > El dg 23 de 09 de 2012 a les 16:47 +0200, en/na Per Tunedal va escriure: > > > Hi, > > > existing one to many translations I've ran into by accident: > > > > > > inte - ikke > > > icke - ikke > > > > > > fru - kone > > > fru - frue > > > fru - fru > > > > > > vid - bred > > > vid - rummelig > > > > > > stad - by > > > stad - stad > > > > > > New ones I've happened to add: > > > > > > utdelning - fordeling (usage: distribution of e.g. bread, is this > > > correct?) > > > utdelning - uddeling (usage: share - information technology) > > > > > > fungera - fungere (I don't know the difference in Danish) > > > fungera - virke > > > > > > godkännande - accept (normal usage) > > > accept - accept (usage: finance/economy and legal) > > > > > > And there are a lot more you can find by searching for RL and LR. > > > > > > I have some more questions below. > > > > > > Yours, > > > Per Tunedal > > > > > > On Wed, Sep 12, 2012, at 16:24, Francis Tyers wrote: > > > > El dc 12 de 09 de 2012 a les 16:04 +0200, en/na Per Tunedal va escriure: > > > > > Hi again, > > > > > there already are some double translations in the pair Swedish (se) - > > > > > Danish (da). (In the new direction from Danish to Swedish.) I cannot > > > > > remove them, can I? Or should I comment them out!? > > > > > > > > If they are valid translations, leave them. I've updated the modes file > > > > to use the -b mode which in the case of ambiguity picks the first > > > > translation. > > > > > > Seems like a convenient way to treat the problem. > > > Which one is "the first"? Simply the one first in the bidix reading from > > > the top to the bottom or something else? > > > > It's not predictable from the order in the bidix because of the > > compilation process. > > "the -b mode which in the case of ambiguity picks the first > translation" Thus the result is unpredictable? Such a solution doesn't > appeal to me. Yes. Then write a lexical selection rule, or use the lexchoicebil.xsl script. > > > Should I remove the original r="RL" and r="LR"? And the ones I've added? > > > > You can do. > > OK, but which one of the translations will be used? The first one as output by lt-proc -b. > > > > > > > > > > > I don't fully understand what you wrote, Francis: > > > > > > > > > > > > Regarding the bilingual dictionaries, if you would like to use the > > > > > > > module in the future (please use it, it's cool!), then the most > > > > > > > important things are: > > > > > > > > > > > > > > * add as many reliable translations as possible per word. > > > > > > > * do not use LR/RL for lexical selection -- only for grammatical stuff. > > > > > > > It is better to use i="yes", or slr="..." with the for translations > > > > > ------------------------------------------------------------------------ > > > > > > > which you don't want to pick. > > Is a word missing above? Do you mean something like: > with the LINES/ENTRIES for translations which you don't want to pick? > > If you use slr, you'll need the > > > > > > > lexchoicebil.xsl script. > > I haven't found any documentation of the lexchoicebil.xsl script in the > Wiki. Grab it from one of the pairs that use it. Suggestions: br-fr, mk-en, sme-nob. > > > > > > > > > > > > > > You can choose to mark the default translation or not. In the case that > > > > > > > you don't mark it, it can be learnt. > > > > > > > > > > I cannot use LR/RL in the wordlists? What can I do instead? > > > > > > > > You can use LR/RL but not for translation divergence, just for > > > > grammatical divergence. > > > > > > > > > > > It is better to use i="yes", or slr="..." with the for translations > > > > > > > > > > with the for translations? With the what? > > > > > > > > Leave them in. > > > > > > Please explain the tags i="yes" and slr="..." > > > I'm curious to my nature. That's my way of learning things. Sometimes it > > > might lead to a new solution for an old problem. > > > > i="yes" means that when the dictionary is compiled, the entry is > > ignored. > > > > slr="" is a way of marking distinct "senses" for words. In your examples > > above, that's probably what I'd use. > > > > Fran > > > > I suppose the tags should be within the <e>-tag? Like this <e > slr="XXX">. Exactly. > Senses? Is there any standard for that? Above I indicated domains. > That's easy, senses are a bit more complicated to explain in short. Do I > set codes for senses in the lexchoicebil.xsl script, or what? No, just use numbers. > Practical considerations: > > 1. Finally, when correcting the pair Swedish (se) - Danish (da), what > shall I do? Simply strip the RL and LR tags, or adding an slr-tag > instead? Depends on what the difference is between the translations, as I've said above. > 2. Obviously, the sme-nob solution <e srl="1">, and using a CG rule to > pick the alternative when needed, is out of the question for the Swedish > (se) - Danish (da) pair. But what's the best solution for the "pair" > Norwegian (nn/nb) - Swedish (se)? That's the old solution. Which we're not using any more. We're moving to using real SELECT/REMOVE rules. The interface being the same for apertium-lex-tools and cg. It's the output of lt-proc -b. > I cannot entirely avoid the problem as there are already two to one > relations from the very start. I have to treat them when I meet them. Ok. You'd probably save yourself a lot of time if you came on IRC like everyone is asking. Things there take about 1/50th of the time to explain as on the mailing list. Fran |
From: Trosterud T. <tro...@ui...> - 2012-08-09 13:35:23
|
Kevin Brubeck Unhammer kirjoitti 9. aug. 2012 kello 14:54: > Francis Tyers <ft...@pr...> writes: >> El dj 09 de 08 de 2012 a les 10:35 +0200, en/na Per Tunedal va escriure: >>> I consider Apertium suitable for translating the pair Swedish - >>> Norwegian Yes. >>> 3. You might use a level 1 translation (without constraint grammar), >>> like the pair Swedish - Danish. In that case, you could make the >>> translation usable for a wide audience by adding the pair to Apertium >>> Caffeine and the new OmegaT plug-in. >> >> In any case there is no free constraint grammar of Swedish currently available. The lack of CG for Swedish is a problem. My suggestion would be to write one. To be a bit specific: To write the 100-or-so rules needed for removing the gross majority, say 80(?)% of the ambiguity. > What you're describing is gisting/translation for understanding; I can't > imagine gisting MT would be very useful for sv-nb/nn (and I suspect > people would use Google for that anyway). >From the Norwegian side, we cannot imagine the need for a sv-nb/nn gisting system. The maximum help we would need is, in rare cases, a dictionary translating a small number of hard words. How hard Norwegian is for Swedes is of course up to the Swedes to judge. But the competition will be between understanding the Norwegian text and understanding (sic) the MT output. > But with these closely related > languages, it's possible to get to a standard good enough for > post-editing (pre-publishing), e.g. with OmegaT as you mentioned, and in > that case the users definitely know which language it is already. Yes, a production system (say, I want to translate a sv article to nn on Wikipedia) is a different matter. My experience from nn-nb translation is that time saving from post editing as compared to rewriting/translation lies around 80%. So yes, that can be a good idea. __But__ nb-nn lexicon and orthographic principles are the same, so more often than not unknown words will come out as free rides. For sv-nn/nb that will __not__ be the same (to the same extent), since both vocabulary and orthography deviates more. So, less free rides for unknown words. This implies that the transfer lexicon must be __much__ bigger than the nb-nn one in order to get the same good results as we have for nb-nn. The good news is that the making of such an enlarged transfer lexicon in part can be done automatically, and then manually post edited. >> >> (3) You make the two translators in the one pair. For this, you could >> have the same Swedish dictionary, but would need different nb and nn >> dictionaries, different sv-nb and sv-nn dictionaries and different sv-nb >> and sv-nn transfer rules. > (3) sounds best to me too. I agree. > Perhaps you could even do with one bidix, and > just use the alt="nn" vs alt="nb" attribute; a rough and dirty count > shows that the majority of entries in the nn-nb bidix carry over the > same lemma/tag: This could very well be the case, yes (cf. my experiences with free rides). > That said, I would pick one first and get the system up and running, > then expand to both later on. This is also a possibility, yes. But the expansion to both languages should be taken into account in the setup phase. > https://en.wikipedia.org/wiki/Language_identification > Using a library like that makes general (you can use it for lots of > languages) and is a *lot* faster than translating everything twice (or > thrice or …). Yes. Language identifications. > http://www.nb.no/spraakbanken/tilgjengelege-ressursar/tekstressursar has > more frequency lists (they also taunt you with this enormous corpus, but > it's currently "in beta", very messy, and best avoided for now). The best resource is the NoWaC corpus, it also has frequency lists, both for lemmata and for word forms. My final comment would be that the work will be 1 in the analysis/generation of Swedish 2 … and in the bidix. As for 1, we should look around in the Swedish language technology landscape and look for open resources, e.g. in Gothenburg (Aarne Ranta, also Språkbanken). As for 2, Lexin might be one resource. I am on Euralex in Oslo right now, and will ask around. Trond. |
From: <ke...@ke...> - 2012-08-09 17:55:32
|
On Thu, Aug 09, 2012 at 02:54:27PM +0200, Kevin Brubeck Unhammer wrote: > Francis Tyers <ft...@pr...> writes: > > What you're describing is gisting/translation for understanding; I can't > imagine gisting MT would be very useful for sv-nb/nn (and I suspect > people would use Google for that anyway). But with these closely related > languages, it's possible to get to a standard good enough for > post-editing (pre-publishing), e.g. with OmegaT as you mentioned, and in > that case the users definitely know which language it is already. > > > There are three possibilities. > > > > (1) You can make an sv-nb (or sv-nn) translator, and then include a > > subset of the nn-nb translator in it, piping the output of sv-nb into > > sv-nn. (here you would have an sv-nb dictionary and an nb-nn dictionary) > > > > (2) You make two translators in parallel. > > > > (3) You make the two translators in the one pair. For this, you could > > have the same Swedish dictionary, but would need different nb and nn > > dictionaries, different sv-nb and sv-nn dictionaries and different sv-nb > > and sv-nn transfer rules. > > > > I think that (3) is probably best, but would like input from others > > (e.g. Unhammer or Trond). > > (3) sounds best to me too. Perhaps you could even do with one bidix, and > just use the alt="nn" vs alt="nb" attribute; a rough and dirty count > shows that the majority of entries in the nn-nb bidix carry over the > same lemma/tag: > > $ lt-expand apertium-nn-nb.nn-nb.dix | grep -v ':[<>]:' | awk -F: '$1==$2'|wc -l > 71628 > $ lt-expand apertium-nn-nb.nn-nb.dix | grep -v ':[<>]:' | awk -F: '$1!=$2'|wc -l > 11365 > As Danish is a kind of old Norwegian bokmaal, maybe we could inlude that language too. Then all three languages could benefit from the combined work. > >> B. I have looked in the repository and found that some work has been > >> done on the following dictionaries: > >> > >> Danish (da) - Norwegian Bokm??l (nb) - nursery > >> Swedish (sv) - Norwegian Bokm??l (nb) - incubator > >> > >> Tihomir told me he's working on Swedish-Icelandic and has expanded the > >> Swedish monolingual dictionary from sv-da. But which is the most > >> complete Norwegian Bokm??l (nb) monolingual dictionnary? The one from the > >> pair Norwegian Bokm??l (nb) - Norwegian Nynorsk (nn)? > > > > Yes, I would take the Swedish dictionary from sv-is and the Norwegian > > dictionar(y,ies) from nn-nb. I have also been working on swedish nouns from SALDO. I was working on a scheme that could remove about 60 % of the eambiguities. Best regards keld |
From: Per T. <per...@op...> - 2012-08-09 18:21:46
|
Hi, thank you all for the useful comments! On Thu, Aug 9, 2012, at 15:35, Trosterud Trond wrote: > > Kevin Brubeck Unhammer kirjoitti 9. aug. 2012 kello 14:54: > > > Francis Tyers <ft...@pr...> writes: --snip-- > > >>> 3. You might use a level 1 translation (without constraint grammar), > >>> like the pair Swedish - Danish. In that case, you could make the > >>> translation usable for a wide audience by adding the pair to Apertium > >>> Caffeine and the new OmegaT plug-in. > >> > >> In any case there is no free constraint grammar of Swedish currently available. > > The lack of CG for Swedish is a problem. My suggestion would be to write > one. To be a bit specific: > To write the 100-or-so rules needed for removing the gross majority, say > 80(?)% of the ambiguity. > Tihomir has told before that he plans to start developing a constraint grammar for Swedish. > > What you're describing is gisting/translation for understanding; I can't > > imagine gisting MT would be very useful for sv-nb/nn (and I suspect > > people would use Google for that anyway). > > >From the Norwegian side, we cannot imagine the need for a sv-nb/nn gisting system. The maximum help we would need is, in rare cases, a dictionary translating a small number of hard words. > > How hard Norwegian is for Swedes is of course up to the Swedes to judge. > But the competition will be between understanding the Norwegian text and > understanding (sic) the MT output. You are right of course, I should have thought of that. > > > But with these closely related > > languages, it's possible to get to a standard good enough for > > post-editing (pre-publishing), e.g. with OmegaT as you mentioned, and in > > that case the users definitely know which language it is already. > > Yes, a production system (say, I want to translate a sv article to nn on > Wikipedia) is a different matter. My experience from nn-nb translation > is that time saving from post editing as compared to > rewriting/translation lies around 80%. Yes, that was the scenario I first had in mind. But it would break if there is a need for a constraint grammar, wouldn't it? And then there wont be any use left for the Apertium-translation. > > So yes, that can be a good idea. __But__ nb-nn lexicon and orthographic > principles are the same, so more often than not unknown words will come > out as free rides. For sv-nn/nb that will __not__ be the same (to the > same extent), since both vocabulary and orthography deviates more. So, > less free rides for unknown words. This implies that the transfer lexicon > must be __much__ bigger than the nb-nn one in order to get the same good > results as we have for nb-nn. The good news is that the making of such an > enlarged transfer lexicon in part can be done automatically, and then > manually post edited. What do you have in mind? Please tell me more about how to generate the bidic automatically! And for the manual part: Keld once told me there is a lists of "false friends" for da/sv/nb. Where do I find that list of problematic words? > > >> > >> (3) You make the two translators in the one pair. For this, you could > >> have the same Swedish dictionary, but would need different nb and nn > >> dictionaries, different sv-nb and sv-nn dictionaries and different sv-nb > >> and sv-nn transfer rules. > > > (3) sounds best to me too. > I agree. > > > Perhaps you could even do with one bidix, and > > just use the alt="nn" vs alt="nb" attribute; a rough and dirty count > > shows that the majority of entries in the nn-nb bidix carry over the > > same lemma/tag: > > This could very well be the case, yes (cf. my experiences with free > rides). > > > That said, I would pick one first and get the system up and running, > > then expand to both later on. > > This is also a possibility, yes. But the expansion to both languages > should be taken into account in the setup phase. > How to proceed? Say that I go for (3) with one bidix and start with bokmål (nb). BTW On my hand cream I can read "N/D Intensivt mykgjørende/blødgørende og pleiende håndkrem/håndcreme." Looks like a similar approach for Norsk bokmål (nb) and Danish (da)! That's why I thought of reusing the danish - swedish transfer rules. --snip-- > > > > http://www.nb.no/spraakbanken/tilgjengelege-ressursar/tekstressursar has > > more frequency lists (they also taunt you with this enormous corpus, but > > it's currently "in beta", very messy, and best avoided for now). > > The best resource is the NoWaC corpus, it also has frequency lists, both > for lemmata and for word forms. > > My final comment would be that the work will be > > 1 in the analysis/generation of Swedish > 2 … and in the bidix. > > As for 1, we should look around in the Swedish language technology > landscape and look for open resources, e.g. in Gothenburg (Aarne Ranta, > also Språkbanken). What kind of resources do I need? > > As for 2, Lexin might be one resource. I am on Euralex in Oslo right now, > and will ask around. Fine! Besides, what's Lexin? > > Trond. |
From: Trosterud T. <tro...@ui...> - 2012-08-09 21:23:52
|
Per Tunedal kirjoitti 9. aug. 2012 kello 20:21: > Tihomir has told before that he plans to start developing a constraint > grammar for Swedish. Good. Again: - Are there open resources? - Could something be ported from Norwegian? (perhaps only indirectly). >> Yes, a production system (say, I want to translate a sv article to nn on >> Wikipedia) (…) > Yes, that was the scenario I first had in mind. But it would break if > there is a need for a constraint grammar, wouldn't it? And then there > wont be any use left for the Apertium-translation. Well. Since a handful of rules will remove most ambiguities, what is left will be partly disambiguated. And how bad this is for MT needs to be seen. So it will not break. It will only be more problematic, and the result will be poorer. >> The good news is that the making of such an >> enlarged transfer lexicon in part can be done automatically, and then >> manually post edited. > What do you have in mind? Please tell me more about how to generate the > bidic automatically! a. via a parallel corpus (of course) b. by --- 1 taking a sv list of words --- run it through a sv2no orthographical + lexical transfer --- analyze the output, and pick the recognized matches (input N Sg -> collect all N Sg output) --- go through the result manually About the transducer: Lexical changes: samhälle > samfunn, prefikset o- -> u-, stad -> by (when these occur in compounds) suffixes: -tion -> -sjon, The obvious things: ö>ø, ä>æ, x>ks See e.g. associationsrikedom, variationsrikedom, situationsrikedom, infektionssjukdom, kombinationsslalom, informationsergonom, nationalekonom, sundströmnationalekonom, konsumtionsboom, kommunikationsform, organisationsform, notationsform, injektionsform, portionsform, distributionsform, nationalsocialism, ationalism, nationalism, smygnationalism, multinationalism, hypernationalism, internationalism, vänsternationalism, naturnationalism, hägnainossnationalism, statsnationalism, rationalism, sensationalism, traditionalism, funktionalism, exceptionalism, koncentrationskapitalism, mutationsmekanism, isolationism, exhibitionism, perfektionism, protektionism, interventionism This is a list over -tion- words. They shall all have -sjon- in nb, nn. In addition: c > s, rikedom > rikdom, sjuk > sky (nb only), ekonom > økonom, social > social, -ism > -isme, xc > ks, Thys a long row of small changes are needed for making such loanword strings into Norwegian. In a recent frequency corpus from Svenska språkbanken i found 365000 unique word forms, of these, 7700 contained -tion, and thus need the ruleset above. > And for the manual part: > Keld once told me there is a lists of "false friends" for da/sv/nb. > Where do I find that list of problematic words? In paper dictionaries and textbooks used in the universities for learning your neighboring language. >> >> 1 in the analysis/generation of Swedish >> 2 … and in the bidix. >> >> As for 1, we should look around in the Swedish language technology >> landscape and look for open resources, e.g. in Gothenburg (Aarne Ranta, >> also Språkbanken). > > What kind of resources do I need? For 1: swetwol :-) But it seems there are resources in Gothenburg: http://www.cse.chalmers.se/alumni/markus/FM/ http://www.cse.chalmers.se/alumni/markus/FM/download/swedish.lexicon This might even work. >> As for 2, Lexin might be one resource. I am on Euralex in Oslo right now, >> and will ask around. > Fine! Besides, what's Lexin? Lexicon för invandrare, http://lexin.nada.kth.se/lexin/ Trond. |
From: Per T. <per...@op...> - 2012-08-10 08:08:23
|
Hi Keld, On Thu, Aug 9, 2012, at 19:55, ke...@ke... wrote: > On Thu, Aug 09, 2012 at 02:54:27PM +0200, Kevin Brubeck Unhammer wrote: > > Francis Tyers <ft...@pr...> writes: --snip-- > > > > > > (3) You make the two translators in the one pair. For this, you could > > > have the same Swedish dictionary, but would need different nb and nn > > > dictionaries, different sv-nb and sv-nn dictionaries and different sv-nb > > > and sv-nn transfer rules. > > > > > > I think that (3) is probably best, but would like input from others > > > (e.g. Unhammer or Trond). > > > > (3) sounds best to me too. Perhaps you could even do with one bidix, and > > just use the alt="nn" vs alt="nb" attribute; a rough and dirty count > > shows that the majority of entries in the nn-nb bidix carry over the > > same lemma/tag: > > > > $ lt-expand apertium-nn-nb.nn-nb.dix | grep -v ':[<>]:' | awk -F: '$1==$2'|wc -l > > 71628 > > $ lt-expand apertium-nn-nb.nn-nb.dix | grep -v ':[<>]:' | awk -F: '$1!=$2'|wc -l > > 11365 > > Some one who can tell the easiest way to add the "alt-tags" to the dictionnaries, before merging them? Maybe one can have an easy procedure to add new entries when the included languages are updated? > > As Danish is a kind of old Norwegian bokmaal, maybe we could inlude that > language too. > Then all three languages could benefit from the combined work. You're right! My hand cream example comes to mind. I quote again: "On my hand cream I can read "N/D Intensivt mykgjørende/blødgørende og pleiende håndkrem/håndcreme."" Just for fun I tried to translate a text in Norwegian bokmål (nb) to Swedish with the da-sv pair and just about half the words where marked as unknown. Your suggestion would be a very radical solution! Very non-ortodox. A practical solution, rather than a solution founded in theoretical linguistic considerations. Might be very fruitful! The new "languge pair" Swedish (sv) - Danish (da)/Norwegian bokmål (nb)/Norwegian nynorsk (nn)! > > > >> B. I have looked in the repository and found that some work has been > > >> done on the following dictionaries: > > >> > > >> Danish (da) - Norwegian Bokm??l (nb) - nursery > > >> Swedish (sv) - Norwegian Bokm??l (nb) - incubator > > >> > > >> Tihomir told me he's working on Swedish-Icelandic and has expanded the > > >> Swedish monolingual dictionary from sv-da. But which is the most > > >> complete Norwegian Bokm??l (nb) monolingual dictionnary? The one from the > > >> pair Norwegian Bokm??l (nb) - Norwegian Nynorsk (nn)? > > > > > > Yes, I would take the Swedish dictionary from sv-is and the Norwegian > > > dictionar(y,ies) from nn-nb. > > > I have also been working on swedish nouns from SALDO. > I was working on a scheme that could remove about 60 % of the > eambiguities. What kind of work? A constraint grammar or what? > > Best regards > keld > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Apertium-stuff mailing list > Ape...@li... > https://lists.sourceforge.net/lists/listinfo/apertium-stuff |
From: Per T. <per...@op...> - 2012-08-10 08:17:49
|
Hi, On Thu, Aug 9, 2012, at 23:23, Trosterud Trond wrote: > > Per Tunedal kirjoitti 9. aug. 2012 kello 20:21: > > Tihomir has told before that he plans to start developing a constraint > > grammar for Swedish. > > Good. Again: > - Are there open resources? > - Could something be ported from Norwegian? (perhaps only indirectly). > > >> Yes, a production system (say, I want to translate a sv article to nn on > >> Wikipedia) (…) > > > Yes, that was the scenario I first had in mind. But it would break if > > there is a need for a constraint grammar, wouldn't it? And then there > > wont be any use left for the Apertium-translation. > > Well. Since a handful of rules will remove most ambiguities, what is left > will be partly disambiguated. And how bad this is for MT needs to be > seen. So it will not break. It will only be more problematic, and the > result will be poorer. Mikel Artetxe has explained that the OmegaT plug-in doesn't work for language pairs that depends on programs that aren't a part of lttoolbox-java. Six language pairs depend on the Constraint Grammar package and are thus excluded, one of them is apertium-nn-nb. But sv-da doesn't use any constraint grammar, thus I concluded that sv-nb (Norsk bokmål) wouldn't need one either. And would come to real use, by real translators, using OmegaT. If the pair cannot be used, I don't see any need to develop it. > > >> The good news is that the making of such an > >> enlarged transfer lexicon in part can be done automatically, and then > >> manually post edited. > > What do you have in mind? Please tell me more about how to generate the > > bidic automatically! > > a. via a parallel corpus (of course) I thought I could do without this and only work with monolingual data. And, of course, existing dictionaries and rules in other language pairs involving Swedish (sv) and Norwegian (nb/nn). > b. by > --- 1 taking a sv list of words > --- run it through a sv2no orthographical + lexical transfer Tools for this? Any documentation? > --- analyze the output, and pick the recognized matches (input N Sg -> > collect all N Sg output) > --- go through the result manually > > About the transducer: > Lexical changes: samhälle > samfunn, prefikset o- -> u-, stad -> by (when > these occur in compounds) > suffixes: -tion -> -sjon, > The obvious things: ö>ø, ä>æ, x>ks > > See e.g. > associationsrikedom, variationsrikedom, situationsrikedom, > infektionssjukdom, kombinationsslalom, informationsergonom, > nationalekonom, sundströmnationalekonom, konsumtionsboom, > kommunikationsform, organisationsform, notationsform, injektionsform, > portionsform, distributionsform, nationalsocialism, ationalism, > nationalism, smygnationalism, multinationalism, hypernationalism, > internationalism, vänsternationalism, naturnationalism, > hägnainossnationalism, statsnationalism, rationalism, sensationalism, > traditionalism, funktionalism, exceptionalism, koncentrationskapitalism, > mutationsmekanism, isolationism, exhibitionism, perfektionism, > protektionism, interventionism > > This is a list over -tion- words. They shall all have -sjon- in nb, nn. > In addition: c > s, rikedom > rikdom, sjuk > sky (nb only), ekonom > > økonom, social > social, -ism > -isme, xc > ks, > > Thys a long row of small changes are needed for making such loanword > strings into Norwegian. In a recent frequency corpus from Svenska > språkbanken i found 365000 unique word forms, of these, 7700 contained > -tion, and thus need the ruleset above. > > > And for the manual part: > > Keld once told me there is a lists of "false friends" for da/sv/nb. > > Where do I find that list of problematic words? > > In paper dictionaries and textbooks used in the universities for learning > your neighboring language. > > >> > >> 1 in the analysis/generation of Swedish > >> 2 … and in the bidix. > >> > >> As for 1, we should look around in the Swedish language technology > >> landscape and look for open resources, e.g. in Gothenburg (Aarne Ranta, > >> also Språkbanken). > > > > What kind of resources do I need? > > For 1: swetwol :-) But it seems there are resources in Gothenburg: > > http://www.cse.chalmers.se/alumni/markus/FM/ > http://www.cse.chalmers.se/alumni/markus/FM/download/swedish.lexicon > > This might even work. As an input for transfer rules or for a potential constraint grammar? > > >> As for 2, Lexin might be one resource. I am on Euralex in Oslo right now, > >> and will ask around. > > Fine! Besides, what's Lexin? > > Lexicon för invandrare, http://lexin.nada.kth.se/lexin/ As a native Swede, I don't see any need for this. > > Trond. > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Apertium-stuff mailing list > Ape...@li... > https://lists.sourceforge.net/lists/listinfo/apertium-stuff |
From: Trosterud T. <tro...@ui...> - 2012-08-10 09:17:48
|
Per Tunedal kirjoitti 10. aug. 2012 kello 10:17: > > Mikel Artetxe has explained that the OmegaT plug-in doesn't work for > language pairs that depends on programs that aren't a part of > lttoolbox-java. Six language pairs depend on the Constraint Grammar > package and are thus excluded, one of them is apertium-nn-nb. But sv-da > doesn't use any constraint grammar, thus I concluded that sv-nb (Norsk > bokmål) wouldn't need one either. And would come to real use, by real > translators, using OmegaT. If the pair cannot be used, I don't see any > need to develop it. Aha, I did not know of that limitation. Also lttoolbox has disambiguation, it is just statistically based (if I understand it correctly). So, in that case you should take some big swedish tagged corpora (e.g. from Språkbanken), and train your apertium disambiguator on them (documentation on wiki.apertium.org, I never did this myself). >> a. via a parallel corpus (of course) > I thought I could do without this and only work with monolingual data. Kimmo Koskenniemi once said: "Don't guess if you know". So, if you have parallel copora, use them. >> b. by >> --- 1 taking a sv list of words >> --- run it through a sv2no orthographical + lexical transfer > Tools for this? Any documentation? No, no tools made. The rest of my posting was a sketch of how to make it. (…) >> For 1: swetwol :-) But it seems there are resources in Gothenburg: >> http://www.cse.chalmers.se/alumni/markus/FM/download/swedish.lexicon >> This might even work. > As an input for transfer rules or for a potential constraint grammar? As a basis for your Swedish morphology (which may be in place already, in sv-da?) The sv morphology will then be input to CG, or, as I now hear about the OmegaT-imposed limitations, as input to the apertium statistical tagger/disambiguator. >>>> Lexicon för invandrare, http://lexin.nada.kth.se/lexin/ > As a native Swede, I don't see any need for this. By "Lexin" i did not imply that you was an immigrant in need of a dictionary, but that you might be interested in the lemma list of 30000 words Lexin offers. I have been told that it is now available under an open license. But looking at their webpage I do not find a download link. You will in any case need a Swedish lemma list to apply the sv2no script to. Trond |
From: Kevin B. U. <unh...@fs...> - 2012-08-10 08:30:04
|
Per Tunedal <per...@op...> writes: > Hi Keld, > > > On Thu, Aug 9, 2012, at 19:55, ke...@ke... wrote: >> On Thu, Aug 09, 2012 at 02:54:27PM +0200, Kevin Brubeck Unhammer wrote: >> > Francis Tyers <ft...@pr...> writes: > --snip-- >> > > >> > > (3) You make the two translators in the one pair. For this, you could >> > > have the same Swedish dictionary, but would need different nb and nn >> > > dictionaries, different sv-nb and sv-nn dictionaries and different sv-nb >> > > and sv-nn transfer rules. >> > > >> > > I think that (3) is probably best, but would like input from others >> > > (e.g. Unhammer or Trond). >> > >> > (3) sounds best to me too. Perhaps you could even do with one bidix, and >> > just use the alt="nn" vs alt="nb" attribute; a rough and dirty count >> > shows that the majority of entries in the nn-nb bidix carry over the >> > same lemma/tag: >> > >> > $ lt-expand apertium-nn-nb.nn-nb.dix | grep -v ':[<>]:' | awk -F: '$1==$2'|wc -l >> > 71628 >> > $ lt-expand apertium-nn-nb.nn-nb.dix | grep -v ':[<>]:' | awk -F: '$1!=$2'|wc -l >> > 11365 >> > > > Some one who can tell the easiest way to add the "alt-tags" to the > dictionnaries, before merging them? > Maybe one can have an easy procedure to add new entries when the > included languages are updated? You wouldn't be adding alt-tags until after you add the third language (ie. Nynorsk if you start with Swedish-Bokmål) to the pair, so it's not something to worry about yet. -- Kevin Brubeck Unhammer Sent using my emacs |
From: Kevin B. U. <unh...@fs...> - 2012-08-10 08:44:04
|
Per Tunedal <per...@op...> writes: > Hi, > > On Thu, Aug 9, 2012, at 23:23, Trosterud Trond wrote: >> >> Per Tunedal kirjoitti 9. aug. 2012 kello 20:21: >> > Tihomir has told before that he plans to start developing a constraint >> > grammar for Swedish. >> >> Good. Again: >> - Are there open resources? >> - Could something be ported from Norwegian? (perhaps only indirectly). >> >> >> Yes, a production system (say, I want to translate a sv article to nn on >> >> Wikipedia) (…) >> >> > Yes, that was the scenario I first had in mind. But it would break if >> > there is a need for a constraint grammar, wouldn't it? And then there >> > wont be any use left for the Apertium-translation. >> >> Well. Since a handful of rules will remove most ambiguities, what is left >> will be partly disambiguated. And how bad this is for MT needs to be >> seen. So it will not break. It will only be more problematic, and the >> result will be poorer. > > Mikel Artetxe has explained that the OmegaT plug-in doesn't work for > language pairs that depends on programs that aren't a part of > lttoolbox-java. Six language pairs depend on the Constraint Grammar > package and are thus excluded, one of them is apertium-nn-nb. But sv-da > doesn't use any constraint grammar, thus I concluded that sv-nb (Norsk > bokmål) wouldn't need one either. And would come to real use, by real > translators, using OmegaT. If the pair cannot be used, I don't see any > need to develop it. In any case a CG could be added later, as an option for those who aren't using OmegaT or Android. [...] >> > What kind of resources do I need? >> >> For 1: swetwol :-) But it seems there are resources in Gothenburg: >> >> http://www.cse.chalmers.se/alumni/markus/FM/ >> http://www.cse.chalmers.se/alumni/markus/FM/download/swedish.lexicon >> >> This might even work. > > As an input for transfer rules or for a potential constraint grammar? The .lexicon file might be used to enlarge sv.dix. However, is-sv.sv.dix should already be big enough to get a pair started. (What's the license on that FM stuff anyway?) >> >> As for 2, Lexin might be one resource. I am on Euralex in Oslo right now, >> >> and will ask around. >> > Fine! Besides, what's Lexin? >> >> Lexicon för invandrare, http://lexin.nada.kth.se/lexin/ > > As a native Swede, I don't see any need for this. Your machine translation system, however, is not a native Swede, and might have a need to know that e.g. "katt" is a noun. But it doesn't seem that Lexin is free software: Går lexikonen att ladda ner? Nej. Däremot kan man ladda ner Folkets lexikon, som ersätter engelska Lexin, men enbart i xml-format. http://lexin.nada.kth.se/lexin/#about=1;main=3; -- Kevin Brubeck Unhammer Sent using my emacs |
From: Per T. <per...@op...> - 2012-08-10 18:11:08
|
Hi, On Fri, Aug 10, 2012, at 10:29, Kevin Brubeck Unhammer wrote: > Per Tunedal <per...@op...> > writes: > > > Hi Keld, > > > > > > On Thu, Aug 9, 2012, at 19:55, ke...@ke... wrote: > >> On Thu, Aug 09, 2012 at 02:54:27PM +0200, Kevin Brubeck Unhammer wrote: > >> > Francis Tyers <ft...@pr...> writes: > > --snip-- > >> > > > >> > > (3) You make the two translators in the one pair. For this, you could > >> > > have the same Swedish dictionary, but would need different nb and nn > >> > > dictionaries, different sv-nb and sv-nn dictionaries and different sv-nb > >> > > and sv-nn transfer rules. > >> > > > >> > > I think that (3) is probably best, but would like input from others > >> > > (e.g. Unhammer or Trond). > >> > > >> > (3) sounds best to me too. Perhaps you could even do with one bidix, and > >> > just use the alt="nn" vs alt="nb" attribute; a rough and dirty count > >> > shows that the majority of entries in the nn-nb bidix carry over the > >> > same lemma/tag: > >> > > >> > $ lt-expand apertium-nn-nb.nn-nb.dix | grep -v ':[<>]:' | awk -F: '$1==$2'|wc -l > >> > 71628 > >> > $ lt-expand apertium-nn-nb.nn-nb.dix | grep -v ':[<>]:' | awk -F: '$1!=$2'|wc -l > >> > 11365 > >> > > > > > Some one who can tell the easiest way to add the "alt-tags" to the > > dictionnaries, before merging them? > > Maybe one can have an easy procedure to add new entries when the > > included languages are updated? > > You wouldn't be adding alt-tags until after you add the third language > (ie. Nynorsk if you start with Swedish-Bokmål) to the pair, so it's not > something to worry about yet. > > > -- > Kevin Brubeck Unhammer > Well, I took the advice of Trond: "... the expansion to both languages should be taken into account in the setup phase." It would be much more complicated to add the tags afterwards. Besides, will this alternative work "out of the box", i.e. without any modifications to Apertium or the Apertium plug-in for OmegaT? Will it actually be presented as different pairs to the users (sv-nb, sv-nn) or how will it work? Yours, Per Tunedal |
From: Francis T. <ft...@pr...> - 2012-08-10 18:22:11
|
El dv 10 de 08 de 2012 a les 20:11 +0200, en/na Per Tunedal va escriure: > Hi, > > > On Fri, Aug 10, 2012, at 10:29, Kevin Brubeck Unhammer wrote: > > Per Tunedal <per...@op...> > > writes: > > > > > Hi Keld, > > > > > > > > > On Thu, Aug 9, 2012, at 19:55, ke...@ke... wrote: > > >> On Thu, Aug 09, 2012 at 02:54:27PM +0200, Kevin Brubeck Unhammer wrote: > > >> > Francis Tyers <ft...@pr...> writes: > > > --snip-- > > >> > > > > >> > > (3) You make the two translators in the one pair. For this, you could > > >> > > have the same Swedish dictionary, but would need different nb and nn > > >> > > dictionaries, different sv-nb and sv-nn dictionaries and different sv-nb > > >> > > and sv-nn transfer rules. > > >> > > > > >> > > I think that (3) is probably best, but would like input from others > > >> > > (e.g. Unhammer or Trond). > > >> > > > >> > (3) sounds best to me too. Perhaps you could even do with one bidix, and > > >> > just use the alt="nn" vs alt="nb" attribute; a rough and dirty count > > >> > shows that the majority of entries in the nn-nb bidix carry over the > > >> > same lemma/tag: > > >> > > > >> > $ lt-expand apertium-nn-nb.nn-nb.dix | grep -v ':[<>]:' | awk -F: '$1==$2'|wc -l > > >> > 71628 > > >> > $ lt-expand apertium-nn-nb.nn-nb.dix | grep -v ':[<>]:' | awk -F: '$1!=$2'|wc -l > > >> > 11365 > > >> > > > > > > > Some one who can tell the easiest way to add the "alt-tags" to the > > > dictionnaries, before merging them? > > > Maybe one can have an easy procedure to add new entries when the > > > included languages are updated? > > > > You wouldn't be adding alt-tags until after you add the third language > > (ie. Nynorsk if you start with Swedish-Bokmål) to the pair, so it's not > > something to worry about yet. > > > > > > -- > > Kevin Brubeck Unhammer > > > Well, I took the advice of Trond: "... the expansion to both languages > should be taken into account in the setup phase." It would be much more > complicated to add the tags afterwards. > > Besides, will this alternative work "out of the box", i.e. without any > modifications to Apertium or the Apertium plug-in for OmegaT? Will it > actually be presented as different pairs to the users (sv-nb, sv-nn) or > how will it work? Get comfortable with Apertium development then try and think of how you want to deal with it. Trying to work it out without having experience of the tools won't make much sense. You can count becoming familiar with the tools as part of the "setup phase" F. |
From: Mikel A. <art...@gm...> - 2012-08-10 20:17:07
|
> > Mikel Artetxe has explained that the OmegaT plug-in doesn't work for > language pairs that depends on programs that aren't a part of > lttoolbox-java. Six language pairs depend on the Constraint Grammar > package and are thus excluded, one of them is apertium-nn-nb. But sv-da > doesn't use any constraint grammar, thus I concluded that sv-nb (Norsk > bokmål) wouldn't need one either. And would come to real use, by real > translators, using OmegaT. If the pair cannot be used, I don't see any > need to develop it. > While what you state is true right now (i.e. the OmegaT plug-in doesn't work with language pairs that use CG), that doesn't necessarily mean that it will never work with them (and the same applies for any program based on lttoolbox-java such as Apertium Caffeine or the Android app). I mean, all this stuff is new... the OmegaT plug-in wasn't even in my initial plan for GSoC! I think (and hope) that this will keep evolving in the future. It seems that, from your point of view, it wouldn't make sense to work on a language pair that wouldn't be supported by the current OmegaT plug-in. I announced the plug-in practically yesterday so, from that perspective, all the work done so far would have been, in principle, completely useless in the moment it was conceived... If all the Apertium developers had thought that way, nothing would have been done so far, so the OmegaT plug-in itself would be inconceivable! All in all, I have absolutely no idea about the linguistic facts that are under discussion here, but I really think that you shouldn't base your decision on what the OmegaT plug-in does right now. Being said that, I've been thinking about how we could support Constraint Grammar in lttoolbox-java (and, consequently, in the OmegaT plug-in), and I think that we could consider 3 different alternatives: 1) Invoke it as an external program. Note that lttoolbox-java can already deal with that. In fact, apertium-viewer is now built on top of lttoolbox-java, and it still works with language pairs with external dependencies. As for the OmegaT plug-in, it works with resources compressed in a JAR, so directly invoking cg-proc wouldn't work for it. We would need to extract the required resources to a temporary directory and invoke it with the correct parameters, which is not currently supported, but it shouldn't be too difficult to implement. The disadvantage of this approach is that the user would need to install the Constraint Grammar package in his machine as usual, so the solution wouldn't be portable at all. For instance, this wouldn't work under Android. 2) Create a Java interface for CG using JNI. That should guarantee certain level of portability (it could even work in Android). It would reuse all the source code from CG, and the user wouldn't need to install any external program to make it work. That sounds good, but I think that it might happen to be more complex and problematic than what it seems. For instance, just looking at the installation instructions I see that it depends on some external libraries, so things start getting more complex... 3) Develop a Java port of it. Probably the best solution but, obviously, the hardest one to implement... If people think that it would be useful, I could implement solution 1) quite easily. That would serve to make the OmegaT plug-in work with language pairs that depend on CG (although the user would have to install CG manually). Solution 2) and, especially, 3), would require much more work. I might work on it some day, but don't expect anything (it is definitely not in my plans for the near future)! But, who knows, somebody else could also work on it, and perhaps it might be an interesting project for future GSoCs... Besides, will this alternative work "out of the box", i.e. without any > modifications to Apertium or the Apertium plug-in for OmegaT? > The Apertium plug-in for OmegaT is built on top of lttoolbox-java. So if what you propose would work with lttoolbox-java, it would work with the plug-in as well. lttoolbox-java is a Java port of lttoolbox and apertium, so if you don't go beyond that, your work would be compatible with it without any problem. But I insist that, from my point of view, you shouldn't look at this to take the decision... |
From: Jacob N. <jac...@gm...> - 2012-08-10 23:40:04
|
2012/8/10 Mikel Artetxe <art...@gm...> > If all the Apertium developers had thought that way, nothing would have >> been done so far, so the OmegaT plug-in itself would be inconceivable! All >> in all, I have absolutely no idea about the linguistic facts that are under >> discussion here, but I really think that you shouldn't base your decision >> on what the OmegaT plug-in does right now. > > I agree :-) > > Being said that, I've been thinking about how we could support Constraint > Grammar in lttoolbox-java (and, consequently, in the OmegaT plug-in), and I > think that we could consider 3 different alternatives: > > 1) Invoke it as an external program. Note that lttoolbox-java can already > deal with that. In fact, apertium-viewer is now built on top of > lttoolbox-java, and it still works with language pairs with external > dependencies. As for the OmegaT plug-in, it works with resources compressed > in a JAR, so directly invoking cg-proc wouldn't work for it. We would need > to extract the required resources to a temporary directory and invoke it > with the correct parameters, which is not currently supported, but it > shouldn't be too difficult to implement. The disadvantage of this approach > is that the user would need to install the Constraint Grammar package in > his machine as usual, so the solution wouldn't be portable at all. For > instance, this wouldn't work under Android. > Invoking external programs works fine in Android. The only difference in Android is that we would have to have the GC programs in the app's directory. And... it seems its not so easy to compile programs using ICU on Android. If people think that it would be useful, I could implement solution 1) > quite easily. That would serve to make the OmegaT plug-in work with > language pairs that depend on CG (although the user would have to install > CG manually). > Having the user to install CG manually on his/her desktop machine is fine. If implemented I'd suggest that we expand the 'language-pairs' file in builds/ with a keyword saying that CG is required. -- Jacob Nordfalk <http://profiles.google.com/jacob.nordfalk> javabog.dk Androidudvikler og -underviser på IHK<http://cv.ihk.dk/diplomuddannelser/itd/vf/MAU>og Lund&Bendsen <https://www.lundogbendsen.dk/undervisning/beskrivelse/LB1809/> |
From: Francis T. <ft...@pr...> - 2012-08-10 21:18:21
|
El dv 10 de 08 de 2012 a les 22:16 +0200, en/na Mikel Artetxe va escriure: > Mikel Artetxe has explained that the OmegaT plug-in doesn't > work for > language pairs that depends on programs that aren't a part of > lttoolbox-java. Six language pairs depend on the Constraint > Grammar > package and are thus excluded, one of them is apertium-nn-nb. > But sv-da > doesn't use any constraint grammar, thus I concluded that > sv-nb (Norsk > bokmål) wouldn't need one either. And would come to real use, > by real > translators, using OmegaT. If the pair cannot be used, I don't > see any > need to develop it. > > While what you state is true right now (i.e. the OmegaT plug-in > doesn't work with language pairs that use CG), that doesn't > necessarily mean that it will never work with them (and the same > applies for any program based on lttoolbox-java such as Apertium > Caffeine or the Android app). I mean, all this stuff is new... the > OmegaT plug-in wasn't even in my initial plan for GSoC! I think (and > hope) that this will keep evolving in the future. > > It seems that, from your point of view, it wouldn't make sense to work > on a language pair that wouldn't be supported by the current OmegaT > plug-in. I announced the plug-in practically yesterday so, from that > perspective, all the work done so far would have been, in principle, > completely useless in the moment it was conceived... If all the > Apertium developers had thought that way, nothing would have been done > so far, so the OmegaT plug-in itself would be inconceivable! All in > all, I have absolutely no idea about the linguistic facts that are > under discussion here, but I really think that you shouldn't base your > decision on what the OmegaT plug-in does right now. > > Being said that, I've been thinking about how we could support > Constraint Grammar in lttoolbox-java (and, consequently, in the OmegaT > plug-in), and I think that we could consider 3 different alternatives: > > 1) Invoke it as an external program. Note that lttoolbox-java can > already deal with that. In fact, apertium-viewer is now built on top > of lttoolbox-java, and it still works with language pairs with > external dependencies. As for the OmegaT plug-in, it works with > resources compressed in a JAR, so directly invoking cg-proc wouldn't > work for it. We would need to extract the required resources to a > temporary directory and invoke it with the correct parameters, which > is not currently supported, but it shouldn't be too difficult to > implement. The disadvantage of this approach is that the user would > need to install the Constraint Grammar package in his machine as > usual, so the solution wouldn't be portable at all. For instance, this > wouldn't work under Android. > > 2) Create a Java interface for CG using JNI. That should guarantee > certain level of portability (it could even work in Android). It would > reuse all the source code from CG, and the user wouldn't need to > install any external program to make it work. That sounds good, but I > think that it might happen to be more complex and problematic than > what it seems. For instance, just looking at the installation > instructions I see that it depends on some external libraries, so > things start getting more complex... As far as I know, it mostly just depends on Boost and ICU. ICU is certainly available for Java. And Boost is probably widely used too. > 3) Develop a Java port of it. Probably the best solution but, > obviously, the hardest one to implement... I believe there was some work on this last year in GSOC: http://www.languagetool.org/gsoc2011/ Fran |
From: Mikel A. <art...@gm...> - 2012-08-11 14:14:02
|
> > > > 3) Develop a Java port of it. Probably the best solution but, > > obviously, the hardest one to implement... > > I believe there was some work on this last year in GSOC: > > http://www.languagetool.org/gsoc2011/ > > I didn't know about it. But it seems that what they developed was a conversion tool and not a Java port: http://languagetool.wikidot.com/using-the-rule-converter-gui And there it says that the conversion for CG is still pretty buggy... Anyway, that would still be viable, I guess. It would involve converting CG rules with that tool and reusing code from LanguageTool to work with them. |
From: Jacob N. <jac...@gm...> - 2012-08-10 22:53:51
|
2012/8/10 Per Tunedal <per...@op...> > > > > As Danish is a kind of old Norwegian bokmaal, maybe we could inlude that > > language too. > > Then all three languages could benefit from the combined work. > > You're right! My hand cream example comes to mind. I quote again: > > "On my hand cream I can read "N/D Intensivt mykgjørende/blødgørende > og pleiende håndkrem/håndcreme."" > :-) > > Just for fun I tried to translate a text in Norwegian bokmål (nb) to > Swedish with the da-sv pair and just about half the words where marked > as unknown. > Note that da->sv hasnt been developed nor released at all. WRT whether GC is available for omega-T or not, I think this is too early to say anything but that the problem can be solved, and there are many ways it could be solved: - GG java port, - Java port thru LanguageTool, - Using a locally installed CG (binaries exists for Windows - see http://beta.visl.sdu.dk/cg3/single/#windows) as an external program - Make OmegaT use the Apertium web service Apart from that, Apertiums built-in ruleset might be satisfactory. It might also be that Francis' work on lexical disambuguation rules could be applied (which I like because it has a much more Apertium-developer friendly syntax than CG) We are near the end of the project period of Google Summer of Code and therefore I think the right advice to give to Per is: Get started, and we can promise that something will be ready to make it usable from Omega-T when you need it. -- Jacob Nordfalk <http://profiles.google.com/jacob.nordfalk> javabog.dk Androidudvikler og -underviser på IHK<http://cv.ihk.dk/diplomuddannelser/itd/vf/MAU>og Lund&Bendsen <https://www.lundogbendsen.dk/undervisning/beskrivelse/LB1809/> |
From: Tino D. <tin...@gm...> - 2012-08-11 09:56:05
|
On Fri, Aug 10, 2012 at 10:16 PM, Mikel Artetxe <art...@gm...> wrote: > 1) Invoke it as an external program. > Probably the easiest to get working, but does add a silly text generation and parsing step. > 2) Create a Java interface for CG using JNI. ... For instance, just > looking at the installation instructions I see that it depends on some > external libraries, so things start getting more complex... > Boost is header-only, so doesn't add any files to the distribution. libtcmalloc is optional. ICU is the heavy one. I've looked at removing ICU and making a UTF-8-only version of CG-3, since everyone uses just UTF-8 these days. The key problem with that is regular expressions: I pass regex off to ICU's very nice Unicode character class (e.g. \p{Katakana}) capable regex engine. >From what I could find, the only C++ engines capable of UTF-8 and Unicode character classes are ICU and PCRE, so that would be trading one library for another less capable one. And I'm open for making the library version easier to use, or just add an easier to use API. The current API is for those who want almost total control. > 3) Develop a Java port of it. Probably the best solution but, obviously, > the hardest one to implement... > Haven't really looked into that as I consider JNI a better solution. But, it's all hash maps and hash sets, so maybe not that hard to convert. Again, regex is a significant feature and apparently only Java 7 and newer gets that right. > If people think that it would be useful, I could implement solution 1) > quite easily. That would serve to make the OmegaT plug-in work with > language pairs that depend on CG (although the user would have to install > CG manually). > It's definitely possible to distribute CG-3 alongside the pairs that need it, or even one shared CG-3 package. ICU is what adds weight to that, but ideally the ICU extra files are only needed on Windows and Mac. -- Tino Didriksen |
From: Per T. <per...@op...> - 2012-08-11 13:04:16
|
Hi, On Sat, Aug 11, 2012, at 00:53, Jacob Nordfalk wrote: > 2012/8/10 Per Tunedal <per...@op...> > > > > > > > As Danish is a kind of old Norwegian bokmaal, maybe we could inlude that > > > language too. > > > Then all three languages could benefit from the combined work. > > > > You're right! My hand cream example comes to mind. I quote again: > > > > "On my hand cream I can read "N/D Intensivt mykgjørende/blødgørende > > og pleiende håndkrem/håndcreme."" > > > > :-) > > > > > > Just for fun I tried to translate a text in Norwegian bokmål (nb) to > > Swedish with the da-sv pair and just about half the words where marked > > as unknown. > > > > Note that da->sv hasnt been developed nor released at all. > Yet it's in Apertium-caffeine! And quite funny to play around with. What's the problem with the direction da-sv compared to sv-da? Why hasn't it been officially released? > > WRT whether GC is available for omega-T or not, I think this is too early > to say anything but that the problem can be solved, and there are many > ways > it could be solved: > - GG java port, > - Java port thru LanguageTool, > - Using a locally installed CG (binaries exists for Windows - see > http://beta.visl.sdu.dk/cg3/single/#windows) as an external program > - Make OmegaT use the Apertium web service > > Apart from that, Apertiums built-in ruleset might be satisfactory. > It might also be that Francis' work on lexical disambuguation rules could > be applied (which I like because it has a much more Apertium-developer > friendly syntax than CG) > I have interpreted the Constraint Grammar discussion as an indication that there are a lot of ambiguities to resolve when translating from Swedish to Norwegian. And thus, the translation quality might be poor if it isn't included in the project. As the Apertium Wiki wasn't accessible yesterday, I haven't studied how far I can get by Apertiums built-in rule set. What about the translation in the other direction, nb/nn to sv? Is there the same need for a Constraint Grammar? Or can I do without it? My original plans where to start developing translation in that direction. As a native Swede I would find it much easier to translate from Norwegian to Swedish: I wouldn't have to check that much in dictionaries. Besides, professional translators always translates into their mother tongue. An other conclusion from the discussion is that I need to create very large dictionaries, to overcome that there are much fewer words that are exactly the same in Norwegian an Swedish, compared to Norwegian bokmål (nb) and Norwegian nynorsk (nn). On the other hand: someone wrote that for comprehension, only a short list of difficult words is needed. My own conclusion is that it might turn out to be very useful with a "pair" Norwegian (nb/nn) to Swedish (sv) containing the most frequent words + words that are known to cause difficulties (including "false friends"). Any one that know of how to figure out what words to include in the later list? Collect personal experiences from experts like you? BTW I ran a few words from the nb frequency wordlist in Apertium-caffeine to translate with the da-sv pair. I expected to get a very low percentage of unknown words, due to my experiences of the hand cream translation. Unfortunately I got as much as 40 % unknown words. I planned to translate say the first 1500 or so words on the frequency list, to get the most important unknown words to work with for a start. I expected to get only a few hundred of them, now I'm not so sure any longer. What about word order? I found a translation on my sun cream that has different word order for Danish (da) and Norwegian bokmål (nb). Just a coincidence or a fact to take into account? An other concern of mine: will the solution (3) with separate mono lingual dictionaries and a common bilingual dictionary work "out of the box" with Apertium, Apertium-caffeine and the OmegaT-plugin? Or does this solution imply some changes to the code? Apertium would have to find out somehow what monolingual dictionary to look into, wouldn't it? I intend to start to play around with Swedish (sv), Danish (da) and Norwegian bokmål (nb): can I test drive my dictionaries and rules? > > We are near the end of the project period of Google Summer of Code and > therefore I think the right advice to give to Per is: Get started, and we > can promise that something will be ready to make it usable from Omega-T > when you need it. > Sounds great! But I'm a very impatient person. I might ponder over some problem for a long time, occasionally even for years, but as soon as I find a feasible solution I want to implement it immediately! Yours, Per Tunedal |
From: Mikel A. <art...@gm...> - 2012-08-11 13:24:20
|
> > > > > Note that da->sv hasnt been developed nor released at all. > > > > Yet it's in Apertium-caffeine! And quite funny to play around with. > It was my mistake to include da->sv in the first release, Jacob told me about it and I updated the pair to remove that mode. Anyway, this is a good occasion to test whether the updating feature of Apertium Caffeine (and the OmegaT plug-in) is working correctly or not... Doesn't it tell you that an update is available for that language pair when you launch it? After you update it, da->sv shouldn't be available anymore... |
From: Mikel A. <art...@gm...> - 2012-08-11 14:23:21
|
On Sat, Aug 11, 2012 at 11:55 AM, Tino Didriksen <tin...@gm...>wrote: > On Fri, Aug 10, 2012 at 10:16 PM, Mikel Artetxe <art...@gm...>wrote: > >> 1) Invoke it as an external program. >> > Probably the easiest to get working, but does add a silly text generation > and parsing step. > I've implemented it at revision 40279<http://apertium.svn.sourceforge.net/viewvc/apertium?view=revision&revision=40279>. My code is quite clumsy and needs more work (it assumes that all the parameters are paths to existing files and always tries to extract them, it doesn't deal with whitespaces, it unnecessarily copies files to a temporary directory even if the file was directly accessible...), but it does the trick. So now it is possible to use Apertium Caffeine or the OmegaT plug-in with language pairs that depend on CG as long as you have CG installed in your machine. I haven't created packages for those language pairs (and I think that we shouldn't do it), so you will need to create the packages by yourself and manually install them. > > >> 2) Create a Java interface for CG using JNI. ... For instance, just >> looking at the installation instructions I see that it depends on some >> external libraries, so things start getting more complex... >> > Boost is header-only, so doesn't add any files to the distribution. > libtcmalloc is optional. > ICU is the heavy one. I've looked at removing ICU and making a UTF-8-only > version of CG-3, since everyone uses just UTF-8 these days. The key problem > with that is regular expressions: I pass regex off to ICU's very nice > Unicode character class (e.g. \p{Katakana}) capable regex engine. > >From what I could find, the only C++ engines capable of UTF-8 and Unicode > character classes are ICU and PCRE, so that would be trading one library > for another less capable one. > I don't have much experience with JNI, but I would say that it would probably be trickier than what it might seem... libcg3.so takes 900 KB in my machine. The JARs would need to include a library for, at least, Linux, Windows and OS X, so they would be, at least, about 3 MB bigger (and that's without taking ICU into account). At the same time, we would need to make it part of lttoolbox-java (or the programs that are based on it), which would require compiling CG targeting different platforms and using NDK for Android. All this is certainly doable, but I think that it would considerably increase the complexity of our current approach, making it harder to maintain. And then there is ICU... All in all, I think that this solution goes against one of the main advantages of Java: portability. It would require compiling CG for each platform that we would be supporting, and embedding the right binaries for all of them. And we would probably have problems to make it work under restricted environments like Java Web Start... > > >> 3) Develop a Java port of it. Probably the best solution but, obviously, >> the hardest one to implement... >> > Haven't really looked into that as I consider JNI a better solution. > But, it's all hash maps and hash sets, so maybe not that hard to convert. > Again, regex is a significant feature and apparently only Java 7 and newer > gets that right. > As far as I know java.util.regex is available since early versions of Java (you can look here <http://docs.oracle.com/javase/tutorial/essential/regex/>for more details about what it offers). But perhaps Java 7 introduces some significant improvements in this field, I don't know... |
From: Tino D. <tin...@gm...> - 2012-08-11 14:37:49
|
On Sat, Aug 11, 2012 at 4:23 PM, Mikel Artetxe <art...@gm...> wrote: > 3) Develop a Java port of it. Probably the best solution but, obviously, >>> the hardest one to implement... >>> >> Haven't really looked into that as I consider JNI a better solution. >> But, it's all hash maps and hash sets, so maybe not that hard to convert. >> > Again, regex is a significant feature and apparently only Java 7 and newer >> gets that right. >> > > As far as I know java.util.regex is available since early versions of Java > (you can look here<http://docs.oracle.com/javase/tutorial/essential/regex/>for more details about what it offers). But perhaps Java 7 introduces some > significant improvements in this field, I don't know... > Java regex before Java 7 worked with ASCII-only semantics. The newer regex engine has many more Unicode features and a new flag http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#UNICODE_CHARACTER_CLASSto enable proper Unicode handling. Only the newer Java 7 engine with the flag would be ICU compatible, from what I could read. -- Tino Didriksen |
From: Per T. <per...@op...> - 2012-08-11 15:22:59
|
Hi, On Sat, Aug 11, 2012, at 16:23, Mikel Artetxe wrote: > On Sat, Aug 11, 2012 at 11:55 AM, Tino Didriksen > <tin...@gm...>wrote: > > > On Fri, Aug 10, 2012 at 10:16 PM, Mikel Artetxe <art...@gm...>wrote: > > > >> 1) Invoke it as an external program. > >> > > Probably the easiest to get working, but does add a silly text generation > > and parsing step. > > > > I've implemented it at revision > 40279<http://apertium.svn.sourceforge.net/viewvc/apertium?view=revision&revision=40279>. > My code is quite clumsy and needs more work (it assumes that all the > parameters are paths to existing files and always tries to extract them, > it > doesn't deal with whitespaces, it unnecessarily copies files to a > temporary > directory even if the file was directly accessible...), but it does the > trick. So now it is possible to use Apertium Caffeine or the OmegaT > plug-in > with language pairs that depend on CG as long as you have CG installed in > your machine. I haven't created packages for those language pairs (and I > think that we shouldn't do it), so you will need to create the packages > by > yourself and manually install them. > As this would be very useful for me, would you please point me to some documentation and/or give a short description of how to do it? I've got one Windows- and two Debian installations to play with. Yours, Per Tunedal |