---------------------------------------------------------
README: Resources for Closely Related Languages Convertor
---------------------------------------------------------
File name: Convertor.1.2.0.zip
Full name: Closely Related Languages Convertor
Version: 1.2.0
Size: 433 Kb
URL: http://rcrl.sourceforge.net (previously: http://d2ac-a2dc.sourceforge.net)
Language: Language independent
-------------------------------------------------------------------------------------
When using this, please cite:
Van Huyssteen, G. and Pilon, S. 2009. Rule-based Conversion of Closely-related Languages: A Dutch-to-Afrikaans Convertor.
20th Annual Symposium of the Pattern Recognition Association of South Africa (PRASA). Stellenbosch, South Africa. pp 23-28.
http://www.prasa.org/index.php/proceedings.html
Copyright 2011 CTexT
Full licence agreements in zipped file (COPYING-CODE and COPYING-DATA).
Developers: GB van Huyssteen, S Pilon (linguists), MJ Puttkammer and M Schlemmer (programmers), assisted by various students and colleagues, including Liesbeth Augustinus, Kirsten Arnauts, Veronique de Gres, Shanna Pettens, Carla-Mari van den Heever, and Daan Wissing.
-------------------------------------------------------------------------------------
What is Closely Related Languages Convertor?
********************************************
A rule-based convertor to convert text from one language to another, closely related language. The code is language independent, and relies on language specifc data for converting text from one language to another closely-related language. This could be useful for machine translation, recycling of language technologies between closely-related languages, etc.
Installation and Quick Start
****************************
(1) Download Convertor.1.2.0.zip
(2) Unzip
(3) Open the Convertor.1.2.0 folder; all the files needed to run the program are in there.
Running the Convertor
*********************
Our example below is about converting a list of words from Dutch to Afrikaans; however, note that the code is language independent - any language can therefore serve as input language (e.g. Dutch in our example), and any language as output language (e.g. Afrikaans in our example), provided that the necessary adaptations of the LEXICONS and CONVERSION RULES (see below) have been made.
(1) Open the file Input.LangIn.txt
+ The input file Input.LangIn.txt should contain a list of source language (e.g. Dutch) tokens to be coverted, each on a separate line.
+ No proper names, acronyms, abbreviations or spelling errors should be included in the list.
(2) Run perlscript Convertor.1.2.0.pl
(3) Open the file Output.LangOut.txt
+ The converted tokens of the input file can be found in the output file Output.LangOut.txt.
+ The output file is a list with each token printed on a separate line, together with a tag to indicate the nature of the output
+ The following tags are used:
<Untranslated> the word could not be translated
<Translated> the word is translated using the morphological rules in MorphRules.txt
<Lex.LangOut> the word is translated as an identical cognate using Lex.LangOut.txt
<Lex.LangIn-LangOut> the word is translated as a non-cognate or false friend using Lex.LangIn-LangOut.txt
+ Example:
Input file: Input.LangIn.txt Output file: Output.LangOut.txt
eieren eieren<Untranslated>
kindje kindjie<Translated>
boom boom<Lex.LangOut>
banaan piesang<Lex.LangIn-LangOut>
Files in the Convertor folder
*****************************
PERL SCRIPT
Convertor.1.2.0.pl
+ The script has to run after creating the input file, in order to translate the Dutch tokens into Afrikaans tokens.
CONVERSION MODULES
The conversion rules are executed by the following conversion modules:
GenMods.pm
+ Generates MORPHModule.pm and G2GModule.pm
MORPHModule.pm
+ Generated from MorphRules.txt by GenMods.pm
+ This module handles systematic differences between Afrikaans and Dutch, which can be handled on the morphosyntactic level.
+ The regular expressions in this module are based on the rules defined in MorphRules.txt
G2GModule.pm
+ Generated from G2GRules.txt by GenMods.pm
+ This module converts Dutch graphemes to Afrikaans graphemes in a systematic way.
+ The regular expressions in this module are based on the rules included in G2GRules.txt
VARIABLES
Clusters.txt
+ In this text file, customized variables can be defined.
+ For the current implementation, variables for diphthongs, consonants, vowels, fricatives, plosives, liquids and nasals are defined.
INPUT AND OUTPUT FILES
Input.LangIn.txt
+ The input file is a list of tokens from the source language (in this case Dutch) to be translated, each on a separate line.
+ No proper names, acronyms, abbreviations or spelling errors should be included in the list.
Output.LangOut.txt
+ The output file is a list with each token printed on a separate line, together with a tag to indicate the nature of the output.
LEXICONS
Note that the Convertor.1.2.0 folder only contains some example lexicons for Afrikaans and Dutch, which can be replaced by lexicons for any other closely related language pair. Updated lexicons for Afrikaans and Dutch can be downloaded from the Afr-DutchListsAndRules folder in the Sourceforge project.
Lex.LangIn-LangOut.txt
+ A bilingual list consisting of source language (e.g. Dutch) tokens (each on a separate line), which are tab separated by potential target language (e.g. Afrikaans) translation equivalents.
+ The lexicon aims mainly at covering false friends and non-cognates (but also some cases of non-identical cognates).
+ Translation alternatives are separated by two forward slashes, e.g. boerderij boerdery//plaas
Lex.LangOut.txt
+ A list consisting of target language (e.g. Afrikaans) tokens, which is used for look-up purposes in order to identify identical cognates.
+ Since false friends are covered by Lex.LangIn-LangOut.txt, look-up in Lex.LangOut.txt follows only after translation from Lex.LangIn-LangOut.txt, e.g. koei
RULES FILES
Note that the Convertor.1.2.0 folder only contains some example rules files for Afrikaans and Dutch, which can be replaced by rules files for any other closely related language pair. Updated rules files for Afrikaans and Dutch can be downloaded from the Afr-DutchListsAndRules folder in the Sourceforge project.
MorphRules.txt
+ Contains morpheme rules, written as regular expressions.
+ In principle, only morphs and allomorphs are included.
+ In order to facilitate making adaptations and changes, all regular expressions are entered in the format <SearchString><tab><SubstitutionString>.
+ Use standard Perl regular expression syntax, e.g. ^niet nie
G2GRules.txt
+ Contains grapheme rules, written as regular expressions.
+ In principle, only rules that apply on a submorphemic level are included, such as clusters of vowels or consonants.
+ In order to facilitate making adaptations and changes, all regular expressions are entered in the format <SearchString><tab><SubstitutionString>.
+ Use standard Perl regular expression syntax, e.g. ven$ we
Tags generated by the Convertor (in Output.LangOut.txt)
*******************************************************
<Lex.LangIn-LangOut>
+ A token receives this tag if it is translated as a false friend or non-cognate (by means of the Lex.LangIn-LangOut.txt lexicon).
<Lex.LangOut>
+ A token receives this tag if it is translated as an identical cognate (by means of the Lex.LangOut.txt lexicon).
<Translated>
+ A token receives this tag if it is translated by means of the grapheme or morpheme rules (using the rules included in Morph.Rules.txt and/or G2GRules.txt).
<Untranslated>
+ A token receives this tag if the input token could not be converted by the program.
Changes from previous versions
******************************
Tags/files used in the current version Tags/files used in Van Huyssteen & Pilon (2009)
Input.LangIn.txt List.D2AC.Ndl.txt
Output.LangOut.txt List.D2AC.Afr.txt
<Lex.LangIn-LangOut> <<D2ALex>>
<Lex.LangOut> <<AfrLex>>
<Translated> <<Translated>>
<Untranslated> <<Untranslated>>
Lex.LangIn-LangOut.txt D2ALex.txt
Lex.LangOut.txt AfrLex.txt
Clusters.txt Variables.txt
MORPHModule.pm MorphModule.pm
G2GModule.pm G2GModule.pm
MorphRules.txt MorphRules.txt
G2GRules.txt G2GRules.txt
References
**********
Van Huyssteen, G. and Pilon, S. 2009. Rule-based Conversion of Closely-related Languages: A Dutch-to-Afrikaans Convertor.
20th Annual Symposium of the Pattern Recognition Association of South Africa (PRASA). Stellenbosch, South Africa. pp 23-28.
http://www.prasa.org/index.php/proceedings.html
Pilon, S, Van Huyssteen, GB & Augustinus, L. 2010. Converting Afrikaans to Dutch for technology recycling. In: Proceedings of the 2010 Conference of the Pattern Recognition Association of South Africa. ISBN: 978-0-7992-2470-2. 22-23 November. Stellenbosch, South Africa. pp 219-224.