Resources for Closely Related Languages - Browse /Convertor/Convertor.2.0.0 at SourceForge.net

The interactive file manager requires Javascript. Please enable it or use sftp or scp.
You may still browse the files here.

Name	Modified	Size	InfoDownloads / Week
Parent folder
README.md	2015-12-29	7.7 kB	0
Convertor.2.0.0.zip	2015-12-18	3.7 MB	0
Totals: 2 Items		3.7 MB	0

What is this?

A rule-based convertor to convert text from one language to another, closely related language. The code is language independent, and relies on language specific data for converting text from one language to another closely-related language. This could be useful for machine translation, recycling of language technologies between closely-related languages, etc.

Availability

File name: Convertor.2.0.0.zip Full name: Closely Related Languages Convertor Version: 2.0.0 Size: 3,582 Kb URL: http://rcrl.sourceforge.net

Language: Language independent

Developers

GB van Huyssteen, S Pilon (linguists)
MJ Puttkammer, M Schlemmer and DR van Niekerk (programmers)
Liesbeth Augustinus, Kirsten Arnauts, Veronique de Gres, Shanna Pettens, Carla-Mari van den Heever, and Daan Wissing (others)

Funding

2008-10: National Research Foundation (GUN: FA2007041600015)

2014-5: Research within this project was financed by the Nederlandse Taalunie and the Department of Arts and Culture of the Government of the Republic of South Africa as part of their cooperation on language and speech technology.

Acknowledgement

When using this, please cite:

Van Niekerk, DR, Van Huyssteen, GB & Puttkammer, MJ. 2015. Closely related languages convertor v2.0.0. Potchefstroom: Centre for Text Technology (CTexT), North-West University.

Software requirements

Python 2.7 (required)
OpenFST with Python bindings (optional)
Apache web server (optional)

Source description

.
convertor.py
COPYING-CODE
COPYING-DATA
decompound.py
g2p.py
LICENCE
README.md
tokenizer.py
data_af2nl
  |-- decompmorphmap.json
  |-- decompwordlist.txt
  |-- defref_gnulls.g2g
  |-- defref_rules.g2g
  |-- lexmap.json
  |-- outlexfreqs.json
  `-- outlex.txt
data_nl2af
  |-- decompmorphmap.json
  |-- decompwordlist.txt
  |-- defref_gnulls.g2g
  |-- defref_rules.g2g
  |-- lexmap.json
  |-- outlexfreqs.json
  `-- outlex.txt
doc
  `-- README.html
examples
  |-- testinput
  `-- webdemo

In the root directory contains the main implementation written in Python. The convertor can be run using the convertor.py script (and the tokenizer.py can be used to pre-process input if required) as described in the next section. The other Python source files (decompound.py and g2p.py) are not standalone scripts, but modules used by convertor.py. The data_* directories contain the language resource files in the correct formats, these need to be compiled as described below. The examples sub-directory contains test input and an example setup for a web-served demo (with JSON output wrapper cgi-script). The doc sub-directory contains this README in HTML format.

Installation and running

In order to use the convertor, after extracting the source code, the language resources need to be prepared for use. Depending on the language data to be used (Afrikaans to Dutch and Dutch to Afrikaans reside in data_a2d and data_d2a respectively), the following command can be used to do this:

python convertor.py compile DATA_DIRECTORY

This should create a file: convertor.pickle which is loaded from the script's working directory during conversion. The script implements conversion of text using standard input and output. Note, input text needs to be pre-tokenised as for example in: examples/testinput

The following test cases can be used to test the Afrikaans to Dutch conversion:

python convertor.py convert metainfo < examples/testinput

Should produce the following output:

Kwaggaijzer CompoundWordlookup
zijn    Wordlookup
een Wordlookup
interessante    Wordlookup
Gruiszandweg    CompoundWordlookup
.   UNCONVERTED

Persoonlijkheidsverwantschap    CompoundWordlookup
schoonmoedergelegenheid CompoundWordlookup
vooral  Wordlookup
pijpe   G2GRewrites
.   UNCONVERTED

An additional script (tokenizer.py) is provided to perform simple tokenisation as required by the convertor. The following:

echo "Hierdie is 'n toetssin." | python tokenizer.py | python convertor.py convert

should produce:

Dit
zijn
een
toetszin
.

Furthermore, some wrapper code is also provided in the cgi sub-directory which implements a simple web-demo, with HTML form for input and cgi script providing JSON output. This implementation was tested on standard Apache webservers in a UserDir environment (note: when setting up Apache, remember to enable CGI script execution in the relevant directory), however it should be easily adaptable to any web-server software.

Development quick start

The convertor implements a number of WordConvertor modules and a simple SentenceConvertor. The simple SentenceConvertor implemented takes an ordered list of WordConvertors and runs each one until one of the modules returns valid output. It also defines a "score word" method using 1-gram frequency to select between multiple possible word translations (these word frequencies are contained in the JSON hash/dictionary file: outlexfreqs.json). In future development, the simple SentenceConvertor implementation may be replaced with a more sophisticated collecting all possible word translations and selecting words based on the larger sentence context (e.g. using an N-gram language model).

The conversion process is briefly illustrated in the following diagram:

Diagram of conversion process

Developers wishing to add new language pairs need to implement at least one of the WordConvertor modules and prepare data as described below (developers may also edit the compile_sentconverter function in convertor.py according to their needs). The currently implemented WordConvertor modules and corresponding data formats are as documented below.

Note: All text files are in UNIX UTF-8 format.

`WordLookup`

The Wordlookup module tries to convert words using a simple lookup process. It uses a "word mapping" and list of words that are part of the lexicon of the output language. The word mapping file (lexmap.json) is a JSON file containing a hash/dictionary of lists (which may contain multiple possible word translations). The output word list (outlex.txt) is a simple text file.

`CompoundWordlookup`

The CompoundWordlookup module tries to split compound words using an algorithm that takes a word list and "morpheme map" to find possible constituents and applies Wordlookup on the result. The wordlist is a simple text file (decompwordlist.txt) and the morpheme map is a JSON hash/dictionary mapping string to string (decompmorphmap.json).

`G2GRewrites`

The G2GRewrites module converts words by applying grapheme-to-grapheme rules (rewrite rules). These rules can be extracted from parallel word lists using the Default&Refine algorithm. The rules are contained in two "semicolon format" text files (defref_gnulls.g2g and defref_rules.g2g) generated by the software referred to.

Source: README.md, updated 2015-12-29