| Name | Modified | Size | Downloads / Week |
|---|---|---|---|
| Parent folder | |||
| README.md | 2015-12-29 | 7.7 kB | |
| Convertor.2.0.0.zip | 2015-12-18 | 3.7 MB | |
| Totals: 2 Items | 3.7 MB | 0 | |
Closely Related Languages Convertor
What is this?
A rule-based convertor to convert text from one language to another, closely related language. The code is language independent, and relies on language specific data for converting text from one language to another closely-related language. This could be useful for machine translation, recycling of language technologies between closely-related languages, etc.
Availability
File name: Convertor.2.0.0.zip Full name: Closely Related Languages Convertor Version: 2.0.0 Size: 3,582 Kb URL: http://rcrl.sourceforge.net
Language: Language independent
Developers
- GB van Huyssteen, S Pilon (linguists)
- MJ Puttkammer, M Schlemmer and DR van Niekerk (programmers)
- Liesbeth Augustinus, Kirsten Arnauts, Veronique de Gres, Shanna Pettens, Carla-Mari van den Heever, and Daan Wissing (others)
Funding
2008-10: National Research Foundation (GUN: FA2007041600015)
2014-5: Research within this project was financed by the Nederlandse Taalunie and the Department of Arts and Culture of the Government of the Republic of South Africa as part of their cooperation on language and speech technology.
Acknowledgement
When using this, please cite:
Van Niekerk, DR, Van Huyssteen, GB & Puttkammer, MJ. 2015. Closely related languages convertor v2.0.0. Potchefstroom: Centre for Text Technology (CTexT), North-West University.
Copyright 2015 CTexT. Full licence agreement can be found in
COPYING-CODE and COPYING-DATA.
Software requirements
- Python 2.7 (required)
- OpenFST with Python bindings (optional)
- Apache web server (optional)
Source description
.
convertor.py
COPYING-CODE
COPYING-DATA
decompound.py
g2p.py
LICENCE
README.md
tokenizer.py
data_af2nl
|-- decompmorphmap.json
|-- decompwordlist.txt
|-- defref_gnulls.g2g
|-- defref_rules.g2g
|-- lexmap.json
|-- outlexfreqs.json
`-- outlex.txt
data_nl2af
|-- decompmorphmap.json
|-- decompwordlist.txt
|-- defref_gnulls.g2g
|-- defref_rules.g2g
|-- lexmap.json
|-- outlexfreqs.json
`-- outlex.txt
doc
`-- README.html
examples
|-- testinput
`-- webdemo
In the root directory contains the main implementation written in
Python. The convertor can be run using the convertor.py script (and
the tokenizer.py can be used to pre-process input if required) as
described in the next section. The other Python source files
(decompound.py and g2p.py) are not standalone scripts, but modules
used by convertor.py. The data_* directories contain the language
resource files in the correct formats, these need to be compiled as
described below. The examples sub-directory contains test input and
an example setup for a web-served demo (with JSON output wrapper
cgi-script). The doc sub-directory contains this README in HTML
format.
Installation and running
In order to use the convertor, after extracting the source code, the
language resources need to be prepared for use. Depending on the
language data to be used (Afrikaans to Dutch and Dutch to Afrikaans
reside in data_a2d and data_d2a respectively), the following
command can be used to do this:
python convertor.py compile DATA_DIRECTORY
This should create a file: convertor.pickle which is loaded from the
script's working directory during conversion. The script implements
conversion of text using standard input and output. Note, input text
needs to be pre-tokenised as for example in: examples/testinput
The following test cases can be used to test the Afrikaans to Dutch conversion:
python convertor.py convert metainfo < examples/testinput
Should produce the following output:
Kwaggaijzer CompoundWordlookup
zijn Wordlookup
een Wordlookup
interessante Wordlookup
Gruiszandweg CompoundWordlookup
. UNCONVERTED
Persoonlijkheidsverwantschap CompoundWordlookup
schoonmoedergelegenheid CompoundWordlookup
vooral Wordlookup
pijpe G2GRewrites
. UNCONVERTED
An additional script (tokenizer.py) is provided to perform simple
tokenisation as required by the convertor. The following:
echo "Hierdie is 'n toetssin." | python tokenizer.py | python convertor.py convert
should produce:
Dit
zijn
een
toetszin
.
Furthermore, some wrapper code is also provided in the cgi
sub-directory which implements a simple web-demo, with HTML form for
input and cgi script providing JSON output. This implementation was
tested on standard Apache webservers in a UserDir environment (note:
when setting up Apache, remember to enable CGI script execution in the
relevant directory), however it should be easily adaptable to any
web-server software.
Development quick start
The convertor implements a number of WordConvertor modules and a
simple SentenceConvertor. The simple SentenceConvertor implemented
takes an ordered list of WordConvertors and runs each one until one
of the modules returns valid output. It also defines a "score word"
method using 1-gram frequency to select between multiple possible word
translations (these word frequencies are contained in the JSON
hash/dictionary file: outlexfreqs.json). In future development, the
simple SentenceConvertor implementation may be replaced with a more
sophisticated collecting all possible word translations and selecting
words based on the larger sentence context (e.g. using an N-gram
language model).
The conversion process is briefly illustrated in the following diagram:
Developers wishing to add new language pairs need to implement at
least one of the WordConvertor modules and prepare data as described
below (developers may also edit the compile_sentconverter function
in convertor.py according to their needs). The currently implemented
WordConvertor modules and corresponding data formats are as
documented below.
Note: All text files are in UNIX UTF-8 format.
WordLookup
The Wordlookup module tries to convert words using a simple lookup
process. It uses a "word mapping" and list of words that are part of
the lexicon of the output language. The word mapping file
(lexmap.json) is a JSON file containing a hash/dictionary of lists
(which may contain multiple possible word translations). The output
word list (outlex.txt) is a simple text file.
CompoundWordlookup
The CompoundWordlookup module tries to split compound words using an
algorithm that takes a word list and "morpheme map" to find possible
constituents and applies Wordlookup on the result. The wordlist is a
simple text file (decompwordlist.txt) and the morpheme map is a JSON
hash/dictionary mapping string to string (decompmorphmap.json).
G2GRewrites
The G2GRewrites module converts words by applying
grapheme-to-grapheme rules (rewrite rules). These rules can be
extracted from parallel word lists using the Default&Refine
algorithm. The rules are
contained in two "semicolon format" text files (defref_gnulls.g2g
and defref_rules.g2g) generated by the software referred to.