# Tree-to-Tree (t2t) Alignment Pipe - Programming Project
# University of Zurich, Institute of Computational Linguistics
# Course: Introduction to multi-lingual text analysis
# README.TXT
# Author: Markus Killer (mki) <mki5600@gmail.com>
# January 2011
# Licensed under the GNU GPLv2
# Release 1.0 (Beta)
* 2011-01-24-t2t-pipe-1.0-small.zip 3.7MB (scripts and docs without pre-computed GIZA++-Dictionaries and test files)
* 2011-01-24-t2t-pipe-1.0-complete 222.9MB (scripts and docs including pre-computed GIZA++-Dictionaries for German-French and test files)
The *Tree-to-Tree (t2t) Alignment Pipe* is a small collection of python scripts,
co-ordinating the process of automatic alignment of parallel treebanks from
plain text or xml files. The main work and the more advanced stuff is done
by a number of freely available NLP software programmes. Once these
third party programmes have been installed and the system and corpus specific
details have been updated, the pipe is designed to produce automatically aligned
parallel treebanks with a single programme call from a unix command line.
Currently, German, French and English are fully supported by the scripts and
the programmes called by the scripts.
The generated ``TIGERxml`` files can be easily imported
into the graphical interface of the :program:`Stockholm TreeAligner`. The
second supported output format is ``TMX``. These files can be used as
translation memories in current translation memory systems (tested with
:program:`OmegaT`).
As I am relatively new to NLP and Python programming, there will be a number
of inconsistancies and rather clumsy solutions in the pipe. I am very grateful
for any suggestions on how to improve the programme. Please, report bugs to
<mki5600@gmail.com>.
**************************************************************************
* FILES
**************************************************************************
docs/:
* documentation in html, latex and pdf format (incl. source files)
src/:
* config – Configuration File (executable)
* errors – Error messages
* extract_corpus – Extract Corpus
* get_files – Get Files
* info – Programme Information File
* prepare_corpus – Prepare Corpus
* run_parser – Statistical Phrase Structure Parsing
* run_preprocessor – Tokenization Module
* run_snt_align – Sentence Alignment
* run_t2t_align – Tree-to-Tree Alignment
* run_word_align – Word Alignment
* save_output – Save Output Files (executable)
* t2t_pipe – Tree-to-Tree (t2t) Alignment Pipe Main Module (executable)
src/resources/:
* resources.autodoc – Sphinx Autodoc generator (executable)
* resources.combine_autodicts – Combine Hunalign Autodict Files (executable)
* resources.dictcc_extract – Extract entries from dict.cc files (executable)
* resources.sta_alignment_stats – Count Alignment Types in sta.xml-files
* resources.tagsets – POS-Tagset Dictionaries
* resources.brackparser.brackparser – Parse bracketed sentences (Penn)
* resources.brackparser.nodes – Extract Terminals and Nonterminals (Penn)
test/:
* output files of test runs and evaluation files
(including two pictures of good alignment results)
**************************************************************************
* SYSTEM USED
**************************************************************************
OS: Linux (Ubuntu 10.10 - x64)
IDE: Eclipse Helios SR1 with PyDev 1.6.4
Programming Language: Python 2.6
**************************************************************************
* THIRD PARTY SOFTWARE USED IN THIS PROGRAMME (order of appearance in pipe)
*
* # NOT INCULDED in downloads #
**************************************************************************
- Python NLTK-Tooklit, Version: 2.0b8 (Ubuntu 10.10 X64 - Repository),
http://www.nltk.org
- Hunalign, Version: 1.1 (2010), http://mokk.bme.hu/resources/hunalign
- Microsoft Bilingual Sentence Aligner, Version: 1.0 (2003),
http://research.microsoft.com/en-us/downloads
- Vanilla Aligner, Version: 1.0 (1997), http://nl.ijs.si/telri/Vanilla
- GIZA++, Version: 1.0.5 (31.10.2010), http://code.google.com/p/giza-pp
- MOSES, Version: SVN Snapshot vom 19.12.2010, http://www.statmt.org/moses
- Berkeley Parser, Version: 1.1 (Sep 2009),
http://code.google.com/p/berkeleyparser
- Stanford Parser, Version: 1.6.5 (30.11.2010),
http://nlp.stanford.edu/software/lex-parser.shtml
- TreeAligner, Version: 2.8.6 (22.03.2009),
http://www.ventsislavzhechev.eu/Home/Software/Software.html
- Stockholm TreeAligner, Version: 1.2.90 (02.06.2010),
http://kitt.cl.uzh.ch/kitt/treealigner/wiki/TreeAlignerDownload
- OmegaT, Version: 2.0.5 Update 4 (02.06.2010), http://www.omegat.org
**************************************************************************
* USAGE
**************************************************************************
There are two ways of starting the pipe:
* update and run src/config.py (from any directory on your system)
or
* update src/config.py in src/ and run:
src/t2t_pipe.py [-1 FIRST_STEP (default=1)] [-2 LAST_STEP (default=7)]
1 extract parallel corpus / add article boundaries
2 tokenize parallel corpus
3 align sentences
4 get word-alignment probabilities
5 parse corpus
6 get tree2tree alignments
7 save output files
**************************************************************************
* DOCUMENTATION (including mini evaluation of `Zhechev TreeAligner`)
**************************************************************************
Recommended:
HTML-Documentation (including links to source code):
Open in browser: docs\html\index.html
Other formats:
PDF \docs\t2t-pipe-manual.pdf
LATEX \docs\latex
all source files \docs\source
**************************************************************************
* CHANGE LOG
**************************************************************************
1.1.beta (not released yet):
- change order of steps in pipe: parse first and then compute word-alginment probs (Berkeley Aligner
can make use of parse trees when computing word alignment probs)
- include support for Gargantua (sentence aligner)