# Tree-to-Tree (t2t) Alignment Pipe - Programming Project
# University of Zurich, Institute of Computational Linguistics
# Course: Introduction to multi-lingual text analysis
# README.TXT
# Author: Markus Killer (mki) <mki5600@gmail.com>
# December 2013
# Licensed under the GNU GPLv2
Project Homepage: http://sourceforge.net/projects/t2t-pipe
# See t2t-pipe-demo.ogv for a demo screen capture of a complete run (4 min 18 sec)
* The t2t-pipe is introduced in Killer/Sennrich/Volk (2011) - BibTeX entry see below
# Release 1.4 (Beta - in development)
* see SVN-Repository for latest version (change log at the bottom of this file)
(http://t2t-pipe.svn.sourceforge.net/viewvc/t2t-pipe)
# Release 1.3 (Beta)
* 2011-10-14-t2t-pipe-1.3.zip 4.7MB (scripts and docs without pre-computed
GIZA++-Dictionaries and test files)
* For a complete snapshot of all project files, download GNU Tarball (approx. 220MB)
from SVN Repository (http://t2t-pipe.svn.sourceforge.net/viewvc/t2t-pipe)
The *Tree-to-Tree (t2t) Alignment Pipe* is a small collection of python scripts,
co-ordinating the process of automatic alignment of parallel treebanks from
plain text or xml files. The main work and the more advanced stuff is done
by a number of freely available NLP software programs. Once these
third party programs have been installed and the system and corpus specific
details have been updated, the pipe is designed to produce automatically aligned
parallel treebanks with a single program call from a unix command line.
Currently, German, French and English are fully supported by the scripts and
the programs called by the scripts.
The generated ``TIGERxml`` files can be easily imported
into the graphical interface of the :program:`Stockholm TreeAligner`. The
second supported output format is ``TMX``. These files can be used as
translation memories in current translation memory systems (tested with
:program:`OmegaT`).
As I am relatively new to NLP and Python programming, there will be a number
of inconsistencies and rather clumsy solutions in the pipe. I am very grateful
for any suggestions on how to improve the program. Please, report bugs to
<mki5600@gmail.com>.
**************************************************************************
* FILES
**************************************************************************
docs/:
* documentation in html, latex and pdf format (incl. source files)
src/:
* config – Configuration File (executable)
* demo - Demo (executable)
* errors – Error messages
* extract_corpus – Extract Corpus
* get_files – Get Files
* info – Program Information File
* prepare_corpus – Prepare Corpus
* run_parser – Statistical Phrase Structure Parsing
* run_preprocessor – Tokenization Module
* run_snt_align – Sentence Alignment
* run_t2t_align – Tree-to-Tree Alignment
* run_word_align – Word Alignment
* save_output – Save Output Files (executable)
* t2t_pipe – Tree-to-Tree (t2t) Alignment Pipe Main Module (executable)
src/resources/:
* resources.autodoc – Sphinx Autodoc generator (executable)
* resources.build_sac_web_corpus – Build SAC-Web-Korpus (executable)
* resources.combine_autodicts – Combine Hunalign Autodict Files (executable)
* resources.convert_files – Convert Files PDF->XML, etc. (executable)
* resources.dictcc_extract – Extract entries from dict.cc files (executable)
* resources.download_files – Download Files from Web-Server (executable)
* resources.sta_alignment_stats – Count Alignment Types in sta.xml-files
* resources.tagsets – POS-Tagset Dictionaries
* resources.brackparser.brackparser – Parse bracketed sentences (Penn)
* resources.brackparser.nodes – Extract Terminals and Nonterminals (Penn)
* null.dic (empty file to run hunalign without dictionary)
* very_short_words.txt (list of words to be excluded from OCR debris removal)
[FILES NOT INCLUDED IN ZIP-RELEASES]
see SVN Repository (http://sourceforge.net/p/t2t-pipe/code/HEAD/tree/)]
src/resources/:
* eparl_96-09_model.zip
* eparl_tub_07-09_57-82_model.zip
* eparl_tub_96-09_57-82_model.zip
* tub_57-82_model.zip
test/:
* output files of test runs and evaluation files
(including two pictures of good alignment results)
**************************************************************************
* SYSTEM USED
**************************************************************************
OS: Linux (Xubuntu 12.04 x64)
IDE: Wing IDE PRO v5 (http://wingware.com)
[free open source development license]
Programming Language: Python 2.7 (from Version 1.4)
Dependencies (ubuntu repositories):
for /src:
python2.6 or python2.7
python-nltk (and nltk_data)
python-lxml
for /docs:
python-sphinx
python-simpleparse
for /bin (from Version 1.4):
libboost-regex1.42.0 (sub-tree-aligner)
known issue: version 1.42 is no longer
in 12.04 LTS repo -> workaround:
apt-get install libboost-regex1.46.1
and simlink to newer version
cd /usr/lib
sudo ln -s libboost_regex.so.1.46.1 libboost_regex.so.1.42.0
**************************************************************************
* THIRD PARTY SOFTWARE USED IN THIS PROGRAMME (order of appearance in pipe)
*
* # NOT INCULDED in downloads #
**************************************************************************
- Python NLTK-Tooklit, Version: 2.0.4 (python-nltk),
http://www.nltk.org
- Hunalign, Version: 1.1 (2010), http://mokk.bme.hu/resources/hunalign
- Microsoft Bilingual Sentence Aligner, Version: 1.0 (2003),
http://research.microsoft.com/en-us/downloads
- Vanilla Aligner, Version: 1.0 (1997), http://nl.ijs.si/telri/Vanilla
- GIZA++, Version: 1.0.5 (31.10.2010), http://code.google.com/p/giza-pp
- MOSES, Version: SVN Snapshot vom 04.10.2011 (Revision 4295), http://www.statmt.org/moses
- Berkeley Aligner, Version: 2.1 (Sep 2009)
http://code.google.com/p/berkeleyaligner
- Berkeley Parser, Version: 1.1 (Sep 2009),
http://code.google.com/p/berkeleyparser
- Stanford Parser, Version: 1.6.5 (30.11.2010) as "stanford_old_de"
and Version: 1.6.9 (14.09.2011) for new projects
http://nlp.stanford.edu/software/lex-parser.shtml
- Sub-Tree Aligner, Version: 2.8.6 (22.03.2009),
http://www.ventsislavzhechev.eu/Home/Software/Entries/2009/3/22_Sub-Tree_Aligner_v2.8.6_files/tree_aligner.v2.8.6.tbz
(Note 21/08/2011: Links on Webpage seem to be broken - use this direct link instead)
- Stockholm TreeAligner, Version: 1.2.90 (02.06.2010),
http://kitt.cl.uzh.ch/kitt/treealigner/wiki/TreeAlignerDownload
- OmegaT, Version: 2.3.0 Update 1 (April 2012), http://www.omegat.org
**************************************************************************
* USAGE
**************************************************************************
There are two ways of starting the pipe:
* update and run src/config.py (from any directory on your system)
or
* update src/config.py in src/ and run:
src/t2t_pipe.py [-1 FIRST_STEP (default=1)] [-2 LAST_STEP (default=7)]
1 extract parallel corpus / add article boundaries
2 tokenize parallel corpus
3 align sentences
4 parse corpus
5 get word-alignment probabilities
6 get tree2tree alignments
7 save output files
**************************************************************************
* DOCUMENTATION (including mini evaluation of the `Sub-Tree Aligner v2.8.6`)
**************************************************************************
Introduced in:
@inproceedings{killer-sennrich-volk:2011,
booktitle = {Multilingual Resources and Multilingual Applications.
Proceedings of the Conference of the German Society for Computational
Linguistics and Language Technology (GSCL 2011)},
month = {September},
title = {{F}rom {M}ultilingual {W}eb-{A}rchives to {P}arallel {T}reebanks in {F}ive {M}inutes},
author = {Markus Killer, Rico Sennrich and Martin Volk},
year = {2011},
pages = {57--62},
abstract = {The Tree-to-Tree (t2t) Alignment Pipe is a collection of Python scripts,
generating automatically aligned parallel treebanks from multilingual web resources
or existing parallel corpora. The pipe contains wrappers for a number of freely
available NLP software programs. Once these third party programs have been
installed and the system and corpus specific details have been updated,
the pipe is designed to generate a parallel treebank with a single program
call from a unix command line. We discuss alignment quality on a fully
automatically processed parallel corpus.}
}
Recommended:
HTML-Documentation (including links to source code):
Open in browser: docs\html\index.html
Other formats:
PDF \docs\latex\t2t-pipe-manual.pdf
LATEX \docs\latex
all source files \docs\source
**************************************************************************
* CHANGE LOG
**************************************************************************
1.4 (in development - not released yet):
- changed standard python executable from python2.6
to the more generic python2 for better compatibility with Ubuntu 12.04 LTS
- include all necessary third party binaries in download (64 bit) [work in progress]
- adding support for new Standford Parser API v2 [work in progress]
(currently working if you manually extract grammar files from ST-Models.jar
into STANFORD_ROOT/grammar/[grammar-files])
- fixed subprocess call of berkeley aligner for English texts in run_parser.py
(- include support for Gargantua (sentence aligner))
(- improve support for Berkeley Aligner)
(- fix downloader for sac corpus - archives are no longer free)
1.3 - 2011-10-14:
- added module to build parallel corpus from xml files produced by xpdf's pdftohtml
- demo.py builds parallel treebank SAC webarchives
- improved parallel article selection in src/resources/build_sac_web_corpus.py
- added support for new Standford Parser API (enabling French)
- fixed issue with POS tags/CAT labels containing '-'
1.2 - 2011-02-21:
- fixed and improved processing of TXT-files (e.g. remove mid-sentence newlines)
- improved tokenization module (no more changes needed in nltk-files) less dangling
right brackets and closing double quotes - but at the moment all hyphenated words
are separated ...
- added support for bracket-labels -LRB-/-RRB- in save_output.py
- updated docs to reflect changes and deleted obsolete warning boxes
- added module to download multiple files from web server (included function to download
the whole SAC Archive 2001-present)
- added module to convert multiple pdf files to xml
(allowing text extraction on the basis of font size)
- use relative path to src directory in subfolder modules
1.1 - 2011-02-12:
- fix computation of STA alignment stats in src/save_output.py
- change order of steps in pipe: parse first and then compute word-alginment probs
(Berkeley Aligner can make use of parse trees when computing word alignment probs)
- add basic support for Berkeley Aligner (there are still some problems with HMM_SYNTACTIC)
1.0 - 2011-01-24:
- first release / project handed in