Download Latest Version t2t-pipe 1.3 (Beta) (4.7 MB)
Email in envelope

Get an email when there's a new version of t2t-pipe

Home
Name Modified Size InfoDownloads / Week
older-versions 2011-02-11
README.TXT 2013-12-10 11.9 kB
2011-10-14-t2t-pipe-1.3.zip 2011-10-13 4.7 MB
t2t-pipe-demo.ogv 2011-09-30 34.6 MB
2011-02-21-t2t-pipe-1.2.zip 2011-02-21 3.8 MB
2011-02-12-t2t-pipe-1.1.zip 2011-02-12 3.7 MB
Totals: 6 Items   46.9 MB 0
# Tree-to-Tree (t2t) Alignment Pipe - Programming Project 
# University of Zurich, Institute of Computational Linguistics
# Course: Introduction to multi-lingual text analysis 
# README.TXT
# Author: Markus Killer (mki) <mki5600@gmail.com> 
# December 2013
# Licensed under the GNU GPLv2

Project Homepage: http://sourceforge.net/projects/t2t-pipe

# See t2t-pipe-demo.ogv for a demo screen capture of a complete run (4 min 18 sec)
    * The t2t-pipe is introduced in Killer/Sennrich/Volk (2011) - BibTeX entry see below

# Release 1.4 (Beta - in development)
    * see SVN-Repository for latest version (change log at the bottom of this file)
        (http://t2t-pipe.svn.sourceforge.net/viewvc/t2t-pipe)

# Release 1.3 (Beta)
    * 2011-10-14-t2t-pipe-1.3.zip 4.7MB (scripts and docs without pre-computed 
        GIZA++-Dictionaries and test files)
    * For a complete snapshot of all project files, download GNU Tarball (approx. 220MB) 
        from SVN Repository (http://t2t-pipe.svn.sourceforge.net/viewvc/t2t-pipe)

The *Tree-to-Tree (t2t) Alignment Pipe* is a small collection of python scripts,
co-ordinating the process of automatic alignment of parallel treebanks from 
plain text or xml files. The main work and the more advanced stuff is done
by a number of freely available NLP software programs. Once these 
third party programs have been installed and the system and corpus specific 
details have been updated, the pipe is designed to produce automatically aligned 
parallel treebanks with a single program call from a unix command line. 
Currently, German, French and English are fully supported by the scripts and 
the programs called by the scripts.

The generated ``TIGERxml`` files can be easily imported 
into the graphical interface of the :program:`Stockholm TreeAligner`. The 
second supported output format is ``TMX``. These files can be used as
translation memories in current translation memory systems (tested with
:program:`OmegaT`).

As I am relatively new to NLP and Python programming, there will be a number
of inconsistencies and rather clumsy solutions in the pipe. I am very grateful 
for any suggestions on how to improve the program. Please, report bugs to
<mki5600@gmail.com>.

**************************************************************************
* FILES
**************************************************************************

    docs/:
        * documentation in html, latex and pdf format (incl. source files)

    src/:
        * config – Configuration File (executable)
        * demo - Demo (executable)
        * errors – Error messages
        * extract_corpus – Extract Corpus
        * get_files – Get Files
        * info – Program Information File
        * prepare_corpus – Prepare Corpus
        * run_parser – Statistical Phrase Structure Parsing
        * run_preprocessor – Tokenization Module
        * run_snt_align – Sentence Alignment
        * run_t2t_align – Tree-to-Tree Alignment
        * run_word_align – Word Alignment
        * save_output – Save Output Files (executable)
        * t2t_pipe – Tree-to-Tree (t2t) Alignment Pipe Main Module (executable)
    src/resources/:
        * resources.autodoc – Sphinx Autodoc generator (executable)
        * resources.build_sac_web_corpus – Build SAC-Web-Korpus (executable)
        * resources.combine_autodicts – Combine Hunalign Autodict Files (executable)
        * resources.convert_files – Convert Files PDF->XML, etc. (executable)
        * resources.dictcc_extract – Extract entries from dict.cc files (executable)
        * resources.download_files – Download Files from Web-Server (executable)
        * resources.sta_alignment_stats – Count Alignment Types in sta.xml-files
        * resources.tagsets – POS-Tagset Dictionaries
        * resources.brackparser.brackparser – Parse bracketed sentences (Penn)
        * resources.brackparser.nodes – Extract Terminals and Nonterminals (Penn)
        * null.dic (empty file to run hunalign without dictionary)
        * very_short_words.txt (list of words to be excluded from OCR debris removal)

[FILES NOT INCLUDED IN ZIP-RELEASES]
see SVN Repository (http://sourceforge.net/p/t2t-pipe/code/HEAD/tree/)]

    src/resources/:
        * eparl_96-09_model.zip
        * eparl_tub_07-09_57-82_model.zip
        * eparl_tub_96-09_57-82_model.zip
        * tub_57-82_model.zip

    test/:
        * output files of test runs and evaluation files 
            (including two pictures of good alignment results)


**************************************************************************
* SYSTEM USED
**************************************************************************

OS:                     Linux (Xubuntu 12.04 x64)
IDE:                    Wing IDE PRO v5 (http://wingware.com)
                        [free open source development license]
Programming Language:   Python 2.7 (from Version 1.4)

Dependencies (ubuntu repositories):

        for /src:       
                        python2.6 or python2.7
                        python-nltk (and nltk_data)
                        python-lxml

        for /docs:
                        python-sphinx
                        python-simpleparse
                        
        for /bin (from Version 1.4):
                        libboost-regex1.42.0 (sub-tree-aligner)
                        
                        known issue: version 1.42 is no longer 
                        in 12.04 LTS repo -> workaround:
              
              apt-get install libboost-regex1.46.1 
              
              and simlink to newer version
              
              cd /usr/lib
              sudo ln -s libboost_regex.so.1.46.1 libboost_regex.so.1.42.0

**************************************************************************
* THIRD PARTY SOFTWARE USED IN THIS PROGRAMME (order of appearance in pipe)
*
* # NOT INCULDED in downloads # 
**************************************************************************

- Python NLTK-Tooklit, Version: 2.0.4 (python-nltk), 
    http://www.nltk.org
- Hunalign, Version: 1.1 (2010), http://mokk.bme.hu/resources/hunalign
- Microsoft Bilingual Sentence Aligner, Version: 1.0 (2003), 
    http://research.microsoft.com/en-us/downloads
- Vanilla Aligner, Version: 1.0 (1997), http://nl.ijs.si/telri/Vanilla
- GIZA++, Version: 1.0.5 (31.10.2010), http://code.google.com/p/giza-pp
- MOSES, Version: SVN Snapshot vom 04.10.2011 (Revision 4295), http://www.statmt.org/moses
- Berkeley Aligner, Version: 2.1 (Sep 2009)
    http://code.google.com/p/berkeleyaligner
- Berkeley Parser, Version: 1.1 (Sep 2009), 
    http://code.google.com/p/berkeleyparser
- Stanford Parser, Version: 1.6.5 (30.11.2010) as "stanford_old_de"
    and Version: 1.6.9 (14.09.2011) for new projects
    http://nlp.stanford.edu/software/lex-parser.shtml
- Sub-Tree Aligner, Version: 2.8.6 (22.03.2009),
    http://www.ventsislavzhechev.eu/Home/Software/Entries/2009/3/22_Sub-Tree_Aligner_v2.8.6_files/tree_aligner.v2.8.6.tbz
    (Note 21/08/2011: Links on Webpage seem to be broken - use this direct link instead)
- Stockholm TreeAligner, Version: 1.2.90 (02.06.2010), 
    http://kitt.cl.uzh.ch/kitt/treealigner/wiki/TreeAlignerDownload
- OmegaT, Version: 2.3.0 Update 1 (April 2012), http://www.omegat.org


**************************************************************************
* USAGE
**************************************************************************

There are two ways of starting the pipe:

* update and run src/config.py (from any directory on your system)

or  

* update src/config.py in src/ and run:

    src/t2t_pipe.py [-1 FIRST_STEP (default=1)] [-2 LAST_STEP (default=7)]

    1    extract parallel corpus / add article boundaries
    2    tokenize parallel corpus
    3    align sentences
    4    parse corpus
    5    get word-alignment probabilities
    6    get tree2tree alignments
    7    save output files

**************************************************************************
* DOCUMENTATION (including mini evaluation of the `Sub-Tree Aligner v2.8.6`)
**************************************************************************

Introduced in:

@inproceedings{killer-sennrich-volk:2011,
    booktitle = {Multilingual Resources and Multilingual Applications.  
        Proceedings of the Conference of the German Society for Computational 
        Linguistics and Language Technology (GSCL 2011)},
    month = {September},
    title = {{F}rom {M}ultilingual {W}eb-{A}rchives to {P}arallel {T}reebanks in {F}ive {M}inutes},
    author = {Markus Killer, Rico Sennrich and Martin Volk},
    year = {2011},
    pages = {57--62},
    abstract = {The Tree-to-Tree (t2t) Alignment Pipe is a collection of Python scripts, 
        generating automatically aligned parallel treebanks from multilingual web resources 
        or existing parallel corpora. The pipe contains wrappers for a number of freely 
        available NLP software programs. Once these third party programs have been 
        installed and the system and corpus specific details have been updated, 
        the pipe is designed to generate a parallel treebank with a single program 
        call from a unix command line. We discuss alignment quality on a fully 
        automatically processed parallel corpus.}
            }

Recommended:

HTML-Documentation (including links to source code): 
Open in browser: docs\html\index.html

Other formats:

PDF               \docs\latex\t2t-pipe-manual.pdf
LATEX             \docs\latex
all source files  \docs\source

**************************************************************************
* CHANGE LOG
**************************************************************************
1.4 (in development - not released yet):
- changed standard python executable from python2.6 
  to the more generic python2 for better compatibility with Ubuntu 12.04 LTS
- include all necessary third party binaries in download (64 bit) [work in progress]
- adding support for new Standford Parser API v2 [work in progress]
  (currently working if you manually extract grammar files from ST-Models.jar 
  into STANFORD_ROOT/grammar/[grammar-files])
- fixed subprocess call of berkeley aligner for English texts in run_parser.py
(- include support for Gargantua (sentence aligner))
(- improve support for Berkeley Aligner)
(- fix downloader for sac corpus - archives are no longer free)

1.3 - 2011-10-14:
- added module to build parallel corpus from xml files produced by xpdf's pdftohtml
- demo.py builds parallel treebank SAC webarchives
- improved parallel article selection in src/resources/build_sac_web_corpus.py
- added support for new Standford Parser API (enabling French)
- fixed issue with POS tags/CAT labels containing '-'

1.2 - 2011-02-21:
- fixed and improved processing of TXT-files (e.g. remove mid-sentence newlines)
- improved tokenization module (no more changes needed in nltk-files) less dangling 
    right brackets and closing double quotes - but at the moment all hyphenated words 
    are separated ...
- added support for bracket-labels -LRB-/-RRB- in save_output.py
- updated docs to reflect changes and deleted obsolete warning boxes
- added module to download multiple files from web server (included function to download 
    the whole SAC Archive 2001-present)
- added module to convert multiple pdf files to xml 
    (allowing text extraction on the basis of font size)
- use relative path to src directory in subfolder modules

1.1 - 2011-02-12:
- fix computation of STA alignment stats in src/save_output.py
- change order of steps in pipe: parse first and then compute word-alginment probs 
    (Berkeley Aligner can make use of parse trees when computing word alignment probs)
- add basic support for Berkeley Aligner (there are still some problems with HMM_SYNTACTIC)

1.0 - 2011-01-24:
- first release / project handed in
Source: README.TXT, updated 2013-12-10