Download Latest Version it-20120608.pars.bz2 (447.6 MB)
Email in envelope

Get an email when there's a new version of Hermes Natural Language Processing

Home / tools
Name Modified Size InfoDownloads / Week
Parent folder
SingleLineTokens.py 2012-07-07 273 Bytes
Lowercase.py 2012-07-07 156 Bytes
README 2012-07-07 557 Bytes
WikiExtractor.py 2012-07-05 21.3 kB
TrainPunkt.py 2012-07-05 1.6 kB
RawFilter.py 2012-07-05 1.3 kB
Punkt.py 2012-07-05 1.1 kB
Totals: 7 Items   26.3 kB 0
wget -c http://dumps.wikimedia.org/glwiki/20120625/glwiki-20120625-pages-articles.xml.bz2
mkdir glwiki
bzcat glwiki-20120625-pages-articles.xml.bz2 | ./WikiExtractor.py -o glwiki
find glwiki -exec cat {} \; > corpus/gl.raw
rm -rf glwiki
cat corpus/gl.raw | ./RawFilter.py > corpus/gl.pars
cat corpus/gl.pars | ./TrainPunkt.py > corpus/gl.punkt
cat corpus/gl.pars | ./Punkt.py corpus/gl.punkt > corpus/gl.sent
cd corpus
bzip2 gl.*
cat corpus/gl.tokens | ./SingleLineTokens.py | ./Lowercase.py | /usr/local/irstlm/add-start-end.sh > corpus/gl-20120625.tokens
Source: README, updated 2012-07-07