Home
Name Modified Size InfoDownloads / Week
corpora 2012-07-09
tools 2012-07-07
setup.exe 2013-04-26 6.2 MB
README 2012-07-09 2.6 kB
Totals: 4 Items   6.2 MB 0
The corpora files can be found in the 'corpora' folder, the tools used to create them (along with some very basic instructions) in the 'tools' folder.

Different databases are offered for each language, among them:
  * 'pars', which contains paragraphs as extracted from Wikipedia
  * 'sent', which contains sentences extracted from the corpus above, using the TrainPunkt and Punkt scripts (which wrap NTLK Punkt module) - this is probably what you want, if you are not sure
  * 'punkt', which contains the trained model from which the corpus above was generated (and can be used with the Punkt script in the tools folder, but remember that this algorithm was designed for unsupervised learning from the text it is expected to be applied to, not for a generalized sentence segmentation - i.e., you might want to use the 'sent' corpus to train your own, supervised, tokenizer)
  * 'tokens', which contains the paragraph corpus tokenized, lowercased, with sentence markers ("<s>" and "</s>") and eventually filtered of exceptionally long words and sentences.
  * 'lm3', which contains an ARPA-format language model based on 3-grams
  * 'lm5', which contains an ARPA-format language model based on 5-grams
  * 'dict', which cointains a sorted, lower-cased vocabulary in the format "token count" (this type of file usually contains noise at the bottom, with low-count tokens)

The two first letters indicate the language, as used by Wikipedia (ISO 639-1). The date indicates the date of the Wikipedia dump (as available from http://dumps.wikimedia.org/backup-index.html), not the date the corpus was generated.

Language specific information:
  * For Georgian ('ka'), the corpus contains noise (such as HTML formatting and English text). The sentences were split with a standard Punkt training and probably contain a number of erros.
  * For Galician ('ga'), the tokens were obtained with FreeLing tokenizing module.
  * For Italian ('it'), the tokens and the sentences were obtained with FreeLing tokenizing modules.

Please remember that the files do not include the entire Wikipedia contents in the given language; in particular, I have removed very short paragraphs. However, I tried to make the corpora as general as possible (for exemple, they are not even tokenized or lowercased), so that you can do pretty much whatever you want with them. Please write if you need any help (especially if you need a language in particular, I will do my best to add it), and also if you use these files. I would love to know how they are being used; mail: <tresoldi@gmail.com>.

Source: README, updated 2012-07-09