I have recently being experimenting with using Wikipedia as a text corpus for
creating language models. If anyone is interested in this process and the
results, I've written up some details on a new blog:
It is very interesting that you post this now, because I wanted to use
Wikipedia to train a model too :)
Yesterday I downloaded a wikipedia corpus from http://www.psych.ualberta.ca/~
westburylab/downloads/westburylab.wikicorp.download.html. They
also have a Usenet corpus but if I can judge from the sample I've seen it is
mostly garbage.
The Wikipedia corpus seems to contain the new line separated, plain text
articles from a slightly older version of Wikipedia(1.7G compressed, 6GB
uncompressed). I guess I can use something like your pattern from
Wikipedia2Txt to separate the sentences.
Nickolay, I know you have experience with various text-to-speech systems ...
AFAIK the text normalization is one of the problems these engines must solve.
Do you know if there is a library resulting from these projects that can be
used to do punctuation removal, convert e.g. numbers and email addresses to
strings and so on?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi all,
I have recently being experimenting with using Wikipedia as a text corpus for
creating language models. If anyone is interested in this process and the
results, I've written up some details on a new blog:
http://trulymadlywordly.blogspot.com/2011/03/creating-text-corpus-from-
wikipedia.html
Regards
Stephen
Very interesting, thanks a lot!
Hi Stephen,
Thank you for you post!
It is very interesting that you post this now, because I wanted to use
Wikipedia to train a model too :)
Yesterday I downloaded a wikipedia corpus from http://www.psych.ualberta.ca/~
westburylab/downloads/westburylab.wikicorp.download.html. They
also have a Usenet corpus but if I can judge from the sample I've seen it is
mostly garbage.
The Wikipedia corpus seems to contain the new line separated, plain text
articles from a slightly older version of Wikipedia(1.7G compressed, 6GB
uncompressed). I guess I can use something like your pattern from
Wikipedia2Txt to separate the sentences.
Nickolay, I know you have experience with various text-to-speech systems ...
AFAIK the text normalization is one of the problems these engines must solve.
Do you know if there is a library resulting from these projects that can be
used to do punctuation removal, convert e.g. numbers and email addresses to
strings and so on?
hmm... if things goes such a way, why not to use literature books itself? eg
http://public-library.narod.ru/Satirikon/Universal_History_v1.htm or
http://ilibrary.ru/ ? ( or whatever for eng lang... )