Creating language models from Wikipedia

Speech Recognition Toolkit

Brought to you by: air, arthchan2003, awb, bhiksha, and 5 others

This project can now be found here.

Creating language models from Wikipedia

Forum: Speech Recognition Theory

Creator: Stephen Marquard

Created: 2011-03-15

Updated: 2012-09-22

Stephen Marquard - 2011-03-15

Hi all,

I have recently being experimenting with using Wikipedia as a text corpus for
creating language models. If anyone is interested in this process and the
results, I've written up some details on a new blog:

http://trulymadlywordly.blogspot.com/2011/03/creating-text-corpus-from-
wikipedia.html

Regards
Stephen

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nickolay V. Shmyrev - 2011-03-15

Very interesting, thanks a lot!

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Vassil Panayotov - 2011-03-16

Hi Stephen,

Thank you for you post!

It is very interesting that you post this now, because I wanted to use
Wikipedia to train a model too :)
Yesterday I downloaded a wikipedia corpus from http://www.psych.ualberta.ca/~
westburylab/downloads/westburylab.wikicorp.download.html. They
also have a Usenet corpus but if I can judge from the sample I've seen it is
mostly garbage.
The Wikipedia corpus seems to contain the new line separated, plain text
articles from a slightly older version of Wikipedia(1.7G compressed, 6GB
uncompressed). I guess I can use something like your pattern from
Wikipedia2Txt to separate the sentences.

Nickolay, I know you have experience with various text-to-speech systems ...
AFAIK the text normalization is one of the problems these engines must solve.
Do you know if there is a library resulting from these projects that can be
used to do punctuation removal, convert e.g. numbers and email addresses to
strings and so on?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Rmkf - 2011-06-28

hmm... if things goes such a way, why not to use literature books itself? eg
http://public-library.narod.ru/Satirikon/Universal_History_v1.htm or
http://ilibrary.ru/ ? ( or whatever for eng lang... )

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Log in to post a comment.