Menu

Creating language models from Wikipedia

2011-03-15
2012-09-22
  • Stephen Marquard

    Hi all,

    I have recently being experimenting with using Wikipedia as a text corpus for
    creating language models. If anyone is interested in this process and the
    results, I've written up some details on a new blog:

    http://trulymadlywordly.blogspot.com/2011/03/creating-text-corpus-from-
    wikipedia.html

    Regards
    Stephen

     
  • Nickolay V. Shmyrev

    Very interesting, thanks a lot!

     
  • Vassil Panayotov

    Hi Stephen,

    Thank you for you post!

    It is very interesting that you post this now, because I wanted to use
    Wikipedia to train a model too :)
    Yesterday I downloaded a wikipedia corpus from http://www.psych.ualberta.ca/~
    westburylab/downloads/westburylab.wikicorp.download.html.
    They
    also have a Usenet corpus but if I can judge from the sample I've seen it is
    mostly garbage.
    The Wikipedia corpus seems to contain the new line separated, plain text
    articles from a slightly older version of Wikipedia(1.7G compressed, 6GB
    uncompressed). I guess I can use something like your pattern from
    Wikipedia2Txt to separate the sentences.

    Nickolay, I know you have experience with various text-to-speech systems ...
    AFAIK the text normalization is one of the problems these engines must solve.
    Do you know if there is a library resulting from these projects that can be
    used to do punctuation removal, convert e.g. numbers and email addresses to
    strings and so on?

     

Log in to post a comment.