What is considered a good corpus to train language model

Help
Shunyi Xu
2013-12-02
2013-12-06
  • Shunyi Xu
    Shunyi Xu
    2013-12-02

    I have been trying to design a good text corpus to cover the words that are useful in my application. (I am using lmtool)
    The sentences that usually appear are in the following form:

    navigate me from [a place name] to [a another place name]

    In order to do so, I have a couple of example sentences in my text corpus like.

    navigate me from new york to hong kong
    show me the route from broadway to central, hong kong
    what is the best path from paris to tokyo.
    //and then in order to let as many places to fit into the [place], I have listed place name in the rest of the training text corpus, with each place name on a line of their own.

    Statue of Liberty
    museum
    great wall
    Carnegie Mellon University
    ....

    so my question:.
    1. does the above example make a good training text, and make it flexible to replace any place name in the lower part fit into the [place name] position? Throughout my observation. This is not very good, since it's more likely for new york to follow "from" in the generated 3gram. and having the lower part replace the "new york" is not that easy.
    2. Since it's not possible for me to list all place names and fit them into the sentences the position they should be, I wonder if there's any trick in writting up the corpus, that can makes a greater coverage of the pattern I intended.
    3. can I break down the sentences and make each meanningful words on their own line? Is this a good practice? this way, all words can be as flexible as possible to combine with other words. is it right?
    4. some general suggestion of how to make a good text corpus .

    Your suggestion is highly appreciated!
    4.

     
    Last edit: Shunyi Xu 2013-12-02
  • does the above example make a good training text, and make it flexible to replace any place name in the lower part fit into the [place name] position?

    No

    Since it's not possible for me to list all place names and fit them into the sentences the position they should be, I wonder if there's any trick in writting up the corpus, that can makes a greater coverage of the pattern I intended.

    It is easily possible if you know how to write code. You can generate required corpus with your favorite scripting langauge

    can I break down the sentences and make each meanningful words on their own line? Is this a good practice?

    No

    some general suggestion of how to make a good text corpus .

    The best corpus is the one collected by the users in a real life application