Menu

Better large vocab language models

James
2006-04-19
2012-09-21
  • James

    James - 2006-04-19

    Hi-

    Not sure if this makes any sense, but would it be useful to use the boon in audio books (audible.com) to train a better language model (and dictionary for that matter)?

    One could use Orwell's 1984:

    http://www.audible.com/adbl/site/products/ProductDetail.jsp?productID=BK_BLAK_000117&BV_UseBVCookie=Yes

    And the text:

    http://www.online-literature.com/orwell/1984/1/

    This of course would produce a lot of training data, and I assume would create more accurate large vocabulary results.

    BTW, I do believe it is possible to convert audible files to wav (or to mp3 then wav) files

    http://forums.afterdawn.com/thread_view.cfm/5/103313

    Would this be a useful thing to explore?

    James

     
    • Robbie

      Robbie - 2006-04-19

      The audio will not be helpful in training language models, since they are text only. It can't help with the dictionary as is, either, because the dictionary is a mapping of words to their corresponding phonemes (which you can't obtain from an audio sample or the transcription, though you can use a speech recognizer to try which would be an admittedly fun project).

      However, the textual transcription can be used to train a language model. As novels are not very representative of spoken speech, you may find your language models to not be top-notch, but certainly better than WSJ text.

      The audio+transcription can be used to train an acoustic model. One caveat is that lossy compression schemes alter the spectral qualities of the waveform, so unless you plan on your target audio (i.e. from the mic, or whatever) being in the same compression format of the original (i.e. audible's format), then you will have an acoustic model mismatch. It won't be severe, but it is definitely not ideal.

      In summary, if you train a language model from book transcriptions and an acoustic model from transcript+audio from audible, you will create a very good audible book transcriber, but perhaps not the best speech recognizer.

       
      • James

        James - 2006-04-19

        Hmmm...

        I am a bit confused, I was thinking that with all the text in an transcript that one could use that to define a new larger dictionary, using automated phonemes tools (like the post in the help forum about adding words to the dictionary automatically). I would have thought a larger dictionary would be better for a general purpose speech recognition (no specific vocabulary that you can narrow down to).

        Language models are built on n-grams, right? Which is really just counting up how many times a certain n-set of words are in a corpus, right? (Note I have done some work with Information Retrivial and NLP, but I am out of my league here). Why couldn't you do this with a transcript of a book?

        I agree that spoken english and what is written in a novel are probably different, but for a general purpose speech recognition system you would want to attempt to handle those infrequent words that might not occur often in spoken english but would occur in a novel.

        Besides all of this, I am thinking in the machine learning mentality.... give your system an overwhelming amount of data to train from and it will work well. I was thinking of more than a single book, one could do hundreds of them, hoping (well probably statistically being ensured) that the variations will smooth out with the large sample.

        I am probably off my rocker on this one but I am just thinking out loud.

        James

         
        • Robbie

          Robbie - 2006-04-20

          You are not off your rocker. In my previous reply, I mentioned that you can train n-gram LMs off of novels, and that it isn't ideal, but it would probably work out fine.

          I also mentioned that it might be possible to automatically extract pronunciations for words. I was thinking you meant from the audio + transcript. There are tools for converting text to phonemes (one such is available on speech.nist.gov), which I believe work with around 98% accuracy, so I suppose you are right about that. I have even read about some work on using audio+transcript to get a more accurate transcription that captures variation in pronunciation (e.g. "dh uh", "d uh" for the).

          Google has proven time and time again that the differences in models becomes less relevant when there is tons and tons of data. The biggest issue is memory and speed. For a vocab size of V, a trigram LM has V^3 possible trigrams + V^2 possible bigrams + V possible unigrams (not all of which occur in English). Nevertheless, for a 50,000 word vocab, that's just not computationally feasible, so you end up throwing out words that occur less than some threshold in training. In other words, you will still miss some infrequent words, no matter how much data you give it.

          I think it is important to distinguish between the acoutsic model and the language model. You train the language model on pure text--you never need audio samples, since there is far more text available than audio (you could always pull text off of public newsgroups, blogs, etc.).

          The acoustic model is trained off of speech (which must be transcribed, at least to seed the initial model). For each window of frames, the acoustic model "suggests" the most probable phonemes. The decoder combines these most probable phonemes along with the most probable words sequences from the LM to provide a list of most probable words.

          Hence, I'm assuming that since you mentioned using audio from audible.com, you were also proposing a way to train an acoustic model. Don't forget my previous warning about the acoustic models & compression: you want the incoming signal to match the original format of the training signal as closely as possible. If you convert from audible's format to mp3 to .wav, you will have to convert the microphone (or other audio source) to .mp3 and then to audible when using the recognizer or there will be some mismatch at the acoustic level. I don't know how big of a problem this is in terms of accuracy, but I would guess an acoustic model mismatch like this could drop accuracy in the ballpark of 5%-10% (that's just a guess, for all I know the drop would be minimal, or it could degrade accuracy even more).

          Bottom line: if one were to undertake such a project as you describe, I think the results would be satsifactory (it should definitely be better than the WSJ stuff, and perhaps even better than HUB4--or at least comparable--both of which have served me quite well).

          FYI, I read a note on the HTK website the other day saying that the latest research in speech recognition dealt with hundreds of hours of data (I believe it was broadcast news). You may want to follow-up on that line of research.

           
          • The Grand Janitor

            Hi Jost and Robert,

            What Robert said is not incorrect, to rephrase him the more the data, the better the estimates of you could get.

            Though in practice, if you don't have enough data, what sort of data is more important. Our experience is in-domain data is usually more important.

            -a

             
    • James

      James - 2006-04-23

      Hi Arthur and Robbie-

      Thanks for the responses, I figured that more data = better results was probably true, I just wanted to make sure I understood everything fully before I tried anything.

      So I decided to do a little test and outlayed a whopping $30 :) of my own money to get audibles 1984 book. I found some software to convert it to wave and then ran it through using hub4 data. I got decent results, so evidently the compression isn't too bad for sphinx to do something with it.

      I did all this because even if I don't train any new models from the data, I can use it to evaluate the hub4 model (since I don't have access to the LDC corpi).

      Anyway, I just thought I would post that this can be done, whether anyone wanted to do it or not. There is also a copyright issue perhaps, although I think training a SR system is probably fair use...

      As for me, I am not sure I have the time to start this project...

      One more question though, will adding more words to the dictionary increase the accuracy of SR while using the hub4 model? Meaning, has anyone tested where the accuracy is lost (or what %, in each), in the audio model or in the fact that many words are not represented in the dictionary (where is the biggest bang for the buck)? I suppose this could be tested by alternating WSJ and HUB4 with smaller and bigger dictionaries...

      James

       

Log in to post a comment.

MongoDB Logo MongoDB