Menu

Language model with POS tag

2017-01-30
2017-01-30
  • Tania Mendonca

    Tania Mendonca - 2017-01-30

    is there any tool to form a language model for a Kannada(Indian language) corpus which is already pos tagged?

    the data looks something like this:
    ರಂಗಭೂಮಿ NN
    ಶ್ರೀಲಂಕಾ NNP
    ಪದವನ್ನು NN
    ಬೇರೆ JJ
    ಲೇಖನಗಳಲ್ಲಿ NN
    ಹುಡುಕಿ VM
    ಹೆಚ್ಚಾಗಿ RB
    ಅವರು PRP
    ನಿರ್ದೇಶನದಲ್ಲಿ NN
    ತಮ್ಮ PRP
    ಹೆಚ್ಚಿನ QF
    ಕೊಡುಗೆಯನ್ನು NN
    ಕೊಟ್ಟಿದ್ದಾರೆ VM

    should i form a seperate language model for the corpus(only text without POS tags) and another for pos tags?
    Or can i incorporate the text with POS tags and form a language model?

     

    Last edit: Tania Mendonca 2017-01-30
    • Nickolay V. Shmyrev

      I would use tensorflow, neural-network language models are most accurate anyway. I'm not aware of ready-to-use implementation though.

      You can consult the paper for details:
      http://ii.tudelft.nl/sites/default/files/i12_1664.pdf

       
    • Arseniy Gorin

      Arseniy Gorin - 2017-01-30

      Sorry, but could you clarify why you need this? For ASR POS tags will unlikely help (almost sure they will not)... But yes, for things like machine translation you do 2 LMs and use both in the decoder http://www.statmt.org/moses/?n=Moses.FactoredTutorial

      Anyway, it is not straightforward to implement in sphinx I think

       
      • Tania Mendonca

        Tania Mendonca - 2017-01-30

        In my project im trying to recognise the out of vocabulary word said. The language kannada is a highly inflecting language. Most sentences are of the form subject-object-verb.

        For example this is one of the training transcription:
        avalu manege hoguthale
        (She is going home)

        Dictionary has these words with their respective pronunciation:
        avanu
        avalu
        manege
        hoguthale

        and the test audio i give is:
        avanu manege hoguthane
        (he is going home)
        So in the above hoguthane become Out-of-vocabulary.

        So hoguthale is already in my dictionary so the recognizer is going to choose that as the closest match for the word "hoguthane" given in my test audio.

        So i thought based on POS tags(context dependent) seeing that the gender for test audio avanu(he-> male) the recognizer based on rules should should be able to change the inflecting ends instead of hoguthale it should become hoguthane.

        Please tell me if this idea is going to work for out-of-vocabulary?

         
        • Arseniy Gorin

          Arseniy Gorin - 2017-01-30

          Yep, I remember we discussed this at some point. Implementing POS tag in LM is difficult and unlikely bring you much (some work in NNLM do but this has nothing to do with your project)

          Even if your LM predicts the correct form of the words, you will need to find a way to construct a list of expanded pronunciations on-the-fly (or correct the recognizer output in the second pass processing).

          Above said, I think the easiest thing for you to experiment with would be decoding with joint word - subword language model. If you have small vocab and many OOVs, that can add some difference.

          Even easier is to add these variants in your dictionary and let the acoustic model decide. I imagine it can be implemented as a second pass decoding with small LM that will have words from first pass decoding and alternative cases for some of the words (you can look at words with low confidences, but not sure if they are reliable in sphinx).

           
          • Tania Mendonca

            Tania Mendonca - 2017-02-03

            Thank you Sir.
            In order to form the word-subword language model I have planned to keep all nouns as words itself and the verbs as morphemes.
            Is it a must while preparing a hybrid language model frequency of words needs to be taken into consideration?(As in Dr.Long Qin paper frequent words keep it as words itself and infrequent words as subwords)
            Or is it fine with what approch i have taken where ive considered only the verbs to be as subwords(morphemes)?

            Just another clarification
            so if i have a word called
            hOguttAne(going)(utt-present tense marker) according to morphology it becomes hogu(go-root word)+Ane(suffix-2nd order plural male)

            So in my corpus should i replace hoguttAne as hogu Ane or as hogutt Ane and make the language model?

            I assume the transcriptions should align with the acoustics during training
            So if i miss the -utt- in the transcription will my acoustics be able to handle it?

             
            • Arseniy Gorin

              Arseniy Gorin - 2017-02-06

              Is it a must while preparing a hybrid language model frequency of words needs to be taken into consideration

              I think it is a good idea. In any case, it is worth trying both

              So in my corpus should i replace hoguttAne as hogu Ane or as hogutt Ane and make the language model?
              I assume the transcriptions should align with the acoustics during training
              So if i miss the -utt- in the transcription will my acoustics be able to handle it?

              You can try adding ''tt'' for language model (distinct count for same word form but different POS). Again, you can try both. But in any case keep in mind that the acoustics are determined by lexicon, not LM. In other words, you can keep ''hogutt Ane'' or ''hogu Ane'' but the pronunciations of ''hogutt'' and ''hogu'' should be the same in your dictionary.

               

Log in to post a comment.