I have a few doubts regarding how to approach LM training. Are there any best practices that are followed that give the maximum benefit for speech recognition applications?
For cases like possessive nouns and plural nouns, should we retain the possessive and plural forms of the words or stem them like in other applications? E.g. 'car, cars, car's' Vs 'car'. If we retain the different forms, then we need to have those different forms in the dictionary, which means thrice the amount of noun morphemes. Or if we stem them, how do we recover the different forms?
With regard to dates (years in particular), how do we represent them in the LM corpus? E.g. Should we expand the year '2015' to 'twenty fifteen' or 'two thousand fifteen'?
For cases like proper nouns and abbreviations (among others), should we retain the capitalized forms and all-caps forms like for information retrieval and machine translation purposes or make all lower case? Which is best for speech recognition? My understanding is that keeping the case intact retains some semantic info albeit small.
E.g. 'General Motors' vs 'general motors'
E.g. 'US' vs 'us'
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
For cases like possessive nouns and plural nouns, should we retain the possessive and plural forms of the words or stem them like in other applications? E.g. 'car, cars, car's' Vs 'car'. If we retain the different forms, then we need to have those different forms in the dictionary, which means thrice the amount of noun morphemes.
You need to keep all forms in vocabulary. Actually forms are very important because they help to predict the next word properly and increase the quality of lm
Or if we stem them, how do we recover the different forms?
It is not recommended to stem the words.
With regard to dates (years in particular), how do we represent them in the LM corpus? E.g. Should we expand the year '2015' to 'twenty fifteen' or 'two thousand fifteen'?
Yes, tutorial says that
For cases like proper nouns and abbreviations (among others), should we retain the capitalized forms and all-caps forms like for information retrieval and machine translation purposes or make all lower case? Which is best for speech recognition? My understanding is that keeping the case intact retains some semantic info albeit small.
Tutorial recommends all lower case.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thanks for replying. Still have some doubts. I am not sure I have conveyed my questions in a proper way.
With regard to years, e.g. 2015, the doubt is not whether or not to expand it in word form, but the actual doubt is how to expand it. Whether it should be expanded to "twenty fifteen" or "two thousand fifteen" since both spoken forms exist and possibly more.
With regard to abbreviations and capitalized forms in corpus, lets take the example of "US" vs. "us". Both have different meanings that are appreciated only in their proper cases. If we lower case everything in text, the "US" which represents "United States" becomes "us" which is a different word with different meaning. An example that I think might illustrate this is given under:
"The economic situation leaves US in the middle of a crisis"
vs
"the economic situation leaves us in the middle of a crisis"
So retaining capitalized forms and abbreviations retains some amount of inherent meaning in the text, which is in practice in IR and MT. So my quesiont is what are the best recommended practices for text preparation in the above cases as well as other special cases that will maximize the accuracy of the output? Besides the tutorial, any pointers or papers on these and other best practice recommandations would be most helpful.
Once again, thanks in advance.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
With regard to years, e.g. 2015, the doubt is not whether or not to expand it in word form, but the actual doubt is how to expand it. Whether it should be expanded to "twenty fifteen" or "two thousand fifteen" since both spoken forms exist and possibly more.
You can mix both with certain probability
With regard to abbreviations and capitalized forms in corpus, lets take the example of "US" vs. "us". Both have different meanings that are appreciated only in their proper cases. If we lower case everything in text, the "US" which represents "United States" becomes "us" which is a different word with different meaning. An example that I think might illustrate this is given under:
It is ok to mix, moreover in most real-life texts like texts collected from the web, us will be in lowercase and you will not be able to repair it.
Besides the tutorial, any pointers or papers on these and other best practice recommandations would be most helpful.
Such things are usually not covered in papers, it is more engineering practice. It is also more of interest for TTS researches than for ASR research.
I agree this is more related to engineering practice. I have given the sparrowhawk package a cursory look. I have already been trying to implement on my own some of what it covers (although with lots of bugs), but I think a package like that would be more helpful.
I also agree that ths is more of interest to TTS researchers than ASR researchers. However, the text preparation or pre-processing part of LM is the reverse of text processing in TTS, especially if the text corpus for LM is not from a transcript source but rather from the web or other sources. An example that comes to mind immediately would be the dollar representation. In text form, it is like "$20" which is what we find in a web source whereas in spoken form it is "twenty dollars." So some kind of processing is inevitable if we need to match the LM wth the real-life spoken equivalents.
Once again, thanks for pointing me to that package and all the valuable inputs. Really appreciate that.
Last edit: Vickie 2016-05-09
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I have a few doubts regarding how to approach LM training. Are there any best practices that are followed that give the maximum benefit for speech recognition applications?
For cases like possessive nouns and plural nouns, should we retain the possessive and plural forms of the words or stem them like in other applications? E.g. 'car, cars, car's' Vs 'car'. If we retain the different forms, then we need to have those different forms in the dictionary, which means thrice the amount of noun morphemes. Or if we stem them, how do we recover the different forms?
With regard to dates (years in particular), how do we represent them in the LM corpus? E.g. Should we expand the year '2015' to 'twenty fifteen' or 'two thousand fifteen'?
For cases like proper nouns and abbreviations (among others), should we retain the capitalized forms and all-caps forms like for information retrieval and machine translation purposes or make all lower case? Which is best for speech recognition? My understanding is that keeping the case intact retains some semantic info albeit small.
E.g. 'General Motors' vs 'general motors'
E.g. 'US' vs 'us'
You need to keep all forms in vocabulary. Actually forms are very important because they help to predict the next word properly and increase the quality of lm
It is not recommended to stem the words.
Yes, tutorial says that
Tutorial recommends all lower case.
Would greatly appreciate any inputs on this. Thanks in advance.
Thanks for replying. Still have some doubts. I am not sure I have conveyed my questions in a proper way.
With regard to years, e.g. 2015, the doubt is not whether or not to expand it in word form, but the actual doubt is how to expand it. Whether it should be expanded to "twenty fifteen" or "two thousand fifteen" since both spoken forms exist and possibly more.
With regard to abbreviations and capitalized forms in corpus, lets take the example of "US" vs. "us". Both have different meanings that are appreciated only in their proper cases. If we lower case everything in text, the "US" which represents "United States" becomes "us" which is a different word with different meaning. An example that I think might illustrate this is given under:
"The economic situation leaves US in the middle of a crisis"
vs
"the economic situation leaves us in the middle of a crisis"
So retaining capitalized forms and abbreviations retains some amount of inherent meaning in the text, which is in practice in IR and MT. So my quesiont is what are the best recommended practices for text preparation in the above cases as well as other special cases that will maximize the accuracy of the output? Besides the tutorial, any pointers or papers on these and other best practice recommandations would be most helpful.
Once again, thanks in advance.
You can mix both with certain probability
It is ok to mix, moreover in most real-life texts like texts collected from the web, us will be in lowercase and you will not be able to repair it.
Such things are usually not covered in papers, it is more engineering practice. It is also more of interest for TTS researches than for ASR research.
You can check https://github.com/google/sparrowhawk
You can get it on scihub http://sci-hub.bz/10.1017/s1351324914000175
I agree this is more related to engineering practice. I have given the sparrowhawk package a cursory look. I have already been trying to implement on my own some of what it covers (although with lots of bugs), but I think a package like that would be more helpful.
I also agree that ths is more of interest to TTS researchers than ASR researchers. However, the text preparation or pre-processing part of LM is the reverse of text processing in TTS, especially if the text corpus for LM is not from a transcript source but rather from the web or other sources. An example that comes to mind immediately would be the dollar representation. In text form, it is like "$20" which is what we find in a web source whereas in spoken form it is "twenty dollars." So some kind of processing is inevitable if we need to match the LM wth the real-life spoken equivalents.
Once again, thanks for pointing me to that package and all the valuable inputs. Really appreciate that.
Last edit: Vickie 2016-05-09