feddybear - 2018-12-05

Let's assume a piecewise (non-end-to-end) system with individual acoustic model, language model, pronunciation dictionary, and decoder settings.

Let's further assume that the acoustic model has been used in another task where it was proven to perform significantly well (i.e. it is well trained).

For the statistical language model, assume that there is a pre-defined vocabulary set. There are non-overlapping training and test sets. The vocabulary encompasses the training set completely, and less than 1% OOV for the test set.

Now my question. What is the significance in terms of evaluating the performance of a speech recognition system, if given the best acoustic model, we create a biased language model. That is, we only use the test set counts, then apply smoothing so that probabilities are then given to all other words in the pre-defined vocabulary that are not in the test set. Then we optimize the decoding parameters. There are no assumptions about the quality of the pronunciation dictionary; it can be bad, good enough, or good.

We are having an argument of whether this can be considered as an upper bound or as an indicator of the performance of the acoustic model (assuming that the pronunciation lexicon is perfect). But in my intuition, I always think about how even evaluating WER can lead to negative values, giving evaluations more degrees of freedom in interpretation. Also, given the complexity and interdependence of each component, I cannot really guarantee any conclusive statement. I wonder if anyone has any substantiated thoughts about this issue.

 

Last edit: feddybear 2018-12-05