If I am right, for the isolated words recognition, the Sphinx4 use one whole word as a unit. For LVCSR, may I know if the Sphinx4 use triphones as units? If so, may I know how many triphones the Sphinx4 uses?
Thanks a lot!
--Larry
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Sphinx-4 can use arbitrary sized contexts in recognition. The current linguists used by sphinx-4 look for a single left and right surrounding context. The type of units used is defined by the acoustic model. For instance for TIDIGITS the units are 'phone within a word' units:
AX_one
AY_five
AY_nine
EH_seven
EY_eight
E_seven
F_five
F_four
II_three
II_zero
I_six
K_six
N_nine
N_nine_2
N_one
N_seven
OO_two
OW_four
OW_oh
OW_zero
R_four
R_three
R_zero
SIL
S_seven
S_six
S_six_2
TH_three
T_eight
T_two
V_five
V_seven
W_one
Z_zero
There are 35 of these units. The TIDIGIT acoustic model defines about 350 context dependent units.
The acoustic models used for general speech recognition (WSJ, RM1, HUB4) use about 40 phonemes and about 30,000 triphones.
If you are interested in looking closer at this, unpack one of the acoustic models and take a look at the file with the name that ends in ".mdef". This file contains the information about the units, including the context independent units and the context dependent (triphones) units.
paul
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
If I am right, for the isolated words recognition, the Sphinx4 use one whole word as a unit. For LVCSR, may I know if the Sphinx4 use triphones as units? If so, may I know how many triphones the Sphinx4 uses?
Thanks a lot!
--Larry
Larry:
Sphinx-4 can use arbitrary sized contexts in recognition. The current linguists used by sphinx-4 look for a single left and right surrounding context. The type of units used is defined by the acoustic model. For instance for TIDIGITS the units are 'phone within a word' units:
AX_one
AY_five
AY_nine
EH_seven
EY_eight
E_seven
F_five
F_four
II_three
II_zero
I_six
K_six
N_nine
N_nine_2
N_one
N_seven
OO_two
OW_four
OW_oh
OW_zero
R_four
R_three
R_zero
SIL
S_seven
S_six
S_six_2
TH_three
T_eight
T_two
V_five
V_seven
W_one
Z_zero
There are 35 of these units. The TIDIGIT acoustic model defines about 350 context dependent units.
The acoustic models used for general speech recognition (WSJ, RM1, HUB4) use about 40 phonemes and about 30,000 triphones.
If you are interested in looking closer at this, unpack one of the acoustic models and take a look at the file with the name that ends in ".mdef". This file contains the information about the units, including the context independent units and the context dependent (triphones) units.
paul