I am trying to use Sphinx's forced aligner to synchronize lyrics for songs. As expected, I didn't get great results with the default en-us accoustic model. I'm planning to spend efforts to improve the results and need advice on following:
Assuming I have enough corpus data, should I try adapting the existing en-us model or am I better off training a new model from scratch suitable for my application?
I am using professional grade software (Audionamix) to extract vocals from songs. Vocal extraction process supresses the music quite well (~80%) but it also distorts the voice a little. My question is, should I train/adapt the accoustic model with vocal extracted audio or just use the song as is (with music in it)?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I have done some initial experiments. Trained an acoustic model on about ~5 hours worth of audio (~100 songs from single artist) and tested with 10 songs from the same artist. Also created language model for the lyrics of theses ~110 songs (about 1400 words in vocab). Got following WER with decoder:
SENTENCE ERROR: 98.0% (192/196)
WORD ERROR RATE: 88.9% (2132/2398)
I'm determined to improve these results and have larger training data. I have 2 quuestions:
For training with larger data set should I be choosing "similar" training data. As in songs from single genre, or set of similar artist etc or should I train a generic music model with all the corpus I have? As you suggested I'm using KAML for vocal extraction.
I'm ultimately intereted in forced alignment accuracy. I am assuming lower the WER of the trained model, higher it's accuracy will be when used for forced alignment. Is this assumption correct?
Specifically, is it possible that a trained acoustic model with high WER still performs very well on the forced alignment task for the same test data?
Look forward to your insights. Thanks.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
For training with larger data set should I be choosing "similar" training data.
Yes
I'm ultimately intereted in forced alignment accuracy. I am assuming lower the WER of the trained model, higher it's accuracy will be when used for forced alignment. Is this assumption correct?
Yes
Specifically, is it possible that a trained acoustic model with high WER still performs very well on the forced alignment task for the same test data?
No
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I am trying to use Sphinx's forced aligner to synchronize lyrics for songs. As expected, I didn't get great results with the default en-us accoustic model. I'm planning to spend efforts to improve the results and need advice on following:
Assuming I have enough corpus data, should I try adapting the existing en-us model or am I better off training a new model from scratch suitable for my application?
I am using professional grade software (Audionamix) to extract vocals from songs. Vocal extraction process supresses the music quite well (~80%) but it also distorts the voice a little. My question is, should I train/adapt the accoustic model with vocal extracted audio or just use the song as is (with music in it)?
It is better to train from scratch. It is better to train on audio cleaned from music.
It is better to use Kaml http://www.loria.fr/~aliutkus/kaml/
You can also get some ideas from the previous research on the subject:
http://www.asmp.eurasipjournals.com/content/pdf/1687-4722-2010-546047.pdf
https://www.comp.nus.edu.sg/~kanmy/papers/p1568934817-wang.pdf
I have done some initial experiments. Trained an acoustic model on about ~5 hours worth of audio (~100 songs from single artist) and tested with 10 songs from the same artist. Also created language model for the lyrics of theses ~110 songs (about 1400 words in vocab). Got following WER with decoder:
SENTENCE ERROR: 98.0% (192/196)
WORD ERROR RATE: 88.9% (2132/2398)
I'm determined to improve these results and have larger training data. I have 2 quuestions:
For training with larger data set should I be choosing "similar" training data. As in songs from single genre, or set of similar artist etc or should I train a generic music model with all the corpus I have? As you suggested I'm using KAML for vocal extraction.
I'm ultimately intereted in forced alignment accuracy. I am assuming lower the WER of the trained model, higher it's accuracy will be when used for forced alignment. Is this assumption correct?
Specifically, is it possible that a trained acoustic model with high WER still performs very well on the forced alignment task for the same test data?
Look forward to your insights. Thanks.
Yes
Yes
No
Thanks for the quick reply. Couple of more questions to help in training:
Approximately how much training data is needed for the application I have in mind? a ballpark estimate is fine.
What's an acceptable WER that gives usable results for forced alignment? In other words, what WER should I be aiming for?
Acoustic model training tutorial provides numbers
WER depends a lot on vocabulary size. For single speaker WER must be lower than 10%.