Menu

Lyric alignment for popular music: Building new vs adapting existing accoustic model

Help
Amit Rathi
2015-12-04
2016-01-21
  • Amit Rathi

    Amit Rathi - 2015-12-04

    I am trying to use Sphinx's forced aligner to synchronize lyrics for songs. As expected, I didn't get great results with the default en-us accoustic model. I'm planning to spend efforts to improve the results and need advice on following:

    1. Assuming I have enough corpus data, should I try adapting the existing en-us model or am I better off training a new model from scratch suitable for my application?

    2. I am using professional grade software (Audionamix) to extract vocals from songs. Vocal extraction process supresses the music quite well (~80%) but it also distorts the voice a little. My question is, should I train/adapt the accoustic model with vocal extracted audio or just use the song as is (with music in it)?

     
  • Amit Rathi

    Amit Rathi - 2016-01-15

    I have done some initial experiments. Trained an acoustic model on about ~5 hours worth of audio (~100 songs from single artist) and tested with 10 songs from the same artist. Also created language model for the lyrics of theses ~110 songs (about 1400 words in vocab). Got following WER with decoder:

    SENTENCE ERROR: 98.0% (192/196)
    WORD ERROR RATE: 88.9% (2132/2398)

    I'm determined to improve these results and have larger training data. I have 2 quuestions:

    1. For training with larger data set should I be choosing "similar" training data. As in songs from single genre, or set of similar artist etc or should I train a generic music model with all the corpus I have? As you suggested I'm using KAML for vocal extraction.

    2. I'm ultimately intereted in forced alignment accuracy. I am assuming lower the WER of the trained model, higher it's accuracy will be when used for forced alignment. Is this assumption correct?
      Specifically, is it possible that a trained acoustic model with high WER still performs very well on the forced alignment task for the same test data?

    Look forward to your insights. Thanks.

     
    • Nickolay V. Shmyrev

      For training with larger data set should I be choosing "similar" training data.

      Yes

      I'm ultimately intereted in forced alignment accuracy. I am assuming lower the WER of the trained model, higher it's accuracy will be when used for forced alignment. Is this assumption correct?

      Yes

      Specifically, is it possible that a trained acoustic model with high WER still performs very well on the forced alignment task for the same test data?

      No

       
  • Amit Rathi

    Amit Rathi - 2016-01-18

    Thanks for the quick reply. Couple of more questions to help in training:

    1. Approximately how much training data is needed for the application I have in mind? a ballpark estimate is fine.

    2. What's an acceptable WER that gives usable results for forced alignment? In other words, what WER should I be aiming for?

     
    • Nickolay V. Shmyrev

      Approximately how much training data is needed for the application I have in mind? a ballpark estimate is fine.

      Acoustic model training tutorial provides numbers

      What's an acceptable WER that gives usable results for forced alignment? In other words, what WER should I be aiming for?

      WER depends a lot on vocabulary size. For single speaker WER must be lower than 10%.

       

Log in to post a comment.