CMU Sphinx / Forums / Help: Unicode: Do case match?

Alexander... - 2014-01-09

In sphinx train run,

WARNING: This word: ï»¿ was in the transcript file, but is not in the dictionary (ï»¿ à®¤à®¿à®°à¯à®šà¯à®šà¯†à®¨à¯à®¤à¯‚à®°à¯ à®…à®°à¯à®³à¯à®®à®¿à®•à¯ à®šà¯à®ªà¯à®ªà®¿à®°à®®à®£à®¿à®¯ à®šà¯à®µà®¾à®®à®¿ à®¤à®¿à®°à¯à®•à¯à®•à¯‹à®µà®¿à®²à®¿à®²à¯ à®ªà¯à®¤à®©à¯à®•à®¿à®´à®®à¯ˆà®¯à®©à¯à®±à¯ à®†à®µà®£à®¿à®¤à¯à®¤à®¿à®°à¯à®µà®¿à®´à®¾ à®¤à¯‡à®°à¯‹à®Ÿà¯à®Ÿà®®à¯ à®¨à®Ÿà¯ˆà®ªà¯†à®±à¯à®±à®¤à¯ ). Do cases match?

Am using Unicode characters in dictionary. Is there any solution?
It works very well in terms of roman script. It gives me 200% accuracy.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nickolay V. Shmyrev - 2014-01-09

First of all please avoid posting same question to the multiple old threads. Doing that decreases your chance to get a good answer and simple not polite.

Second, sphinxtrain supports specific unicode form - UTF-8. Please make sure you are not using UTF16 or something like that. Please make sure all input files are encoded in UTF-8

In case of troubles please share your training folder. By providing your data you greatly increase the chance to get a solution, not by posting same question to the threads from 2005.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Alexander... - 2014-01-17
  
  sorry for posting same question to the multiple old threads... Hereafter i wont do it again. Am using utf-8 only but still error....
  Here is my file which in utf-8.
  
  words_tamil (2).txt
  
  words_tamil.dic
  
  words_tamil.dic~
  
  words_tamil.html
  
  words_tamil_test.transcription~
  
  words_tamil_train.transcription~
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nickolay V. Shmyrev - 2014-01-17

It's better to attach files in a single archive, not as a ten links. You also attached backup files with ~ in the end, I doubt you are using them. It's probably better to share the whole folder.

It doesn't seem like your dictionary has a phonetic transcription for the words, it only contains a list of words in random order. Make sure your dictionary has one word for line with a phonetic transcription.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Alexander... - 2014-01-20
  
  https://drive.google.com/folderview?id=0B5fhKPTbTJ4Pck9TTENoVmxKZnM&usp=sharing
  
  its my whole project.. help me out
  
  Last edit: Alexander... 2014-01-22
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Nickolay V. Shmyrev - 2014-01-29
    
    You still didn't fix the main problem I told you about:
    
    It doesn't seem like your dictionary has a phonetic transcription for the words, it only contains a list of words in random order. Make sure your dictionary has one word for line with a phonetic transcription. See the acoustic model training tutorial for details.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Alexander... - 2014-02-03
      
      Hello .... I have edited my dictioanry with phonetic transcription.. Herewith attached. Is it correct?
      
      words_tamil2.dic
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nickolay V. Shmyrev - 2014-02-03

Is it correct?

No, the file is not correct

1) Encoding is utf-16 instead of utf-8

2) It doesn't contain a single word with a transcription per line. There are lines which do not describe words.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Alexander... - 2014-02-03
  
  Hmmm... Let me edit again
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Alexander... - 2014-02-07

Hello
I have discussed with linguistic expert and they said that, for tamil script(Which is my project) the phonetic transcription is same as the words as I have posted dictionary file. In my dictionary UTF-8 only am using but its still words matching error
Phase 6: Checking that all the words in the transcript are in the dictionary
Words in dictionary: 88
Words in filler dictionary: 3
WARNING: This word: was in the transcript file, but is not in the dictionary ( திருச்செந்தூர் அருள்மிகு சுப்பிரமணிய சுவாமி திருக்கோவிலில் புதன்கிழமையன்று ஆவணித்திருவிழா தேரோட்டம் நடைபெற்றது ). Do cases match?
I have rechecked in every aspects like
1. words in the transcript are in the dictionary
2. match case when they appear
3. words in the transcript may be misspelled
4. dictionary file is not perfectly sorted

For Roman script Phase 6 was passed. In terms unicode(Utf-8) Phase 6 is FAILED.
When I used to run sphinx All files like mixtures, means, variane, result file like align, match, feat.params, mdef, means, mixture_weights, noisedict, transition_matrices, variances files were created except Phase 6. With those files am getting accuracy nearly 70-80. Am fighting to get accuracy more than 90. Please help me out....

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Alexander... - 2014-02-10

Please anyone reply to solve this issue..

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Unicode: Do case match?

Speech Recognition Toolkit

Forums

Help

Unicode: Do case match? document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Unicode: Do case match?