Menu

Merging dictionaries

Help
2016-09-13
2020-07-07
  • Soren Ebbesen

    Soren Ebbesen - 2016-09-13
    1. Is there a tool to easily merge two dictionaries? I have a feeling that the G2P tool can be used, but I haven't figured out how.

    2. If I manually add a word to my dictionary, do I also have to add it manually to my language model (the .lm file)? I mean, if the word doesn't exist in the language model, the probability of getting picked is zero I suppose? Assuming I have to add it to the LM, would you suggest just to add it to the list of 1-grams and give it the same statistics as a "similar" word?

    Thanks,
    Soren

     
    • Nickolay V. Shmyrev

      Is there a tool to easily merge two dictionaries?

      Yes, Python scripting language

      I have a feeling that the G2P tool can be used, but I haven't figured out how.

      No, unlikely

      If I manually add a word to my dictionary, do I also have to add it manually to my language model (the .lm file)?

      Language model is not easily editable, you need to use lm tools to update it.

      I mean, if the word doesn't exist in the language model, the probability of getting picked is zero I suppose?

      Yes

        Assuming I have to add it to the LM, would you suggest just to add it to the list of 1-grams and give it the same statistics as a "similar" word?
      

      You can check http://cmusphinx.sourceforge.net/wiki/tutoriallmadvanced

       
  • Daniel Wolf

    Daniel Wolf - 2016-12-28

    I'm in a similar situation. I want my application to support both a built-in dicationay and a "user dictionary". If a user dictionary is specified, it should be merged into the built-in dictionary.

    For example:

    Built-in dictionary:

    [...]
    biscuit B IH S K AH T
    bishop B IH SH AH P
    [...]
    

    User dictionary:

    biscuit B IH S K IH T
    biscuit(2) B IH S K UH IY
    thimbleweed TH IH M B AH L W IY D
    

    Merged dictionary:

    [...]
    biscuit B IH S K AH T
    biscuit(2) B IH S K IH T
    biscuit(3) B IH S K UH IY
    bishop B IH SH AH P
    thimbleweed TH IH M B AH L W IY D
    [...]
    

    So by merge, I mean that

    • any words not in the built-in dictionary should be added;
    • any words that already exist should be added as alternative pronunciations;
    • numeric suffixes should be auto-increased as needed.

    I don't want to merge dictionary files, creating a new file. Instead, I'd prefer to leave the two files unchanged and merge the entries in memory, using C/C++.

    I understand that there is no built-in functionality for doing this. So here's a rough outline of what I'm planning to do.

    1. Load the built-in dictionary using the -dict option
    2. Read the user dictionary manually, splitting each line into word string and pronunciation string
    3. For each pair: manually strip any numeric suffix, then call ps_add_word.

    This approach only works if ps_add_word does the following:

    • If the word already exists, a new pronunciation is added
    • If the same word with the same pronunciation already exists, this is a no-op.

    So I wonder: Does this approach make sense? Is there a better way that relies on existing functionality?

     
    • Nickolay V. Shmyrev

      If the same word with the same pronunciation already exists, this is a no-op.

      Not sure what it does currently, but we can modify code to ignore such case.

      So I wonder: Does this approach make sense?

      Looks ok

       
  • larabrian

    larabrian - 2020-07-07

    PEP 448 also expanded the abilities of by allowing this operator to be used for dumping key/value pairs from one dictionary into a new dictionary . Leveraging dictionary comprehension and the unpacking operator, you can merge the two dictionaries in a single expression .

    dict1 = {1:'one' , 2:'two'}
    dict2 = {3:'three', 4:'four'}
    fDict = {dict1 , dict2}
    print(fDict)

     

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.