Re: [Dictionarymaker-devel] DM: import dictionary into project

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi there, 

I preferred Andries's approach, with the addition that 3 only applies
to verified words (which I sort of assumed).  In general, this process
will be used to combine two dictionaries that may come from very
different sources - it won't just be used to add a dictionary to a
bootstrapped project. Also, before adding a new dictionary, I would
encourage a user to export the current dictionary and rather create a
new project, which means both source dictionaries will continue to exist
separate from the new combined version.  

For example,  after creating dictionary A through bootstrapping I find
dictionary B on the Internet, and would like to combine the two. A quick
script changes dict B into the correct format. Now I create a new
project, initialize it with dict A and select to import dictionary B (or
vice versa!)  Any of the dictionaries may have mistakes in them - even
if marked valid. Whenever they disagree, I would like to make a choice.
If I'm more sure of one dictionary than the other, I can also
multiselect all words and reject or ignore en mass. 

My suggestion (from the user's perspective):
> 1. File browse. Pick dictionary file (verify that it is in a *.dict
format, whatever it is called)
> 2. Add all words that does not already exist in dictionary (check for

> missing graphemes)
> 3. Show list of conflict words+phonemes+status, where conflict words
are only those that have been verified on both sides, and conflict. If
both are unverified, then just add the word (the pronunciation will be
created through prediction based on the rule set, as usual.)  If only
one is verified, import that verdict, whatever it is. Also import the
pronunciation if verdict==correct. (check for missing phonemes) Only for
words that have conflicting verdicts, continue to 4.
> 4. single/multi select word+phoneme+status and select Replace or
> ignore

Cheers
Marelie 

>>> Thomas Fogwill <tfo...@us...> 08/01/06 3:23 PM
>>> 
> >>> avrensbu <avr...@cs...> 08/01/06 11:06 AM >>> 
> I want to confirm functionality for importing dictionary into
> project.

My proposal follows. Please comment.

> 3. Show list of conflict words+phonemes+status
> 4. single/multi select word+phoneme+status and select Replace or
> ignore

This seems like a lot of work for the user. I think the system should
do
most of this work (the whole "sane default system behaviour"
philosophy). The way I see it, we always want to preserve (in our
current dictionary) any information that was explicitly provided by
the
user. This is important because the dictionary file being imported is
not altered, but the file for the current dictionary is. If we
overwrite
any information, it is thus lost.

Throughout the rest of this email, OW refers to the Old_Word (i.e. the
word in our current dictionary), and NW refers to the New_Word (i.e.
the
word in the dictionary being imported). These are the cases to
consider:
     1. If OW == null (i.e. this word is not in our dictionary), we
        import NW verbatim (checking graphemes and phonemes -  see
below)
     2. If OW != null (i.e. exists in current dict) && OW.status ==
        CORRECT then skip NW completely (user marked this
pronunciation
        as VALID, so we shouldn't mess with it)
     3. OW != null && OW.status == UNVERIFIED: if NW.status ==
        UNVERIFIED, skip NW, otherwise import NW verbatim (checking
        phonemes -  see below). In this case, the user provided no
info
        for OW, and the only info we have for this word is (possibly)
a
        list of phonemes as predicted by the system. If NW has more
        info, we should use that info.
     4. OW != null && OW.status == INVALID or AMBIGUOUS: the user has
        specified that this word is invalid or ambiguous. We must not
        lose this info, so we skip NW.
     5. OW != null && OW.status == UNCERTAIN: the user said they
didn't
        know what the correct pronunciation for this word was. If
        NW.status is "stronger" (i.e. CORRECT, INVALID, or AMBIGUOUS)
        then we import NW verbatim, otherwise we skip it.

The above implies 3 levels for word statuses:
High: CORRECT, AMBIGUOUS, INVALID
Medium: UNCERTAIN
Lowest: UNVERIFIED

Existing words are only altered when the new word has a higher status
level. When both words have the same status level, there are 3
possible
approaches:
      * keep the existing word as is (my proposed approach, described
        above)
      * replace the existing word with the new one (I would not
        recommend this)
      * prompt user to make a selection (similar to what Andries
        proposed; I prefer the first approach, as it is less
        burdensome/tedious for the user)

Whenever we add a word that was not previously in the dict, we must
check that all graphemes are valid (and import any new ones).

We must also check the phoneme list of any new words to be added, to
ensure that all phonemes are valid. If there are invalid phonemes, we
should either:
      * skip that word 
      * import the phoneme (prompting for sound files, etc.) 

I'm not sure which approach we should follow. Thoughts?

The phoneme list also needs to be checked when existing words are
changed.

m2c
--  
Thomas Fogwill <tfo...@us...>

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share
your
opinions on IT & business topics through brief surveys --  and earn
cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Dictionarymaker- devel mailing list
Dictionarymaker- de...@li...
https://lists.sourceforge.net/lists/listinfo/dictionarymaker- devel

-- 
This message is subject to the CSIR's copyright, terms and conditions and
e-mail legal notice. Views expressed herein do not necessarily represent the
views of the CSIR.

CSIR E-mail Legal Notice
http://mail.csir.co.za/CSIR_eMail_Legal_Notice.html 

CSIR Copyright, Terms and Conditions
http://mail.csir.co.za/CSIR_Copyright.html 

For electronic copies of the CSIR Copyright, Terms and Conditions and the CSIR
Legal Notice send a blank message with REQUEST LEGAL in the subject line to
Hel...@cs....

This message has been scanned for viruses and dangerous content by MailScanner, 
and is believed to be clean.