Re: [Dictionarymaker-devel] DM: import dictionary into project
Brought to you by:
bmcalister,
tfogwill
From: Marelie D. <md...@cs...> - 2006-08-02 06:39:50
|
Hi there, I preferred Andries's approach, with the addition that 3 only applies to verified words (which I sort of assumed). In general, this process will be used to combine two dictionaries that may come from very different sources - it won't just be used to add a dictionary to a bootstrapped project. Also, before adding a new dictionary, I would encourage a user to export the current dictionary and rather create a new project, which means both source dictionaries will continue to exist separate from the new combined version. For example, after creating dictionary A through bootstrapping I find dictionary B on the Internet, and would like to combine the two. A quick script changes dict B into the correct format. Now I create a new project, initialize it with dict A and select to import dictionary B (or vice versa!) Any of the dictionaries may have mistakes in them - even if marked valid. Whenever they disagree, I would like to make a choice. If I'm more sure of one dictionary than the other, I can also multiselect all words and reject or ignore en mass. My suggestion (from the user's perspective): > 1. File browse. Pick dictionary file (verify that it is in a *.dict format, whatever it is called) > 2. Add all words that does not already exist in dictionary (check for > missing graphemes) > 3. Show list of conflict words+phonemes+status, where conflict words are only those that have been verified on both sides, and conflict. If both are unverified, then just add the word (the pronunciation will be created through prediction based on the rule set, as usual.) If only one is verified, import that verdict, whatever it is. Also import the pronunciation if verdict==correct. (check for missing phonemes) Only for words that have conflicting verdicts, continue to 4. > 4. single/multi select word+phoneme+status and select Replace or > ignore Cheers Marelie >>> Thomas Fogwill <tfo...@us...> 08/01/06 3:23 PM >>> > >>> avrensbu <avr...@cs...> 08/01/06 11:06 AM >>> > I want to confirm functionality for importing dictionary into > project. My proposal follows. Please comment. > 3. Show list of conflict words+phonemes+status > 4. single/multi select word+phoneme+status and select Replace or > ignore This seems like a lot of work for the user. I think the system should do most of this work (the whole "sane default system behaviour" philosophy). The way I see it, we always want to preserve (in our current dictionary) any information that was explicitly provided by the user. This is important because the dictionary file being imported is not altered, but the file for the current dictionary is. If we overwrite any information, it is thus lost. Throughout the rest of this email, OW refers to the Old_Word (i.e. the word in our current dictionary), and NW refers to the New_Word (i.e. the word in the dictionary being imported). These are the cases to consider: 1. If OW == null (i.e. this word is not in our dictionary), we import NW verbatim (checking graphemes and phonemes - see below) 2. If OW != null (i.e. exists in current dict) && OW.status == CORRECT then skip NW completely (user marked this pronunciation as VALID, so we shouldn't mess with it) 3. OW != null && OW.status == UNVERIFIED: if NW.status == UNVERIFIED, skip NW, otherwise import NW verbatim (checking phonemes - see below). In this case, the user provided no info for OW, and the only info we have for this word is (possibly) a list of phonemes as predicted by the system. If NW has more info, we should use that info. 4. OW != null && OW.status == INVALID or AMBIGUOUS: the user has specified that this word is invalid or ambiguous. We must not lose this info, so we skip NW. 5. OW != null && OW.status == UNCERTAIN: the user said they didn't know what the correct pronunciation for this word was. If NW.status is "stronger" (i.e. CORRECT, INVALID, or AMBIGUOUS) then we import NW verbatim, otherwise we skip it. The above implies 3 levels for word statuses: High: CORRECT, AMBIGUOUS, INVALID Medium: UNCERTAIN Lowest: UNVERIFIED Existing words are only altered when the new word has a higher status level. When both words have the same status level, there are 3 possible approaches: * keep the existing word as is (my proposed approach, described above) * replace the existing word with the new one (I would not recommend this) * prompt user to make a selection (similar to what Andries proposed; I prefer the first approach, as it is less burdensome/tedious for the user) Whenever we add a word that was not previously in the dict, we must check that all graphemes are valid (and import any new ones). We must also check the phoneme list of any new words to be added, to ensure that all phonemes are valid. If there are invalid phonemes, we should either: * skip that word * import the phoneme (prompting for sound files, etc.) I'm not sure which approach we should follow. Thoughts? The phoneme list also needs to be checked when existing words are changed. m2c -- Thomas Fogwill <tfo...@us...> ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys -- and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Dictionarymaker- devel mailing list Dictionarymaker- de...@li... https://lists.sourceforge.net/lists/listinfo/dictionarymaker- devel -- This message is subject to the CSIR's copyright, terms and conditions and e-mail legal notice. Views expressed herein do not necessarily represent the views of the CSIR. CSIR E-mail Legal Notice http://mail.csir.co.za/CSIR_eMail_Legal_Notice.html CSIR Copyright, Terms and Conditions http://mail.csir.co.za/CSIR_Copyright.html For electronic copies of the CSIR Copyright, Terms and Conditions and the CSIR Legal Notice send a blank message with REQUEST LEGAL in the subject line to Hel...@cs.... This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. |