|
From: Silas S. B. <ss...@ca...> - 2009-01-20 10:23:56
|
Added a Native 5 to the data below. If we treat "wǒ xiǎng" and "gěi nǐ" as single 2-syllable words (as per Native 4's previous suggestion), and then use the existing selection algorithm but re-start it after every 2-syllable word (i.e. treat the end of a 2-syllable word as a phrase break for the purposes of tone selection), then we'll have acceptable renditions of everything, except case 7 (because 可有 isn't really a word, but for this case I wonder if the natives' responses would have been different if I hadn't hyphenated it - that hyphen was my mistake, but the natives might have thought I'd got it from some book and let it override their intuition). A variant would be to do the above but additionally check for the case of a 2-syllable 2-third-tones word followed by a single 3rd tone, and if so then set both syllables of the 2-syllable word to tone 2. This catches the case described in Wikipedia's "Tone sandhi rules at a glance", and also some of the "alternative" versions that some of the natives chose below. However I'm not 100% convinced that this more complex variant is worth coding, given that the simpler version above is also acceptable (and is probably so in more cases). We will need to know about all 2-syllable words that end in a 3rd tone, even the ones that use their component characters' default pronunciations. There are over 6000 of these in the version of CEDICT on my hard disk. If we're getting hanzi input then we could just hack this into zh_listx (along with the special extra cases "wǒ xiǎng" and "gěi nǐ"), and assume the resulting pinyin is appropriately word-spaced for our purposes. But if we're getting pinyin input then things get more difficult - sometimes pinyin is written with a space after EVERY syllable, and other times two or more words are strung together into one. OK there is some reduction in the word count when you're using only pinyin, due to the fact that there's more than one way to write some pinyin words as hanzi, but we're still looking at over 5600 pinyin compounds (2800 if we count just the ones whose first syllable is tone 3 or tone 4, and I'm not sure that this is correct). So I guess either eSpeak is going to need a special data file for this, or we'll need to say that pinyin input should be spaced right if you want it to do correct 3rd-tone-sandhi blocking (and we'd still need to look out for "wo3 xiang3" and "gei3 ni3", and their equivalents like "wǒ xiǎng" and "gěi nǐ", because we can't rely on the user to remember NOT to space these). What do people think? I could easily change my zh_listx-generating script to make sure the relevant compounds come out with appropriate spacing, but I might need some help getting eSpeak to use it. Thanks. Silas (1) 我想请您 wǒ xiǎng qǐng nín Natives 1 and 2: wó xiáng qǐng nín Natives 3, 4, 5 and SinoVoice: wó xiǎng qǐng nín (2) 美语补习班 Měiyǔ bǔxíbān Natives 1, 3, 4, 5 and SinoVoice: Méiyǔ bǔxíbān (4 says Měiyú bǔxíbān is also acceptable) Native 2: Méiyú bǔxíbān (3) 可以讨论 kěyǐ tǎolùn Natives 1, 3, 4, 5 and SinoVoice: kéyǐ tǎolùn (4) 次力量给你保持忍耐 ..gěi nǐ bǎochí Natives 1, 3, 4, 5 and SinoVoice: ..géi nǐ bǎochí (5) 教导你使你得益处 jiàodǎo nǐ shǐ nǐ dé yìchu Natives 1, 3, 4, 5 and SinoVoice (and previous group): jiàodǎo nǐ shí nǐ dé yìchu (6) 只有少数人 zhǐyǒu shǎoshùrén Natives 1, 3, 4, 5 and SinoVoice: zhíyǒu shǎoshùrén (7) 可有可无 kěyǒu-kěwú Native 1 and SinoVoice: kéyóu-kěwú Natives 3, 4 and 5: kéyǒu-kěwú (8) 至少有两个 zhìshǎo yǒu liǎng ge Natives 1, 3, 4, 5 and SinoVoice: zhìshǎo yóu liǎng ge (9) 令人难以理解的 lìngrén nányǐ lǐjiě de Natives 1, 4 and SinoVoice: lìngrén nányǐ líjiě de Natives 3 and 5: lìngrén nányí líjiě de (10) 可以怎样效法 kěyǐ zěnyàng Native 1: kéyí zěnyàng Natives 3, 4, 5 and SinoVoice: kéyǐ zěnyàng (11) 什么方法可以改善 shénme fāngfǎ kěyǐ gǎishàn Natives 1, 3, 4 and 5 (and SinoVoice although it seems to fault on the "me"): shénme fāngfǎ kéyǐ gǎishàn Native 2 and previous group: shénme fāngfǎ kéyí gǎishàn (12) 可以改善 kěyǐ gǎishàn Native 1: kéyí gǎishàn Natives 3, 4 and SinoVoice: kéyǐ gǎishàn -- Silas S Brown http://people.pwf.cam.ac.uk/ssb22 |