Re: [Palmkit-users-jp] language model
Status: Beta
Brought to you by:
a-ito
From: Akinori I. <ai...@ma...> - 2006-11-27 11:52:11
|
Hello, > I think I will be able to tag the words successfully > following your advice.Now,after completion of tagging,is > it possible to calculate the probabilities of my proposed > LM using Palmkit?If so,can you tell me about the > process(or commands,in particular) of doing it?? Are you OK creating an ordinary bigram/trigram model from a corpus? I assume you are already familiar with basic use of the palmkit commands. You have to create two kinds of LMs: ordinary trigram (for function words),=20 content-word trigram (for content words). To create a content-word trigram, you have to create a corpus without function words. For example, if you have a corpus consists of the following sentense =1B$B:#F|=1B(B+=1B$BL>;l=1B(B =1B$B$N=1B(B+=1B$B=3Du;l=1B(B = =1B$B?);v=1B(B+=1B$BL>;l=1B(B =1B$B$O=1B(B+=1B$B=3Du;l=1B(B = =1B$B%i!<%a%s=1B(B+=1B$BL>;l=1B(B =1B$B$@=1B(B+=1B$B=3DuF0;l=1B(B then what you have to create is a corpus looks like this: =1B$B:#F|=1B(B+=1B$BL>;l=1B(B =1B$B?);v=1B(B+=1B$BL>;l=1B(B = =1B$B%i!<%a%s=1B(B+=1B$BL>;l=1B(B Creating a trigram from the content-word-only corpus, you can create a content-word LM P_C(c|a,b). Suppose an ordinary trigram from the entire corpus be P(c|a,b). Then you can predict a probability of a function word w_f by P(w_f|a,b) and that of a content word w_c by P_C(w_c|a',b')(1-=1B$B&2=1B(B_{w in F} P(w|a,b)) where F is a set of function words. <a,b> is the two-word context before w_c and <a',b'> is two-content-word context before w_c. As P_C(w|a,b) become one by summing up for all content words, you have to multiply a probability that the next word is a=20 content word (that is, 1-=1B$B&2=1B(B_{w in F} P(w|a,b) ). -----Original Message----- From: pal...@li... [mailto:pal...@li...] On Behalf Of Khan Sakeb Sent: Monday, November 27, 2006 6:29 PM To: Akinori Ito Cc: pal...@li... Subject: Re: [Palmkit-users-jp] language model Dear Sir, Thank you very much for your kind response and detailed explanation.In fact,I dont have that much control on Japanese language and I hope your advice will be an immense help for me to distinguish between Content words and Function words.I have also downloaded the paper refered by you and I'm going to consult it carefully. I think I will be able to tag the words successfully following your advice.Now,after completion of tagging,is it possible to calculate the probabilities of my proposed LM using Palmkit?If so,can you tell me about the process(or commands,in particular) of doing it?? Sorry for bothering you so many times.But your advices have really really helped me a lot and let me express my heartiest gratitude to you for spending your valuable time for me.Thank you once again. I'm looking forward to your reply. With Regards --- Akinori Ito <ai...@fw...> wrote: > Hello, >=20 > I'm sorry for the late reply. >=20 > I read your Word file. The model you are going to > make is exactly same one > that was proposed by Isotani and Matsunaga in > 1994[1]. If you haven't read > their paper, you'd better to consult it. >=20 > [1] R. Isotani and S. Matsunaga, "A Stochastic > Language Model for Speech > Recognition Integrating Local and Global > Constraints," Proc. ICASSP94, > vol. II, pp. 5-8, 1994. >=20 > Now, we have a couple of way to distinguish content > words and function words. > If you are using Chasen, the easiest way is to make > lists of content and > function words. You can get a list of all parts of > speech by "chasen -lp" > command. Then, we can split the POS into the > following classes. > (The treatment of "others" classes depends on the > purpose of the LM.) >=20 > Content words: > 1 =1B$BL>;l=1B(B > 2 =1B$BL>;l=1B(B-=1B$B0lHL=1B(B > 3 =1B$BL>;l=1B(B-=1B$B8GM-L>;l=1B(B > 4 =1B$BL>;l=1B(B-=1B$B8GM-L>;l=1B(B-=1B$B0lHL=1B(B > 5 =1B$BL>;l=1B(B-=1B$B8GM-L>;l=1B(B-=1B$B?ML>=1B(B > 6 =1B$BL>;l=1B(B-=1B$B8GM-L>;l=1B(B-=1B$B?ML>=1B(B-=1B$B0lHL=1B(B > 7 =1B$BL>;l=1B(B-=1B$B8GM-L>;l=1B(B-=1B$B?ML>=1B(B-=1B$B@+=1B(B > 8 =1B$BL>;l=1B(B-=1B$B8GM-L>;l=1B(B-=1B$B?ML>=1B(B-=1B$BL>=1B(B > 9 =1B$BL>;l=1B(B-=1B$B8GM-L>;l=1B(B-=1B$BAH?%=1B(B > 10 =1B$BL>;l=1B(B-=1B$B8GM-L>;l=1B(B-=1B$BCO0h=1B(B > 11 =1B$BL>;l=1B(B-=1B$B8GM-L>;l=1B(B-=1B$BCO0h=1B(B-=1B$B0lHL=1B(B > 12 =1B$BL>;l=1B(B-=1B$B8GM-L>;l=1B(B-=1B$BCO0h=1B(B-=1B$B9q=1B(B > 13 =1B$BL>;l=1B(B-=1B$BBeL>;l=1B(B > 14 =1B$BL>;l=1B(B-=1B$BBeL>;l=1B(B-=1B$B0lHL=1B(B > 15 =1B$BL>;l=1B(B-=1B$BBeL>;l=1B(B-=1B$B=3DLLs=1B(B > 16 =1B$BL>;l=1B(B-=1B$BI{;l2DG=3D=1B(B > 17 =1B$BL>;l=1B(B-=1B$B%5JQ@\B3=1B(B > 18 =1B$BL>;l=1B(B-=1B$B7AMFF0;l8l44=1B(B > 19 =1B$BL>;l=1B(B-=1B$B?t=1B(B > 40 =1B$BL>;l=1B(B-=1B$B%J%$7AMF;l8l44=1B(B > 46 =1B$BF0;l=1B(B > 47 =1B$BF0;l=1B(B-=1B$B<+N)=1B(B > 50 =1B$B7AMF;l=1B(B > 51 =1B$B7AMF;l=1B(B-=1B$B<+N)=1B(B > 54 =1B$BI{;l=1B(B > 55 =1B$BI{;l=1B(B-=1B$B0lHL=1B(B > 56 =1B$BI{;l=1B(B-=1B$B=3Du;lN`@\B3=1B(B > 57 =1B$BO"BN;l=1B(B > 58 =1B$B@\B3;l=1B(B > 75 =1B$B46F0;l=1B(B > 81 =1B$B5-9f=1B(B-=1B$B%"%k%U%!%Y%C%H=1B(B >=20 > Function words: > 20 =1B$BL>;l=1B(B-=1B$BHs<+N)=1B(B > 21 =1B$BL>;l=1B(B-=1B$BHs<+N)=1B(B-=1B$B0lHL=1B(B > 22 =1B$BL>;l=1B(B-=1B$BHs<+N)=1B(B-=1B$BI{;l2DG=3D=1B(B > 23 =1B$BL>;l=1B(B-=1B$BHs<+N)=1B(B-=1B$B=3DuF0;l8l44=1B(B > 24 =1B$BL>;l=1B(B-=1B$BHs<+N)=1B(B-=1B$B7AMFF0;l8l44=1B(B > 25 =1B$BL>;l=1B(B-=1B$BFC<l=1B(B > 26 =1B$BL>;l=1B(B-=1B$BFC<l=1B(B-=1B$B=3DuF0;l8l44=1B(B > 27 =1B$BL>;l=1B(B-=1B$B@\Hx=1B(B > 28 =1B$BL>;l=1B(B-=1B$B@\Hx=1B(B-=1B$B0lHL=1B(B > 29 =1B$BL>;l=1B(B-=1B$B@\Hx=1B(B-=1B$B?ML>=1B(B > 30 =1B$BL>;l=1B(B-=1B$B@\Hx=1B(B-=1B$BCO0h=1B(B > 31 =1B$BL>;l=1B(B-=1B$B@\Hx=1B(B-=1B$B%5JQ@\B3=1B(B > 32 =1B$BL>;l=1B(B-=1B$B@\Hx=1B(B-=1B$B=3DuF0;l8l44=1B(B > 33 =1B$BL>;l=1B(B-=1B$B@\Hx=1B(B-=1B$B7AMFF0;l8l44=1B(B > 34 =1B$BL>;l=1B(B-=1B$B@\Hx=1B(B-=1B$BI{;l2DG=3D=1B(B > 35 =1B$BL>;l=1B(B-=1B$B@\Hx=1B(B-=1B$B=3Du?t;l=1B(B > 36 =1B$BL>;l=1B(B-=1B$B@\Hx=1B(B-=1B$BFC<l=1B(B > 37 =1B$BL>;l=1B(B-=1B$B@\B3;lE*=1B(B > 38 =1B$BL>;l=1B(B-=1B$BF0;lHs<+N)E*=1B(B > 41 =1B$B@\F,;l=1B(B > 42 =1B$B@\F,;l=1B(B-=1B$BL>;l@\B3=1B(B > 43 =1B$B@\F,;l=1B(B-=1B$BF0;l@\B3=1B(B > 44 =1B$B@\F,;l=1B(B-=1B$B7AMF;l@\B3=1B(B > 45 =1B$B@\F,;l=1B(B-=1B$B?t@\B3=1B(B > 48 =1B$BF0;l=1B(B-=1B$BHs<+N)=1B(B > 49 =1B$BF0;l=1B(B-=1B$B@\Hx=1B(B > 52 =1B$B7AMF;l=1B(B-=1B$BHs<+N)=1B(B > 53 =1B$B7AMF;l=1B(B-=1B$B@\Hx=1B(B > 59 =1B$B=3Du;l=1B(B > 60 =1B$B=3Du;l=1B(B-=1B$B3J=3Du;l=1B(B > 61 =1B$B=3Du;l=1B(B-=1B$B3J=3Du;l=1B(B-=1B$B0lHL=1B(B > 62 =1B$B=3Du;l=1B(B-=1B$B3J=3Du;l=1B(B-=1B$B0zMQ=1B(B > 63 =1B$B=3Du;l=1B(B-=1B$B3J=3Du;l=1B(B-=1B$BO"8l=1B(B > 64 =1B$B=3Du;l=1B(B-=1B$B@\B3=3Du;l=1B(B > 65 =1B$B=3Du;l=1B(B-=1B$B78=3Du;l=1B(B > 66 =1B$B=3Du;l=1B(B-=1B$BI{=3Du;l=1B(B > 67 =1B$B=3Du;l=1B(B-=1B$B4VEj=3Du;l=1B(B > 68 =1B$B=3Du;l=1B(B-=1B$BJBN)=3Du;l=1B(B > 69 =1B$B=3Du;l=1B(B-=1B$B=3D*=3Du;l=1B(B > 70 =1B$B=3Du;l=1B(B-=1B$BI{=3Du;l!?JBN)=3Du;l!?=3D*=3Du;l=1B(B > 71 =1B$B=3Du;l=1B(B-=1B$BO"BN2=3D=1B(B > 72 =1B$B=3Du;l=1B(B-=1B$BI{;l2=3D=1B(B > 73 =1B$B=3Du;l=1B(B-=1B$BFC<l=1B(B > 74 =1B$B=3DuF0;l=1B(B >=20 > Others (Not a word) > 84 =1B$B$=3D$NB>=1B(B > 85 =1B$B$=3D$NB>=1B(B-=1B$B4VEj=1B(B > 86 =1B$B%U%#%i!<=1B(B > 87 =1B$BHs8@8l2;=1B(B > 88 =1B$B8lCGJR=1B(B >=20 > Others (They don't have speech form) > 0 BOS/EOS > 39 =1B$BL>;l=1B(B-=1B$B0zMQJ8;zNs=1B(B > 76 =1B$B5-9f=1B(B > 77 =1B$B5-9f=1B(B-=1B$B0lHL=1B(B > 78 =1B$B5-9f=1B(B-=1B$B6gE@=1B(B > 79 =1B$B5-9f=1B(B-=1B$BFIE@=1B(B > 80 =1B$B5-9f=1B(B-=1B$B6uGr=1B(B > 82 =1B$B5-9f=1B(B-=1B$B3g8L3+=1B(B > 83 =1B$B5-9f=1B(B-=1B$B3g8LJD=1B(B >=20 >=20 > Khan Sakeb wrote: > > Dear Sir, > > Thank you very much for your kind and prompt > response.Let me apologize at first for my reply > being late.I could successfully make classlist using > the "ctext2class" command.But can you plz tell me > which order do the words follow when they appear in > the classlist? > > =20 > > Now,let me focus on my research topic.I'm > attaching a MS-WORD file along with this mail which > describes the equations of my proposed language > models.I'm confused about which approach to take for calcualting the=20 > probablities.At first, I just want to determine the probablity of the=20 > next word > being=1B$B<+N)8l=1B(B(Ci=3D1)/=1B$BIUB08l=1B(B(Ci=3D0) in a tri-gram = model. > I mean =1B$B-t=1B(BP(Ci=3D1| Wi-2 Wi-1) or =1B$B-t=1B(BP(Ci=3D0| Wi-2 = Wi-1). > > =20 > > You mentioned in your mail to use some kind of > tagger to distinguish between =1B$B<+N)8l=1B(B and =1B$BIUB08l=1B(B. > Write now,I'm using 'CHASEN' for morphological > analysis.Can you plz give me an idea about using > CHASEN effectively to distinguish between =1B$B<+N)8l=1B(B and > =1B$BIUB08l=1B(B. I think then I can use palmkit to generate > more specific classlists. > > =20 > > Thank you very much once again.I will be highly > grateful if you kindly reply to my mail at your > convenient time. > > =20 > > With Regards > > upal1660 > > =20 > >=20 > >=20 > > --------------------------------- > > Start Yahoo! Auction now! Check out the cool > campaign > >=20 > >=20 > > > ------------------------------------------------------------------------ > >=20 > > > ------------------------------------------------------------------------ - > > Take Surveys. Earn Cash. Influence the Future of > IT > > Join SourceForge.net's Techsay panel and you'll > get the chance to share your > > opinions on IT & business topics through brief > surveys - and earn cash > > > http://www.techsay.com/default.php?page=3Djoin.php&p=3Dsourceforge&CID=3D= DEVDE V > >=20 > >=20 > > > ------------------------------------------------------------------------ > >=20 > > _______________________________________________ > > Palmkit-users-jp mailing list Pal...@li... > > > https://lists.sourceforge.net/lists/listinfo/palmkit-users-jp >=20 > -- > =1B$B0KF#=1B(B =1B$B>4B'=1B(B =1B$BElKLBg3X=1B(B = =1B$BBg3X1!9)3X8&5f2J=1B(B > Akinori Ito, Assoc. Prof. > Graduate School of Engineering, Tohoku Univ. > TEL: 022-795-7084 E-mail: ai...@fw... >=20 >=20 >=20 -------------------------------------- Start Yahoo! Auction now! Check out the cool campaign http://pr.mail.yahoo.co.jp/auction/ |