Re: [Palmkit-users-jp] language model
Status: Beta
Brought to you by:
a-ito
From: Khan S. <upa...@ya...> - 2006-11-28 05:15:27
|
Dear Sir, Sorry for bothering you again.I have few more questions to ask.As I mentioned before that I'm quite new in this field,so the questions may seem funny to you. >Then you can predict a probability of a function word >w_f by > P(w_f|a,b) > and that of a content word w_c by > P_C(w_c|a',b')(1-Σ_{w in F} P(w|a,b)) > > where F is a set of function words. <a,b> is the > two-word context before w_c and <a',b'> is >two-content-word context before w_c. As P_C(w|a,b) >become one by summing > up for all content words, you have to multiply a > probability that the > next word is a > content word (that is, 1-Σ_{w in F} P(w|a,b) ). --- Akinori Ito <ai...@ma...> wrote: > Hello, > > > I think I will be able to tag the words > successfully > > following your advice.Now,after completion of > tagging,is > > it possible to calculate the probabilities of my > proposed > > LM using Palmkit?If so,can you tell me about the > > process(or commands,in particular) of doing it?? > > Are you OK creating an ordinary bigram/trigram model > from a corpus? > I assume you are already familiar with basic use of > the palmkit > commands. > > You have to create two kinds of LMs: ordinary > trigram (for function > words), > content-word trigram (for content words). To create > a content-word > trigram, > you have to create a corpus without function words. > For example, if you > have > a corpus consists of the following sentense > > 今日+名詞 の+助詞 食事+名詞 は+助詞 ラーメン+名詞 > だ+助動詞 > > then what you have to create is a corpus looks like > this: > > 今日+名詞 食事+名詞 ラーメン+名詞 > > Creating a trigram from the content-word-only > corpus, you can create a > content-word LM P_C(c|a,b). Suppose an ordinary > trigram from the entire > corpus > be P(c|a,b). Then you can predict a probability of a > function word w_f > by > > P(w_f|a,b) > > and that of a content word w_c by > > P_C(w_c|a',b')(1-Σ_{w in F} P(w|a,b)) > > where F is a set of function words. <a,b> is the > two-word context before > w_c and > <a',b'> is two-content-word context before w_c. As > P_C(w|a,b) become one > by summing > up for all content words, you have to multiply a > probability that the > next word is a > content word (that is, 1-Σ_{w in F} P(w|a,b) ). > > -----Original Message----- > From: pal...@li... > [mailto:pal...@li...] > On Behalf Of > Khan Sakeb > Sent: Monday, November 27, 2006 6:29 PM > To: Akinori Ito > Cc: pal...@li... > Subject: Re: [Palmkit-users-jp] language model > > > Dear Sir, > Thank you very much for your kind response and > detailed explanation.In > fact,I dont have that much control on Japanese > language and I hope your > advice will be an immense help for me to distinguish > between Content > words and Function words.I have also downloaded the > paper refered by you > and I'm going to consult it carefully. > > I think I will be able to tag the words successfully > following your advice.Now,after completion of > tagging,is > it possible to calculate the probabilities of my > proposed > LM using Palmkit?If so,can you tell me about the > process(or commands,in particular) of doing it?? > > Sorry for bothering you so many times.But your > advices > have really really helped me a lot and let me > express my heartiest > gratitude to you for spending your valuable time for > me.Thank you once > again. > > I'm looking forward to your reply. > > With Regards > > --- Akinori Ito <ai...@fw...> wrote: > > > Hello, > > > > I'm sorry for the late reply. > > > > I read your Word file. The model you are going to > > make is exactly same one > > that was proposed by Isotani and Matsunaga in > > 1994[1]. If you haven't read > > their paper, you'd better to consult it. > > > > [1] R. Isotani and S. Matsunaga, "A Stochastic > > Language Model for Speech > > Recognition Integrating Local and Global > > Constraints," Proc. ICASSP94, > > vol. II, pp. 5-8, 1994. > > > > Now, we have a couple of way to distinguish > content > > words and function words. > > If you are using Chasen, the easiest way is to > make > > lists of content and > > function words. You can get a list of all parts of > > speech by "chasen -lp" > > command. Then, we can split the POS into the > > following classes. > > (The treatment of "others" classes depends on the > > purpose of the LM.) > > > > Content words: > > 1 名詞 > > 2 名詞-一般 > > 3 名詞-固有名詞 > > 4 名詞-固有名詞-一般 > > 5 名詞-固有名詞-人名 > > 6 名詞-固有名詞-人名-一般 > > 7 名詞-固有名詞-人名-姓 > > 8 名詞-固有名詞-人名-名 > > 9 名詞-固有名詞-組織 > > 10 名詞-固有名詞-地域 > > 11 名詞-固有名詞-地域-一般 > > 12 名詞-固有名詞-地域-国 > > 13 名詞-代名詞 > > 14 名詞-代名詞-一般 > > 15 名詞-代名詞-縮約 > > 16 名詞-副詞可能 > > 17 名詞-サ変接続 > > 18 名詞-形容動詞語幹 > > 19 名詞-数 > > 40 名詞-ナイ形容詞語幹 > > 46 動詞 > > 47 動詞-自立 > > 50 形容詞 > > 51 形容詞-自立 > > 54 副詞 > > 55 副詞-一般 > > 56 副詞-助詞類接続 > > 57 連体詞 > > 58 接続詞 > > 75 感動詞 > > 81 記号-アルファベット > > > > Function words: > > 20 名詞-非自立 > > 21 名詞-非自立-一般 > > 22 名詞-非自立-副詞可能 > > 23 名詞-非自立-助動詞語幹 > > 24 名詞-非自立-形容動詞語幹 > > 25 名詞-特殊 > > 26 名詞-特殊-助動詞語幹 > > 27 名詞-接尾 > > 28 名詞-接尾-一般 > > 29 名詞-接尾-人名 > > 30 名詞-接尾-地域 > > 31 名詞-接尾-サ変接続 > > 32 名詞-接尾-助動詞語幹 > > 33 名詞-接尾-形容動詞語幹 > > 34 名詞-接尾-副詞可能 > > 35 名詞-接尾-助数詞 > > 36 名詞-接尾-特殊 > > 37 名詞-接続詞的 > > 38 名詞-動詞非自立的 > > 41 接頭詞 > > 42 接頭詞-名詞接続 > > 43 接頭詞-動詞接続 > > 44 接頭詞-形容詞接続 > > 45 接頭詞-数接続 > > 48 動詞-非自立 > > 49 動詞-接尾 > > 52 形容詞-非自立 > > 53 形容詞-接尾 > > 59 助詞 > > 60 助詞-格助詞 > > 61 助詞-格助詞-一般 > > 62 助詞-格助詞-引用 > === 以下のメッセージは省略されました === -------------------------------------- Start Yahoo! Auction now! Check out the cool campaign http://pr.mail.yahoo.co.jp/auction/ |