Re: [Palmkit-users-jp] language model

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hello,

> I think I will be able to tag the words successfully
> following your advice.Now,after completion of tagging,is
> it possible to calculate the probabilities of my proposed
> LM using Palmkit?If so,can you tell me about the
> process(or commands,in particular) of doing it??

Are you OK creating an ordinary bigram/trigram model from a corpus?
I assume you are already familiar with basic use of the palmkit
commands.

You have to create two kinds of LMs: ordinary trigram (for function
words),=20
content-word trigram (for content words). To create a content-word
trigram,
you have to create a corpus without function words. For example, if you
have
a corpus consists of the following sentense

=1B$B:#F|=1B(B+=1B$BL>;l=1B(B =1B$B$N=1B(B+=1B$B=3Du;l=1B(B =
=1B$B?);v=1B(B+=1B$BL>;l=1B(B =1B$B$O=1B(B+=1B$B=3Du;l=1B(B =
=1B$B%i!<%a%s=1B(B+=1B$BL>;l=1B(B =1B$B$@=1B(B+=1B$B=3DuF0;l=1B(B

then what you have to create is a corpus looks like this:

=1B$B:#F|=1B(B+=1B$BL>;l=1B(B =1B$B?);v=1B(B+=1B$BL>;l=1B(B =
=1B$B%i!<%a%s=1B(B+=1B$BL>;l=1B(B

Creating a trigram from the content-word-only corpus, you can create a
content-word LM P_C(c|a,b). Suppose an ordinary trigram from the entire
corpus
be P(c|a,b). Then you can predict a probability of a function word w_f
by

  P(w_f|a,b)

and that of a content word w_c by

  P_C(w_c|a',b')(1-=1B$B&2=1B(B_{w in F} P(w|a,b))

where F is a set of function words. <a,b> is the two-word context before
w_c and
<a',b'> is two-content-word context before w_c. As P_C(w|a,b) become one
by summing
up for all content words, you have to multiply a probability that the
next word is a=20
content word (that is,  1-=1B$B&2=1B(B_{w in F} P(w|a,b) ).

-----Original Message-----
From: pal...@li...
[mailto:pal...@li...] On Behalf Of
Khan Sakeb
Sent: Monday, November 27, 2006 6:29 PM
To: Akinori Ito
Cc: pal...@li...
Subject: Re: [Palmkit-users-jp] language model

Dear Sir,
Thank you very much for your kind response and detailed explanation.In
fact,I dont have that much control on Japanese language and I hope your
advice will be an immense help for me to distinguish between Content
words and Function words.I have also downloaded the paper refered by you
and I'm going to consult it carefully.

I think I will be able to tag the words successfully
following your advice.Now,after completion of tagging,is
it possible to calculate the probabilities of my proposed
LM using Palmkit?If so,can you tell me about the
process(or commands,in particular) of doing it??

Sorry for bothering you so many times.But your advices
have really really helped me a lot and let me express my heartiest
gratitude to you for spending your valuable time for me.Thank you once
again.

I'm looking forward to your reply.

With Regards

--- Akinori Ito <ai...@fw...> wrote:

> Hello,
>=20
> I'm sorry for the late reply.
>=20
> I read your Word file. The model you are going to
> make is exactly same one
> that was proposed by Isotani and Matsunaga in
> 1994[1]. If you haven't read
> their paper, you'd better to consult it.
>=20
> [1] R. Isotani and S. Matsunaga, "A Stochastic
> Language Model for Speech
> Recognition Integrating Local and Global
> Constraints," Proc. ICASSP94,
> vol. II, pp. 5-8, 1994.
>=20
> Now, we have a couple of way to distinguish content
> words and function words.
> If you are using Chasen, the easiest way is to make
> lists of content and
> function words. You can get a list of all parts of
> speech by "chasen -lp"
> command. Then, we can split the POS into the
> following classes.
> (The treatment of "others" classes depends on the
> purpose of the LM.)
>=20
> Content words:
> 1 =1B$BL>;l=1B(B
> 2 =1B$BL>;l=1B(B-=1B$B0lHL=1B(B
> 3 =1B$BL>;l=1B(B-=1B$B8GM-L>;l=1B(B
> 4 =1B$BL>;l=1B(B-=1B$B8GM-L>;l=1B(B-=1B$B0lHL=1B(B
> 5 =1B$BL>;l=1B(B-=1B$B8GM-L>;l=1B(B-=1B$B?ML>=1B(B
> 6 =1B$BL>;l=1B(B-=1B$B8GM-L>;l=1B(B-=1B$B?ML>=1B(B-=1B$B0lHL=1B(B
> 7 =1B$BL>;l=1B(B-=1B$B8GM-L>;l=1B(B-=1B$B?ML>=1B(B-=1B$B@+=1B(B
> 8 =1B$BL>;l=1B(B-=1B$B8GM-L>;l=1B(B-=1B$B?ML>=1B(B-=1B$BL>=1B(B
> 9 =1B$BL>;l=1B(B-=1B$B8GM-L>;l=1B(B-=1B$BAH?%=1B(B
> 10 =1B$BL>;l=1B(B-=1B$B8GM-L>;l=1B(B-=1B$BCO0h=1B(B
> 11 =1B$BL>;l=1B(B-=1B$B8GM-L>;l=1B(B-=1B$BCO0h=1B(B-=1B$B0lHL=1B(B
> 12 =1B$BL>;l=1B(B-=1B$B8GM-L>;l=1B(B-=1B$BCO0h=1B(B-=1B$B9q=1B(B
> 13 =1B$BL>;l=1B(B-=1B$BBeL>;l=1B(B
> 14 =1B$BL>;l=1B(B-=1B$BBeL>;l=1B(B-=1B$B0lHL=1B(B
> 15 =1B$BL>;l=1B(B-=1B$BBeL>;l=1B(B-=1B$B=3DLLs=1B(B
> 16 =1B$BL>;l=1B(B-=1B$BI{;l2DG=3D=1B(B
> 17 =1B$BL>;l=1B(B-=1B$B%5JQ@\B3=1B(B
> 18 =1B$BL>;l=1B(B-=1B$B7AMFF0;l8l44=1B(B
> 19 =1B$BL>;l=1B(B-=1B$B?t=1B(B
> 40 =1B$BL>;l=1B(B-=1B$B%J%$7AMF;l8l44=1B(B
> 46 =1B$BF0;l=1B(B
> 47 =1B$BF0;l=1B(B-=1B$B<+N)=1B(B
> 50 =1B$B7AMF;l=1B(B
> 51 =1B$B7AMF;l=1B(B-=1B$B<+N)=1B(B
> 54 =1B$BI{;l=1B(B
> 55 =1B$BI{;l=1B(B-=1B$B0lHL=1B(B
> 56 =1B$BI{;l=1B(B-=1B$B=3Du;lN`@\B3=1B(B
> 57 =1B$BO"BN;l=1B(B
> 58 =1B$B@\B3;l=1B(B
> 75 =1B$B46F0;l=1B(B
> 81 =1B$B5-9f=1B(B-=1B$B%"%k%U%!%Y%C%H=1B(B
>=20
> Function words:
> 20 =1B$BL>;l=1B(B-=1B$BHs<+N)=1B(B
> 21 =1B$BL>;l=1B(B-=1B$BHs<+N)=1B(B-=1B$B0lHL=1B(B
> 22 =1B$BL>;l=1B(B-=1B$BHs<+N)=1B(B-=1B$BI{;l2DG=3D=1B(B
> 23 =1B$BL>;l=1B(B-=1B$BHs<+N)=1B(B-=1B$B=3DuF0;l8l44=1B(B
> 24 =1B$BL>;l=1B(B-=1B$BHs<+N)=1B(B-=1B$B7AMFF0;l8l44=1B(B
> 25 =1B$BL>;l=1B(B-=1B$BFC<l=1B(B
> 26 =1B$BL>;l=1B(B-=1B$BFC<l=1B(B-=1B$B=3DuF0;l8l44=1B(B
> 27 =1B$BL>;l=1B(B-=1B$B@\Hx=1B(B
> 28 =1B$BL>;l=1B(B-=1B$B@\Hx=1B(B-=1B$B0lHL=1B(B
> 29 =1B$BL>;l=1B(B-=1B$B@\Hx=1B(B-=1B$B?ML>=1B(B
> 30 =1B$BL>;l=1B(B-=1B$B@\Hx=1B(B-=1B$BCO0h=1B(B
> 31 =1B$BL>;l=1B(B-=1B$B@\Hx=1B(B-=1B$B%5JQ@\B3=1B(B
> 32 =1B$BL>;l=1B(B-=1B$B@\Hx=1B(B-=1B$B=3DuF0;l8l44=1B(B
> 33 =1B$BL>;l=1B(B-=1B$B@\Hx=1B(B-=1B$B7AMFF0;l8l44=1B(B
> 34 =1B$BL>;l=1B(B-=1B$B@\Hx=1B(B-=1B$BI{;l2DG=3D=1B(B
> 35 =1B$BL>;l=1B(B-=1B$B@\Hx=1B(B-=1B$B=3Du?t;l=1B(B
> 36 =1B$BL>;l=1B(B-=1B$B@\Hx=1B(B-=1B$BFC<l=1B(B
> 37 =1B$BL>;l=1B(B-=1B$B@\B3;lE*=1B(B
> 38 =1B$BL>;l=1B(B-=1B$BF0;lHs<+N)E*=1B(B
> 41 =1B$B@\F,;l=1B(B
> 42 =1B$B@\F,;l=1B(B-=1B$BL>;l@\B3=1B(B
> 43 =1B$B@\F,;l=1B(B-=1B$BF0;l@\B3=1B(B
> 44 =1B$B@\F,;l=1B(B-=1B$B7AMF;l@\B3=1B(B
> 45 =1B$B@\F,;l=1B(B-=1B$B?t@\B3=1B(B
> 48 =1B$BF0;l=1B(B-=1B$BHs<+N)=1B(B
> 49 =1B$BF0;l=1B(B-=1B$B@\Hx=1B(B
> 52 =1B$B7AMF;l=1B(B-=1B$BHs<+N)=1B(B
> 53 =1B$B7AMF;l=1B(B-=1B$B@\Hx=1B(B
> 59 =1B$B=3Du;l=1B(B
> 60 =1B$B=3Du;l=1B(B-=1B$B3J=3Du;l=1B(B
> 61 =1B$B=3Du;l=1B(B-=1B$B3J=3Du;l=1B(B-=1B$B0lHL=1B(B
> 62 =1B$B=3Du;l=1B(B-=1B$B3J=3Du;l=1B(B-=1B$B0zMQ=1B(B
> 63 =1B$B=3Du;l=1B(B-=1B$B3J=3Du;l=1B(B-=1B$BO"8l=1B(B
> 64 =1B$B=3Du;l=1B(B-=1B$B@\B3=3Du;l=1B(B
> 65 =1B$B=3Du;l=1B(B-=1B$B78=3Du;l=1B(B
> 66 =1B$B=3Du;l=1B(B-=1B$BI{=3Du;l=1B(B
> 67 =1B$B=3Du;l=1B(B-=1B$B4VEj=3Du;l=1B(B
> 68 =1B$B=3Du;l=1B(B-=1B$BJBN)=3Du;l=1B(B
> 69 =1B$B=3Du;l=1B(B-=1B$B=3D*=3Du;l=1B(B
> 70 =1B$B=3Du;l=1B(B-=1B$BI{=3Du;l!?JBN)=3Du;l!?=3D*=3Du;l=1B(B
> 71 =1B$B=3Du;l=1B(B-=1B$BO"BN2=3D=1B(B
> 72 =1B$B=3Du;l=1B(B-=1B$BI{;l2=3D=1B(B
> 73 =1B$B=3Du;l=1B(B-=1B$BFC<l=1B(B
> 74 =1B$B=3DuF0;l=1B(B
>=20
> Others (Not a word)
> 84 =1B$B$=3D$NB>=1B(B
> 85 =1B$B$=3D$NB>=1B(B-=1B$B4VEj=1B(B
> 86 =1B$B%U%#%i!<=1B(B
> 87 =1B$BHs8@8l2;=1B(B
> 88 =1B$B8lCGJR=1B(B
>=20
> Others (They don't have speech form)
> 0 BOS/EOS
> 39 =1B$BL>;l=1B(B-=1B$B0zMQJ8;zNs=1B(B
> 76 =1B$B5-9f=1B(B
> 77 =1B$B5-9f=1B(B-=1B$B0lHL=1B(B
> 78 =1B$B5-9f=1B(B-=1B$B6gE@=1B(B
> 79 =1B$B5-9f=1B(B-=1B$BFIE@=1B(B
> 80 =1B$B5-9f=1B(B-=1B$B6uGr=1B(B
> 82 =1B$B5-9f=1B(B-=1B$B3g8L3+=1B(B
> 83 =1B$B5-9f=1B(B-=1B$B3g8LJD=1B(B
>=20
>=20
> Khan Sakeb wrote:
> > Dear Sir,
> >   Thank you very much for your kind and prompt
> response.Let me apologize at first for my reply
> being late.I could successfully make classlist using
> the "ctext2class" command.But can you plz tell me
> which order do the words follow when they appear in
> the classlist?
> >   =20
> >   Now,let me focus on my research topic.I'm
> attaching a MS-WORD file along with this mail which
> describes the equations of my proposed language
> models.I'm confused about which approach to take for calcualting the=20
> probablities.At first, I just want to determine the probablity of the=20
> next word
> being=1B$B<+N)8l=1B(B(Ci=3D1)/=1B$BIUB08l=1B(B(Ci=3D0) in a tri-gram =
model.
> I mean =1B$B-t=1B(BP(Ci=3D1| Wi-2 Wi-1) or =1B$B-t=1B(BP(Ci=3D0| Wi-2 =
Wi-1).
> >   =20
> >   You mentioned in your mail to use some kind of
> tagger to distinguish between =1B$B<+N)8l=1B(B and =1B$BIUB08l=1B(B.
> Write now,I'm using 'CHASEN' for morphological
> analysis.Can you plz give me an idea about using
> CHASEN effectively to distinguish between =1B$B<+N)8l=1B(B and
> =1B$BIUB08l=1B(B. I think then I can use palmkit to generate
> more specific classlists.
> >   =20
> >   Thank you very much once again.I will be highly
> grateful if you kindly reply to my mail at your
> convenient time.
> >   =20
> >   With Regards
> >   upal1660
> > =20
> >=20
> >=20
> > ---------------------------------
> > Start Yahoo! Auction now! Check out the cool
> campaign
> >=20
> >=20
> >
>
------------------------------------------------------------------------
> >=20
> >
>
------------------------------------------------------------------------
-
> > Take Surveys. Earn Cash. Influence the Future of
> IT
> > Join SourceForge.net's Techsay panel and you'll
> get the chance to share your
> > opinions on IT & business topics through brief
> surveys - and earn cash
> >
>
http://www.techsay.com/default.php?page=3Djoin.php&p=3Dsourceforge&CID=3D=
DEVDE
V
> >=20
> >=20
> >
>
------------------------------------------------------------------------
> >=20
> > _______________________________________________
> > Palmkit-users-jp mailing list Pal...@li...
> >
>
https://lists.sourceforge.net/lists/listinfo/palmkit-users-jp
>=20
> --
> =1B$B0KF#=1B(B =1B$B>4B'=1B(B =1B$BElKLBg3X=1B(B =
=1B$BBg3X1!9)3X8&5f2J=1B(B
> Akinori Ito, Assoc. Prof.
> Graduate School of Engineering, Tohoku Univ.
> TEL: 022-795-7084 E-mail: ai...@fw...
>=20
>=20
>=20

--------------------------------------
Start Yahoo! Auction now! Check out the cool campaign
http://pr.mail.yahoo.co.jp/auction/