|
From: Amine Z. <zak...@gm...> - 2007-12-12 23:22:10
|
Hi,
Does this mean that I have to stop developing the unification module =20
until we discuss all the points you talked about or should I use the =20
current formalism and develop other versions as the formalism changes?
Anyway, there is no tagger for now that specifies chunk in any way.
Regards
Amine Zaki
06 78 97 63 95
Le 12 d=E9c. 07 =E0 23:34, thomas lebarb=E9 a =E9crit :
> Hi all,
>
> (for those who never heard of me, I'm Agn=E9s' PhD supervisor)
> To be honest, I'm a bit bothered by the xml formalism chosen for =20
> language tool.
> I'm not retrieving the value of Daniel's of Agn=E9s' work, far from =20=
> that,
> However, if we wish to tend towards an as-multilingual-as-possible =20
> tool, we have to take into account that categorization and =20
> subcategorisation varies a lot.
> Having different elements for each kind of graphic word and for each =20=
> kind of chunk is not the most generic approach to language.
> The only sure thing we know is that in most languages we have =20
> graphic words (not to be confounded with the linguistic notion of =20
> word), that consecutive words can form a chunk and that consecutive =20=
> chunks form a sentence. Or the other way round, a sentence splits =20
> into consecutive chunks which themselves split into graphic words =20
> (usully called tokens in nlp).
>
> Therefore, if I were to advise some sort of DTD for a tagged corpus =20=
> (and maybe it's time to think about it seriously before the =20
> developments go any further), it would look something like this (the =20=
> tags are inspired from the ABU tagset) :
>
> <!ELEMENT sentence (chk+)>
> <!ATTLIST sentence type (nominal|verbal) #REQUIRED>
> <!ELEMENT chk (tok+) >
> <!ATTLIST chk category =
(nominal|prepositional|verbal|coordination|=20
> subordination|non_final_punctuation|final_punctuation) # REQUIRED>
> <!ATTLIST chk subcat (MasSin|MasPLur|FemSin|FemPlur|...) =
#REQUIRED>
> <!ELEMENT tok (#PCDATA)>
> <!ATTLIST tok category =
(Ver|Adj|Det|Pro|ProDem|...) #REQUIRED>
> <!ATTLIST tok subcat =
(MasSin|MasPlur|FemSin|FemPlur|...) # =20
> REQUIRED>
>
> Using such a formalism would allow to adapt the processes to a) =20
> different languages and b) different kinds of tagsets (depending on =20=
> the tagger used).
>
> Hence, the sentence in Agn=E9s' report would be tagged as follows :
> <sentence type=3D"verbal">
> <chk category=3D"nominal" subcat=3D"MasPlur">
> <tok category=3D"Det" subcat=3D"InvPlur">Les</tok>
> <tok categoriy=3D"Nom" subcat=3D"MasPlur">enfants</tok>
> </chk>
> <chk category=3D"subordination" subcat=3D"subject">
> <tok category=3D"Pro" =
subcat=3D"relative-subject">qui</tok>
> </chk>
> <chk category=3D"Ver" subcat=3D"IP3S">
> <tok category=3D"Ver" subcat=3D"IP3S">=E9choue</tok>
> </chk>
> <chk category=3D"prepositional" subcat=3D"FemSin">
> <tok category=3D"prep" subcat=3D"inv">=E0</tok>
> ...
> ...
>
>
> Well, all this has to be discussed and debated. Such a change might =20=
> be costy right now but will be efficient in the long term since =20
> developping checkers will simply mean adapting unification rules to =20=
> the tagset (and therefore to values of attributes, not names of =20
> elements and attributes). Furthermore, using the sentence/chunk/=20
> token structure as I just did will allow not to mix the different =20
> levels of abstraction (for example to token "qui" is easier to =20
> manipulate as a chunk for linking groups together and working out =20
> incoherences, as often relative pronouns can be composed of several =20=
> tokens ("=E0 qui", "pour lequel"...).
>
> Have a good night,
>
> Thomas
>
> Le 12 d=E9c. 07 =E0 22:40, Amine Zaki a =E9crit :
>
>> Hi all
>> I=92m developing the unification module using the new formalism =20
>> proposed by Agn=E8s. For that, I need some examples of tagged =20
>> documents. As there is not yet any tagger that specifies chunks, I =20=
>> first used the examples in Agn=E8s=92 report, but I don=92t know if =
the =20
>> final tagger would retrieve such xml files. Actually, in these =20
>> examples, it=92s a bit difficult to determine the line number, the =20=
>> column number and other important information that allow localizing =20=
>> the error for easier correction.
>> Here is an example of An=E8s=92 test tagged texts.
>> Original sentence : "Les enfants qui =E9choue =E0 l=92=E9cole ont des =
=20
>> grandes capaci=E9 linguistique"
>> <sentence>
>>
>> <SN genre=3D"e" nombre=3D"p">
>>
>> <D genre=3D"e" nombre=3D"p">Les</D>
>>
>> <N genre=3D"e" nombre=3D"p">enfants</N>
>>
>> </SN>
>>
>> <R type=3D"rel" genre=3D"e" nombre=3D"p">qui</R>
>>
>> <SV type=3D"rel">
>>
>> <V mode=3D"ind" temps=3D"pres" pers=3D"3" =
nombre=3D"s">=E9choue</V>
>>
>> </SV>
>>
>> <SP>
>>
>> <P>=E0</P>
>>
>> <D genre=3D"e" nombre=3D"s">l'</D>
>>
>> <N genre=3D"f" nombre=3D"s">=E9cole</N>
>>
>> </SP>
>>
>> <SV type=3D"ppal">
>>
>> <V mode=3D"ind" temps=3D"pres" pers=3D"3" nombre=3D"p">ont</V>
>>
>> </SV>
>>
>> <SN>
>>
>> <D genre=3D"e" nombre=3D"p">des</D>
>>
>> <J genre=3D"f" nombre=3D"p">grandes</J>
>>
>> <N genre=3D"f" nombre=3D"s">capacit=E9</N>
>>
>> <J genre=3D"e" nombre=3D"s">linguistique</J>
>>
>> </SN>
>>
>> </sentence >
>>
>> Any further information about how to get the line number, column =20
>> number etc. or how would the xml tagged file look like would be =20
>> helpful.
>> Regards.
>> Amine
>> Amine Zaki
>> 06 78 97 63 95
>>
>> =
-------------------------------------------------------------------------
>> SF.Net email is sponsored by:
>> Check out the new SourceForge.net Marketplace.
>> It's the best place to buy or sell services
>> for just about anything Open Source.
>> =
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketpla=
ce_______________________________________________
>> Languagetool-devel mailing list
>> Lan...@li...
>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>
> ---
> Le centre de ressources en informatique de l'Universit=E9 Stendhal - =20=
> Grenoble 3 n'=E9tant pas en mesure d'assurer l'acheminement des =20
> courriels de mani=E8re fiable, =E9crivez-moi =E0 mon adresse Free ou =20=
> Gmail. Merci.
> ---
> Thomas LEBARBE
> * Enseignant au D=E9partement d'Informatique P=E9dagogique
> * Chercheur au Laboratoire LIDILEM - Universit=E9 Stendhal - Grenoble =
3
> Courriel : tho...@fr..., tho...@gm...
> Web : http://www.u-grenoble3.fr/lebarbe
> Poste : BP 25 - 38040 Grenoble Cx 9
> ---
> "En effet, la paresse est un obstacle dans la lutte contre elle-=20
> m=EAme, car il en co=FBte beaucoup d'efforts au paresseux de penser =E0 =
=20
> cesser de l'=EAtre."
> - Miguel Albero - Les Perdants H=E9ro=EFques.
>
>
>
>
>
>
> =
-------------------------------------------------------------------------
> SF.Net email is sponsored by:
> Check out the new SourceForge.net Marketplace.
> It's the best place to buy or sell services
> for just about anything Open Source.
> =
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketpla=
ce_______________________________________________
> Languagetool-devel mailing list
> Lan...@li...
> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
|