From: Agnes S. <as...@nu...> - 2007-10-29 10:58:13
|
Le 25 oct. 07 =C3=A0 16:39, r.j.baars a =C3=A9crit : > Agnes, > > Is French a language using compounding (I only know a very little =20 > bit of French, but I thought it doesn't.) > If it doesnt, the amount of words is probably rather limited (as =20 > opposed to compounding languages where it is infinite). > It is rather easy to get these mitakes and their treshold of use. > > Having a set of words, it is ratehr easy to program the most common =20= > mistakes. (Letter swap, key on (French) keyboard next to it..) > Then using Google or a different search machine, it is very easy to =20= > count the amount of usage ot that item. > > Another good source for getting words and mistakes is the Wikipedia =20= > dump. > > I understand this is not the path you would prefer; nevertheless, i =20= > wanted to contribute this option. Well, it's not my favorite path, but it is an intersting path I had =20 never thought about ;-) Agn=C3=A8s > > > Ruud > > > Marcin Mi=C5=82kowski schreef: >> >> Hi Agnes, >> >> if you use some statistics, then you'll have the list of most =20 >> frequent >> spelling mistakes. And that could be important for grammar checking =20= >> (I >> don't know how frequent are spelling mistakes in French, seems to be >> they should be pretty frequent as you have one-to-many relationship >> between pronunciation and spelling). >> >> Infrequent mistakes aren't really important for grammar checking. Of >> course, all context-dependence mistakes should be absolutely ruled =20= >> out - >> this kind of context dependence should be used in LT rules to detect >> common spelling mistakes. >> >> Regards, >> Marcin >> >> >> Agnes SOUQUE pisze: >> >>> Hi, >>> I'm a bit sceptical concerning such an approach for spell-=20 >>> checking... >>> All mistakes can't be listed (well, we could say the same for =20 >>> correct >>> words...). So not all misspelled words of a text may be found, and >>> there are probably misspelled words left that can prevent some >>> grammar and tagging rules from applying in LT. Then, anyway, it is >>> necessary to use the tagged lexicon to find other potential >>> misspelled words. So I don't really see the interest in using a list >>> of misspelled words... >>> Best >>> Agnes >>> >>> >>> Le 23 oct. 07 =C3=A0 15:39, Marcin Mi=C5=82kowski a =C3=A9crit : >>> >>> >>>> It could be pretty easy to make such a list for French - grab a =20 >>>> large >>>> corpus, use a spell checker, and filter out misspelled words along >>>> with >>>> suggestions. Or simply use a large autocorrect list to make such a >>>> list. >>>> >>>> It might be worth trying. >>>> >>>> Regards, >>>> Marcin >>>> >>>> r.j.baars pisze: >>>> >>>>> Especially for reasons like these, Alpino has a dictionary =20 >>>>> containing >>>>> lots of incorrectly spelled, but often used words. >>>>> >>>>> It might be worth the trouble trying to find the most commonly >>>>> missspelled words and perform this trick. It might be usefull for >>>>> non-ooo use. >>>>> >>>>> Ruud >>>>> >>>>> Marcin Mi=C5=82kowski schreef: >>>>> >>>>>> Laurent Godard pisze: >>>>>> >>>>>> >>>>>>> HI Agnes, Hi all >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>> Concerning the spellcheking, our priority now would be to >>>>>>>> verify if >>>>>>>> all words are in the lexicon or not. Just after the >>>>>>>> tokenization and >>>>>>>> before the tagging, the user should be informed that a given >>>>>>>> word is >>>>>>>> unknown and that it is probably misspelled. The user should >>>>>>>> then have >>>>>>>> the possibility to replace the word by a new one corrected. We >>>>>>>> don't >>>>>>>> want for the moment the system to do suggestions to correct >>>>>>>> unknown >>>>>>>> words for example. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> I agree with you that it is a prerequisite for =20 >>>>>>> desambiguisation and >>>>>>> later unification >>>>>>> >>>>>>> We can't grammatically analyze a sentence if it contains unknown >>>>>>> words >>>>>>> >>>>>>> >>>>>>> >>>>>> Yes, you can, and you do it all the time yourself. You cannot >>>>>> know all >>>>>> possible proper names, for example. You can silently ignore those >>>>>> words >>>>>> and even use a simple heuristic (if it's capitalized and has =20 >>>>>> UNKNOWN >>>>>> postag, it's a proper name). >>>>>> >>>>>> >>>>>> >>>>>>>> Of course, if LT is used in OOo, a complete spellchecking can =20= >>>>>>>> be >>>>>>>> realized by OOo... >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> The api to be used to obtain the suggestions is quite easy ( >>>>>>> http://api.openoffice.org/docs/common/ref/com/sun/star/ >>>>>>> linguistic2/SpellChecker.html >>>>>>> http://api.openoffice.org/docs/DevelopersGuide/OfficeDev/ >>>>>>> OfficeDev.xhtml#1_2_3_Linguistics >>>>>>> >>>>>>> But it requires UNO and OOo so if implemented, should be =20 >>>>>>> optional >>>>>>> >>>>>>> >>>>>>> >>>>>> LT will underline in green with the new API that is being now >>>>>> prepared, >>>>>> and probably the old way (the dialog box way) is not going to =20 >>>>>> be the >>>>>> most popular anyway. >>>>>> >>>>>> >>>>>> >>>>>>> Moreover, this would imply that OOo lexicons and LT one are the >>>>>>> same >>>>>>> Which is not the case (for the moment) >>>>>>> >>>>>>> So lets stick simple and mark the sentence as gramativally >>>>>>> incorrect if >>>>>>> a spellchecking error is detected, and avoid doing further >>>>>>> processing in >>>>>>> this case >>>>>>> >>>>>>> >>>>>>> >>>>>> This is wrong - not all unknown words are misspelled, see above, >>>>>> >>>>>> Regards, >>>>>> Marcin >>>>>> >>>>>> >>>>>> >>>>>>> Laurent >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> = -------------------------------------------------------------------- >>>>>> ----- >>>>>> This SF.net email is sponsored by: Splunk Inc. >>>>>> Still grepping through log files to find problems? Stop. >>>>>> Now Search log events and configuration files using AJAX and a >>>>>> browser. >>>>>> Download your FREE copy of Splunk now >> http://get.splunk.com/ >>>>>> _______________________________________________ >>>>>> Languagetool-devel mailing list >>>>>> Lan...@li... >>>>>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel >>>>>> >>>>>> >>>>>> >>>>>> >>>> = ---------------------------------------------------------------------- >>>> --- >>>> This SF.net email is sponsored by: Splunk Inc. >>>> Still grepping through log files to find problems? Stop. >>>> Now Search log events and configuration files using AJAX and a >>>> browser. >>>> Download your FREE copy of Splunk now >> http://get.splunk.com/ >>>> _______________________________________________ >>>> Languagetool-devel mailing list >>>> Lan...@li... >>>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel >>>> >>> >>> -- >>> Agn=C3=A8s SOUQUE >>> as...@nu... >>> http://blogs.nuxeo.com/sections/blogs/agnes-souque >>> -- >>> >>> >>> >>> = ------------------------------------------------------------------------- >>> This SF.net email is sponsored by: Splunk Inc. >>> Still grepping through log files to find problems? Stop. >>> Now Search log events and configuration files using AJAX and a =20 >>> browser. >>> Download your FREE copy of Splunk now >> http://get.splunk.com/ >>> _______________________________________________ >>> Languagetool-devel mailing list >>> Lan...@li... >>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel >>> >> >> = ------------------------------------------------------------------------- >> This SF.net email is sponsored by: Splunk Inc. >> Still grepping through log files to find problems? Stop. >> Now Search log events and configuration files using AJAX and a =20 >> browser. >> Download your FREE copy of Splunk now >> http://get.splunk.com/ >> _______________________________________________ >> Languagetool-devel mailing list >> Lan...@li... >> https://lists.sourceforge.net/lists/listinfo/languagetool-devel >> > -- Agn=C3=A8s SOUQUE as...@nu... http://blogs.nuxeo.com/sections/blogs/agnes-souque -- |