|
From: Hèctor A. i F. <hec...@gm...> - 2025-10-15 15:58:10
|
I have been able to have a look at how this all worked. Indeed, it was only a test, but it did not achieve the expected results in the time we were able to devote to it. First, we selected a number of texts from Wikipedia, dividing them into Languedocian, Gascon, and Aranese. These are the _raw.txt files. Then we generated the _vislcg.txt files, as explained in the README.md file. Next came the most tedious part: manually disambiguating a few of them. Above all, we disambiguated texts in Languedocian, because that was our task. Then, using the Makefile, the prob file is generated. The truth is that I don't really think that the prob files that are generated are necessarily worse than the one we took from French. They should be quite a bit better, however short their training corpora may be. The thing is that a lot of work has been done to patch the errors produced by the French prob with CG rules. These ‘à la carte’ disambiguation rules in CG are probably not as effective with the new prob files, which probably produce fewer errors, but part of them are different. The expected improvement, at first glance, does not seem to be happening. For this reason, we eventually set this issue aside to focus on other things that seemed more productive. Best, Hèctor Missatge de Hèctor Alòs i Font <hec...@gm...> del dia dc., 15 d’oct. 2025 a les 16:05: > J'ai trouvé cela dans la documentation : > https://wiki.apertium.org/wiki/Paire_Occitan-Fran%C3%A7ais#D.C3.A9sambigu.C3.AFsateur_statistique_2 > > Missatge de Hèctor Alòs i Font <hec...@gm...> del dia dc., 15 > d’oct. 2025 a les 16:02: > >> Adiu, Aure, >> >> The texts for training the tagger, if I remember correctly, were >> something we tried back with Claudi Balaguer, but I don't think we managed >> to get a post-tagger that worked better than the one that already >> existed. Consequently, we didn't use them, and simply left them in case >> they might be useful to someone in the future. I don't have access to >> Apertium stuff right now. I'll try to look into it tonight. >> >> Best, >> >> Hèctor >> >> Missatge de Aure Séguier <a.s...@lo...> del dia dc., 15 >> d’oct. 2025 a les 15:39: >> >>> Hi, >>> >>> I changed the organization of occitan language words to merge words >>> which are variants in many varieties. For instance « veire » (oci), « véser >>> » (oci@gascon), « véder » (oci@gascon) and « veir » (oci@aran) are now >>> merged in only one verb « véser ». >>> >>> The apertium-oci repository has a « texts » subdirectory with pos-tagged >>> .vislcg.txt texts. I understood these texts are used to fine-tune the >>> pos-tagger with statistical techniques. I corrected these texts so they >>> reflect the new verbs organization in the monodix. >>> >>> But now I have no idea what to do with these texts. How do I use them to >>> fine-tune the pos-tagger ? I found this page on Apertium wiki : >>> https://wiki.apertium.org/wiki/Tagger_training. But it doesn't mention >>> any vislcg text. Where can I found the procedure to fine-tune again the >>> pos-tagger with the corrected texts ? >>> >>> Thanks >>> -- >>> Aure SÉGUIER >>> >>> Responsabla del pòle informatic >>> >>> Congrès permanent de la lenga occitana >>> >>> >>> >>> [image: mobilePhone] +33 (0)5 32 00 00 64 >>> <+33%20(0)5%2032%2000%2000%2064> >>> [image: website] www.locongres.org <//www.locongres.org> >>> [image: address] La Ciutat - Creem! , 5-7 rue de la Fontaine, 64000 Pau >>> >>> >>> >>> >>> [image: facebook] <https://www.facebook.com/lo.congres> >>> >>> [image: twitter] <https://twitter.com/locongres> >>> >>> [image: linkedin] >>> <https://www.linkedin.com/company/congres-permanent-de-la-lenga-occitane/> >>> >>> [image: instagram] <https://www.instagram.com/locongres/> >>> >>> >>> >>> _______________________________________________ >>> Apertium-stuff mailing list >>> Ape...@li... >>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff >>> >> |