Name | Modified | Size | Downloads / Week |
---|---|---|---|
Parent folder | |||
ma-2015-02-05-prim-8-publ-inf_freq.dict | 2020-03-21 | 6.8 MB | |
Totals: 1 Item | 6.8 MB | 0 |
SimpleLemmatizer SimpleLemmatizer v.0.9 Copyright (c) 2020 Ján Mojžiš Use this program and its parts freely for non-commercial use You need the model in order to use this program, natively this program is supplied with 1 model for public style domain Model is created from, created from dictionary source of the dictionary https://korpus.sk/attachments/morphology_database/https://korpus.sk/attachments/morphology_database/ma-2014-10-20.txt.bz2 source of model for frequencz for public style https://korpus.sk/files/prim-8.0/https://korpus.sk/files/prim-8.0/prim-8.0-public-inf-lemma_frequency.txt.bz2 conflicts between lemmas are handled by frequency model, which is also supplied in this base model description for these files are given at https://korpus.sk/structure1.html Both public style dictionary and frequency model are parts of Slovak National Corpus, copyright by Ľ. Štúr Institute of Linguistics, Slovak Academy of Sciences The base model provided in this program is for lemmatization of slovak texts only Known problems with accuracy regarding lemmatization, base conflicts between lemmas are handled by frequency model, where from two conflicting lemmas, one with higher frequency is used base word processing is performed unigram, no bigrams or ngrams processing is available. Therefore, this has a negative impact on overall accuracy. For example, for text "prišli pri študovaní krvi tých ľudí" it outputs "prísť pri študovaný krv to človek" where correctly it should be "prísť pri študovať krv ten človek"