| Name | Modified | Size | Downloads / Week |
|---|---|---|---|
| Parent folder | |||
| ma-2015-02-05-prim-8-publ-inf_freq.dict | 2020-03-21 | 6.8 MB | |
| Totals: 1 Item | 6.8 MB | 0 | |
SimpleLemmatizer
SimpleLemmatizer v.0.9
Copyright (c) 2020 Ján Mojžiš
Use this program and its parts freely for non-commercial use
You need the model in order to use this program, natively
this program is supplied with 1 model for public style domain
Model is created from, created from dictionary
source of the dictionary
https://korpus.sk/attachments/morphology_database/https://korpus.sk/attachments/morphology_database/ma-2014-10-20.txt.bz2
source of model for frequencz for public style
https://korpus.sk/files/prim-8.0/https://korpus.sk/files/prim-8.0/prim-8.0-public-inf-lemma_frequency.txt.bz2
conflicts between lemmas are handled by frequency model, which is also supplied in this base model
description for these files are given at
https://korpus.sk/structure1.html
Both public style dictionary and frequency model are parts of Slovak National Corpus, copyright by Ľ. Štúr Institute of Linguistics, Slovak Academy of Sciences
The base model provided in this program is for lemmatization of slovak texts only
Known problems with accuracy
regarding lemmatization, base conflicts between lemmas are handled by frequency model, where from two conflicting lemmas, one with
higher frequency is used
base word processing is performed unigram, no bigrams or ngrams processing is available.
Therefore, this has a negative impact on overall accuracy. For example, for text "prišli pri študovaní krvi tých ľudí"
it outputs "prísť pri študovaný krv to človek" where correctly it should be "prísť pri študovať krv ten človek"