Download Latest Version SimpleLemmatizer.zip (6.2 MB)
Email in envelope

Get an email when there's a new version of SimpleLemmatizer

Name Modified Size InfoDownloads / Week
Parent folder
ma-2015-02-05-prim-8-publ-inf_freq.dict 2020-03-21 6.8 MB
Totals: 1 Item   6.8 MB 0
				SimpleLemmatizer

SimpleLemmatizer v.0.9
Copyright (c) 2020 Ján Mojžiš
  
Use this program and its parts freely for non-commercial use
You need the model in order to use this program, natively
this program is supplied with 1 model for public style domain
Model is created from, created from dictionary
   
source of the dictionary
https://korpus.sk/attachments/morphology_database/https://korpus.sk/attachments/morphology_database/ma-2014-10-20.txt.bz2
    
   
source of model for frequencz for public style
https://korpus.sk/files/prim-8.0/https://korpus.sk/files/prim-8.0/prim-8.0-public-inf-lemma_frequency.txt.bz2
   
conflicts between lemmas are handled by frequency model, which is also supplied in this base model
description for these files are given at
https://korpus.sk/structure1.html
   
Both public style dictionary and frequency model are parts of Slovak National Corpus, copyright by Ľ. Štúr Institute of Linguistics, Slovak Academy of Sciences
   
The base model provided in this program is for lemmatization of slovak texts only


Known problems with accuracy
regarding lemmatization, base conflicts between lemmas are handled by frequency model, where from two conflicting lemmas, one with
higher frequency is used

base word processing is performed unigram, no bigrams or ngrams processing is available.
Therefore, this has a negative impact on overall accuracy. For example, for text "prišli pri študovaní krvi tých ľudí"
it outputs "prísť pri študovaný krv to človek" where correctly it should be "prísť pri študovať krv ten človek"
Source: readme.txt, updated 2020-03-21