Download Latest Version SimpleLemmatizer.zip (6.2 MB)
Email in envelope

Get an email when there's a new version of SimpleLemmatizer

Home / SimpleLemmatizer
Name Modified Size InfoDownloads / Week
Parent folder
src 2020-03-22
model 2020-03-21
sample 2020-03-21
bin 2020-03-21
include 2020-03-21
readme.txt 2020-03-21 1.6 kB
_classpath 2020-03-21 347 Bytes
_project 2020-03-21 392 Bytes
Totals: 8 Items   2.4 kB 0
				SimpleLemmatizer

SimpleLemmatizer v.0.9
Copyright (c) 2020 Ján Mojžiš
  
Use this program and its parts freely for non-commercial use
You need the model in order to use this program, natively
this program is supplied with 1 model for public style domain
Model is created from, created from dictionary
   
source of the dictionary
https://korpus.sk/attachments/morphology_database/https://korpus.sk/attachments/morphology_database/ma-2014-10-20.txt.bz2
    
   
source of model for frequencz for public style
https://korpus.sk/files/prim-8.0/https://korpus.sk/files/prim-8.0/prim-8.0-public-inf-lemma_frequency.txt.bz2
   
conflicts between lemmas are handled by frequency model, which is also supplied in this base model
description for these files are given at
https://korpus.sk/structure1.html
   
Both public style dictionary and frequency model are parts of Slovak National Corpus, copyright by Ľ. Štúr Institute of Linguistics, Slovak Academy of Sciences
   
The base model provided in this program is for lemmatization of slovak texts only


Known problems with accuracy
regarding lemmatization, base conflicts between lemmas are handled by frequency model, where from two conflicting lemmas, one with
higher frequency is used

base word processing is performed unigram, no bigrams or ngrams processing is available.
Therefore, this has a negative impact on overall accuracy. For example, for text "prišli pri študovaní krvi tých ľudí"
it outputs "prísť pri študovaný krv to človek" where correctly it should be "prísť pri študovať krv ten človek"
Source: readme.txt, updated 2020-03-21