| Name | Modified | Size | Downloads / Week |
|---|---|---|---|
| Parent folder | |||
| README.md | 2025-10-05 | 6.7 kB | |
| v1.11.0 source code.tar.gz | 2025-10-05 | 1.6 MB | |
| v1.11.0 source code.zip | 2025-10-05 | 1.9 MB | |
| Totals: 3 Items | 3.4 MB | 7 | |
Training upgrades
- Should now be possible to train all annotators on Windows: https://github.com/stanfordnlp/stanza-train/issues/20 https://github.com/stanfordnlp/stanza/issues/1439 The issue was twofold: a perl script shell call (which could actually be installed, but was annoying for non-perl users) and an overreliance on temp files, which can be opened twice in Unix but not in Windows. Fixed in https://github.com/stanfordnlp/stanza/commit/2677e7789394225c7da09d857a6de15bcb62180b https://github.com/stanfordnlp/stanza/commit/d5c7b7ffee4089f43bc712c3910ae573ed8e686e https://github.com/stanfordnlp/stanza/pull/1514
Model upgrades
-
Tokenizer can support the pretrained charlm now. This significantly improves the MWT performance on Hebrew, for example. https://github.com/stanfordnlp/stanza/pull/1511
-
Building tokenizers with pretrained charlm exposed a possible issue with the tokenizer including spaces when tokenizing when an MWT is split across two words. The effect occurred in Hebrew, but an English example would be
wo n'ttokenized as a single token with embedded space. Augmenting the training to enforce word splits across those spaces fixed the issue. https://github.com/stanfordnlp/stanza/pull/1511/commits/52cea783431c85af68227c0f00dc4022a36ea7f4 -
use PackedSequence for the tokenizer - is slower, but results are stable when using inputs of different lengths: https://github.com/stanfordnlp/stanza/commit/4433e83542a34e9ef121d17db84695d9d359d5f1 https://github.com/stanfordnlp/stanza/issues/1472
-
If a Tokenizer training set consistently has spaces between the ends of words and punctuations, the resulting trained model may not properly recognize the same text with periods at the end of the word. For example,
this is a test .vsthis is a test.Reported in https://github.com/stanfordnlp/stanza/issues/1504 Fixed for VI by https://github.com/stanfordnlp/stanza/commit/6878d8e6405441ee1d14de3d96f8a786ccc599ed -
Coref now includes a zeros predictor - this predicts when a mention for certain datasets (such as Spanish) is a pro-drop mention. This behavior occurs by adding an empty node to the sentence. It can be disabled with the
coref_use_zeros=Falseflag to the Pipeline. https://github.com/stanfordnlp/stanza/pull/1502
Model improvements
-
Sindhi pipeline based on the ISRA UD dataset, published at SyntaxFest 2025, with annotation support from MLtwist: https://aclanthology.org/2025.udw-1.11/
-
Tamil coreference model from KBC
-
update English lemmatizer with more verbs and ADJ from Prof. Lapalme
-
also, French lemmatizer changes with corrections from Prof. Lapalme
-
create a German lemmatizer using GSD data and a set of ADJ from Wiktionary
-
add GRC models mixed with a copy of the data with the diacritics stripped. because those work worse on GRC with diacritics, the originals are still the default: https://github.com/stanfordnlp/stanza/commit/5beca58c054c404b8ab6552fbcf61dee5b33a7e9
-
add a Thai TUD dataset from https://github.com/nlp-chula/TUD (not yet included in UD): https://github.com/stanfordnlp/stanza/commit/bca078cda2b10e74e04abde2667ddf6b896d7efb
-
NER model for ANG: https://github.com/stanfordnlp/stanza/commit/68a56aa51e013631c2a5cfbc044b41d23fe63780 https://github.com/dmetola/Old_English-OEDT/tree/main
-
NER models for Hindi, Telegu, and Urdu: https://github.com/stanfordnlp/stanza/issues/1469, model built from https://github.com/ltrc/IL-NER, added in https://github.com/stanfordnlp/stanza/commit/a4902dfbf14164cda6ae0d82ff393264cc3a347d
Other interface improvements
-
Fix conparser SyntaxWarning: https://github.com/stanfordnlp/stanza/pull/1513 thanks to @orenl
-
improve efficiency of reading conllu documents: https://github.com/stanfordnlp/stanza/commit/f15f0bc56ccea285cd5278ff75207e63ca9178b7
-
sort CoNLLU features when outputting a doc, as is standard: https://github.com/stanfordnlp/stanza/commit/aa20fbb27f8e402595723e3609f2d9ae0dd452b1
-
semgrex interface improvements: search all files, only output failed matches, process all documents at once
-
turn coref
max_train_leninto a parameter: https://github.com/stanfordnlp/stanza/commit/1f98d8f55f1b537a688141f181c055506d3eeb1b https://github.com/stanfordnlp/stanza/issues/1465 -
allow for combined depparse models with multiple training files in a zip file (easier to mix training data): https://github.com/stanfordnlp/stanza/commit/be94ac6f1af6c210cd82e841abeaad6ff31b0fb1
-
lemmatizer can skip blank lemmas (useful when training using partially complete lemma data): https://github.com/stanfordnlp/stanza/commit/7c34714d8bfa9c4cbb92b50ea4fa8fc6257f5451
-
if using pretokenized text in the NER, try to use the token text to extract the text (previously would crash): https://github.com/stanfordnlp/stanza/commit/ab249f6f425795dad898c0c878e8fa0c84f3fdfa
-
don't retokenize pretokenized sentences: https://github.com/stanfordnlp/stanza/pull/1466 https://github.com/stanfordnlp/stanza/issues/1464
-
remove stray test output files: https://github.com/stanfordnlp/stanza/pull/1493/commits/2e4735a358ca535284be92b72b340623389b2637 thanks to @otakutyrant
Constituency parser
-
relative attention layer, similar to that used in https://aclanthology.org/2023.findings-emnlp.25/ https://github.com/stanfordnlp/stanza/pull/1474
-
output some basic analysis of errors: https://github.com/stanfordnlp/stanza/commit/5503c4c83f4d9403cf98960e72da243c580a522f
-
current best conparser published at SyntaxFest 2025: https://aclanthology.org/2025.iwpt-1.4/
Package dependency updates
-
remove verbose from ReduceLROnPlateau: https://github.com/stanfordnlp/stanza/commit/1015b6bc916cda9ae1a532680aae55376a8fad82 thanks to @otakutyrant
-
update usage of xml.etree.ElementTree to match updated python interface: https://github.com/stanfordnlp/stanza/commit/7ca875019993628f0d01df833e63e28a3eb497cb thanks to @otakutyrant
-
cover up a jieba warning - package has not been updated in many years, not likely to be updated to fix deprecation errors any time soon. https://github.com/stanfordnlp/stanza/commit/0afdb611955d9bb0522af367c159a65d53b3b839 thanks to @otakutyrant
-
drop support for Python 3.8: https://github.com/stanfordnlp/stanza/commit/6420c3da5f20435c1b3761cb5bd4c0a9846c9c4a thanks to @otakutyrant
-
update tomli version requirement, https://github.com/stanfordnlp/stanza/pull/1444 thanks to @BLKSerene