Stanza - Browse /v1.11.0 at SourceForge.net

The interactive file manager requires Javascript. Please enable it or use sftp or scp.
You may still browse the files here.

Name	Modified	Size	InfoDownloads / Week
Parent folder
README.md	2025-10-05	6.7 kB	0
v1.11.0 source code.tar.gz	2025-10-05	1.6 MB	1
v1.11.0 source code.zip	2025-10-05	1.9 MB	6
Totals: 3 Items		3.4 MB	7

Training upgrades

Should now be possible to train all annotators on Windows: https://github.com/stanfordnlp/stanza-train/issues/20 https://github.com/stanfordnlp/stanza/issues/1439 The issue was twofold: a perl script shell call (which could actually be installed, but was annoying for non-perl users) and an overreliance on temp files, which can be opened twice in Unix but not in Windows. Fixed in https://github.com/stanfordnlp/stanza/commit/2677e7789394225c7da09d857a6de15bcb62180b https://github.com/stanfordnlp/stanza/commit/d5c7b7ffee4089f43bc712c3910ae573ed8e686e https://github.com/stanfordnlp/stanza/pull/1514

Model upgrades

Tokenizer can support the pretrained charlm now. This significantly improves the MWT performance on Hebrew, for example. https://github.com/stanfordnlp/stanza/pull/1511
Building tokenizers with pretrained charlm exposed a possible issue with the tokenizer including spaces when tokenizing when an MWT is split across two words. The effect occurred in Hebrew, but an English example would be wo n't tokenized as a single token with embedded space. Augmenting the training to enforce word splits across those spaces fixed the issue. https://github.com/stanfordnlp/stanza/pull/1511/commits/52cea783431c85af68227c0f00dc4022a36ea7f4
use PackedSequence for the tokenizer - is slower, but results are stable when using inputs of different lengths: https://github.com/stanfordnlp/stanza/commit/4433e83542a34e9ef121d17db84695d9d359d5f1 https://github.com/stanfordnlp/stanza/issues/1472
If a Tokenizer training set consistently has spaces between the ends of words and punctuations, the resulting trained model may not properly recognize the same text with periods at the end of the word. For example, this is a test . vs this is a test. Reported in https://github.com/stanfordnlp/stanza/issues/1504 Fixed for VI by https://github.com/stanfordnlp/stanza/commit/6878d8e6405441ee1d14de3d96f8a786ccc599ed
Coref now includes a zeros predictor - this predicts when a mention for certain datasets (such as Spanish) is a pro-drop mention. This behavior occurs by adding an empty node to the sentence. It can be disabled with the coref_use_zeros=False flag to the Pipeline. https://github.com/stanfordnlp/stanza/pull/1502

Model improvements

Sindhi pipeline based on the ISRA UD dataset, published at SyntaxFest 2025, with annotation support from MLtwist: https://aclanthology.org/2025.udw-1.11/
Tamil coreference model from KBC
update English lemmatizer with more verbs and ADJ from Prof. Lapalme
also, French lemmatizer changes with corrections from Prof. Lapalme
create a German lemmatizer using GSD data and a set of ADJ from Wiktionary
add GRC models mixed with a copy of the data with the diacritics stripped. because those work worse on GRC with diacritics, the originals are still the default: https://github.com/stanfordnlp/stanza/commit/5beca58c054c404b8ab6552fbcf61dee5b33a7e9
add a Thai TUD dataset from https://github.com/nlp-chula/TUD (not yet included in UD): https://github.com/stanfordnlp/stanza/commit/bca078cda2b10e74e04abde2667ddf6b896d7efb
NER model for ANG: https://github.com/stanfordnlp/stanza/commit/68a56aa51e013631c2a5cfbc044b41d23fe63780 https://github.com/dmetola/Old_English-OEDT/tree/main
NER models for Hindi, Telegu, and Urdu: https://github.com/stanfordnlp/stanza/issues/1469, model built from https://github.com/ltrc/IL-NER, added in https://github.com/stanfordnlp/stanza/commit/a4902dfbf14164cda6ae0d82ff393264cc3a347d

Other interface improvements

Fix conparser SyntaxWarning: https://github.com/stanfordnlp/stanza/pull/1513 thanks to @orenl
improve efficiency of reading conllu documents: https://github.com/stanfordnlp/stanza/commit/f15f0bc56ccea285cd5278ff75207e63ca9178b7
sort CoNLLU features when outputting a doc, as is standard: https://github.com/stanfordnlp/stanza/commit/aa20fbb27f8e402595723e3609f2d9ae0dd452b1
semgrex interface improvements: search all files, only output failed matches, process all documents at once
turn coref max_train_len into a parameter: https://github.com/stanfordnlp/stanza/commit/1f98d8f55f1b537a688141f181c055506d3eeb1b https://github.com/stanfordnlp/stanza/issues/1465
allow for combined depparse models with multiple training files in a zip file (easier to mix training data): https://github.com/stanfordnlp/stanza/commit/be94ac6f1af6c210cd82e841abeaad6ff31b0fb1
lemmatizer can skip blank lemmas (useful when training using partially complete lemma data): https://github.com/stanfordnlp/stanza/commit/7c34714d8bfa9c4cbb92b50ea4fa8fc6257f5451
if using pretokenized text in the NER, try to use the token text to extract the text (previously would crash): https://github.com/stanfordnlp/stanza/commit/ab249f6f425795dad898c0c878e8fa0c84f3fdfa
don't retokenize pretokenized sentences: https://github.com/stanfordnlp/stanza/pull/1466 https://github.com/stanfordnlp/stanza/issues/1464
remove stray test output files: https://github.com/stanfordnlp/stanza/pull/1493/commits/2e4735a358ca535284be92b72b340623389b2637 thanks to @otakutyrant

Constituency parser

relative attention layer, similar to that used in https://aclanthology.org/2023.findings-emnlp.25/ https://github.com/stanfordnlp/stanza/pull/1474
output some basic analysis of errors: https://github.com/stanfordnlp/stanza/commit/5503c4c83f4d9403cf98960e72da243c580a522f
current best conparser published at SyntaxFest 2025: https://aclanthology.org/2025.iwpt-1.4/

Package dependency updates

remove verbose from ReduceLROnPlateau: https://github.com/stanfordnlp/stanza/commit/1015b6bc916cda9ae1a532680aae55376a8fad82 thanks to @otakutyrant
update usage of xml.etree.ElementTree to match updated python interface: https://github.com/stanfordnlp/stanza/commit/7ca875019993628f0d01df833e63e28a3eb497cb thanks to @otakutyrant
cover up a jieba warning - package has not been updated in many years, not likely to be updated to fix deprecation errors any time soon. https://github.com/stanfordnlp/stanza/commit/0afdb611955d9bb0522af367c159a65d53b3b839 thanks to @otakutyrant
drop support for Python 3.8: https://github.com/stanfordnlp/stanza/commit/6420c3da5f20435c1b3761cb5bd4c0a9846c9c4a thanks to @otakutyrant
update tomli version requirement, https://github.com/stanfordnlp/stanza/pull/1444 thanks to @BLKSerene

Source: README.md, updated 2025-10-05

Stanza Files

Stanford NLP Python library for many human languages

Training upgrades

Model upgrades

Model improvements

Other interface improvements

Constituency parser

Package dependency updates

Stanza Files

Stanford NLP Python library for many human languages

Get an email when there's a new version of Stanza

Training upgrades

Model upgrades

Model improvements

Other interface improvements

Constituency parser

Package dependency updates