StringZilla - Browse /v4.4.0 at SourceForge.net

The interactive file manager requires Javascript. Please enable it or use sftp or scp.
You may still browse the files here.

Name	Modified	Size	InfoDownloads / Week
Parent folder
stringzilla_bare_windows_x64_4.4.0.tar	2025-11-29	128.0 kB	0
stringzilla_bare_linux_arm64_4.4.0.deb	2025-11-29	86.8 kB	0
stringzilla_bare_linux_arm64_4.4.0.so	2025-11-29	177.1 kB	0
stringzilla_bare_linux_amd64_4.4.0.deb	2025-11-29	112.6 kB	0
stringzilla_bare_linux_amd64_4.4.0.so	2025-11-29	321.2 kB	0
stringzilla_shared_macos_arm64_4.4.0.zip	2025-11-29	94.1 kB	0
README.md	2025-11-29	2.9 kB	0
v4.4_ Case-Folding UTF-8 in AVX-512 source code.tar.gz	2025-11-29	729.1 kB	0
v4.4_ Case-Folding UTF-8 in AVX-512 source code.zip	2025-11-29	787.8 kB	0
Totals: 9 Items		2.4 MB	0

To my knowledge, this is the first ever properly vectorized case-folding (aka .to_lower()) implementation compliant with Unicode (v17) and using SIMD (AVX-512 for Intel Ice Lake and newer). The results are remarkable across most languages, but it wasn't trivial to achieve. Unlike dense linear algebra workloads, such as in SimSIMD, no shared logic holds across all languages and code points here. After all, Unicode began in 1989 and covers languages and writing systems that took thousands of years to develop and decades to be organized into a standardized set of rules.

This implementation focuses on locale-independent conversion. It covers every one of 1000+ character folding rules in CaseFolding.txt of the Unicode spec, including:

simple cases, like ASCII English letters: 'A' → 'a'.
complex Latin extensions, where one codepoint expands into multiple characters: 'ẞ' → "ss".
ligatures and mathematical symbols, like 'ﬃ' → "ffi".
less common bicameral alphabets, including Armenian, Georgian, Vietnamese, and others.
fast memcpy-like paths for unicameral scripts, like Chinese, Japanese, and Korean.

To benchmark all of those, I've extended the StringWars benchmarks with a new bench_unicode.rs and bench_unicode.py scripts and the bench_unicode.md report produced for two dozen datasets pulled from the Leipzig Wikipedia corpora. On most languages the performance is great, except for Georgian and Vietnamese for now:

Language	Serial	Ice Lake AVX-512	Speedup
English	550.93 MiB/s	6.87 GiB/s	12.8x
German	482.14 MiB/s	2.54 GiB/s	5.4x
Russian	518.44 MiB/s	2.14 GiB/s	4.2x
Greek	255.05 MiB/s	960.11 MiB/s	3.8x
Chinese	526.30 MiB/s	1.00 GiB/s	1.95x
Vietnamese	346.69 MiB/s	353.04 MiB/s	1.02x
Georgian	519.07 MiB/s	517.61 MiB/s	0.997x

For a complete comparison, go to StringWars 😉

Minor

Add: Fast path for Georgian case-folding (fa7422c)
Add: Case-insensitive ops for Python (d88e30a)
Add: Dispatch case-insensitive search (4ae91c0)
Add: Serial case-insensitive find & compare (4b18f05)

Patch

Fix: Eszett hex parsing warnings in Clang (8b27080)
Fix: Avoid __builtin missing on MSVC (fdc95f3)
Fix: Uninitialized values warning (b84c83e)
Improve: Safer & faster case-folding on Ice Lake (bcd5d16)
Improve: Case-folding on Ice Lake (bb23b60)
Fix: Move Ice Lake kernels out of Haswell scope (b7cc2c4)
Improve: Rename functions towards utf8_case* (44fbb92)
Improve: Faster serial Unicode folding (aa1b21b)
Improve: Re-group folding by char-length (c3586e2)
Docs: Avoid locale-specific Unicode rules (333a778)
Docs: Emoji-free doc section titles (#284) (dc11b40)

Source: README.md, updated 2025-11-29

StringZilla Files

10x faster string search, split, sort, and shuffle for long strings

Minor

Patch

StringZilla Files

10x faster string search, split, sort, and shuffle for long strings

Get an email when there's a new version of StringZilla

Minor

Patch