| Name | Modified | Size | Downloads / Week |
|---|---|---|---|
| Parent folder | |||
| stringzilla_bare_windows_x64_4.4.0.tar | 2025-11-29 | 128.0 kB | |
| stringzilla_bare_linux_arm64_4.4.0.deb | 2025-11-29 | 86.8 kB | |
| stringzilla_bare_linux_arm64_4.4.0.so | 2025-11-29 | 177.1 kB | |
| stringzilla_bare_linux_amd64_4.4.0.deb | 2025-11-29 | 112.6 kB | |
| stringzilla_bare_linux_amd64_4.4.0.so | 2025-11-29 | 321.2 kB | |
| stringzilla_shared_macos_arm64_4.4.0.zip | 2025-11-29 | 94.1 kB | |
| README.md | 2025-11-29 | 2.9 kB | |
| v4.4_ Case-Folding UTF-8 in AVX-512 source code.tar.gz | 2025-11-29 | 729.1 kB | |
| v4.4_ Case-Folding UTF-8 in AVX-512 source code.zip | 2025-11-29 | 787.8 kB | |
| Totals: 9 Items | 2.4 MB | 0 | |
To my knowledge, this is the first ever properly vectorized case-folding (aka .to_lower()) implementation compliant with Unicode (v17) and using SIMD (AVX-512 for Intel Ice Lake and newer). The results are remarkable across most languages, but it wasn't trivial to achieve. Unlike dense linear algebra workloads, such as in SimSIMD, no shared logic holds across all languages and code points here. After all, Unicode began in 1989 and covers languages and writing systems that took thousands of years to develop and decades to be organized into a standardized set of rules.
This implementation focuses on locale-independent conversion. It covers every one of 1000+ character folding rules in CaseFolding.txt of the Unicode spec, including:
- simple cases, like ASCII English letters: 'A' → 'a'.
- complex Latin extensions, where one codepoint expands into multiple characters: 'ẞ' → "ss".
- ligatures and mathematical symbols, like 'ffi' → "ffi".
- less common bicameral alphabets, including Armenian, Georgian, Vietnamese, and others.
- fast
memcpy-like paths for unicameral scripts, like Chinese, Japanese, and Korean.
To benchmark all of those, I've extended the StringWars benchmarks with a new bench_unicode.rs and bench_unicode.py scripts and the bench_unicode.md report produced for two dozen datasets pulled from the Leipzig Wikipedia corpora. On most languages the performance is great, except for Georgian and Vietnamese for now:
| Language | Serial | Ice Lake AVX-512 | Speedup |
|---|---|---|---|
| English | 550.93 MiB/s | 6.87 GiB/s | 12.8x |
| German | 482.14 MiB/s | 2.54 GiB/s | 5.4x |
| Russian | 518.44 MiB/s | 2.14 GiB/s | 4.2x |
| Greek | 255.05 MiB/s | 960.11 MiB/s | 3.8x |
| Chinese | 526.30 MiB/s | 1.00 GiB/s | 1.95x |
| Vietnamese | 346.69 MiB/s | 353.04 MiB/s | 1.02x |
| Georgian | 519.07 MiB/s | 517.61 MiB/s | 0.997x |
For a complete comparison, go to StringWars 😉
Minor
- Add: Fast path for Georgian case-folding (fa7422c)
- Add: Case-insensitive ops for Python (d88e30a)
- Add: Dispatch case-insensitive search (4ae91c0)
- Add: Serial case-insensitive find & compare (4b18f05)
Patch
- Fix: Eszett hex parsing warnings in Clang (8b27080)
- Fix: Avoid
__builtinmissing on MSVC (fdc95f3) - Fix: Uninitialized values warning (b84c83e)
- Improve: Safer & faster case-folding on Ice Lake (bcd5d16)
- Improve: Case-folding on Ice Lake (bb23b60)
- Fix: Move Ice Lake kernels out of Haswell scope (b7cc2c4)
- Improve: Rename functions towards
utf8_case*(44fbb92) - Improve: Faster serial Unicode folding (aa1b21b)
- Improve: Re-group folding by char-length (c3586e2)
- Docs: Avoid locale-specific Unicode rules (333a778)
- Docs: Emoji-free doc section titles (#284) (dc11b40)