StringZilla - Browse /v4.3.0 at SourceForge.net

The interactive file manager requires Javascript. Please enable it or use sftp or scp.
You may still browse the files here.

Name	Modified	Size	InfoDownloads / Week
Parent folder
stringzilla_shared_macos_arm64_4.3.0.zip	< 13 hours ago	68.5 kB	0
stringzilla_bare_windows_x64_4.3.0.tar	< 13 hours ago	117.8 kB	0
stringzilla_bare_linux_amd64_4.3.0.deb	< 13 hours ago	81.7 kB	0
stringzilla_bare_linux_amd64_4.3.0.so	< 13 hours ago	181.7 kB	0
stringzilla_bare_linux_arm64_4.3.0.deb	< 13 hours ago	67.9 kB	0
stringzilla_bare_linux_arm64_4.3.0.so	< 13 hours ago	115.4 kB	0
README.md	< 13 hours ago	6.0 kB	0
v4.3_ Tokenizing UTF-8 with SIMD (Zhu) source code.tar.gz	< 13 hours ago	668.4 kB	0
v4.3_ Tokenizing UTF-8 with SIMD (Zhu) source code.zip	< 13 hours ago	725.9 kB	0
Totals: 9 Items		2.0 MB	0

On AMD Zen5 Turin CPUs on different datasets, StringZilla provides the following throughput for splitting around whitespace and newline characters on 5 vastly different languages. Chinese and Korean texts, for example, are both made of mostly 3-byte letters, but Korean uses a lot of whitespace characters for syllable separation, while Chinese doesn't use any. French and English both use a lot of single-byte whitespace characters, but French uses many accented letters that are 2-byte long in UTF-8.

Library	English	Chinese	Arabic	French	Korean
Split around 8 newline combinations:
`stringzilla::utf8_newline_splits`	15.45 GiB/s	16.65 GiB/s	18.34 GiB/s	14.52 GiB/s	16.71 GiB/s
`stdlib::split(char::is_unicode_newline)`	1.90 GiB/s	1.93 GiB/s	1.82 GiB/s	1.78 GiB/s	1.81 GiB/s

Split around 25 whitespace characters:
`stringzilla::utf8_whitespace_splits`	0.82 GiB/s	2.40 GiB/s	2.40 GiB/s	0.92 GiB/s	1.88 GiB/s
`stdlib::split(char::is_whitespace)`	0.77 GiB/s	1.87 GiB/s	1.04 GiB/s	0.72 GiB/s	0.98 GiB/s
`icu::WhiteSpace`	0.11 GiB/s	0.16 GiB/s	0.15 GiB/s	0.12 GiB/s	0.15 GiB/s

On Apple M2 Pro:

Library	English	Chinese	Arabic	French	Korean
Split around 8 newline combinations:
`stringzilla::utf8_newline_splits`	5.69 GiB/s	6.24 GiB/s	6.58 GiB/s	6.70 GiB/s	6.29 GiB/s
`stdlib::split(char::is_unicode_newline)`	1.12 GiB/s	1.11 GiB/s	1.11 GiB/s	1.11 GiB/s	1.13 GiB/s

Split around 25 whitespace characters:
`stringzilla::utf8_whitespace_splits`	0.57 GiB/s	2.45 GiB/s	1.18 GiB/s	0.61 GiB/s	0.92 GiB/s
`stdlib::split(char::is_whitespace)`	0.59 GiB/s	1.16 GiB/s	0.99 GiB/s	0.63 GiB/s	0.89 GiB/s
`icu::WhiteSpace`	0.10 GiB/s	0.16 GiB/s	0.14 GiB/s	0.11 GiB/s	0.14 GiB/s

Minor

Add: UTF-8 case-folding placeholders (15bcc43)
Add: UTF-8 serial case-folding (65b652f)
Add: SVE2 kernels for UTF-8 (d4504be)
Add: Skip-ahead UTF-8 iterator interface (958be10)
Add: NEON UTF-8 tokenization kernels (0259f58)
Add: try_replace_all for Rust (35ed227)
Add: NEON UTF-8 placeholders (f1fcdc5)
Add: Lazy UTF-8 views for Rust (c08dc0c)
Add: sz_utf8_unpack_upto64 for iterators (3ea1857)
Add: UTF-8 length counting 15x faster (49d9da0)
Add: utf8.h for new valid and find_nth interfaces (e0465d5)
Add: UTF-8 bound checks for Rust (e7b4b9e)
Add: UTF-8 boundary detection (f1e5318)

Patch

Make: SZ_ENFORCE_SVE_OVER_NEON=0 by default (da5687d)
Improve: Fewer loads in SVE2 and no fast paths (a06583a)
Make: Bump macOS-13 → 15 in CI (98b8802)
Improve: Fewer registers for e280xx masks in SVE2 (5434ebf)
Improve: Faster SVE2 & Neon logic (bd9ddf5)
Fix: NEON whitespace & newline equivalence (016c44a)
Improve: UTF-8 equivalence checks (786a322)
Fix: Missing i8 greater-than in AVX2 (dd4c4b0)
Fix: MSVC-compatible uint8x16_t init (97cf851)
Improve: Consistent var. names in UTF-8 tokenizers (5c6a32a)
Fix: Aligned state compilation in NEON (31e4c8b)
Fix: Missing svcompact_u8 in SVE2 (302af92)
Improve: Include SVE2 benchmarks (4f558e1)
Fix: Incorrect literal bound for test input (5e0f3ea)
Improve: skip_empty arg for Python compatibility (0279383)
Improve: Consistent split-iterator across languages (07c4d1c)
Improve: Case-folding bump from Unicode 16 to 17 (9daa2a7)
Fix: UBSAN issues in hash.h (36fa527)
Docs: On complexity of case-insensitive substring search (ac5cb2f)
Make: Bump Rust deps & drop ICU (ebc4296)
Improve: New case-folding ABI (82528a7)
Make: Separate file for UTF-8 unpacking (567cf17)
Improve: Check UTF-8 case-folding (bf0ff0d)
Make: Deprecate current UTF-32 unpacking code (b2b96f4)
Fix: Misplaced UTF-8 skip in StringZilla (b838127)
Fix: svmatch-ing zero characters in SVE2 kernels (6f045aa)
Improve: Use fewer registers in SVE2 code (e52f4a1)
Fix: short implicit casts (00bacfc)
Improve: Test CLRF corner cases (0edc81f)
Improve: Faster utf8_count_neon w/out u64 unpacking in loop (b583fa8)
Improve: Fast path for 1-byte whitespace in NEON (73da441)
Fix: Compile-time AES/SHA dispatch for Apple (8c34baf)
Improve: More UTF-8 whitespace tokenization tests (8bb0324)
Fix: no_std builds and doctests (bb699e9)
Improve: Test UTF-8 decoding ops (849bff2)
Fix: Out of bounds access in sz_sha256_*_ice (2bceb8d)
Make: Correct env fields for .vscode/tasks.json (dda7704)
Improve: Unlimited chunk size for UTF-8 iterators (aad09a4)
Make: Tune Rust analyzer to use less RAM (ced9636)
Fix: Skip U+001C, U+001D, U+001E (aca0473)
Improve: Avoid optimization in more benchmarks (f979ed9)
Improve: Fast path for UTF-8 whitespaces (a3c407f)
Make: Build just 1 target for VS Code debug (26b0074)
Fix: Signed comparisons for UTF-8 boundaries (f532ea2)
Make: Redefining SZ_DEBUG=0 in CMake (febbdac)

Source: README.md, updated 2025-11-26

StringZilla Files

10x faster string search, split, sort, and shuffle for long strings

Minor

Patch

StringZilla Files

10x faster string search, split, sort, and shuffle for long strings

Get an email when there's a new version of StringZilla

Minor

Patch