| Name | Modified | Size | Downloads / Week |
|---|---|---|---|
| Parent folder | |||
| stringzilla_shared_macos_arm64_4.3.0.zip | < 13 hours ago | 68.5 kB | |
| stringzilla_bare_windows_x64_4.3.0.tar | < 13 hours ago | 117.8 kB | |
| stringzilla_bare_linux_amd64_4.3.0.deb | < 13 hours ago | 81.7 kB | |
| stringzilla_bare_linux_amd64_4.3.0.so | < 13 hours ago | 181.7 kB | |
| stringzilla_bare_linux_arm64_4.3.0.deb | < 13 hours ago | 67.9 kB | |
| stringzilla_bare_linux_arm64_4.3.0.so | < 13 hours ago | 115.4 kB | |
| README.md | < 13 hours ago | 6.0 kB | |
| v4.3_ Tokenizing UTF-8 with SIMD (Zhu) source code.tar.gz | < 13 hours ago | 668.4 kB | |
| v4.3_ Tokenizing UTF-8 with SIMD (Zhu) source code.zip | < 13 hours ago | 725.9 kB | |
| Totals: 9 Items | 2.0 MB | 0 | |
On AMD Zen5 Turin CPUs on different datasets, StringZilla provides the following throughput for splitting around whitespace and newline characters on 5 vastly different languages. Chinese and Korean texts, for example, are both made of mostly 3-byte letters, but Korean uses a lot of whitespace characters for syllable separation, while Chinese doesn't use any. French and English both use a lot of single-byte whitespace characters, but French uses many accented letters that are 2-byte long in UTF-8.
| Library | English | Chinese | Arabic | French | Korean |
|---|---|---|---|---|---|
| Split around 8 newline combinations: | |||||
stringzilla::utf8_newline_splits |
15.45 GiB/s | 16.65 GiB/s | 18.34 GiB/s | 14.52 GiB/s | 16.71 GiB/s |
stdlib::split(char::is_unicode_newline) |
1.90 GiB/s | 1.93 GiB/s | 1.82 GiB/s | 1.78 GiB/s | 1.81 GiB/s |
| Split around 25 whitespace characters: | |||||
stringzilla::utf8_whitespace_splits |
0.82 GiB/s | 2.40 GiB/s | 2.40 GiB/s | 0.92 GiB/s | 1.88 GiB/s |
stdlib::split(char::is_whitespace) |
0.77 GiB/s | 1.87 GiB/s | 1.04 GiB/s | 0.72 GiB/s | 0.98 GiB/s |
icu::WhiteSpace |
0.11 GiB/s | 0.16 GiB/s | 0.15 GiB/s | 0.12 GiB/s | 0.15 GiB/s |
On Apple M2 Pro:
| Library | English | Chinese | Arabic | French | Korean |
|---|---|---|---|---|---|
| Split around 8 newline combinations: | |||||
stringzilla::utf8_newline_splits |
5.69 GiB/s | 6.24 GiB/s | 6.58 GiB/s | 6.70 GiB/s | 6.29 GiB/s |
stdlib::split(char::is_unicode_newline) |
1.12 GiB/s | 1.11 GiB/s | 1.11 GiB/s | 1.11 GiB/s | 1.13 GiB/s |
| Split around 25 whitespace characters: | |||||
stringzilla::utf8_whitespace_splits |
0.57 GiB/s | 2.45 GiB/s | 1.18 GiB/s | 0.61 GiB/s | 0.92 GiB/s |
stdlib::split(char::is_whitespace) |
0.59 GiB/s | 1.16 GiB/s | 0.99 GiB/s | 0.63 GiB/s | 0.89 GiB/s |
icu::WhiteSpace |
0.10 GiB/s | 0.16 GiB/s | 0.14 GiB/s | 0.11 GiB/s | 0.14 GiB/s |
Minor
- Add: UTF-8 case-folding placeholders (15bcc43)
- Add: UTF-8 serial case-folding (65b652f)
- Add: SVE2 kernels for UTF-8 (d4504be)
- Add: Skip-ahead UTF-8 iterator interface (958be10)
- Add: NEON UTF-8 tokenization kernels (0259f58)
- Add:
try_replace_allfor Rust (35ed227) - Add: NEON UTF-8 placeholders (f1fcdc5)
- Add: Lazy UTF-8 views for Rust (c08dc0c)
- Add:
sz_utf8_unpack_upto64for iterators (3ea1857) - Add: UTF-8 length counting 15x faster (49d9da0)
- Add:
utf8.hfor newvalidandfind_nthinterfaces (e0465d5) - Add: UTF-8 bound checks for Rust (e7b4b9e)
- Add: UTF-8 boundary detection (f1e5318)
Patch
- Make:
SZ_ENFORCE_SVE_OVER_NEON=0by default (da5687d) - Improve: Fewer loads in SVE2 and no fast paths (a06583a)
- Make: Bump macOS-13 → 15 in CI (98b8802)
- Improve: Fewer registers for
e280xxmasks in SVE2 (5434ebf) - Improve: Faster SVE2 & Neon logic (bd9ddf5)
- Fix: NEON whitespace & newline equivalence (016c44a)
- Improve: UTF-8 equivalence checks (786a322)
- Fix: Missing
i8greater-than in AVX2 (dd4c4b0) - Fix: MSVC-compatible
uint8x16_tinit (97cf851) - Improve: Consistent var. names in UTF-8 tokenizers (5c6a32a)
- Fix: Aligned state compilation in NEON (31e4c8b)
- Fix: Missing
svcompact_u8in SVE2 (302af92) - Improve: Include SVE2 benchmarks (4f558e1)
- Fix: Incorrect literal bound for test input (5e0f3ea)
- Improve:
skip_emptyarg for Python compatibility (0279383) - Improve: Consistent split-iterator across languages (07c4d1c)
- Improve: Case-folding bump from Unicode 16 to 17 (9daa2a7)
- Fix: UBSAN issues in
hash.h(36fa527) - Docs: On complexity of case-insensitive substring search (ac5cb2f)
- Make: Bump Rust deps & drop ICU (ebc4296)
- Improve: New case-folding ABI (82528a7)
- Make: Separate file for UTF-8 unpacking (567cf17)
- Improve: Check UTF-8 case-folding (bf0ff0d)
- Make: Deprecate current UTF-32 unpacking code (b2b96f4)
- Fix: Misplaced UTF-8 skip in StringZilla (b838127)
- Fix:
svmatch-ing zero characters in SVE2 kernels (6f045aa) - Improve: Use fewer registers in SVE2 code (e52f4a1)
- Fix:
shortimplicit casts (00bacfc) - Improve: Test CLRF corner cases (0edc81f)
- Improve: Faster
utf8_count_neonw/out u64 unpacking in loop (b583fa8) - Improve: Fast path for 1-byte whitespace in NEON (73da441)
- Fix: Compile-time AES/SHA dispatch for Apple (8c34baf)
- Improve: More UTF-8 whitespace tokenization tests (8bb0324)
- Fix:
no_stdbuilds and doctests (bb699e9) - Improve: Test UTF-8 decoding ops (849bff2)
- Fix: Out of bounds access in
sz_sha256_*_ice(2bceb8d) - Make: Correct
envfields for.vscode/tasks.json(dda7704) - Improve: Unlimited chunk size for UTF-8 iterators (aad09a4)
- Make: Tune Rust analyzer to use less RAM (ced9636)
- Fix: Skip U+001C, U+001D, U+001E (aca0473)
- Improve: Avoid optimization in more benchmarks (f979ed9)
- Improve: Fast path for UTF-8 whitespaces (a3c407f)
- Make: Build just 1 target for VS Code debug (26b0074)
- Fix: Signed comparisons for UTF-8 boundaries (f532ea2)
- Make: Redefining
SZ_DEBUG=0in CMake (febbdac)