StringZilla is the Godzilla of string libraries, splitting, sorting, and shuffling large textual datasets. StringZilla uses a heuristic so simple it's almost stupid... but it works. It matches the first few letters of words with hyper-scalar code to achieve memcpy speeds. The implementation fits into a single C 99 header file and uses different SIMD flavors and SWAR on older platforms. The Str is designed to replace long Python str strings and wrap our C-level API. On the other hand, the File memory-maps a file from persistent memory without loading its copy into RAM. The contents of that file would remain immutable, and the mapping can be shared by multiple Python processes simultaneously. A standard dataset pre-processing use case would be to map a sizeable textual dataset like Common Crawl into memory, spawn child processes, and split the job between them.

Features

  • Collection-Level Operations
  • Low-Level Python API
  • String libraries, splitting, sorting, and shuffling large textual dataset
  • JavaScript docs
  • Python docs
  • Substring Search

Project Samples

Project Activity

See All Activity >

Categories

JSON

License

Apache License V2.0

Follow StringZilla

StringZilla Web Site

Other Useful Business Software
Go From AI Idea to AI App Fast Icon
Go From AI Idea to AI App Fast

One platform to build, fine-tune, and deploy ML models. No MLOps team required.

Access Gemini 3 and 200+ models. Build chatbots, agents, or custom models with built-in monitoring and scaling.
Try Free
Rate This Project
Login To Rate This Project

User Reviews

Be the first to post a review of StringZilla!

Additional Project Details

Programming Language

C++

Related Categories

C++ JSON Software

Registered

2023-10-18