StringZilla is the Godzilla of string libraries, splitting, sorting, and shuffling large textual datasets. StringZilla uses a heuristic so simple it's almost stupid... but it works. It matches the first few letters of words with hyper-scalar code to achieve memcpy speeds. The implementation fits into a single C 99 header file and uses different SIMD flavors and SWAR on older platforms. The Str is designed to replace long Python str strings and wrap our C-level API. On the other hand, the File memory-maps a file from persistent memory without loading its copy into RAM. The contents of that file would remain immutable, and the mapping can be shared by multiple Python processes simultaneously. A standard dataset pre-processing use case would be to map a sizeable textual dataset like Common Crawl into memory, spawn child processes, and split the job between them.

Features

  • Collection-Level Operations
  • Low-Level Python API
  • String libraries, splitting, sorting, and shuffling large textual dataset
  • JavaScript docs
  • Python docs
  • Substring Search

Project Samples

Project Activity

See All Activity >

Categories

JSON

License

Apache License V2.0

Follow StringZilla

StringZilla Web Site

You Might Also Like
All-in-One Payroll and HR Platform Icon
All-in-One Payroll and HR Platform

For small and mid-sized businesses that need a comprehensive payroll and HR solution with personalized support

We design our technology to make workforce management easier. APS offers core HR, payroll, benefits administration, attendance, recruiting, employee onboarding, and more.
Rate This Project
Login To Rate This Project

User Reviews

Be the first to post a review of StringZilla!

Additional Project Details

Programming Language

C++

Related Categories

C++ JSON Software

Registered

2023-10-18