Open Source OCR Engine
Scalable data pre processing and curation toolkit for LLMs
Minimal, clean code for the Byte Pair Encoding (BPE) algorithm
Utility for efficiently grouping files and folders together
Unicode XML TEI text analysis platform
Unsupervised text tokenizer focused on computational efficiency
Romanizing 9 Indian languages (Unicode) to English alphabets