Open Source OCR Engine
Scalable data pre processing and curation toolkit for LLMs
Minimal, clean code for the Byte Pair Encoding (BPE) algorithm
Utility for efficiently grouping files and folders together
Unsupervised text tokenizer focused on computational efficiency
Romanizing 9 Indian languages (Unicode) to English alphabets