Minimal, clean code for the Byte Pair Encoding (BPE) algorithm
Unsupervised text tokenizer for Neural Network-based text generation
Apple Silicon (MLX) port of Karpathy's autoresearch
tiktoken is a fast BPE tokeniser for use with OpenAI's models
Open-source pre-training implementation of Google's LaMDA in PyTorch
Unsupervised text tokenizer focused on computational efficiency
It is a Windows library that merges standard PDFs into a final PDF
Realtime bigdata tool for bit strings up to 2^63 based on AVL forest
A forensic file identification tool using neural networks
Robust BERT-based model for English with improved MLM training