A small library for converting tokenized PHP source code into XML
Unsupervised text tokenizer for Neural Network-based text generation
tiktoken is a fast BPE tokeniser for use with OpenAI's models
Minimal, clean code for the Byte Pair Encoding (BPE) algorithm
Tokenizer-Free TTS for Multilingual Speech Generation
TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning
This repo contains the code for 1D tokenizer and generator
Long-form streaming TTS system for multi-speaker dialogue generation
PostgreSQL extension for full-text search of Chinese language
Python library and CLI tool to interface with Google Translate
The best ChatGPT that $100 can buy
LLM-based Reinforcement Learning audio edit model
A Foundation Model for the Language of Financial Markets
Audiocraft is a library for audio processing and generation
Pre-trained Neural Network models in Axon
The official PyTorch implementation of Google's Gemma models
Qwen3-Coder is the code version of Qwen3
Implement a ChatGPT-like LLM in PyTorch from scratch, step by step
Large Language Model Principles and Practice Tutorial from Scratch
Unified Multimodal Understanding and Generation Models
Data loaders and abstractions for text and NLP
A plugin that integrates Lucene IK analyzer into elasticsearch
Code for the paper Language Models are Unsupervised Multitask Learners
The regex-centric, fast lexical analyzer generator for C++