A small library for converting tokenized PHP source code into XML
Unsupervised text tokenizer for Neural Network-based text generation
tiktoken is a fast BPE tokeniser for use with OpenAI's models
Minimal, clean code for the Byte Pair Encoding (BPE) algorithm
Tokenizer-Free TTS for Multilingual Speech Generation
TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning
This repo contains the code for 1D tokenizer and generator
Long-form streaming TTS system for multi-speaker dialogue generation
PostgreSQL extension for full-text search of Chinese language
Python library and CLI tool to interface with Google Translate
The best ChatGPT that $100 can buy
LLM-based Reinforcement Learning audio edit model
A Foundation Model for the Language of Financial Markets
Large Language Model Principles and Practice Tutorial from Scratch
Audiocraft is a library for audio processing and generation
Pre-trained Neural Network models in Axon
Qwen3-Coder is the code version of Qwen3
The official PyTorch implementation of Google's Gemma models
Implement a ChatGPT-like LLM in PyTorch from scratch, step by step
Unified Multimodal Understanding and Generation Models
Data loaders and abstractions for text and NLP
A plugin that integrates Lucene IK analyzer into elasticsearch
Code for the paper Language Models are Unsupervised Multitask Learners
The regex-centric, fast lexical analyzer generator for C++