SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabulary size is predetermined prior to the neural model training. SentencePiece implements subword units (e.g., byte-pair-encoding (BPE) [Sennrich et al.]) and unigram language model [Kudo.]) with the extension of direct training from raw sentences. SentencePiece allows us to make a purely end-to-end system that does not depend on language-specific pre/postprocessing. Purely data driven, sentencePiece trains tokenization and detokenization models from sentences. Pre-tokenization (Moses tokenizer/MeCab/KyTea) is not always required. SentencePiece treats the sentences just as sequences of Unicode characters. There is no language-dependent logic.

Features

  • Multiple subword algorithms
  • Subword regularization
  • Fast and lightweight
  • Self-contained
  • Direct vocabulary id generation
  • NFKC-based normalization

Project Samples

Project Activity

See All Activity >

Categories

Machine Learning

License

Apache License V2.0

Follow SentencePiece

SentencePiece Web Site

Other Useful Business Software
Ship Agents Faster Icon
Ship Agents Faster

Transform your applications and workflows into powerful agentic systems at global scale.

Gemini Enterprise Agent Platform lets you rapidly build, scale, govern and optimize production-ready agents grounded in your organization's data. The platform enables developers to build custom or pre-built agents for virtually any use case. New customers get $300 in free credits.
Get Started Free
Rate This Project
Login To Rate This Project

User Reviews

Be the first to post a review of SentencePiece!

Additional Project Details

Operating Systems

Mac

Programming Language

C++

Related Categories

C++ Machine Learning Software

Registered

2021-10-06