SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabulary size is predetermined prior to the neural model training. SentencePiece implements subword units (e.g., byte-pair-encoding (BPE) [Sennrich et al.]) and unigram language model [Kudo.]) with the extension of direct training from raw sentences. SentencePiece allows us to make a purely end-to-end system that does not depend on language-specific pre/postprocessing. Purely data driven, sentencePiece trains tokenization and detokenization models from sentences. Pre-tokenization (Moses tokenizer/MeCab/KyTea) is not always required. SentencePiece treats the sentences just as sequences of Unicode characters. There is no language-dependent logic.

Features

  • Multiple subword algorithms
  • Subword regularization
  • Fast and lightweight
  • Self-contained
  • Direct vocabulary id generation
  • NFKC-based normalization

Project Samples

Project Activity

See All Activity >

Categories

Machine Learning

License

Apache License V2.0

Follow SentencePiece

SentencePiece Web Site

You Might Also Like
All-in-One Payroll and HR Platform Icon
All-in-One Payroll and HR Platform

For small and mid-sized businesses that need a comprehensive payroll and HR solution with personalized support

We design our technology to make workforce management easier. APS offers core HR, payroll, benefits administration, attendance, recruiting, employee onboarding, and more.
Rate This Project
Login To Rate This Project

User Reviews

Be the first to post a review of SentencePiece!

Additional Project Details

Operating Systems

Mac

Programming Language

C++

Related Categories

C++ Machine Learning Software

Registered

2021-10-06