SentencePiece

SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabulary size is predetermined prior to the neural model training. SentencePiece implements subword units (e.g., byte-pair-encoding (BPE) [Sennrich et al.]) and unigram language model [Kudo.]) with the extension of direct training from raw sentences. SentencePiece allows us to make a purely end-to-end system that does not depend on language-specific pre/postprocessing. Purely data driven, sentencePiece trains tokenization and detokenization models from sentences. Pre-tokenization (Moses tokenizer/MeCab/KyTea) is not always required. SentencePiece treats the sentences just as sequences of Unicode characters. There is no language-dependent logic.

Features

Multiple subword algorithms
Subword regularization
Fast and lightweight
Self-contained
Direct vocabulary id generation
NFKC-based normalization

Project Samples

Project Activity

See All Activity >

License

Apache License V2.0

Follow SentencePiece

SentencePiece Web Site

User Reviews

Be the first to post a review of SentencePiece!

Additional Project Details

Registered

2021-10-06

Similar Business Software

Neural Designer

Neural Designer is a powerful software tool for developing and deploying machine learning models. It provides a user-friendly interface that allows users to build, train, and evaluate neural networks without requiring extensive programming knowledge. With a wide range of features and...

See Software
IBM Watson Machine Learning Accelerator

Accelerate your deep learning workload. Speed your time to value with AI model training and inference. With advancements in compute, algorithm and data access, enterprises are adopting deep learning more widely to extract and scale insight through speech recognition, natural language processing...

See Software
Google Cloud Natural Language API

Get insightful text analysis with machine learning that extracts, analyzes, and stores text. Train high-quality machine learning custom models without a single line of code with AutoML. Apply natural language understanding (NLU) to apps with Natural Language API. Use entity analysis to find and...

See Software

Report inappropriate content

SentencePiece

Unsupervised text tokenizer for Neural Network-based text generation

Features

Project Samples

Project Activity

Categories

License

Follow SentencePiece

User Reviews

Additional Project Details

Operating Systems

Programming Language

Related Categories

Registered