Fast State-of-the-art tokenizers, optimized for both research and production. Tokenizers provides an implementation of today’s most used tokenizers, with a focus on performance and versatility. These tokenizers are also used in Transformers. Train new vocabularies and tokenize, using today’s most used tokenizers. Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server’s CPU. Easy to use, but also extremely versatile. Designed for both research and production. Full alignment tracking. Even with destructive normalization, it’s always possible to get the part of the original sentence that corresponds to any token. Does all the pre-processing: Truncation, Padding, add the special tokens your model needs.

Features

  • Train new vocabularies and tokenize, using today’s most used tokenizers
  • Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server’s CPU
  • Easy to use, but also extremely versatile
  • Designed for both research and production
  • Full alignment tracking
  • Truncation, Padding, add the special tokens your model needs

Project Samples

Project Activity

See All Activity >

License

Apache License V2.0

Follow Tokenizers

Tokenizers Web Site

You Might Also Like
Achieve perfect load balancing with a flexible Open Source Load Balancer Icon
Achieve perfect load balancing with a flexible Open Source Load Balancer

Take advantage of Open Source Load Balancer to elevate your business security and IT infrastructure with a custom ADC Solution.

Boost application security and continuity with SKUDONET ADC, our Open Source Load Balancer, that maximizes IT infrastructure flexibility. Additionally, save up to $470 K per incident with AI and SKUDONET solutions, further enhancing your organization’s risk management and cost-efficiency strategies.
Rate This Project
Login To Rate This Project

User Reviews

Be the first to post a review of Tokenizers!

Additional Project Details

Programming Language

Rust

Related Categories

Rust Artificial Intelligence Software, Rust Machine Learning Software

Registered

2023-03-23