text-dedup is a Python library that enables efficient deduplication of large text corpora by using MinHash and other probabilistic techniques to detect near-duplicate content. This is especially useful for NLP tasks where duplicated training data can skew model performance. text-dedup scales to billions of documents and offers tools for chunking, hashing, and comparing text efficiently with low memory usage. It supports Jaccard similarity thresholding, parallel execution, and flexible deduplication strategies, making it ideal for cleaning web-scraped data, language model training datasets, or document archives.
Features
- Fast and scalable near-duplicate detection
- Uses MinHash and Jaccard similarity for fuzzy matching
- Designed for web-scale datasets with billions of documents
- Supports customizable deduplication thresholds
- Multi-threaded and memory-efficient processing
- Hashing-based representation of text chunks
- Optional GPU acceleration for faster computation
- Suitable for cleaning NLP and LLM training data
Categories
Stream ProcessingLicense
Apache License V2.0Follow text-dedup
Other Useful Business Software
MongoDB Atlas runs apps anywhere
MongoDB Atlas gives you the freedom to build and run modern applications anywhere—across AWS, Azure, and Google Cloud. With global availability in over 115 regions, Atlas lets you deploy close to your users, meet compliance needs, and scale with confidence across any geography.
Rate This Project
Login To Rate This Project
User Reviews
Be the first to post a review of text-dedup!