text-dedup

text-dedup is a Python library that enables efficient deduplication of large text corpora by using MinHash and other probabilistic techniques to detect near-duplicate content. This is especially useful for NLP tasks where duplicated training data can skew model performance. text-dedup scales to billions of documents and offers tools for chunking, hashing, and comparing text efficiently with low memory usage. It supports Jaccard similarity thresholding, parallel execution, and flexible deduplication strategies, making it ideal for cleaning web-scraped data, language model training datasets, or document archives.

Features

Fast and scalable near-duplicate detection
Uses MinHash and Jaccard similarity for fuzzy matching
Designed for web-scale datasets with billions of documents
Supports customizable deduplication thresholds
Multi-threaded and memory-efficient processing
Hashing-based representation of text chunks
Optional GPU acceleration for faster computation
Suitable for cleaning NLP and LLM training data

Project Samples

Project Activity

See All Activity >

License

Apache License V2.0

Follow text-dedup

text-dedup Web Site

Other Useful Business Software

Ship Agents Faster

Transform your applications and workflows into powerful agentic systems at global scale.

Gemini Enterprise Agent Platform lets you rapidly build, scale, govern and optimize production-ready agents grounded in your organization's data. The platform enables developers to build custom or pre-built agents for virtually any use case. New customers get $300 in free credits.

Get Started Free

Rate This Project

User Reviews

Be the first to post a review of text-dedup!

Additional Project Details

Programming Language

Python

Related Categories

Python Stream Processing Tool

Registered

2025-04-08

Similar Business Software

Ably

Ably is the definitive realtime experience platform. We power more WebSocket connections than any other pub/sub platform, serving over a billion devices monthly. Businesses like HubSpot, NASCAR and Webflow trust us to power their critical applications - reliably, securely and at serious...

See Software
RudderStack

RudderStack is the smart customer data pipeline. Easily build pipelines connecting your whole customer data stack, then make them smarter by pulling analysis from your data warehouse to trigger enrichment and activation in customer tools for identity stitching and other advanced use cases. Start...

See Software
groundcover

Cloud-based observability solution that helps businesses track and manage workload and performance on a unified dashboard. Monitor everything you run in your cloud without compromising on cost, granularity, or scale. groundcover is a full stack cloud-native APM platform designed to make...

See Software
Aiven

Aiven manages your open source data infrastructure in the cloud - so you don't have to. Developers can do what they do best: create applications. We do what we do best: manage cloud data infrastructure. All solutions are open source. You can also freely move data between clouds or create...

See Software
Nussknacker

Nussknacker is a low-code visual tool for domain experts to define and run real-time decisioning algorithms instead of implementing them in the code. It serves where real-time actions on data have to be made: real-time marketing, fraud detection, Internet of Things, Customer 360, and Machine...

See Software
PubNub

Innovate with Realtime Features: We take care of realtime communication infrastructure so you can focus on your app. Our Platform for Realtime Communication: A platform to build and operate real-time interactivity for web, mobile, AI/ML, IoT, and Edge computing applications Faster &...

See Software

Report inappropriate content

text-dedup

All-in-one text de-duplication

Get an email when there's a new version of text-dedup

Features

Project Samples

Project Activity

Categories

License

Follow text-dedup

User Reviews

Additional Project Details

Programming Language

Related Categories

Registered