text-dedup

text-dedup is a Python library that enables efficient deduplication of large text corpora by using MinHash and other probabilistic techniques to detect near-duplicate content. This is especially useful for NLP tasks where duplicated training data can skew model performance. text-dedup scales to billions of documents and offers tools for chunking, hashing, and comparing text efficiently with low memory usage. It supports Jaccard similarity thresholding, parallel execution, and flexible deduplication strategies, making it ideal for cleaning web-scraped data, language model training datasets, or document archives.

Features

Fast and scalable near-duplicate detection
Uses MinHash and Jaccard similarity for fuzzy matching
Designed for web-scale datasets with billions of documents
Supports customizable deduplication thresholds
Multi-threaded and memory-efficient processing
Hashing-based representation of text chunks
Optional GPU acceleration for faster computation
Suitable for cleaning NLP and LLM training data

Project Samples

Project Activity

See All Activity >

License

Apache License V2.0

Follow text-dedup

text-dedup Web Site

Other Useful Business Software

$300 Free Credits for Your Google Cloud Projects

Start building on Google Cloud with $300 in free credits. No commitment, no credit card required until you're ready to scale.

Launch your next project with $300 in free Google Cloud credits—no strings attached. Test, build, and deploy without risk. Use your credits across the entire Google Cloud platform to find what works best for your needs. After your credits are used, continue with always-free tier services. Only pay when you're ready to scale. Sign up in minutes and start exploring.

Start Free Trial

Rate This Project

User Reviews

Be the first to post a review of text-dedup!

Additional Project Details

Programming Language

Python

Related Categories

Python Stream Processing Tool

Registered

2025-04-08

Similar Business Software

Ably

Ably is the definitive realtime experience platform. We power more WebSocket connections than any other pub/sub platform, serving over a billion devices monthly. Businesses like HubSpot, NASCAR and Webflow trust us to power their critical applications - reliably, securely and at serious...

See Software
RudderStack

RudderStack is the smart customer data pipeline. Easily build pipelines connecting your whole customer data stack, then make them smarter by pulling analysis from your data warehouse to trigger enrichment and activation in customer tools for identity stitching and other advanced use cases. Start...

See Software
groundcover

Cloud-based observability solution that helps businesses track and manage workload and performance on a unified dashboard. Monitor everything you run in your cloud without compromising on cost, granularity, or scale. groundcover is a full stack cloud-native APM platform designed to make...

See Software
Aiven

Aiven manages your open source data infrastructure in the cloud - so you don't have to. Developers can do what they do best: create applications. We do what we do best: manage cloud data infrastructure. All solutions are open source. You can also freely move data between clouds or create...

See Software
Nussknacker

Nussknacker is a low-code visual tool for domain experts to define and run real-time decisioning algorithms instead of implementing them in the code. It serves where real-time actions on data have to be made: real-time marketing, fraud detection, Internet of Things, Customer 360, and Machine...

See Software
PubNub

Innovate with Realtime Features: We take care of realtime communication infrastructure so you can focus on your app. Our Platform for Realtime Communication: A platform to build and operate real-time interactivity for web, mobile, AI/ML, IoT, and Edge computing applications Faster &...

See Software

Report inappropriate content

text-dedup

All-in-one text de-duplication

Get an email when there's a new version of text-dedup

Features

Project Samples

Project Activity

Categories

License

Follow text-dedup

User Reviews

Additional Project Details

Programming Language

Related Categories

Registered