Redundancy due to cut-paste operations in text creates bias in machine learning for NLP.
This module takes a directory and produces a subset of the files in that directory (in a list) with an upper bound on similarity between two files.

Features

  • Identify copy paste redundancy in a document corpus
  • Input: a folder with text documents and similarity threshold
  • Output (a) a list of non-redundant documents (a non-redundant subset of the corpus)
  • Output (b) list of document pairs found to be redundant with the amount of redundancy for the pair
  • Python script (2.6) - tested on various Linux flavours + Windows XP/7

Project Activity

See All Activity >

License

GNU General Public License version 3.0 (GPLv3)

Follow Corpus redundancy manager

Corpus redundancy manager Web Site

Other Useful Business Software
Full-stack observability with actually useful AI | Grafana Cloud Icon
Full-stack observability with actually useful AI | Grafana Cloud

Our generous forever free tier includes the full platform, including the AI Assistant, for 3 users with 10k metrics, 50GB logs, and 50GB traces.

Built on open standards like Prometheus and OpenTelemetry, Grafana Cloud includes Kubernetes Monitoring, Application Observability, Incident Response, plus the AI-powered Grafana Assistant. Get started with our generous free tier today.
Create free account
Rate This Project
Login To Rate This Project

User Reviews

Be the first to post a review of Corpus redundancy manager!

Additional Project Details

Intended Audience

Science/Research

User Interface

Console/Terminal

Programming Language

Python

Related Categories

Python Linguistics Software, Python Natural Language Processing (NLP) Tool

Registered

2011-05-09