Redundancy due to cut-paste operations in text creates bias in machine learning for NLP.
This module takes a directory and produces a subset of the files in that directory (in a list) with an upper bound on similarity between two files.

Features

  • Identify copy paste redundancy in a document corpus
  • Input: a folder with text documents and similarity threshold
  • Output (a) a list of non-redundant documents (a non-redundant subset of the corpus)
  • Output (b) list of document pairs found to be redundant with the amount of redundancy for the pair
  • Python script (2.6) - tested on various Linux flavours + Windows XP/7

Project Activity

See All Activity >

License

GNU General Public License version 3.0 (GPLv3)

Follow Corpus redundancy manager

Corpus redundancy manager Web Site

Other Useful Business Software
AI-powered service management for IT and enterprise teams Icon
AI-powered service management for IT and enterprise teams

Enterprise-grade ITSM, for every business

Give your IT, operations, and business teams the ability to deliver exceptional services—without the complexity. Maximize operational efficiency with refreshingly simple, AI-powered Freshservice.
Try it Free
Rate This Project
Login To Rate This Project

User Reviews

Be the first to post a review of Corpus redundancy manager!

Additional Project Details

Intended Audience

Science/Research

User Interface

Console/Terminal

Programming Language

Python

Related Categories

Python Linguistics Software, Python Natural Language Processing (NLP) Tool

Registered

2011-05-09