EntropyGuard is a local-first CLI tool designed to sanitize RAG and LLM training datasets. It prevents model collapse by eliminating duplicate and low-entropy data without sending sensitive files to the cloud.

Unlike scripts that crash on large files, EntropyGuard uses Polars LazyFrames to process datasets significantly larger than available RAM (e.g., 100GB+ on a standard laptop) without OOM errors.

Hybrid Architecture:

Exact Dedup: Uses xxHash to instantly strip ~60% of identical noise (CPU-based).

Semantic Dedup: Uses local AI embeddings (SentenceTransformers + FAISS) to detect fuzzy duplicates (e.g., "Hello world" vs "Hi world").

Key Features:

Zero API Costs: No external calls or privacy risks.

Fault Tolerant: Built-in checkpoint/resume system.

Pipe-friendly: Integrates seamlessly with Unix pipelines.

Open Source: MIT Licensed.

Stop training your models on garbage data. pip install entropyguard

Project Activity

See All Activity >

Follow entropyguard

entropyguard Web Site

Other Useful Business Software
AI-generated apps that pass security review Icon
AI-generated apps that pass security review

Stop waiting on engineering. Build production-ready internal tools with AI—on your company data, in your cloud.

Retool lets you generate dashboards, admin panels, and workflows directly on your data. Type something like “Build me a revenue dashboard on my Stripe data” and get a working app with security, permissions, and compliance built in from day one. Whether on our cloud or self-hosted, create the internal software your team needs without compromising enterprise standards or control.
Try Retool free
Rate This Project
Login To Rate This Project

User Reviews

Be the first to post a review of entropyguard!

Additional Project Details

Registered

2025-12-27