EntropyGuard is a local-first CLI tool designed to sanitize RAG and LLM training datasets. It prevents model collapse by eliminating duplicate and low-entropy data without sending sensitive files to the cloud.
Unlike scripts that crash on large files, EntropyGuard uses Polars LazyFrames to process datasets significantly larger than available RAM (e.g., 100GB+ on a standard laptop) without OOM errors.
Hybrid Architecture:
Exact Dedup: Uses xxHash to instantly strip ~60% of identical noise (CPU-based).
Semantic Dedup: Uses local AI embeddings (SentenceTransformers + FAISS) to detect fuzzy duplicates (e.g., "Hello world" vs "Hi world").
Key Features:
Zero API Costs: No external calls or privacy risks.
Fault Tolerant: Built-in checkpoint/resume system.
Pipe-friendly: Integrates seamlessly with Unix pipelines.
Open Source: MIT Licensed.
Stop training your models on garbage data. pip install entropyguard
entropyguard
Local-first semantic deduplication for datasets larger than RAM.
Brought to you by:
damiansiuta
Downloads:
0 This Week