entropyguard Code

Local-first semantic deduplication for datasets larger than RAM.

Brought to you by: damiansiuta

Tree [31a62a] main /

History

HTTPS access

File	Date	Author	Commit
docs	2025-12-22	Damian Siuta	[764d4d] feat(release): v1.11.0 - unix pipes, hybrid ded...
scripts	2025-12-25	Damian Siuta	[31a62a] fix: Fix PowerShell script syntax error (apostr...
src	2025-12-25	Damian Siuta	[c16881] chore: Bump version to 1.22.1
tests	2025-12-23	Damian Siuta	[7f93cb] chore: repository cleanup & docs overhaul for v...
.gitignore	2025-12-23	Damian Siuta	[7f93cb] chore: repository cleanup & docs overhaul for v...
CHECKPOINT_RESUME_GUIDE.md	2025-12-23	Damian Siuta	[5365c8] docs: translate documentation to English and up...
Dockerfile	2025-12-22	Damian Siuta	[764d4d] feat(release): v1.11.0 - unix pipes, hybrid ded...
LICENSE	2025-12-15	Damian Siuta	[8fed7b] Release v1.3.0: Enterprise Audit Logging & MIT ...
OPEN_CORE_STRATEGY.md	2025-12-21	Damian Siuta	[301542] docs: implement Open Core strategy with MIT Lic...
PROJECT_COMPREHENSIVE_DOCUMENTATION.md	2025-12-25	Damian Siuta	[f7b3ca] docs: Translate documentation to English and ad...
README.md	2025-12-25	Damian Siuta	[c16881] chore: Bump version to 1.22.1
demo.ipynb	2025-12-18	Damian Siuta	[fc916e] Release v1.4.0: Added Recursive Text Chunking (...
pyproject.toml	2025-12-25	Damian Siuta	[c16881] chore: Bump version to 1.22.1
run_pipeline.ps1	2025-12-14	Damian Siuta	[924f73] MVP Complete: Core pipeline operational (Ingest...
run_stress_test.ps1	2025-12-15	Damian Siuta	[ec84dc] Release v1.2.0: Final - Universal Ingestion, Do...
test_alerts.py	2025-12-22	Damian Siuta	[764d4d] feat(release): v1.11.0 - unix pipes, hybrid ded...

Read Me

🛡️ EntropyGuard v1.22.1

**The Unbreakable RAG Data Cleaner** [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) [![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/) [![Docker](https://img.shields.io/badge/docker-ready-blue.svg)](https://www.docker.com/) [![Production Ready](https://img.shields.io/badge/status-production--ready-green.svg)](https://github.com/DamianSiuta/entropyguard) **Enterprise-grade semantic data deduplication and sanitization engine for LLM training data.** [Features](#-key-features) • [Quick Start](#-quick-start) • [Installation](#-installation) • [Documentation](#-documentation)

Why EntropyGuard?

The Problem: Dirty Data = Hallucinations & Wasted Money

Training Large Language Models on contaminated, redundant, or low-quality data leads to:

Model Collapse — Degraded performance from duplicate content
Hallucinations — Inaccurate outputs from poor training data
Wasted Compute — Paying for processing duplicate data multiple times
Compliance Risks — PII and sensitive data in training sets

The Solution: Local CPU Processing with Hybrid Deduplication

EntropyGuard runs 100% locally on your CPU—no data ever leaves your machine. Perfect for:

Air-gapped environments (no cloud dependencies)
Privacy compliance (GDPR, HIPAA, SOC 2)
Cost efficiency (no API calls, no cloud fees)
Enterprise security (complete data sovereignty)

✨ Key Features

🛡️ Fault Tolerant

Checkpoint/Resume System — Automatic recovery from failures
Memory Safety — Chunked processing prevents OOM errors
Graceful Shutdown — SIGINT/SIGTERM handling (Windows + Unix)
Error Recovery — Automatic retry with exponential backoff

🚀 High Performance

Hybrid Engine — Hash-based exact dedup + AI semantic similarity
Unix Pipes Support — Stream processing for data engineering workflows
Lazy Evaluation — Polars LazyFrame for datasets larger than RAM
Optimized Memory — Pre-materialization checks prevent OOM

📉 Memory Safe

Chunked Processing — Process datasets larger than available RAM
Memory Profiling — Track memory usage per pipeline stage
Resource Guards — Disk space and memory checks before operations

📊 Observability

Prometheus Metrics — Export pipeline metrics for monitoring
Structured Logging — JSON logs with correlation IDs
Progress Tracking — Real-time ETA and throughput estimation
Audit Logs — Complete audit trail of all operations

🔒 Enterprise Ready

Standard Exit Codes — sysexits.h compliant for automation
Type Safety — Full type hints (MyPy strict compatible)
Configuration Validation — Pydantic-based schema validation
Input Validation — Format detection and consistency checks

⚡ Quick Start

The "Magic" Command

# Unix pipe example (the most common use case)
cat data.jsonl | entropyguard --dedup-threshold 0.95 > clean.jsonl

Basic Usage

# File-to-file processing
entropyguard \
  --input data.jsonl \
  --output clean.jsonl \
  --text-column text \
  --dedup-threshold 0.95

# With custom settings
entropyguard \
  --input data.ndjson \
  --output cleaned.ndjson \
  --text-column content \
  --min-length 100 \
  --dedup-threshold 0.9 \
  --chunk-size 500

Advanced: Checkpoint & Resume

# Enable automatic checkpoint recovery
entropyguard \
  --input large_dataset.jsonl \
  --output clean.jsonl \
  --checkpoint-dir ./checkpoints \
  --text-column text

# Resume from checkpoint manually
entropyguard \
  --input large_dataset.jsonl \
  --output clean.jsonl \
  --checkpoint-dir ./checkpoints \
  --resume \
  --text-column text

📦 Installation

Option 1: pip from PyPI (Recommended)

pip install entropyguard

Requirements:

Python 3.10, 3.11, or 3.12 (3.13 not supported yet)

Option 2: Install from Git

pip install "git+https://github.com/DamianSiuta/entropyguard.git"

Requirements:

Python 3.10, 3.11, or 3.12 (3.13 not supported yet)
git available on your system

Option 3: Docker

# Build image
docker build -t entropyguard:latest .

# Run container
docker run -v $(pwd):/data entropyguard:latest \
  --input /data/input.jsonl \
  --output /data/output.jsonl \
  --text-column text

Option 4: Development Setup

git clone https://github.com/DamianSiuta/entropyguard.git
cd entropyguard
poetry install

📋 CLI Flags Reference

Complete reference for all available flags:

Flag	Type	Default	Description
Input/Output
`--input`	string	`-` (stdin)	Path to input file (CSV, JSON, NDJSON). Use `-` for stdin
`--output`	string	`-` (stdout)	Path to output file (NDJSON). Use `-` for stdout
`--text-column`	string	auto-detect	Name of text column to process. Auto-detects first string column if omitted
`--required-columns`	string	None	Comma-separated list of required columns (optional schema validation)
Processing Options
`--min-length`	int	`50`	Minimum text length after sanitization (characters)
`--dedup-threshold`	float	`0.95`	Similarity threshold for semantic deduplication (0.0-1.0). Higher = stricter
`--model-name`	string	`all-MiniLM-L6-v2`	Sentence-transformers model for embeddings. Use `paraphrase-multilingual-MiniLM-L12-v2` for multilingual
`--batch-size`	int	`10000`	Batch size for embedding processing. Reduce for low-memory systems
Chunking (RAG)
`--chunk-size`	int	None	Chunk size (characters) for splitting long texts. Disabled if not set
`--chunk-overlap`	int	`50`	Overlap size (characters) between consecutive chunks. Only used with `--chunk-size`
`--separators`	list	default	Custom separators for chunking (space-separated). Use `\n` for newline, `\t` for tab
Checkpoint & Resume
`--checkpoint-dir`	string	None	Directory to save checkpoints for error recovery
`--resume`	flag	false	Resume from last checkpoint if available. Requires `--checkpoint-dir`
`--no-auto-resume`	flag	false	Disable automatic checkpoint recovery (requires explicit `--resume`)
Logging & Output
`--verbose`	flag	false	Enable verbose logging (INFO level)
`--debug`	flag	false	Enable debug mode (DEBUG level + full tracebacks). Implies `--verbose`
`--demo`	flag	false	Demo mode: Hide INFO logs, show only progress bars and final summary
`--quiet`	flag	false	Disable progress bars (useful for CI/CD)
`--json`	flag	false	Output results as JSON (machine-readable format)
`--json-logs`	flag	false	Output logs as JSON (for log aggregation systems)
Monitoring & Profiling
`--profile-memory`	flag	false	Enable memory profiling. Tracks usage at each pipeline stage
`--memory-report-path`	string	None	Path to save memory profiling report (JSON). Requires `--profile-memory`
`--metrics-port`	int	None	Start Prometheus metrics HTTP server on specified port
`--audit-log`	string	None	Path to JSON file for audit log of dropped/duplicate rows
Configuration
`--config`	string	auto-detect	Path to config file (JSON/YAML/TOML). Auto-detects `.entropyguardrc` in current/home dir
Utility
`--dry-run`	flag	false	Simulate processing without expensive operations. Shows statistics only
`--version`	flag	-	Show version number and exit

Flag Categories Explained

Input/Output: Control where data comes from and goes to. Supports Unix pipes (- for stdin/stdout).

Processing Options: Core deduplication settings. --dedup-threshold controls how similar texts must be to be considered duplicates (0.95 = 95% similarity).

Chunking (RAG): For Retrieval-Augmented Generation workflows. Splits long texts into smaller chunks with configurable overlap.

Checkpoint & Resume: Fault tolerance features. Automatically saves progress and can resume from failures.

Logging & Output: Control verbosity and output format. --demo is perfect for video demonstrations.

Monitoring & Profiling: Production observability. Memory profiling helps debug OOM issues, Prometheus metrics enable monitoring.

Configuration: Use config files to avoid repeating flags. CLI arguments override config file values.

🏢 Enterprise / Advanced Usage

Configuration File (`.entropyguardrc.json`)

Create a configuration file in your home directory or project root:

{
  "text_column": "text",
  "min_length": 100,
  "dedup_threshold": 0.95,
  "chunk_size": 500,
  "chunk_overlap": 50,
  "remove_pii": true,
  "normalize_text": true,
  "show_progress": true
}

Then run:

entropyguard --input data.jsonl --output clean.jsonl

Monitoring & Observability

# Enable Prometheus metrics
entropyguard \
  --input data.jsonl \
  --output clean.jsonl \
  --metrics-port 9090 \
  --text-column text

# Enable memory profiling
entropyguard \
  --input data.jsonl \
  --output clean.jsonl \
  --profile-memory \
  --text-column text

# JSON logs for machine parsing
entropyguard \
  --input data.jsonl \
  --output clean.jsonl \
  --json-logs \
  --text-column text

Exit Codes

EntropyGuard follows the sysexits.h standard:

Code	Meaning
`0`	Success
`1`	General error
`2`	Usage error (invalid arguments)
`64`	Data format error
`65`	Input file error
`66`	Output file error
`70`	Software error (internal bug)
`130`	Process interrupted (SIGINT/Ctrl+C)

📊 Comparison

Feature	EntropyGuard	Basic Scripts	Vector DBs
Exact Deduplication	✅ Hash-based (fast)	⚠️ Manual	❌
Semantic Deduplication	✅ AI-powered	❌	✅
Local Processing	✅ 100% local	✅	⚠️ Requires DB
Memory Safety	✅ Chunked processing	⚠️ Manual	⚠️ Depends on DB
Fault Tolerance	✅ Checkpoint/Resume	❌	⚠️ Depends on DB
Unix Pipes	✅ Native support	⚠️ Manual	❌
Observability	✅ Metrics + Logs	❌	⚠️ Depends on DB
Configuration	✅ Pydantic validation	❌	⚠️ DB-specific
Type Safety	✅ Full type hints	❌	⚠️ Depends on language

🛠️ Tech Stack

Core: Python 3.10+, Polars (LazyFrame)
AI/ML: PyTorch (CPU), FAISS, Sentence-Transformers
Validation: Pydantic v2
Logging: structlog (optional)
Metrics: Prometheus Client (optional)
Infrastructure: Poetry, Docker-ready

📋 Edition Comparison

EntropyGuard is available in two editions:

Feature	Community (Open Source)	Enterprise
CLI Tool	✅ Full-featured	✅ Full-featured
Semantic Deduplication	✅ Unlimited	✅ Unlimited
PII Removal	✅ Unlimited	✅ Unlimited
Data Formats	✅ All formats	✅ All formats
Docker Support	✅ Yes	✅ Yes
Audit Logs	✅ Yes	✅ Enhanced
Web Dashboard	❌	✅ Professional Analytics Platform
Real-time Monitoring	❌	✅ Live telemetry & metrics
Alert System	❌	✅ Custom alert rules (Watchtower)
API Access	❌	✅ RESTful API
SSO Integration	❌	✅ SAML 2.0, OAuth 2.0
Support	Community	Priority support with SLA
License	MIT License	Commercial license required

📌 Legal Notice: Enterprise features (Control Plane, Dashboard, API, Alerting System) are proprietary software covered by a commercial license. These components are NOT included in the Open Source release and are NOT subject to the MIT license terms.

📚 Documentation

🤝 Contributing

Contributions are welcome! Please read our contributing guidelines and code of conduct before submitting pull requests.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Built with ❤️ by the EntropyGuard Team

Special thanks to:

Polars for the amazing DataFrame library
Sentence-Transformers for semantic embeddings
FAISS for vector similarity search

**[⬆ Back to Top](#-entropyguard-v1220)** Made with ❤️ for the LLM community

Gen AI apps are built with MongoDB Atlas

Atlas offers built-in vector search and global availability across 125+ regions. Start building AI apps faster, all in one place.

Try Free →

entropyguard Code

Local-first semantic deduplication for datasets larger than RAM.

Branches

Tags

Tree [31a62a] main /

History

Read Me

🛡️ EntropyGuard v1.22.1

Why EntropyGuard?

The Problem: Dirty Data = Hallucinations & Wasted Money

The Solution: Local CPU Processing with Hybrid Deduplication

✨ Key Features

🛡️ Fault Tolerant

🚀 High Performance

📉 Memory Safe

📊 Observability

🔒 Enterprise Ready

⚡ Quick Start

The "Magic" Command

Basic Usage

Advanced: Checkpoint & Resume

📦 Installation

Option 1: pip from PyPI (Recommended)

Option 2: Install from Git

Option 3: Docker

Option 4: Development Setup

📋 CLI Flags Reference

Flag Categories Explained

🏢 Enterprise / Advanced Usage

Configuration File (`.entropyguardrc.json`)

Monitoring & Observability

Exit Codes

📊 Comparison

🛠️ Tech Stack

📋 Edition Comparison

📚 Documentation

🤝 Contributing

📄 License

🙏 Acknowledgments

entropyguard Code

Local-first semantic deduplication for datasets larger than RAM.

Branches

Tags

Tree [31a62a] main / Download Snapshot History

Read Me

🛡️ EntropyGuard v1.22.1

Why EntropyGuard?

The Problem: Dirty Data = Hallucinations & Wasted Money

The Solution: Local CPU Processing with Hybrid Deduplication

✨ Key Features

🛡️ Fault Tolerant

🚀 High Performance

📉 Memory Safe

📊 Observability

🔒 Enterprise Ready

⚡ Quick Start

The "Magic" Command

Basic Usage

Advanced: Checkpoint & Resume

📦 Installation

Option 1: pip from PyPI (Recommended)

Option 2: Install from Git

Option 3: Docker

Option 4: Development Setup

📋 CLI Flags Reference

Flag Categories Explained

🏢 Enterprise / Advanced Usage

Configuration File (.entropyguardrc.json)

Monitoring & Observability

Exit Codes

📊 Comparison

🛠️ Tech Stack

📋 Edition Comparison

📚 Documentation

🤝 Contributing

📄 License

🙏 Acknowledgments

Tree [31a62a] main /

History

Configuration File (`.entropyguardrc.json`)