Menu

Tree [31a62a] main /
 History

HTTPS access


File Date Author Commit
 docs 2025-12-22 Damian Siuta Damian Siuta [764d4d] feat(release): v1.11.0 - unix pipes, hybrid ded...
 scripts 2025-12-25 Damian Siuta Damian Siuta [31a62a] fix: Fix PowerShell script syntax error (apostr...
 src 2025-12-25 Damian Siuta Damian Siuta [c16881] chore: Bump version to 1.22.1
 tests 2025-12-23 Damian Siuta Damian Siuta [7f93cb] chore: repository cleanup & docs overhaul for v...
 .gitignore 2025-12-23 Damian Siuta Damian Siuta [7f93cb] chore: repository cleanup & docs overhaul for v...
 CHECKPOINT_RESUME_GUIDE.md 2025-12-23 Damian Siuta Damian Siuta [5365c8] docs: translate documentation to English and up...
 Dockerfile 2025-12-22 Damian Siuta Damian Siuta [764d4d] feat(release): v1.11.0 - unix pipes, hybrid ded...
 LICENSE 2025-12-15 Damian Siuta Damian Siuta [8fed7b] Release v1.3.0: Enterprise Audit Logging & MIT ...
 OPEN_CORE_STRATEGY.md 2025-12-21 Damian Siuta Damian Siuta [301542] docs: implement Open Core strategy with MIT Lic...
 PROJECT_COMPREHENSIVE_DOCUMENTATION.md 2025-12-25 Damian Siuta Damian Siuta [f7b3ca] docs: Translate documentation to English and ad...
 README.md 2025-12-25 Damian Siuta Damian Siuta [c16881] chore: Bump version to 1.22.1
 demo.ipynb 2025-12-18 Damian Siuta Damian Siuta [fc916e] Release v1.4.0: Added Recursive Text Chunking (...
 pyproject.toml 2025-12-25 Damian Siuta Damian Siuta [c16881] chore: Bump version to 1.22.1
 run_pipeline.ps1 2025-12-14 Damian Siuta Damian Siuta [924f73] MVP Complete: Core pipeline operational (Ingest...
 run_stress_test.ps1 2025-12-15 Damian Siuta Damian Siuta [ec84dc] Release v1.2.0: Final - Universal Ingestion, Do...
 test_alerts.py 2025-12-22 Damian Siuta Damian Siuta [764d4d] feat(release): v1.11.0 - unix pipes, hybrid ded...

Read Me

🛡️ EntropyGuard v1.22.1

**The Unbreakable RAG Data Cleaner** [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) [![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/) [![Docker](https://img.shields.io/badge/docker-ready-blue.svg)](https://www.docker.com/) [![Production Ready](https://img.shields.io/badge/status-production--ready-green.svg)](https://github.com/DamianSiuta/entropyguard) **Enterprise-grade semantic data deduplication and sanitization engine for LLM training data.** [Features](#-key-features) • [Quick Start](#-quick-start) • [Installation](#-installation) • [Documentation](#-documentation)

Why EntropyGuard?

The Problem: Dirty Data = Hallucinations & Wasted Money

Training Large Language Models on contaminated, redundant, or low-quality data leads to:

  • Model Collapse — Degraded performance from duplicate content
  • Hallucinations — Inaccurate outputs from poor training data
  • Wasted Compute — Paying for processing duplicate data multiple times
  • Compliance Risks — PII and sensitive data in training sets

The Solution: Local CPU Processing with Hybrid Deduplication

EntropyGuard runs 100% locally on your CPU—no data ever leaves your machine. Perfect for:

  • Air-gapped environments (no cloud dependencies)
  • Privacy compliance (GDPR, HIPAA, SOC 2)
  • Cost efficiency (no API calls, no cloud fees)
  • Enterprise security (complete data sovereignty)

✨ Key Features

🛡️ Fault Tolerant

  • Checkpoint/Resume System — Automatic recovery from failures
  • Memory Safety — Chunked processing prevents OOM errors
  • Graceful Shutdown — SIGINT/SIGTERM handling (Windows + Unix)
  • Error Recovery — Automatic retry with exponential backoff

🚀 High Performance

  • Hybrid Engine — Hash-based exact dedup + AI semantic similarity
  • Unix Pipes Support — Stream processing for data engineering workflows
  • Lazy Evaluation — Polars LazyFrame for datasets larger than RAM
  • Optimized Memory — Pre-materialization checks prevent OOM

📉 Memory Safe

  • Chunked Processing — Process datasets larger than available RAM
  • Memory Profiling — Track memory usage per pipeline stage
  • Resource Guards — Disk space and memory checks before operations

📊 Observability

  • Prometheus Metrics — Export pipeline metrics for monitoring
  • Structured Logging — JSON logs with correlation IDs
  • Progress Tracking — Real-time ETA and throughput estimation
  • Audit Logs — Complete audit trail of all operations

🔒 Enterprise Ready

  • Standard Exit Codes — sysexits.h compliant for automation
  • Type Safety — Full type hints (MyPy strict compatible)
  • Configuration Validation — Pydantic-based schema validation
  • Input Validation — Format detection and consistency checks

⚡ Quick Start

The "Magic" Command

# Unix pipe example (the most common use case)
cat data.jsonl | entropyguard --dedup-threshold 0.95 > clean.jsonl

Basic Usage

# File-to-file processing
entropyguard \
  --input data.jsonl \
  --output clean.jsonl \
  --text-column text \
  --dedup-threshold 0.95

# With custom settings
entropyguard \
  --input data.ndjson \
  --output cleaned.ndjson \
  --text-column content \
  --min-length 100 \
  --dedup-threshold 0.9 \
  --chunk-size 500

Advanced: Checkpoint & Resume

# Enable automatic checkpoint recovery
entropyguard \
  --input large_dataset.jsonl \
  --output clean.jsonl \
  --checkpoint-dir ./checkpoints \
  --text-column text

# Resume from checkpoint manually
entropyguard \
  --input large_dataset.jsonl \
  --output clean.jsonl \
  --checkpoint-dir ./checkpoints \
  --resume \
  --text-column text

📦 Installation

pip install entropyguard

Requirements:

  • Python 3.10, 3.11, or 3.12 (3.13 not supported yet)

Option 2: Install from Git

pip install "git+https://github.com/DamianSiuta/entropyguard.git"

Requirements:

  • Python 3.10, 3.11, or 3.12 (3.13 not supported yet)
  • git available on your system

Option 3: Docker

# Build image
docker build -t entropyguard:latest .

# Run container
docker run -v $(pwd):/data entropyguard:latest \
  --input /data/input.jsonl \
  --output /data/output.jsonl \
  --text-column text

Option 4: Development Setup

git clone https://github.com/DamianSiuta/entropyguard.git
cd entropyguard
poetry install

📋 CLI Flags Reference

Complete reference for all available flags:

Flag Type Default Description
Input/Output
--input string - (stdin) Path to input file (CSV, JSON, NDJSON). Use - for stdin
--output string - (stdout) Path to output file (NDJSON). Use - for stdout
--text-column string auto-detect Name of text column to process. Auto-detects first string column if omitted
--required-columns string None Comma-separated list of required columns (optional schema validation)
Processing Options
--min-length int 50 Minimum text length after sanitization (characters)
--dedup-threshold float 0.95 Similarity threshold for semantic deduplication (0.0-1.0). Higher = stricter
--model-name string all-MiniLM-L6-v2 Sentence-transformers model for embeddings. Use paraphrase-multilingual-MiniLM-L12-v2 for multilingual
--batch-size int 10000 Batch size for embedding processing. Reduce for low-memory systems
Chunking (RAG)
--chunk-size int None Chunk size (characters) for splitting long texts. Disabled if not set
--chunk-overlap int 50 Overlap size (characters) between consecutive chunks. Only used with --chunk-size
--separators list default Custom separators for chunking (space-separated). Use \n for newline, \t for tab
Checkpoint & Resume
--checkpoint-dir string None Directory to save checkpoints for error recovery
--resume flag false Resume from last checkpoint if available. Requires --checkpoint-dir
--no-auto-resume flag false Disable automatic checkpoint recovery (requires explicit --resume)
Logging & Output
--verbose flag false Enable verbose logging (INFO level)
--debug flag false Enable debug mode (DEBUG level + full tracebacks). Implies --verbose
--demo flag false Demo mode: Hide INFO logs, show only progress bars and final summary
--quiet flag false Disable progress bars (useful for CI/CD)
--json flag false Output results as JSON (machine-readable format)
--json-logs flag false Output logs as JSON (for log aggregation systems)
Monitoring & Profiling
--profile-memory flag false Enable memory profiling. Tracks usage at each pipeline stage
--memory-report-path string None Path to save memory profiling report (JSON). Requires --profile-memory
--metrics-port int None Start Prometheus metrics HTTP server on specified port
--audit-log string None Path to JSON file for audit log of dropped/duplicate rows
Configuration
--config string auto-detect Path to config file (JSON/YAML/TOML). Auto-detects .entropyguardrc in current/home dir
Utility
--dry-run flag false Simulate processing without expensive operations. Shows statistics only
--version flag - Show version number and exit

Flag Categories Explained

Input/Output: Control where data comes from and goes to. Supports Unix pipes (- for stdin/stdout).

Processing Options: Core deduplication settings. --dedup-threshold controls how similar texts must be to be considered duplicates (0.95 = 95% similarity).

Chunking (RAG): For Retrieval-Augmented Generation workflows. Splits long texts into smaller chunks with configurable overlap.

Checkpoint & Resume: Fault tolerance features. Automatically saves progress and can resume from failures.

Logging & Output: Control verbosity and output format. --demo is perfect for video demonstrations.

Monitoring & Profiling: Production observability. Memory profiling helps debug OOM issues, Prometheus metrics enable monitoring.

Configuration: Use config files to avoid repeating flags. CLI arguments override config file values.


🏢 Enterprise / Advanced Usage

Configuration File (.entropyguardrc.json)

Create a configuration file in your home directory or project root:

{
  "text_column": "text",
  "min_length": 100,
  "dedup_threshold": 0.95,
  "chunk_size": 500,
  "chunk_overlap": 50,
  "remove_pii": true,
  "normalize_text": true,
  "show_progress": true
}

Then run:

entropyguard --input data.jsonl --output clean.jsonl

Monitoring & Observability

# Enable Prometheus metrics
entropyguard \
  --input data.jsonl \
  --output clean.jsonl \
  --metrics-port 9090 \
  --text-column text

# Enable memory profiling
entropyguard \
  --input data.jsonl \
  --output clean.jsonl \
  --profile-memory \
  --text-column text

# JSON logs for machine parsing
entropyguard \
  --input data.jsonl \
  --output clean.jsonl \
  --json-logs \
  --text-column text

Exit Codes

EntropyGuard follows the sysexits.h standard:

Code Meaning
0 Success
1 General error
2 Usage error (invalid arguments)
64 Data format error
65 Input file error
66 Output file error
70 Software error (internal bug)
130 Process interrupted (SIGINT/Ctrl+C)

📊 Comparison

Feature EntropyGuard Basic Scripts Vector DBs
Exact Deduplication ✅ Hash-based (fast) ⚠️ Manual
Semantic Deduplication ✅ AI-powered
Local Processing ✅ 100% local ⚠️ Requires DB
Memory Safety ✅ Chunked processing ⚠️ Manual ⚠️ Depends on DB
Fault Tolerance ✅ Checkpoint/Resume ⚠️ Depends on DB
Unix Pipes ✅ Native support ⚠️ Manual
Observability ✅ Metrics + Logs ⚠️ Depends on DB
Configuration ✅ Pydantic validation ⚠️ DB-specific
Type Safety ✅ Full type hints ⚠️ Depends on language

🛠️ Tech Stack

  • Core: Python 3.10+, Polars (LazyFrame)
  • AI/ML: PyTorch (CPU), FAISS, Sentence-Transformers
  • Validation: Pydantic v2
  • Logging: structlog (optional)
  • Metrics: Prometheus Client (optional)
  • Infrastructure: Poetry, Docker-ready

📋 Edition Comparison

EntropyGuard is available in two editions:

Feature Community (Open Source) Enterprise
CLI Tool ✅ Full-featured ✅ Full-featured
Semantic Deduplication ✅ Unlimited ✅ Unlimited
PII Removal ✅ Unlimited ✅ Unlimited
Data Formats ✅ All formats ✅ All formats
Docker Support ✅ Yes ✅ Yes
Audit Logs ✅ Yes ✅ Enhanced
Web Dashboard ✅ Professional Analytics Platform
Real-time Monitoring ✅ Live telemetry & metrics
Alert System ✅ Custom alert rules (Watchtower)
API Access ✅ RESTful API
SSO Integration ✅ SAML 2.0, OAuth 2.0
Support Community Priority support with SLA
License MIT License Commercial license required

📌 Legal Notice: Enterprise features (Control Plane, Dashboard, API, Alerting System) are proprietary software covered by a commercial license. These components are NOT included in the Open Source release and are NOT subject to the MIT license terms.


📚 Documentation


🤝 Contributing

Contributions are welcome! Please read our contributing guidelines and code of conduct before submitting pull requests.


📄 License

This project is licensed under the MIT License - see the LICENSE file for details.


🙏 Acknowledgments

Built with ❤️ by the EntropyGuard Team

Special thanks to:


**[⬆ Back to Top](#-entropyguard-v1220)** Made with ❤️ for the LLM community
MongoDB Logo MongoDB
Gen AI apps are built with MongoDB Atlas
Atlas offers built-in vector search and global availability across 125+ regions. Start building AI apps faster, all in one place.