Training Large Language Models on contaminated, redundant, or low-quality data leads to:
EntropyGuard runs 100% locally on your CPU—no data ever leaves your machine. Perfect for:
# Unix pipe example (the most common use case)
cat data.jsonl | entropyguard --dedup-threshold 0.95 > clean.jsonl
# File-to-file processing
entropyguard \
--input data.jsonl \
--output clean.jsonl \
--text-column text \
--dedup-threshold 0.95
# With custom settings
entropyguard \
--input data.ndjson \
--output cleaned.ndjson \
--text-column content \
--min-length 100 \
--dedup-threshold 0.9 \
--chunk-size 500
# Enable automatic checkpoint recovery
entropyguard \
--input large_dataset.jsonl \
--output clean.jsonl \
--checkpoint-dir ./checkpoints \
--text-column text
# Resume from checkpoint manually
entropyguard \
--input large_dataset.jsonl \
--output clean.jsonl \
--checkpoint-dir ./checkpoints \
--resume \
--text-column text
pip install entropyguard
Requirements:
pip install "git+https://github.com/DamianSiuta/entropyguard.git"
Requirements:
git available on your system# Build image
docker build -t entropyguard:latest .
# Run container
docker run -v $(pwd):/data entropyguard:latest \
--input /data/input.jsonl \
--output /data/output.jsonl \
--text-column text
git clone https://github.com/DamianSiuta/entropyguard.git
cd entropyguard
poetry install
Complete reference for all available flags:
| Flag | Type | Default | Description |
|---|---|---|---|
| Input/Output | |||
--input |
string | - (stdin) |
Path to input file (CSV, JSON, NDJSON). Use - for stdin |
--output |
string | - (stdout) |
Path to output file (NDJSON). Use - for stdout |
--text-column |
string | auto-detect | Name of text column to process. Auto-detects first string column if omitted |
--required-columns |
string | None | Comma-separated list of required columns (optional schema validation) |
| Processing Options | |||
--min-length |
int | 50 |
Minimum text length after sanitization (characters) |
--dedup-threshold |
float | 0.95 |
Similarity threshold for semantic deduplication (0.0-1.0). Higher = stricter |
--model-name |
string | all-MiniLM-L6-v2 |
Sentence-transformers model for embeddings. Use paraphrase-multilingual-MiniLM-L12-v2 for multilingual |
--batch-size |
int | 10000 |
Batch size for embedding processing. Reduce for low-memory systems |
| Chunking (RAG) | |||
--chunk-size |
int | None | Chunk size (characters) for splitting long texts. Disabled if not set |
--chunk-overlap |
int | 50 |
Overlap size (characters) between consecutive chunks. Only used with --chunk-size |
--separators |
list | default | Custom separators for chunking (space-separated). Use \n for newline, \t for tab |
| Checkpoint & Resume | |||
--checkpoint-dir |
string | None | Directory to save checkpoints for error recovery |
--resume |
flag | false | Resume from last checkpoint if available. Requires --checkpoint-dir |
--no-auto-resume |
flag | false | Disable automatic checkpoint recovery (requires explicit --resume) |
| Logging & Output | |||
--verbose |
flag | false | Enable verbose logging (INFO level) |
--debug |
flag | false | Enable debug mode (DEBUG level + full tracebacks). Implies --verbose |
--demo |
flag | false | Demo mode: Hide INFO logs, show only progress bars and final summary |
--quiet |
flag | false | Disable progress bars (useful for CI/CD) |
--json |
flag | false | Output results as JSON (machine-readable format) |
--json-logs |
flag | false | Output logs as JSON (for log aggregation systems) |
| Monitoring & Profiling | |||
--profile-memory |
flag | false | Enable memory profiling. Tracks usage at each pipeline stage |
--memory-report-path |
string | None | Path to save memory profiling report (JSON). Requires --profile-memory |
--metrics-port |
int | None | Start Prometheus metrics HTTP server on specified port |
--audit-log |
string | None | Path to JSON file for audit log of dropped/duplicate rows |
| Configuration | |||
--config |
string | auto-detect | Path to config file (JSON/YAML/TOML). Auto-detects .entropyguardrc in current/home dir |
| Utility | |||
--dry-run |
flag | false | Simulate processing without expensive operations. Shows statistics only |
--version |
flag | - | Show version number and exit |
Input/Output: Control where data comes from and goes to. Supports Unix pipes (- for stdin/stdout).
Processing Options: Core deduplication settings. --dedup-threshold controls how similar texts must be to be considered duplicates (0.95 = 95% similarity).
Chunking (RAG): For Retrieval-Augmented Generation workflows. Splits long texts into smaller chunks with configurable overlap.
Checkpoint & Resume: Fault tolerance features. Automatically saves progress and can resume from failures.
Logging & Output: Control verbosity and output format. --demo is perfect for video demonstrations.
Monitoring & Profiling: Production observability. Memory profiling helps debug OOM issues, Prometheus metrics enable monitoring.
Configuration: Use config files to avoid repeating flags. CLI arguments override config file values.
.entropyguardrc.json)Create a configuration file in your home directory or project root:
{
"text_column": "text",
"min_length": 100,
"dedup_threshold": 0.95,
"chunk_size": 500,
"chunk_overlap": 50,
"remove_pii": true,
"normalize_text": true,
"show_progress": true
}
Then run:
entropyguard --input data.jsonl --output clean.jsonl
# Enable Prometheus metrics
entropyguard \
--input data.jsonl \
--output clean.jsonl \
--metrics-port 9090 \
--text-column text
# Enable memory profiling
entropyguard \
--input data.jsonl \
--output clean.jsonl \
--profile-memory \
--text-column text
# JSON logs for machine parsing
entropyguard \
--input data.jsonl \
--output clean.jsonl \
--json-logs \
--text-column text
EntropyGuard follows the sysexits.h standard:
| Code | Meaning |
|---|---|
0 |
Success |
1 |
General error |
2 |
Usage error (invalid arguments) |
64 |
Data format error |
65 |
Input file error |
66 |
Output file error |
70 |
Software error (internal bug) |
130 |
Process interrupted (SIGINT/Ctrl+C) |
| Feature | EntropyGuard | Basic Scripts | Vector DBs |
|---|---|---|---|
| Exact Deduplication | ✅ Hash-based (fast) | ⚠️ Manual | ❌ |
| Semantic Deduplication | ✅ AI-powered | ❌ | ✅ |
| Local Processing | ✅ 100% local | ✅ | ⚠️ Requires DB |
| Memory Safety | ✅ Chunked processing | ⚠️ Manual | ⚠️ Depends on DB |
| Fault Tolerance | ✅ Checkpoint/Resume | ❌ | ⚠️ Depends on DB |
| Unix Pipes | ✅ Native support | ⚠️ Manual | ❌ |
| Observability | ✅ Metrics + Logs | ❌ | ⚠️ Depends on DB |
| Configuration | ✅ Pydantic validation | ❌ | ⚠️ DB-specific |
| Type Safety | ✅ Full type hints | ❌ | ⚠️ Depends on language |
EntropyGuard is available in two editions:
| Feature | Community (Open Source) | Enterprise |
|---|---|---|
| CLI Tool | ✅ Full-featured | ✅ Full-featured |
| Semantic Deduplication | ✅ Unlimited | ✅ Unlimited |
| PII Removal | ✅ Unlimited | ✅ Unlimited |
| Data Formats | ✅ All formats | ✅ All formats |
| Docker Support | ✅ Yes | ✅ Yes |
| Audit Logs | ✅ Yes | ✅ Enhanced |
| Web Dashboard | ❌ | ✅ Professional Analytics Platform |
| Real-time Monitoring | ❌ | ✅ Live telemetry & metrics |
| Alert System | ❌ | ✅ Custom alert rules (Watchtower) |
| API Access | ❌ | ✅ RESTful API |
| SSO Integration | ❌ | ✅ SAML 2.0, OAuth 2.0 |
| Support | Community | Priority support with SLA |
| License | MIT License | Commercial license required |
📌 Legal Notice: Enterprise features (Control Plane, Dashboard, API, Alerting System) are proprietary software covered by a commercial license. These components are NOT included in the Open Source release and are NOT subject to the MIT license terms.
Contributions are welcome! Please read our contributing guidelines and code of conduct before submitting pull requests.
This project is licensed under the MIT License - see the LICENSE file for details.
Built with ❤️ by the EntropyGuard Team
Special thanks to: