director-ai Files

Real-time LLM hallucination guardrail — NLI + RAG fact-checking

Brought to you by: anulum

The interactive file manager requires Javascript. Please enable it or use sftp or scp.
You may still browse the files here.

Name	Modified	Size	InfoDownloads / Week
Parent folder
director_ai-3.9.4-py3-none-any.whl	2026-03-20	394.9 kB	0
director_ai-3.9.4.tar.gz	2026-03-20	576.6 kB	0
sbom.json	2026-03-20	72.8 kB	0
README.md	2026-03-20	1.9 kB	0
v3.9.4 source code.tar.gz	2026-03-20	10.9 MB	0
v3.9.4 source code.zip	2026-03-20	11.3 MB	0
Totals: 6 Items		23.3 MB	0

v3.9.4 — Verified Claims Release

Every claim in README, docs, and notebooks is now backed by measured data.

Domain Profile Thresholds (measured 2026-03-20, GTX 1060 6GB, NLI on CUDA)

Profile	Old Threshold	New Threshold	Basis
medical	0.75	0.30	PubMedQA 500 samples: F1=59.9%, catch=77.3%
finance	0.70	0.30	FinanceBench 150 samples: 0% FPR at t≤0.30
legal	0.68	0.30	Aligned (CUAD OOM on 6GB — needs ≥16GB GPU)

CoherenceScorer produces scores in [0.25, 0.55] regardless of domain. Old thresholds rejected everything.

Provider Integrations Tested

guard() + OpenAI (gpt-4o-mini): ✅
guard() + Anthropic (claude-haiku): ✅ (correctly rejects hallucinations)
score() standalone: ✅

Docker

CPU Dockerfile builds and runs (health, review, source endpoints verified)
GPU Dockerfile included (not tested locally — needs NVIDIA runtime)

Documentation (25 files updated)

Model attribution: FactCG-DeBERTa-v3-Large credited with paper link
Domain presets: "tuned" → "preset" (starting points, not validated)
Docker: removed dead Docker Hub links, local build instructions
HF Spaces: "Live Demo" → "Demo" (space sleeps due to inactivity)
FPR: 2.0% → 10.5% (matching measured data)
Version references: all updated to 3.9.4
Threshold inconsistency documented: guard()=0.3 vs DirectorConfig=0.6
All cookbooks, guides, notebooks aligned to measured thresholds

Known Limitations (honest)

Domain profiles are starting points — tune on your own data
Long documents (legal contracts, SEC filings) OOM on 6GB VRAM
NLI-only E2E catch rate: 46.7% — hybrid mode needed for 90.7%
Summarization FPR: 10.5% (not 2.0% as previously claimed)

Full Changelog: https://github.com/anulum/director-ai/compare/v3.9.2...v3.9.4

Source: README.md, updated 2026-03-20

Other Useful Business Software

Earn up to 16% annual interest with Nexo. Icon

Earn up to 16% annual interest with Nexo.

More flexibility. More control.

Generate interest, access liquidity without selling, and execute trades seamlessly. All in one platform. Geographic restrictions, eligibility, and terms apply.

Get started with Nexo.

AI-powered service management for IT and enterprise teams Icon

AI-powered service management for IT and enterprise teams

Enterprise-grade ITSM, for every business

Give your IT, operations, and business teams the ability to deliver exceptional services—without the complexity. Maximize operational efficiency with refreshingly simple, AI-powered Freshservice.

Try it Free

Try Google Cloud Risk-Free With $300 in Credit

No hidden charges. No surprise bills. Cancel anytime.

Use your credit across every product. Compute, storage, AI, analytics. When it runs out, 20+ products stay free. You only pay when you choose to.

Start Free

Recommended Projects

Node Director
The Node Director is a tool for managing distributed, hetergeneous UNIX Systems. Its functionality includes System Configuration, Application Distribution, NIS & NIS+ Management, User Creation and Dynamic System Documentation.
Opik
Debug, evaluate, and monitor your LLMapps, RAG systems, and agentic AI
DocsGPT
Private AI platform for agents, enterprise search and RAG pipelines
DeepEval
DeepEval is a simple-to-use, open-source LLM evaluation framework, for evaluating and testing large-language model systems. It is similar to Pytest but specialized for unit testing LLM outputs. DeepEval incorporates the latest research to evaluate LLM outputs based on metrics such as G-Eval, hallucination, answer relevancy, RAGAS, etc., which uses LLMs and various other NLP models that run locally on your machine for evaluation. Whether your application is implemented via RAG or fine-tuning, LangChain, or LlamaIndex, DeepEval has you covered. With it, you can easily determine the optimal hyperparameters to improve your RAG pipeline, prevent prompt drifting, or even transition from OpenAI to hosting your own Llama2 with confidence.
Vector Admin
The universal tool suite for vector database management