Local Model Evaluation

Don't run what you haven't scanned. Don't bench what'll freeze your Mac.

A security-first lifecycle for every AI model you download — scan it for malware, size-gate it before it touches your disk, bench it under a memory-pressure throttle that physically cannot let it freeze your machine, and auto-offload the weights if it loses. Every model in your local pool is safe, tested, and earns its disk space. Includes a printable setup PDF.

Replaces Ad-hoc model testing — and the RAM/storage upgrade you almost bought after the last crash

What you get

🛡️
6-check security scan
picklescan, safetensors header, ClamAV, JSON config audit, Jinja template audit, and Ollama blob hash — every model is scanned before it ever runs.
🧯
Anti-freeze bench throttle
Benching local models is RAM-intensive — and the #1 way to lock or crash a Mac. The throttle enforces one bench at a time, refuses to start under memory pressure, automatically unloads idle models to free RAM, suspends the bench if pressure spikes, and kills it cleanly before your machine can freeze. Built after the developer's own Mac Studio crashed twice in two days.
⚖️
Pre-download size gate
A hard gate, not a suggestion — the pull is blocked when weights would exceed your machine's RAM or disk floor, and a BLOCKED-too-large verdict is recorded so the same model is never re-attempted. No more 70B downloads on a 16 GB machine.
🗑️
Auto-offload on DECLINE
When a model loses the bench, the weights are reclaimed automatically and the verdict is recorded — you never pay disk for a model you've already declined, and future scouts skip the redownload.
📊
Speed + quality bench
3-run median tok/s plus task-representative quality prompts with pass/fail bars per task type. Verdict: DEPLOY, WATCH, or DECLINE.
🔀
Automatic routing
Every downloaded model is classified by type and routed to the right domain evaluator — vision, TTS, code, reasoning. Nothing sits in cache unreviewed.
📋
Daily gap report
Scheduled audit runs at your preferred time and surfaces any model in your pool that hasn't been scanned or benched.
📄
Printable setup PDF
A 2-page setup guide covers install, the 6-step lifecycle, throttle exit codes, and where every log lives. Ship it to your team or stick it on the fridge.
💻
Cross-platform
Works on macOS, Ubuntu, Fedora, and Windows. Auto-detects tools on each platform.

How it works

1
Scout

Check your evaluation history for prior verdicts. Verify disk headroom. Confirm no duplicate already installed.

2
Size check

Queries model size before downloading. Blocks models that exceed your machine's RAM.

3
Pull

ollama pull or huggingface-cli download — Gemini, Kimi, Qwen, Llama, Mistral, and others.

4
Scan

6 virus checks run automatically. Non-zero exit = stop.

5
Bench

Speed bench (3-run median tok/s) + 3 quality prompts per task type. RAM-intensive — run when you're not actively using the computer; the throttle watchdog pauses or aborts the bench under memory pressure so it can't freeze your machine.

6
Decide

DEPLOY: wire into router. DECLINE: auto-offload weights. WATCH: re-evaluate in 30 days.

7
Record

Log the verdict — including bytes reclaimed by offload — so future evaluations skip re-testing the same model.

8
Offload

Reclaim disk on DECLINE / BLOCKED-too-large. The verdict persists, the weights don't — you never pay disk for a model you've already declined.

Who it's for

Bench big models on a working machine without bricking it — the throttle pauses, unloads, and resumes around your real workload instead of crashing it
Refuse downloads you cannot possibly run — the size gate blocks a pull before a single byte transfers when weights exceed your RAM or disk floor
Stop paying disk for declined models — auto-offload + persistent verdict history means you never accidentally redownload a loser
Catch malicious models before they execute — picklescan, ClamAV, and template audits run on every pull from HuggingFace or Ollama
Enforce one consistent bench protocol across vision, TTS, code, and reasoning models — no more ad-hoc 'I think this one was faster'
Audit your local pool at a glance — daily gap report flags anything unscanned or unbenched

What's new

v1.1.0Updated 2026-06-16
v1.1.02026-06-16
  • Baseline version recorded.