Inside the Simon slug benchmark harness
How we score candidate models on professional tasks, latency under load, and regression checks before a slug upgrade ships.
Every simon-says-* slug passes through a fixed harness before it reaches production. We run domain-specific prompts drawn from real customer workloads — financial analysis, legal drafting, code review — and score outputs against frontier baselines.
Latency matters as much as quality. Candidates run under concurrent load on lab hardware so we catch slowdowns before auto-update rolls them out. Safety spot checks cover jailbreak resistance and hallucination patterns on high-risk prompts.
Results are versioned per slug. When auto-update promotes a new upstream model, telemetry shows exactly which version served each request — so teams can audit upgrades or roll back if a release misses the bar.