Inside the Simon slug benchmark harness

How we score candidate models on professional tasks, latency under load, and regression checks before a slug upgrade ships.

Every simon-says-* slug passes through a fixed harness before it reaches production. We run domain-specific prompts drawn from real customer workloads — financial analysis, legal drafting, code review — and score outputs against frontier baselines.

Latency matters as much as quality. Candidates run under concurrent load on lab hardware so we catch slowdowns before auto-update rolls them out. Safety spot checks cover jailbreak resistance and hallucination patterns on high-risk prompts.

Results are versioned per slug. When auto-update promotes a new upstream model, telemetry shows exactly which version served each request — so teams can audit upgrades or roll back if a release misses the bar.

Model catalog

More news from GeniusPro →