DualGauge-Bench
Leaderboard
Joint security-functionality evaluation of 10 LLMs and 3 agentic coding systems across 307 specification-only tasks. Click any model to view cross-language details.
| # | Model | Org | pass@1 | secure@1 | secure-pass@1↓ | PR | SPR | Type | Details |
|---|---|---|---|---|---|---|---|---|---|
| 1 | Codex (GPT-5.4) | OpenAI | 46.7% | 50.7% | 24.3% | 71.7% | 78.5% | Agent | |
| 2 | GPT-5 Medium | OpenAI | 38.6% | 34.5% | 14.8% | 73.2% | 72.7% | Proprietary | |
| 3 | Claude Opus 4.7 (think-on) | Anthropic | 33.9% | 24% | 10.2% | 70.3% | 59.6% | Proprietary | |
| 4 | OpenHands (GPT-5.4) | OpenHands | 27.3% | 29.6% | 9.5% | 49.4% | 63.2% | Agent | |
| 5 | GPT-4.1 | OpenAI | 31.3% | 23.4% | 8.7% | 65.2% | 68.4% | Proprietary | |
| 6 | Claude Sonnet 4.5 (think-off) | Anthropic | 31.3% | 20.2% | 8.4% | 68.5% | 57.1% | Proprietary | |
| 7 | Qwen2.5 Coder 32B Instruct | Alibaba | 29.5% | 20.3% | 7.2% | 64.2% | 56.5% | Open | |
| 8 | Gemma 3 27B IT | 27.3% | 19.6% | 6.9% | 61.5% | 56.6% | Open | ||
| 9 | Claude Haiku 4.5 | Anthropic | 17.8% | 18.8% | 6.5% | 40.8% | 55.2% | Proprietary | |
| 10 | Qwen3 14B | Alibaba | 27.4% | 17.5% | 5.6% | 60.2% | 56.5% | Open | |
| 11 | Llama 3.1 8B Instruct (bf16) | Meta | 22.5% | 14.4% | 5.1% | 48.5% | 49.2% | Open | |
| 12 | Claude Code (Opus 4.7) | Anthropic | 21.7% | 19.4% | 4.9% | 40.5% | 53.2% | Agent | |
| 13 | Codestral 22B v0.1 | Mistral | 27.5% | 14.8% | 4.5% | 63.7% | 49.8% | Open |
13 models · 307 tasks · k=1Sorted by secure-pass@1 (desc)
Hover over any metric column header for its definition.