DualGauge-Bench

Leaderboard

Joint security-functionality evaluation of 10 LLMs and 3 agentic coding systems across 307 specification-only tasks. Click any model to view cross-language details.

#ModelOrgpass@1secure@1secure-pass@1PRSPRTypeDetails
1Codex (GPT-5.4)OpenAI46.7%50.7%24.3%71.7%78.5%Agent
2GPT-5 MediumOpenAI38.6%34.5%14.8%73.2%72.7%Proprietary
3Claude Opus 4.7 (think-on)Anthropic33.9%24%10.2%70.3%59.6%Proprietary
4OpenHands (GPT-5.4)OpenHands27.3%29.6%9.5%49.4%63.2%Agent
5GPT-4.1OpenAI31.3%23.4%8.7%65.2%68.4%Proprietary
6Claude Sonnet 4.5 (think-off)Anthropic31.3%20.2%8.4%68.5%57.1%Proprietary
7Qwen2.5 Coder 32B InstructAlibaba29.5%20.3%7.2%64.2%56.5%Open
8Gemma 3 27B ITGoogle27.3%19.6%6.9%61.5%56.6%Open
9Claude Haiku 4.5Anthropic17.8%18.8%6.5%40.8%55.2%Proprietary
10Qwen3 14BAlibaba27.4%17.5%5.6%60.2%56.5%Open
11Llama 3.1 8B Instruct (bf16)Meta22.5%14.4%5.1%48.5%49.2%Open
12Claude Code (Opus 4.7)Anthropic21.7%19.4%4.9%40.5%53.2%Agent
13Codestral 22B v0.1Mistral27.5%14.8%4.5%63.7%49.8%Open
13 models · 307 tasks · k=1Sorted by secure-pass@1 (desc)

Hover over any metric column header for its definition.