AI BENCHY

AI Benchmark Leaderboard

Name: AI BENCHY Model Benchmark Results
Creator: AI BENCHY
License: https://aibenchy.com/methodology/

Last updated at: 2026-04-21 Models Evaluated: 99

99/99

Rank	Model	Score	Company	Total Cost	Response Time (avg)
#1🥇 #1	Gemini 3 Flash Previewmedium	10.0	Google	$0.314	17.60s
View model card Total Tests: 18 Wrong Tests: 0 Attempt pass rate: 100.0% Flaky tests: 0 Output Tokens: 2,072 Reasoning Tokens: 97,041 Response time: avg 17.60s · total 193.57s · max 79.71s Anti-AI Tricks : 10.0 Coding : 10.0 Combined : 10.0 Data parsing and extraction : 10.0 Domain specific : 10.0 General Intelligence : 10.0 Instructions following : 10.0 Puzzle Solving : 10.0 Tool Calling : 10.0
#2🥈 #2	Gemini 3.1 Pro Previewmedium	9.6	Google	$0.578	15.96s
View model card Total Tests: 18 Wrong Tests: 1 Attempt pass rate: 94.4% Flaky tests: 0 Output Tokens: 1,932 Reasoning Tokens: 40,542 Response time: avg 15.96s · total 175.52s · max 40.61s Wrong answer: 1 Anti-AI Tricks : 10.0 Coding : 10.0 Combined : 9.5 Data parsing and extraction : 10.0 Domain specific : 7.7 General Intelligence : 10.0 Instructions following : 10.0 Puzzle Solving : 10.0 Tool Calling : 10.0
#3🥉 #3	Claude Opus 4.7medium	9.2	Anthropic	$0.447	3.53s
View model card Total Tests: 18 Wrong Tests: 2 Attempt pass rate: 88.9% Flaky tests: 0 Output Tokens: 5,375 Reasoning Tokens: 1,341 Response time: avg 3.53s · total 60.03s · max 21.45s Timed out: 1 Wrong answer: 1 Anti-AI Tricks : 8.3 Coding : 10.0 Combined : 10.0 Data parsing and extraction : 10.0 Domain specific : 7.7 General Intelligence : 10.0 Instructions following : 10.0 Puzzle Solving : 10.0 Tool Calling : 10.0
#4#4	Claude Opus 4.7none	9.2	Anthropic	$0.505	3.13s
View model card Total Tests: 18 Wrong Tests: 2 Attempt pass rate: 88.9% Flaky tests: 0 Output Tokens: 6,326 Reasoning Tokens: 0 Response time: avg 3.13s · total 56.33s · max 18.27s Wrong answer: 2 Anti-AI Tricks : 8.3 Coding : 10.0 Combined : 9.5 Data parsing and extraction : 10.0 Domain specific : 7.7 General Intelligence : 10.0 Instructions following : 10.0 Puzzle Solving : 10.0 Tool Calling : 10.0
#5#5	Gemini 3 Flash Previewlow	8.8	Google	$0.091	6.01s
View model card Total Tests: 18 Wrong Tests: 3 Attempt pass rate: 85.2% Flaky tests: 1 Output Tokens: 2,018 Reasoning Tokens: 23,273 Response time: avg 6.01s · total 108.12s · max 14.72s Wrong answer: 3 Anti-AI Tricks : 10.0 Coding : 10.0 Combined : 3.0 Data parsing and extraction : 10.0 Domain specific : 5.3 General Intelligence : 10.0 Instructions following : 9.9 Puzzle Solving : 10.0 Tool Calling : 10.0
#6#6	Seed-2.0-Litemedium	8.6	Bytedance Seed	$0.121	30.37s
View model card Total Tests: 18 Wrong Tests: 5 Attempt pass rate: 83.3% Flaky tests: 3 Output Tokens: 3,257 Reasoning Tokens: 52,042 Response time: avg 30.37s · total 546.72s · max 168.71s Wrong answer: 3 Did not follow instructions: 2 Anti-AI Tricks : 8.3 Coding : 10.0 Combined : 10.0 Data parsing and extraction : 10.0 Domain specific : 5.9 General Intelligence : 6.7 Instructions following : 10.0 Puzzle Solving : 9.0 Tool Calling : 10.0
#7#7	GPT-5.3-Codexmedium	8.6	OpenAI	$0.573	15.38s
View model card Total Tests: 18 Wrong Tests: 5 Attempt pass rate: 83.3% Flaky tests: 3 Output Tokens: 2,279 Reasoning Tokens: 35,179 Response time: avg 15.38s · total 276.91s · max 100.93s Wrong answer: 3 Did not follow instructions: 2 Anti-AI Tricks : 8.7 Coding : 10.0 Combined : 10.0 Data parsing and extraction : 10.0 Domain specific : 5.9 General Intelligence : 4.6 Instructions following : 10.0 Puzzle Solving : 9.0 Tool Calling : 10.0
#8#8	Qwen3.5 Plus 2026-02-15medium	8.5	Qwen	$0.220	46.56s
View model card Total Tests: 18 Wrong Tests: 4 Attempt pass rate: 83.3% Flaky tests: 2 Output Tokens: 2,121 Reasoning Tokens: 111,889 Response time: avg 46.56s · total 512.20s · max 120.91s Timed out: 2 Wrong answer: 2 Anti-AI Tricks : 8.2 Coding : 10.0 Combined : 10.0 Data parsing and extraction : 10.0 Domain specific : 5.3 General Intelligence : 4.7 Instructions following : 10.0 Puzzle Solving : 10.0 Tool Calling : 10.0
#9#9	Qwen3.6 Plus Previewmedium	8.5	Qwen	$0.000	13.94s
View model card Total Tests: 17 Wrong Tests: 4 Attempt pass rate: 76.5% Flaky tests: 0 Output Tokens: 1,756 Reasoning Tokens: 77,213 Response time: avg 13.94s · total 237.01s · max 43.55s Wrong answer: 3 Did not follow instructions: 1 Anti-AI Tricks : 10.0 Combined : 10.0 Data parsing and extraction : 10.0 Domain specific : 3.0 General Intelligence : 5.1 Instructions following : 10.0 Puzzle Solving : 10.0 Tool Calling : 10.0
#10#10	Qwen3.5-27Bmedium	8.4	Qwen	$0.497	53.03s
View model card Total Tests: 18 Wrong Tests: 5 Attempt pass rate: 81.5% Flaky tests: 3 Output Tokens: 2,500 Reasoning Tokens: 242,500 Response time: avg 53.03s · total 954.46s · max 163.96s Did not follow instructions: 2 Extra formatting: 1 Timed out: 1 Wrong answer: 1 Anti-AI Tricks : 8.7 Coding : 10.0 Combined : 10.0 Data parsing and extraction : 10.0 Domain specific : 5.3 General Intelligence : 6.1 Instructions following : 10.0 Puzzle Solving : 8.2 Tool Calling : 10.0
#11#11	Gemini 3.1 Flash Lite Previewhigh	8.4	Google	$2.310	68.83s
View model card Total Tests: 16 Wrong Tests: 4 Attempt pass rate: 77.1% Flaky tests: 1 Output Tokens: 1,283 Reasoning Tokens: 1,533,310 Response time: avg 68.83s · total 1101.32s · max 280.52s Wrong answer: 3 Did not follow instructions: 1 Anti-AI Tricks : 10.0 Combined : 10.0 Data parsing and extraction : 10.0 Domain specific : 5.3 General Intelligence : 10.0 Instructions following : 7.9 Puzzle Solving : 7.7 Tool Calling : 10.0
#12#12	Gemini 3 PRO Previewmedium	8.4	Google	$0.197	9.06s
View model card Total Tests: 18 Wrong Tests: 4 Attempt pass rate: 77.8% Flaky tests: 0 Output Tokens: 1,508 Reasoning Tokens: 10,084 Response time: avg 9.06s · total 90.58s · max 26.24s Wrong answer: 3 API error: 1 Anti-AI Tricks : 10.0 Coding : 3.0 Combined : 3.0 Data parsing and extraction : 10.0 Domain specific : 5.3 General Intelligence : 10.0 Instructions following : 9.8 Puzzle Solving : 10.0 Tool Calling : 10.0
#13#13	GLM 5medium	8.4	Z.ai	$0.155	23.34s
View model card Total Tests: 18 Wrong Tests: 5 Attempt pass rate: 85.2% Flaky tests: 4 Output Tokens: 20,163 Reasoning Tokens: 58,337 Response time: avg 23.34s · total 233.40s · max 79.09s Wrong answer: 2 Did not follow instructions: 1 No answer: 1 Timed out: 1 Anti-AI Tricks : 10.0 Coding : 10.0 Combined : 10.0 Data parsing and extraction : 7.1 Domain specific : 3.5 General Intelligence : 6.1 Instructions following : 10.0 Puzzle Solving : 10.0 Tool Calling : 10.0
#14#14	Gemma 4 31Bmedium	8.3	Google	$0.018	24.88s
View model card Total Tests: 18 Wrong Tests: 5 Attempt pass rate: 79.6% Flaky tests: 2 Output Tokens: 12,734 Reasoning Tokens: 27,950 Response time: avg 24.88s · total 398.13s · max 70.97s API error: 2 Did not follow instructions: 1 Timed out: 1 Wrong answer: 1 Anti-AI Tricks : 10.0 Coding : 4.7 Combined : 3.0 Data parsing and extraction : 10.0 Domain specific : 7.7 General Intelligence : 10.0 Instructions following : 10.0 Puzzle Solving : 8.8 Tool Calling : 3.0
#15#15	Gemini 2.5 Flashmedium	8.2	Google	$0.319	12.12s
View model card Total Tests: 18 Wrong Tests: 5 Attempt pass rate: 75.9% Flaky tests: 1 Output Tokens: 1,898 Reasoning Tokens: 122,273 Response time: avg 12.12s · total 218.12s · max 95.48s Wrong answer: 4 Did not follow instructions: 1 Anti-AI Tricks : 8.4 Coding : 10.0 Combined : 10.0 Data parsing and extraction : 10.0 Domain specific : 5.9 General Intelligence : 4.8 Instructions following : 9.8 Puzzle Solving : 7.7 Tool Calling : 10.0
#16#16	GPT-5.4medium	8.2	OpenAI	$0.832	18.63s
View model card Total Tests: 18 Wrong Tests: 5 Attempt pass rate: 79.6% Flaky tests: 3 Output Tokens: 2,169 Reasoning Tokens: 48,732 Response time: avg 18.63s · total 335.26s · max 100.41s Wrong answer: 3 Did not follow instructions: 2 Anti-AI Tricks : 8.3 Coding : 10.0 Combined : 10.0 Data parsing and extraction : 10.0 Domain specific : 5.3 General Intelligence : 4.7 Instructions following : 10.0 Puzzle Solving : 8.2 Tool Calling : 10.0
#17#17	Gemini 3.1 Flash Lite Previewmedium	8.2	Google	$0.055	3.74s
View model card Total Tests: 18 Wrong Tests: 5 Attempt pass rate: 72.2% Flaky tests: 0 Output Tokens: 2,168 Reasoning Tokens: 29,030 Response time: avg 3.74s · total 67.31s · max 14.93s Wrong answer: 4 Did not follow instructions: 1 Anti-AI Tricks : 9.1 Coding : 10.0 Combined : 10.0 Data parsing and extraction : 10.0 Domain specific : 3.0 General Intelligence : 10.0 Instructions following : 10.0 Puzzle Solving : 7.7 Tool Calling : 10.0
#18#18	GLM 5 Turbomedium	8.1	Z.ai	$0.182	17.67s
View model card Total Tests: 18 Wrong Tests: 6 Attempt pass rate: 77.8% Flaky tests: 5 Output Tokens: 12,197 Reasoning Tokens: 38,933 Response time: avg 17.67s · total 317.98s · max 194.23s Wrong answer: 3 Did not follow instructions: 2 Timed out: 1 Anti-AI Tricks : 10.0 Coding : 10.0 Combined : 10.0 Data parsing and extraction : 10.0 Domain specific : 2.9 General Intelligence : 6.1 Instructions following : 10.0 Puzzle Solving : 7.3 Tool Calling : 10.0
#19#19	Qwen3.5-122B-A10Bmedium	8.1	Qwen	$0.528	31.38s
View model card Total Tests: 18 Wrong Tests: 5 Attempt pass rate: 79.6% Flaky tests: 3 Output Tokens: 17,635 Reasoning Tokens: 162,668 Response time: avg 31.38s · total 564.84s · max 119.29s Wrong answer: 3 Timed out: 2 Anti-AI Tricks : 10.0 Coding : 4.7 Combined : 10.0 Data parsing and extraction : 10.0 Domain specific : 2.9 General Intelligence : 3.4 Instructions following : 10.0 Puzzle Solving : 10.0 Tool Calling : 10.0
#20#20	Qwen3.6 Plusmedium	8.1	Qwen	$0.000	15.27s
View model card Total Tests: 18 Wrong Tests: 5 Attempt pass rate: 74.1% Flaky tests: 1 Output Tokens: 1,763 Reasoning Tokens: 83,782 Response time: avg 15.27s · total 259.55s · max 43.55s Wrong answer: 3 API error: 1 Did not follow instructions: 1 Anti-AI Tricks : 10.0 Coding : 3.0 Combined : 10.0 Data parsing and extraction : 10.0 Domain specific : 2.9 General Intelligence : 5.1 Instructions following : 10.0 Puzzle Solving : 10.0 Tool Calling : 10.0
#21#21	Gemini 3 Flash Previewnone	8.1	Google	$0.021	1.65s
View model card Total Tests: 18 Wrong Tests: 5 Attempt pass rate: 77.8% Flaky tests: 2 Output Tokens: 1,840 Reasoning Tokens: 0 Response time: avg 1.65s · total 18.20s · max 3.56s Wrong answer: 5 Anti-AI Tricks : 8.3 Coding : 10.0 Combined : 4.7 Data parsing and extraction : 10.0 Domain specific : 7.7 General Intelligence : 10.0 Instructions following : 6.4 Puzzle Solving : 7.7 Tool Calling : 10.0
#22#22	Gemini 3.1 Flash Lite Previewlow	8.1	Google	$0.022	3.22s
View model card Total Tests: 18 Wrong Tests: 5 Attempt pass rate: 72.2% Flaky tests: 0 Output Tokens: 2,247 Reasoning Tokens: 8,058 Response time: avg 3.22s · total 58.00s · max 11.91s Wrong answer: 4 Did not follow instructions: 1 Anti-AI Tricks : 8.3 Coding : 10.0 Combined : 3.0 Data parsing and extraction : 10.0 Domain specific : 5.3 General Intelligence : 4.0 Instructions following : 10.0 Puzzle Solving : 10.0 Tool Calling : 10.0
#23#23	MiMo-V2-Promedium	8.1	Xiaomi	$0.159	12.27s
View model card Total Tests: 18 Wrong Tests: 6 Attempt pass rate: 77.8% Flaky tests: 3 Output Tokens: 2,360 Reasoning Tokens: 38,320 Response time: avg 12.27s · total 208.56s · max 64.71s Wrong answer: 3 Extra formatting: 1 Did not follow instructions: 1 Timed out: 1 Anti-AI Tricks : 10.0 Coding : 10.0 Combined : 4.7 Data parsing and extraction : 7.3 Domain specific : 5.3 General Intelligence : 10.0 Instructions following : 9.9 Puzzle Solving : 7.0 Tool Calling : 10.0
#24#24	Gemma 4 26B A4Bmedium	8.0	Google	$0.028	25.03s
View model card Total Tests: 18 Wrong Tests: 5 Attempt pass rate: 75.9% Flaky tests: 2 Output Tokens: 15,928 Reasoning Tokens: 44,631 Response time: avg 25.03s · total 425.48s · max 147.47s Timed out: 2 Wrong answer: 2 Did not follow instructions: 1 Anti-AI Tricks : 10.0 Coding : 2.8 Combined : 9.6 Data parsing and extraction : 10.0 Domain specific : 2.9 General Intelligence : 10.0 Instructions following : 10.0 Puzzle Solving : 7.9 Tool Calling : 10.0
#25#25	Grok 4.20 Betamedium	8.0	X AI	$0.633	9.81s
View model card Total Tests: 18 Wrong Tests: 6 Attempt pass rate: 74.1% Flaky tests: 2 Output Tokens: 1,568 Reasoning Tokens: 91,909 Response time: avg 9.81s · total 176.62s · max 31.36s Did not follow instructions: 3 Wrong answer: 3 Anti-AI Tricks : 8.7 Coding : 10.0 Combined : 10.0 Data parsing and extraction : 10.0 Domain specific : 5.3 General Intelligence : 10.0 Instructions following : 8.3 Puzzle Solving : 8.2 Tool Calling : 3.0
#26#26	Claude Sonnet 4.6medium	8.0	Anthropic	$1.161	12.66s
View model card Total Tests: 18 Wrong Tests: 5 Attempt pass rate: 74.1% Flaky tests: 1 Output Tokens: 42,068 Reasoning Tokens: 26,784 Response time: avg 12.66s · total 126.62s · max 46.35s Extra formatting: 2 Wrong answer: 2 Timed out: 1 Anti-AI Tricks : 6.5 Coding : 10.0 Combined : 10.0 Data parsing and extraction : 10.0 Domain specific : 2.9 General Intelligence : 10.0 Instructions following : 10.0 Puzzle Solving : 10.0 Tool Calling : 10.0
#27#27	DeepSeek V3.2medium	8.0	DeepSeek	$0.029	46.41s
View model card Total Tests: 18 Wrong Tests: 6 Attempt pass rate: 79.6% Flaky tests: 4 Output Tokens: 10,620 Reasoning Tokens: 48,511 Response time: avg 46.41s · total 835.33s · max 180.92s Wrong answer: 3 Timed out: 2 Did not follow instructions: 1 Anti-AI Tricks : 8.4 Coding : 4.7 Combined : 10.0 Data parsing and extraction : 10.0 Domain specific : 5.3 General Intelligence : 5.4 Instructions following : 10.0 Puzzle Solving : 8.2 Tool Calling : 10.0
#28#28	GPT-5.2 Chatnone	7.9	OpenAI	$0.291	6.84s
View model card Total Tests: 18 Wrong Tests: 6 Attempt pass rate: 75.9% Flaky tests: 3 Output Tokens: 17,346 Reasoning Tokens: 0 Response time: avg 6.84s · total 123.17s · max 38.52s Wrong answer: 5 Did not follow instructions: 1 Anti-AI Tricks : 8.7 Coding : 10.0 Combined : 10.0 Data parsing and extraction : 10.0 Domain specific : 5.3 General Intelligence : 4.4 Instructions following : 7.5 Puzzle Solving : 7.7 Tool Calling : 10.0
#29#29	Gemini 3.1 Flash Lite Previewnone	7.9	Google	$0.016	1.30s
View model card Total Tests: 18 Wrong Tests: 6 Attempt pass rate: 70.4% Flaky tests: 1 Output Tokens: 5,361 Reasoning Tokens: 0 Response time: avg 1.30s · total 23.42s · max 3.39s Wrong answer: 4 Did not follow instructions: 2 Anti-AI Tricks : 7.5 Coding : 10.0 Combined : 3.0 Data parsing and extraction : 10.0 Domain specific : 5.3 General Intelligence : 4.0 Instructions following : 10.0 Puzzle Solving : 10.0 Tool Calling : 10.0
#30#30	Step 3.5 Flashmedium	7.9	Stepfun	$0.000	26.78s
View model card Total Tests: 17 Wrong Tests: 6 Attempt pass rate: 70.6% Flaky tests: 2 Output Tokens: 71,904 Reasoning Tokens: 155,607 Response time: avg 26.78s · total 294.58s · max 170.45s Did not follow instructions: 3 Wrong answer: 3 Anti-AI Tricks : 10.0 Combined : 10.0 Data parsing and extraction : 10.0 Domain specific : 5.3 General Intelligence : 5.5 Instructions following : 8.5 Puzzle Solving : 5.3 Tool Calling : 10.0
#31#31	GLM 5V Turbomedium	7.8	Z.ai	$0.291	14.96s
View model card Total Tests: 18 Wrong Tests: 7 Attempt pass rate: 77.8% Flaky tests: 6 Output Tokens: 2,351 Reasoning Tokens: 58,941 Response time: avg 14.96s · total 269.32s · max 67.08s Wrong answer: 3 Did not follow instructions: 2 Invalid tool call: 2 Anti-AI Tricks : 7.2 Coding : 10.0 Combined : 6.9 Data parsing and extraction : 10.0 Domain specific : 5.3 General Intelligence : 10.0 Instructions following : 9.9 Puzzle Solving : 7.7 Tool Calling : 7.0
#32#32	Qwen3.5-Flashmedium	7.8	Qwen	$0.080	66.72s
View model card Total Tests: 18 Wrong Tests: 7 Attempt pass rate: 81.5% Flaky tests: 6 Output Tokens: 2,073 Reasoning Tokens: 191,899 Response time: avg 66.72s · total 1201.03s · max 234.29s Timed out: 4 API error: 1 Did not follow instructions: 1 Wrong answer: 1 Anti-AI Tricks : 10.0 Coding : 4.7 Combined : 10.0 Data parsing and extraction : 7.3 Domain specific : 5.3 General Intelligence : 6.1 Instructions following : 10.0 Puzzle Solving : 6.4 Tool Calling : 10.0
#33#33	GLM 5.1medium	7.8	Z.ai	$0.201	24.13s
View model card Total Tests: 18 Wrong Tests: 6 Attempt pass rate: 75.9% Flaky tests: 3 Output Tokens: 8,005 Reasoning Tokens: 49,090 Response time: avg 24.13s · total 410.25s · max 118.52s Wrong answer: 3 Timed out: 2 API error: 1 Anti-AI Tricks : 10.0 Coding : 4.7 Combined : 9.5 Data parsing and extraction : 10.0 Domain specific : 5.3 General Intelligence : 10.0 Instructions following : 6.4 Puzzle Solving : 8.2 Tool Calling : 3.0
#34#34	Kimi K2.6medium	7.7	Moonshot AI	$0.722	45.20s
View model card Total Tests: 18 Wrong Tests: 7 Attempt pass rate: 74.1% Flaky tests: 4 Output Tokens: 80,759 Reasoning Tokens: 179,814 Response time: avg 45.20s · total 768.37s · max 215.85s Did not follow instructions: 3 Timed out: 2 Wrong answer: 2 Anti-AI Tricks : 7.0 Coding : 10.0 Combined : 10.0 Data parsing and extraction : 10.0 Domain specific : 5.3 General Intelligence : 10.0 Instructions following : 10.0 Puzzle Solving : 5.0 Tool Calling : 10.0
#35#35	MiMo-V2-Omnimedium	7.7	Xiaomi	$0.153	16.76s
View model card Total Tests: 18 Wrong Tests: 7 Attempt pass rate: 61.1% Flaky tests: 0 Output Tokens: 928 Reasoning Tokens: 72,661 Response time: avg 16.76s · total 301.61s · max 158.78s Wrong answer: 3 Did not follow instructions: 2 Extra formatting: 1 No answer: 1 Anti-AI Tricks : 10.0 Coding : 4.0 Combined : 10.0 Data parsing and extraction : 10.0 Domain specific : 3.0 General Intelligence : 10.0 Instructions following : 8.3 Puzzle Solving : 6.5 Tool Calling : 10.0
#36#36	GPT-5.3 Chatnone	7.7	OpenAI	$0.340	5.88s
View model card Total Tests: 18 Wrong Tests: 7 Attempt pass rate: 68.5% Flaky tests: 3 Output Tokens: 20,784 Reasoning Tokens: 0 Response time: avg 5.88s · total 105.90s · max 18.33s Wrong answer: 5 Did not follow instructions: 2 Anti-AI Tricks : 6.7 Coding : 10.0 Combined : 10.0 Data parsing and extraction : 10.0 Domain specific : 3.5 General Intelligence : 4.6 Instructions following : 8.3 Puzzle Solving : 10.0 Tool Calling : 10.0
#37#37	Claude Opus 4.6medium	7.6	Anthropic	$1.446	21.08s
View model card Total Tests: 18 Wrong Tests: 6 Attempt pass rate: 70.4% Flaky tests: 2 Output Tokens: 29,829 Reasoning Tokens: 18,938 Response time: avg 21.08s · total 231.84s · max 83.40s Extra formatting: 4 Wrong answer: 2 Anti-AI Tricks : 6.4 Coding : 10.0 Combined : 10.0 Data parsing and extraction : 10.0 Domain specific : 3.0 General Intelligence : 10.0 Instructions following : 10.0 Puzzle Solving : 7.7 Tool Calling : 10.0
#38#38	GPT-5.4 Nanomedium	7.6	OpenAI	$0.083	11.21s
View model card Total Tests: 18 Wrong Tests: 7 Attempt pass rate: 68.5% Flaky tests: 2 Output Tokens: 2,946 Reasoning Tokens: 58,132 Response time: avg 11.21s · total 201.80s · max 94.06s Wrong answer: 4 Did not follow instructions: 3 Anti-AI Tricks : 8.3 Coding : 10.0 Combined : 9.8 Data parsing and extraction : 10.0 Domain specific : 5.9 General Intelligence : 4.5 Instructions following : 9.8 Puzzle Solving : 4.0 Tool Calling : 10.0
#39#39	Seed-2.0-Minimedium	7.5	Bytedance Seed	$0.037	69.70s
View model card Total Tests: 18 Wrong Tests: 7 Attempt pass rate: 66.7% Flaky tests: 2 Output Tokens: 2,419 Reasoning Tokens: 79,238 Response time: avg 69.70s · total 1045.47s · max 262.83s Timed out: 4 Wrong answer: 2 Did not follow instructions: 1 Anti-AI Tricks : 6.6 Coding : 10.0 Combined : 10.0 Data parsing and extraction : 10.0 Domain specific : 3.0 General Intelligence : 5.1 Instructions following : 10.0 Puzzle Solving : 8.2 Tool Calling : 10.0
#40#40	GPT-5.2medium	7.5	OpenAI	$0.352	14.04s
View model card Total Tests: 18 Wrong Tests: 7 Attempt pass rate: 72.2% Flaky tests: 4 Output Tokens: 2,705 Reasoning Tokens: 18,977 Response time: avg 14.04s · total 154.41s · max 77.80s Did not follow instructions: 3 Wrong answer: 2 No answer: 1 Timed out: 1 Anti-AI Tricks : 6.5 Coding : 10.0 Combined : 10.0 Data parsing and extraction : 10.0 Domain specific : 5.9 General Intelligence : 3.7 Instructions following : 9.9 Puzzle Solving : 7.7 Tool Calling : 4.7
#41#41	MiMo-V2-Flashmedium	7.5	Xiaomi	$0.038	23.36s
View model card Total Tests: 18 Wrong Tests: 7 Attempt pass rate: 70.4% Flaky tests: 3 Output Tokens: 12,387 Reasoning Tokens: 115,182 Response time: avg 23.36s · total 280.34s · max 96.01s Wrong answer: 3 API error: 1 Extra formatting: 1 Did not follow instructions: 1 Timed out: 1 Anti-AI Tricks : 8.1 Coding : 4.7 Combined : 9.8 Data parsing and extraction : 6.5 Domain specific : 5.9 General Intelligence : 4.0 Instructions following : 10.0 Puzzle Solving : 7.7 Tool Calling : 10.0
#42#42	Claude Sonnet 4.6none	7.4	Anthropic	$0.262	4.98s
View model card Total Tests: 18 Wrong Tests: 7 Attempt pass rate: 64.8% Flaky tests: 1 Output Tokens: 7,433 Reasoning Tokens: 0 Response time: avg 4.98s · total 54.83s · max 23.84s Extra formatting: 3 Wrong answer: 3 Did not follow instructions: 1 Anti-AI Tricks : 4.8 Coding : 10.0 Combined : 9.5 Data parsing and extraction : 10.0 Domain specific : 7.7 General Intelligence : 6.1 Instructions following : 6.5 Puzzle Solving : 7.7 Tool Calling : 10.0
#43#43	Qwen3.5-35B-A3Bmedium	7.4	Qwen	$0.398	44.51s
View model card Total Tests: 18 Wrong Tests: 8 Attempt pass rate: 79.6% Flaky tests: 7 Output Tokens: 10,137 Reasoning Tokens: 208,761 Response time: avg 44.51s · total 801.21s · max 106.00s Timed out: 4 Wrong answer: 2 API error: 1 No answer: 1 Anti-AI Tricks : 10.0 Coding : 10.0 Combined : 4.7 Data parsing and extraction : 7.3 Domain specific : 4.1 General Intelligence : 2.8 Instructions following : 10.0 Puzzle Solving : 6.4 Tool Calling : 10.0
#44#44	GPT-5.4 Minimedium	7.3	OpenAI	$0.299	15.22s
View model card Total Tests: 18 Wrong Tests: 9 Attempt pass rate: 70.4% Flaky tests: 6 Output Tokens: 2,131 Reasoning Tokens: 59,567 Response time: avg 15.22s · total 273.90s · max 102.91s Did not follow instructions: 5 Wrong answer: 4 Anti-AI Tricks : 8.6 Coding : 10.0 Combined : 10.0 Data parsing and extraction : 10.0 Domain specific : 4.1 General Intelligence : 4.5 Instructions following : 7.4 Puzzle Solving : 6.8 Tool Calling : 4.7
#45#45	GPT-5 Minimedium	7.0	OpenAI	$0.128	23.98s
View model card Total Tests: 18 Wrong Tests: 9 Attempt pass rate: 61.1% Flaky tests: 3 Output Tokens: 6,379 Reasoning Tokens: 53,482 Response time: avg 23.98s · total 431.56s · max 88.15s Did not follow instructions: 4 Wrong answer: 4 Timed out: 1 Anti-AI Tricks : 7.1 Coding : 10.0 Combined : 10.0 Data parsing and extraction : 10.0 Domain specific : 3.6 General Intelligence : 4.5 Instructions following : 8.0 Puzzle Solving : 5.6 Tool Calling : 10.0
#46#46	Kimi K2.5medium	7.0	Moonshot AI	$0.220	72.43s
View model card Total Tests: 18 Wrong Tests: 9 Attempt pass rate: 72.2% Flaky tests: 7 Output Tokens: 42,176 Reasoning Tokens: 84,870 Response time: avg 72.43s · total 796.70s · max 150.77s Wrong answer: 4 Did not follow instructions: 2 Timed out: 2 No answer: 1 Anti-AI Tricks : 7.3 Coding : 4.7 Combined : 10.0 Data parsing and extraction : 10.0 Domain specific : 3.5 General Intelligence : 6.5 Instructions following : 10.0 Puzzle Solving : 5.3 Tool Calling : 10.0
#47#47	Grok 4.20medium	7.0	X AI	$0.743	10.33s
View model card Total Tests: 18 Wrong Tests: 9 Attempt pass rate: 66.7% Flaky tests: 5 Output Tokens: 1,744 Reasoning Tokens: 109,882 Response time: avg 10.33s · total 185.87s · max 29.87s Did not follow instructions: 4 Wrong answer: 3 API error: 1 Extra formatting: 1 Anti-AI Tricks : 8.2 Coding : 4.3 Combined : 10.0 Data parsing and extraction : 10.0 Domain specific : 5.3 General Intelligence : 5.8 Instructions following : 7.3 Puzzle Solving : 6.4 Tool Calling : 3.0
#48#48	Gemma 4 31Bnone	6.9	Google	$0.003	4.02s
View model card Total Tests: 18 Wrong Tests: 8 Attempt pass rate: 55.6% Flaky tests: 0 Output Tokens: 1,359 Reasoning Tokens: 0 Response time: avg 4.02s · total 64.33s · max 26.13s Wrong answer: 5 API error: 2 Did not follow instructions: 1 Anti-AI Tricks : 6.5 Coding : 10.0 Combined : 3.0 Data parsing and extraction : 10.0 Domain specific : 7.7 General Intelligence : 10.0 Instructions following : 6.5 Puzzle Solving : 5.5 Tool Calling : 3.0
#49#49	Qwen3.5 Plus 2026-02-15none	6.8	Qwen	$0.017	2.60s
View model card Total Tests: 18 Wrong Tests: 9 Attempt pass rate: 53.7% Flaky tests: 2 Output Tokens: 2,461 Reasoning Tokens: 0 Response time: avg 2.60s · total 31.23s · max 6.65s Wrong answer: 9 Anti-AI Tricks : 4.8 Coding : 6.3 Combined : 3.0 Data parsing and extraction : 10.0 Domain specific : 5.3 General Intelligence : 4.4 Instructions following : 10.0 Puzzle Solving : 7.7 Tool Calling : 10.0
#50#50	Hunter Alphamedium	6.7	OpenRouter	$0.000	10.33s
View model card Total Tests: 18 Wrong Tests: 10 Attempt pass rate: 64.8% Flaky tests: 6 Output Tokens: 4,724 Reasoning Tokens: 17,921 Response time: avg 10.33s · total 175.60s · max 30.53s Wrong answer: 4 Did not follow instructions: 2 Timed out: 2 API error: 1 Extra formatting: 1 Anti-AI Tricks : 7.3 Coding : 3.0 Combined : 4.7 Data parsing and extraction : 10.0 Domain specific : 3.0 General Intelligence : 7.0 Instructions following : 9.9 Puzzle Solving : 6.1 Tool Calling : 10.0
#51#51	Nemotron 3 Supermedium	6.7	NVIDIA	$0.000	19.06s
View model card Total Tests: 18 Wrong Tests: 9 Attempt pass rate: 55.6% Flaky tests: 3 Output Tokens: 11,947 Reasoning Tokens: 29,768 Response time: avg 19.06s · total 305.04s · max 87.80s Did not follow instructions: 4 Wrong answer: 3 API error: 1 Timed out: 1 Anti-AI Tricks : 10.0 Coding : 3.0 Combined : 10.0 Data parsing and extraction : 10.0 Domain specific : 2.9 General Intelligence : 3.8 Instructions following : 7.2 Puzzle Solving : 3.5 Tool Calling : 10.0
#52#52	Grok 4.1 Fastmedium	6.7	X AI	$0.056	23.88s
View model card Total Tests: 18 Wrong Tests: 9 Attempt pass rate: 64.8% Flaky tests: 6 Output Tokens: 2,010 Reasoning Tokens: 91,298 Response time: avg 23.88s · total 262.66s · max 121.79s Did not follow instructions: 4 Wrong answer: 3 No answer: 1 Timed out: 1 Anti-AI Tricks : 8.7 Coding : 2.3 Combined : 10.0 Data parsing and extraction : 10.0 Domain specific : 5.8 General Intelligence : 4.2 Instructions following : 6.6 Puzzle Solving : 5.3 Tool Calling : 2.8
#53#53	GLM 5none	6.6	Z.ai	$0.020	4.23s
View model card Total Tests: 18 Wrong Tests: 9 Attempt pass rate: 51.9% Flaky tests: 1 Output Tokens: 1,959 Reasoning Tokens: 0 Response time: avg 4.23s · total 46.51s · max 11.07s Wrong answer: 9 Anti-AI Tricks : 4.8 Coding : 5.6 Combined : 3.0 Data parsing and extraction : 10.0 Domain specific : 3.0 General Intelligence : 10.0 Instructions following : 10.0 Puzzle Solving : 7.7 Tool Calling : 10.0
#54#54	Mercury 2medium	6.5	Inception	$0.047	2.21s
View model card Total Tests: 18 Wrong Tests: 10 Attempt pass rate: 53.7% Flaky tests: 3 Output Tokens: 3,972 Reasoning Tokens: 48,333 Response time: avg 2.21s · total 37.51s · max 14.63s Wrong answer: 6 Did not follow instructions: 4 Anti-AI Tricks : 6.9 Coding : 10.0 Combined : 10.0 Data parsing and extraction : 7.3 Domain specific : 2.9 General Intelligence : 4.8 Instructions following : 10.0 Puzzle Solving : 3.9 Tool Calling : 10.0
#55#55	MiMo-V2-Omninone	6.5	Xiaomi	$0.007	1.99s
View model card Total Tests: 18 Wrong Tests: 10 Attempt pass rate: 44.4% Flaky tests: 0 Output Tokens: 868 Reasoning Tokens: 0 Response time: avg 1.99s · total 35.81s · max 6.81s Wrong answer: 8 Did not follow instructions: 2 Anti-AI Tricks : 4.8 Coding : 6.6 Combined : 3.0 Data parsing and extraction : 10.0 Domain specific : 5.3 General Intelligence : 4.5 Instructions following : 6.5 Puzzle Solving : 8.0 Tool Calling : 10.0
#56#56	Grok 4.20 Multi Agent Betamedium	6.4	X AI	$5.074	9.80s
View model card Total Tests: 18 Wrong Tests: 11 Attempt pass rate: 57.4% Flaky tests: 6 Output Tokens: 299,034 Reasoning Tokens: 309,670 Response time: avg 9.80s · total 156.75s · max 35.28s Did not follow instructions: 4 Wrong answer: 3 API error: 2 Extra formatting: 2 Anti-AI Tricks : 6.9 Coding : 10.0 Combined : 3.0 Data parsing and extraction : 10.0 Domain specific : 2.9 General Intelligence : 5.8 Instructions following : 8.3 Puzzle Solving : 7.2 Tool Calling : 3.0
#57#57	GPT-5 Nanomedium	6.3	OpenAI	$0.066	44.13s
View model card Total Tests: 18 Wrong Tests: 11 Attempt pass rate: 59.3% Flaky tests: 8 Output Tokens: 4,980 Reasoning Tokens: 156,288 Response time: avg 44.13s · total 485.47s · max 204.02s Wrong answer: 7 Did not follow instructions: 3 Timed out: 1 Anti-AI Tricks : 6.5 Coding : 6.7 Combined : 10.0 Data parsing and extraction : 3.7 Domain specific : 5.2 General Intelligence : 4.1 Instructions following : 8.5 Puzzle Solving : 5.3 Tool Calling : 10.0
#58#58	GLM 5V Turbonone	6.2	Z.ai	$0.044	3.10s
View model card Total Tests: 18 Wrong Tests: 10 Attempt pass rate: 44.4% Flaky tests: 0 Output Tokens: 1,724 Reasoning Tokens: 0 Response time: avg 3.10s · total 55.87s · max 6.51s Wrong answer: 8 Did not follow instructions: 2 Anti-AI Tricks : 4.8 Coding : 10.0 Combined : 3.0 Data parsing and extraction : 10.0 Domain specific : 5.3 General Intelligence : 4.6 Instructions following : 6.5 Puzzle Solving : 5.3 Tool Calling : 10.0
#59#59	Qwen3.5-Flashnone	6.2	Qwen	$0.006	3.25s
View model card Total Tests: 18 Wrong Tests: 10 Attempt pass rate: 46.3% Flaky tests: 1 Output Tokens: 4,266 Reasoning Tokens: 0 Response time: avg 3.25s · total 58.44s · max 13.73s Wrong answer: 9 Did not follow instructions: 1 Anti-AI Tricks : 3.5 Coding : 10.0 Combined : 3.0 Data parsing and extraction : 10.0 Domain specific : 7.7 General Intelligence : 10.0 Instructions following : 6.3 Puzzle Solving : 3.3 Tool Calling : 10.0
#60#60	Gemma 4 26B A4Bnone	6.2	Google	$0.005	6.59s
View model card Total Tests: 18 Wrong Tests: 11 Attempt pass rate: 48.2% Flaky tests: 3 Output Tokens: 1,783 Reasoning Tokens: 0 Response time: avg 6.59s · total 118.61s · max 57.10s Wrong answer: 7 Did not follow instructions: 3 Timed out: 1 Anti-AI Tricks : 8.3 Coding : 4.7 Combined : 3.0 Data parsing and extraction : 10.0 Domain specific : 3.6 General Intelligence : 4.0 Instructions following : 4.4 Puzzle Solving : 5.7 Tool Calling : 10.0
#61#61	Seed-2.0-Litenone	6.2	Bytedance Seed	$0.016	2.53s
View model card Total Tests: 18 Wrong Tests: 10 Attempt pass rate: 55.6% Flaky tests: 5 Output Tokens: 3,129 Reasoning Tokens: 0 Response time: avg 2.53s · total 45.46s · max 6.70s Wrong answer: 10 Anti-AI Tricks : 3.0 Coding : 10.0 Combined : 3.0 Data parsing and extraction : 10.0 Domain specific : 3.6 General Intelligence : 10.0 Instructions following : 10.0 Puzzle Solving : 5.2 Tool Calling : 10.0
#62#62	Gemini 2.5 Flashnone	6.2	Google	$0.013	903ms
View model card Total Tests: 18 Wrong Tests: 11 Attempt pass rate: 44.4% Flaky tests: 2 Output Tokens: 1,726 Reasoning Tokens: 0 Response time: avg 903ms · total 16.26s · max 4.39s Wrong answer: 10 Did not follow instructions: 1 Anti-AI Tricks : 3.0 Coding : 10.0 Combined : 3.0 Data parsing and extraction : 10.0 Domain specific : 5.9 General Intelligence : 5.0 Instructions following : 8.0 Puzzle Solving : 5.7 Tool Calling : 10.0
#63#63	Qwen3.5-35B-A3Bnone	6.1	Qwen	$0.016	3.82s
View model card Total Tests: 18 Wrong Tests: 11 Attempt pass rate: 50.0% Flaky tests: 3 Output Tokens: 4,300 Reasoning Tokens: 0 Response time: avg 3.82s · total 68.74s · max 47.43s Wrong answer: 9 Did not follow instructions: 2 Anti-AI Tricks : 3.4 Coding : 10.0 Combined : 3.0 Data parsing and extraction : 10.0 Domain specific : 7.7 General Intelligence : 6.5 Instructions following : 6.3 Puzzle Solving : 3.9 Tool Calling : 10.0
#64#64	DeepSeek V3.2none	6.1	DeepSeek	$0.016	12.09s
View model card Total Tests: 18 Wrong Tests: 11 Attempt pass rate: 50.0% Flaky tests: 4 Output Tokens: 8,384 Reasoning Tokens: 0 Response time: avg 12.09s · total 217.56s · max 115.89s Wrong answer: 8 Extra formatting: 2 Invalid tool call: 1 Anti-AI Tricks : 3.2 Coding : 2.4 Combined : 6.5 Data parsing and extraction : 6.3 Domain specific : 3.6 General Intelligence : 10.0 Instructions following : 10.0 Puzzle Solving : 8.5 Tool Calling : 10.0
#65#65	MiMo-V2-Pronone	6.0	Xiaomi	$0.043	2.39s
View model card Total Tests: 18 Wrong Tests: 11 Attempt pass rate: 48.2% Flaky tests: 3 Output Tokens: 2,320 Reasoning Tokens: 0 Response time: avg 2.39s · total 43.06s · max 6.58s Wrong answer: 9 Did not follow instructions: 2 Anti-AI Tricks : 3.5 Coding : 10.0 Combined : 3.0 Data parsing and extraction : 10.0 Domain specific : 5.3 General Intelligence : 4.3 Instructions following : 6.5 Puzzle Solving : 6.0 Tool Calling : 10.0
#66#66	GPT-5.4none	5.9	OpenAI	$0.104	1.51s
View model card Total Tests: 18 Wrong Tests: 11 Attempt pass rate: 42.6% Flaky tests: 2 Output Tokens: 2,317 Reasoning Tokens: 0 Response time: avg 1.51s · total 27.21s · max 2.95s Wrong answer: 10 Did not follow instructions: 1 Anti-AI Tricks : 3.2 Coding : 10.0 Combined : 3.0 Data parsing and extraction : 10.0 Domain specific : 5.3 General Intelligence : 4.4 Instructions following : 6.5 Puzzle Solving : 5.6 Tool Calling : 10.0
#67#67	Qwen3.5-27Bnone	5.9	Qwen	$0.016	1.74s
View model card Total Tests: 18 Wrong Tests: 12 Attempt pass rate: 38.9% Flaky tests: 2 Output Tokens: 3,545 Reasoning Tokens: 0 Response time: avg 1.74s · total 31.32s · max 9.39s Wrong answer: 10 Did not follow instructions: 2 Anti-AI Tricks : 4.8 Coding : 10.0 Combined : 2.8 Data parsing and extraction : 10.0 Domain specific : 3.0 General Intelligence : 5.0 Instructions following : 4.8 Puzzle Solving : 6.7 Tool Calling : 10.0
#68#68	gpt-oss-120bmedium	5.8	OpenAI	$0.011	16.08s
View model card Total Tests: 18 Wrong Tests: 11 Attempt pass rate: 51.9% Flaky tests: 6 Output Tokens: 13,493 Reasoning Tokens: 36,879 Response time: avg 16.08s · total 176.88s · max 50.92s Wrong answer: 7 Did not follow instructions: 4 Anti-AI Tricks : 6.7 Coding : 4.3 Combined : 10.0 Data parsing and extraction : 6.4 Domain specific : 2.9 General Intelligence : 4.3 Instructions following : 9.9 Puzzle Solving : 3.2 Tool Calling : 9.8
#69#69	Kimi K2.6none	5.8	Moonshot AI	$0.038	2.05s
View model card Total Tests: 18 Wrong Tests: 11 Attempt pass rate: 42.6% Flaky tests: 2 Output Tokens: 2,973 Reasoning Tokens: 0 Response time: avg 2.05s · total 36.93s · max 6.65s Wrong answer: 8 Did not follow instructions: 3 Anti-AI Tricks : 4.6 Coding : 10.0 Combined : 3.0 Data parsing and extraction : 10.0 Domain specific : 5.3 General Intelligence : 5.4 Instructions following : 6.5 Puzzle Solving : 3.4 Tool Calling : 10.0
#70#70	Qwen3.5-122B-A10Bnone	5.7	Qwen	$0.022	3.69s
View model card Total Tests: 18 Wrong Tests: 12 Attempt pass rate: 38.9% Flaky tests: 2 Output Tokens: 3,341 Reasoning Tokens: 0 Response time: avg 3.69s · total 66.50s · max 46.00s Wrong answer: 11 Did not follow instructions: 1 Anti-AI Tricks : 4.8 Coding : 4.3 Combined : 3.0 Data parsing and extraction : 10.0 Domain specific : 5.3 General Intelligence : 5.0 Instructions following : 4.5 Puzzle Solving : 5.4 Tool Calling : 10.0
#71#71	MiniMax M2.5medium	5.7	Minimax	$0.250	39.65s
View model card Total Tests: 18 Wrong Tests: 13 Attempt pass rate: 57.4% Flaky tests: 10 Output Tokens: 107,044 Reasoning Tokens: 206,422 Response time: avg 39.65s · total 396.47s · max 237.27s Wrong answer: 5 Timed out: 4 Did not follow instructions: 3 Invalid tool call: 1 Anti-AI Tricks : 7.9 Coding : 3.0 Combined : 4.5 Data parsing and extraction : 4.6 Domain specific : 2.9 General Intelligence : 3.8 Instructions following : 8.1 Puzzle Solving : 5.3 Tool Calling : 10.0
#72#72	Hunter Alphanone	5.7	OpenRouter	$0.000	4.58s
View model card Total Tests: 18 Wrong Tests: 12 Attempt pass rate: 46.3% Flaky tests: 4 Output Tokens: 2,278 Reasoning Tokens: 0 Response time: avg 4.58s · total 77.92s · max 15.17s Wrong answer: 9 Did not follow instructions: 2 API error: 1 Anti-AI Tricks : 3.5 Coding : 3.0 Combined : 3.0 Data parsing and extraction : 10.0 Domain specific : 5.3 General Intelligence : 6.1 Instructions following : 6.4 Puzzle Solving : 5.8 Tool Calling : 10.0
#73#73	Mistral Small 4medium	5.7	Mistral	$0.034	5.64s
View model card Total Tests: 18 Wrong Tests: 13 Attempt pass rate: 50.0% Flaky tests: 7 Output Tokens: 15,084 Reasoning Tokens: 39,408 Response time: avg 5.64s · total 101.52s · max 30.49s Wrong answer: 8 Did not follow instructions: 3 API error: 2 Anti-AI Tricks : 5.6 Coding : 6.7 Combined : 3.0 Data parsing and extraction : 7.3 Domain specific : 5.3 General Intelligence : 4.8 Instructions following : 7.3 Puzzle Solving : 3.4 Tool Calling : 10.0
#74#74	GLM 4.7 Flashnone	5.6	Z.ai	$0.003	3.35s
View model card Total Tests: 18 Wrong Tests: 13 Attempt pass rate: 37.0% Flaky tests: 3 Output Tokens: 2,489 Reasoning Tokens: 0 Response time: avg 3.35s · total 36.90s · max 7.05s Wrong answer: 10 Did not follow instructions: 2 Invalid tool call: 1 Anti-AI Tricks : 5.2 Coding : 6.4 Combined : 3.0 Data parsing and extraction : 7.3 Domain specific : 7.7 General Intelligence : 4.0 Instructions following : 6.5 Puzzle Solving : 4.4 Tool Calling : 2.8
#75#75	GLM 5.1none	5.6	Z.ai	$0.053	4.33s
View model card Total Tests: 18 Wrong Tests: 13 Attempt pass rate: 37.0% Flaky tests: 4 Output Tokens: 3,720 Reasoning Tokens: 0 Response time: avg 4.33s · total 78.02s · max 32.57s Wrong answer: 10 Did not follow instructions: 2 Invalid tool call: 1 Anti-AI Tricks : 4.0 Coding : 5.1 Combined : 2.8 Data parsing and extraction : 10.0 Domain specific : 2.9 General Intelligence : 5.0 Instructions following : 8.3 Puzzle Solving : 5.7 Tool Calling : 10.0
#76#76	Kimi K2.5none	5.5	Moonshot AI	$0.017	13.37s
View model card Total Tests: 18 Wrong Tests: 12 Attempt pass rate: 40.7% Flaky tests: 3 Output Tokens: 2,659 Reasoning Tokens: 0 Response time: avg 13.37s · total 147.05s · max 42.13s Wrong answer: 12 Anti-AI Tricks : 3.6 Coding : 10.0 Combined : 2.8 Data parsing and extraction : 7.3 Domain specific : 5.3 General Intelligence : 10.0 Instructions following : 6.5 Puzzle Solving : 3.1 Tool Calling : 10.0
#77#77	GLM 5 Turbonone	5.5	Z.ai	$0.032	2.94s
View model card Total Tests: 18 Wrong Tests: 12 Attempt pass rate: 37.0% Flaky tests: 2 Output Tokens: 1,775 Reasoning Tokens: 0 Response time: avg 2.94s · total 52.98s · max 8.21s Wrong answer: 10 Did not follow instructions: 2 Anti-AI Tricks : 3.0 Coding : 5.3 Combined : 3.0 Data parsing and extraction : 10.0 Domain specific : 5.3 General Intelligence : 4.2 Instructions following : 6.5 Puzzle Solving : 5.5 Tool Calling : 10.0
#78#78	Trinity Large Previewnone	5.3	Arcee AI	$0.000	5.07s
View model card Total Tests: 18 Wrong Tests: 13 Attempt pass rate: 29.6% Flaky tests: 1 Output Tokens: 1,985 Reasoning Tokens: 0 Response time: avg 5.07s · total 91.23s · max 39.47s Wrong answer: 11 Did not follow instructions: 2 Anti-AI Tricks : 3.0 Coding : 6.3 Combined : 3.0 Data parsing and extraction : 10.0 Domain specific : 5.3 General Intelligence : 4.4 Instructions following : 4.1 Puzzle Solving : 5.4 Tool Calling : 10.0
#79#79	Grok 4.20 Betanone	5.3	X AI	$0.091	1.19s
View model card Total Tests: 18 Wrong Tests: 14 Attempt pass rate: 29.6% Flaky tests: 2 Output Tokens: 1,591 Reasoning Tokens: 0 Response time: avg 1.19s · total 21.37s · max 6.48s Wrong answer: 10 Did not follow instructions: 3 Invalid tool call: 1 Anti-AI Tricks : 4.0 Coding : 5.5 Combined : 3.0 Data parsing and extraction : 10.0 Domain specific : 3.0 General Intelligence : 5.0 Instructions following : 4.8 Puzzle Solving : 5.9 Tool Calling : 10.0
#80#80	MiniMax M2.7medium	5.3	Minimax	$0.091	31.08s
View model card Total Tests: 18 Wrong Tests: 14 Attempt pass rate: 51.9% Flaky tests: 10 Output Tokens: 4,984 Reasoning Tokens: 62,787 Response time: avg 31.08s · total 528.37s · max 117.04s Did not follow instructions: 6 Wrong answer: 5 Timed out: 2 Invalid tool call: 1 Anti-AI Tricks : 7.9 Coding : 10.0 Combined : 4.7 Data parsing and extraction : 6.3 Domain specific : 3.0 General Intelligence : 3.9 Instructions following : 3.7 Puzzle Solving : 3.8 Tool Calling : 4.7
#81#81	Elephantmedium	5.2	Openrouter	$0.000	1.27s
View model card Total Tests: 18 Wrong Tests: 13 Attempt pass rate: 29.6% Flaky tests: 1 Output Tokens: 2,596 Reasoning Tokens: 0 Response time: avg 1.27s · total 22.82s · max 3.70s Wrong answer: 9 Did not follow instructions: 3 Invalid tool call: 1 Anti-AI Tricks : 6.6 Coding : 5.1 Combined : 3.0 Data parsing and extraction : 6.5 Domain specific : 3.0 General Intelligence : 4.3 Instructions following : 9.8 Puzzle Solving : 3.7 Tool Calling : 3.0
#82#82	Grok 4.20none	5.2	X AI	$0.095	1.11s
View model card Total Tests: 18 Wrong Tests: 13 Attempt pass rate: 29.6% Flaky tests: 1 Output Tokens: 1,967 Reasoning Tokens: 0 Response time: avg 1.11s · total 20.02s · max 6.04s Wrong answer: 9 Did not follow instructions: 2 Extra formatting: 1 Invalid tool call: 1 Anti-AI Tricks : 4.8 Coding : 3.4 Combined : 3.0 Data parsing and extraction : 10.0 Domain specific : 3.0 General Intelligence : 4.8 Instructions following : 4.8 Puzzle Solving : 5.3 Tool Calling : 10.0
#83#83	Mistral Small 4none	5.2	Mistral	$0.006	665ms
View model card Total Tests: 18 Wrong Tests: 13 Attempt pass rate: 31.5% Flaky tests: 1 Output Tokens: 2,207 Reasoning Tokens: 0 Response time: avg 665ms · total 11.97s · max 1.72s Wrong answer: 11 Did not follow instructions: 2 Anti-AI Tricks : 3.4 Coding : 4.5 Combined : 3.0 Data parsing and extraction : 10.0 Domain specific : 5.3 General Intelligence : 4.0 Instructions following : 6.5 Puzzle Solving : 3.1 Tool Calling : 10.0
#84#84	gpt-oss-120bnone	5.2	OpenAI	$0.009	11.96s
View model card Total Tests: 18 Wrong Tests: 14 Attempt pass rate: 38.9% Flaky tests: 5 Output Tokens: 44,652 Reasoning Tokens: 0 Response time: avg 11.96s · total 179.34s · max 68.97s Wrong answer: 6 Did not follow instructions: 5 API error: 3 Anti-AI Tricks : 6.6 Coding : 4.3 Combined : 3.0 Data parsing and extraction : 6.5 Domain specific : 3.0 General Intelligence : 4.6 Instructions following : 8.4 Puzzle Solving : 4.5 Tool Calling : 3.0
#85#85	Elephantnone	5.2	Openrouter	$0.000	1.23s
View model card Total Tests: 18 Wrong Tests: 13 Attempt pass rate: 31.5% Flaky tests: 1 Output Tokens: 2,573 Reasoning Tokens: 0 Response time: avg 1.23s · total 22.16s · max 3.81s Wrong answer: 9 Did not follow instructions: 3 Invalid tool call: 1 Anti-AI Tricks : 6.6 Coding : 6.4 Combined : 3.0 Data parsing and extraction : 6.5 Domain specific : 3.0 General Intelligence : 4.0 Instructions following : 9.8 Puzzle Solving : 3.3 Tool Calling : 3.0
#86#86	GPT-5.4 Mininone	5.1	OpenAI	$0.032	1.17s
View model card Total Tests: 18 Wrong Tests: 13 Attempt pass rate: 35.2% Flaky tests: 3 Output Tokens: 2,418 Reasoning Tokens: 0 Response time: avg 1.17s · total 21.01s · max 2.52s Wrong answer: 10 Did not follow instructions: 3 Anti-AI Tricks : 3.1 Coding : 10.0 Combined : 3.0 Data parsing and extraction : 10.0 Domain specific : 3.5 General Intelligence : 4.8 Instructions following : 6.3 Puzzle Solving : 5.4 Tool Calling : 3.0
#87#87	Qwen3 Coder Nextnone	5.1	Qwen	$0.008	10.18s
View model card Total Tests: 18 Wrong Tests: 14 Attempt pass rate: 25.9% Flaky tests: 1 Output Tokens: 3,617 Reasoning Tokens: 0 Response time: avg 10.18s · total 122.13s · max 45.14s Wrong answer: 12 Extra formatting: 1 Did not follow instructions: 1 Anti-AI Tricks : 3.6 Coding : 7.3 Combined : 3.0 Data parsing and extraction : 6.5 Domain specific : 5.3 General Intelligence : 10.0 Instructions following : 4.8 Puzzle Solving : 3.2 Tool Calling : 10.0
#88#88	Nemotron 3 Supernone	5.1	NVIDIA	$0.000	8.54s
View model card Total Tests: 18 Wrong Tests: 14 Attempt pass rate: 35.2% Flaky tests: 4 Output Tokens: 4,760 Reasoning Tokens: 0 Response time: avg 8.54s · total 153.69s · max 24.97s Wrong answer: 10 Did not follow instructions: 4 Anti-AI Tricks : 4.8 Coding : 3.3 Combined : 3.0 Data parsing and extraction : 10.0 Domain specific : 3.6 General Intelligence : 4.2 Instructions following : 4.9 Puzzle Solving : 5.7 Tool Calling : 4.7
#89#89	GPT-4o-mininone	4.9	OpenAI	$0.005	2.00s
View model card Total Tests: 18 Wrong Tests: 14 Attempt pass rate: 22.2% Flaky tests: 0 Output Tokens: 1,947 Reasoning Tokens: 0 Response time: avg 2.00s · total 21.99s · max 7.58s Wrong answer: 13 Did not follow instructions: 1 Anti-AI Tricks : 4.8 Coding : 3.0 Combined : 3.0 Data parsing and extraction : 10.0 Domain specific : 3.0 General Intelligence : 4.0 Instructions following : 4.8 Puzzle Solving : 3.7 Tool Calling : 10.0
#90#90	Qwen3.5-9Bnone	4.8	Qwen	$0.005	1.47s
View model card Total Tests: 18 Wrong Tests: 14 Attempt pass rate: 24.1% Flaky tests: 1 Output Tokens: 3,951 Reasoning Tokens: 0 Response time: avg 1.47s · total 26.43s · max 5.91s Wrong answer: 10 Did not follow instructions: 3 Invalid tool call: 1 Anti-AI Tricks : 3.1 Coding : 5.2 Combined : 3.0 Data parsing and extraction : 10.0 Domain specific : 3.0 General Intelligence : 4.4 Instructions following : 6.5 Puzzle Solving : 3.2 Tool Calling : 10.0
#91#91	Mercury 2none	4.8	Inception	$0.007	613ms
View model card Total Tests: 18 Wrong Tests: 14 Attempt pass rate: 27.8% Flaky tests: 2 Output Tokens: 1,625 Reasoning Tokens: 0 Response time: avg 613ms · total 11.04s · max 1.27s Wrong answer: 13 Did not follow instructions: 1 Anti-AI Tricks : 3.0 Coding : 3.6 Combined : 3.0 Data parsing and extraction : 7.3 Domain specific : 5.3 General Intelligence : 4.8 Instructions following : 6.5 Puzzle Solving : 3.1 Tool Calling : 10.0
#92#92	Qwen3 Coder Nextmedium	4.7	Qwen	$0.008	10.75s
View model card Total Tests: 18 Wrong Tests: 15 Attempt pass rate: 27.8% Flaky tests: 3 Output Tokens: 3,241 Reasoning Tokens: 0 Response time: avg 10.75s · total 129.01s · max 81.80s Wrong answer: 9 Did not follow instructions: 5 Timed out: 1 Anti-AI Tricks : 3.5 Coding : 4.7 Combined : 3.0 Data parsing and extraction : 6.5 Domain specific : 5.3 General Intelligence : 6.3 Instructions following : 4.8 Puzzle Solving : 3.1 Tool Calling : 10.0
#93#93	GLM 4.7 Flashmedium	4.6	Z.ai	$0.046	32.33s
View model card Total Tests: 18 Wrong Tests: 14 Attempt pass rate: 38.9% Flaky tests: 8 Output Tokens: 39,688 Reasoning Tokens: 72,401 Response time: avg 32.33s · total 355.65s · max 174.55s Wrong answer: 8 Did not follow instructions: 2 No answer: 2 Invalid tool call: 1 Timed out: 1 Anti-AI Tricks : 4.7 Coding : 3.6 Combined : 2.8 Data parsing and extraction : 6.3 Domain specific : 3.5 General Intelligence : 3.6 Instructions following : 6.2 Puzzle Solving : 2.9 Tool Calling : 10.0
#94#94	MiMo-V2-Flashnone	4.5	Xiaomi	$0.023	2.79s
View model card Total Tests: 18 Wrong Tests: 15 Attempt pass rate: 27.8% Flaky tests: 5 Output Tokens: 68,522 Reasoning Tokens: 0 Response time: avg 2.79s · total 39.08s · max 19.68s Wrong answer: 12 API error: 1 Extra formatting: 1 Did not follow instructions: 1 Anti-AI Tricks : 3.2 Coding : 6.3 Combined : 3.0 Data parsing and extraction : 2.9 Domain specific : 5.3 General Intelligence : 4.6 Instructions following : 6.5 Puzzle Solving : 3.6 Tool Calling : 10.0
#95#95	Grok 4.1 Fastnone	4.5	X AI	$0.009	1.76s
View model card Total Tests: 18 Wrong Tests: 15 Attempt pass rate: 24.1% Flaky tests: 3 Output Tokens: 1,721 Reasoning Tokens: 0 Response time: avg 1.76s · total 19.35s · max 5.51s Wrong answer: 13 Did not follow instructions: 2 Anti-AI Tricks : 3.2 Coding : 5.3 Combined : 3.0 Data parsing and extraction : 10.0 Domain specific : 5.9 General Intelligence : 4.4 Instructions following : 3.0 Puzzle Solving : 3.2 Tool Calling : 2.8
#96#96	GPT-5.4 Nanonone	4.5	OpenAI	$0.009	1.40s
View model card Total Tests: 18 Wrong Tests: 16 Attempt pass rate: 31.5% Flaky tests: 7 Output Tokens: 2,762 Reasoning Tokens: 0 Response time: avg 1.40s · total 25.14s · max 3.84s Wrong answer: 13 Did not follow instructions: 3 Anti-AI Tricks : 3.5 Coding : 7.1 Combined : 3.0 Data parsing and extraction : 6.5 Domain specific : 2.9 General Intelligence : 3.8 Instructions following : 5.0 Puzzle Solving : 3.7 Tool Calling : 10.0
#97#97	Qwen3.5-9Bmedium	4.4	Qwen	$0.030	73.64s
View model card Total Tests: 18 Wrong Tests: 15 Attempt pass rate: 33.3% Flaky tests: 6 Output Tokens: 24,291 Reasoning Tokens: 172,597 Response time: avg 73.64s · total 1104.60s · max 226.38s Timed out: 11 Did not follow instructions: 2 Extra formatting: 1 Wrong answer: 1 Anti-AI Tricks : 5.1 Coding : 2.6 Combined : 3.0 Data parsing and extraction : 3.6 Domain specific : 3.6 General Intelligence : 2.8 Instructions following : 6.4 Puzzle Solving : 3.1 Tool Calling : 10.0
#98#98	LFM2-24B-A2Bnone	4.1	Liquid	$0.001	811ms
View model card Total Tests: 16 Wrong Tests: 15 Attempt pass rate: 14.6% Flaky tests: 2 Output Tokens: 1,185 Reasoning Tokens: 0 Response time: avg 811ms · total 11.35s · max 2.88s Wrong answer: 9 API error: 4 Did not follow instructions: 2 Anti-AI Tricks : 3.3 Combined : 3.0 Data parsing and extraction : 3.0 Domain specific : 5.9 General Intelligence : 4.0 Instructions following : 4.8 Puzzle Solving : 4.4 Tool Calling : 3.0
#99#99	Step 3.5 Flashnone	3.0	Stepfun	$0.000	0ms
View model card Total Tests: 1 Wrong Tests: 1 Attempt pass rate: 0.0% Flaky tests: 0 Output Tokens: 0 Reasoning Tokens: 0 Response time: avg 0ms · total 0ms · max 0ms API error: 1 Coding : 3.0

Quick Compare

Gemini 3 Flash PreviewmediumvsGemini 3.1 Pro Previewmedium Gemini 3 Flash PreviewmediumvsClaude Opus 4.7medium Gemini 3 Flash PreviewmediumvsClaude Opus 4.7none Gemini 3 Flash PreviewmediumvsGemini 3 Flash Previewlow Gemini 3 Flash PreviewmediumvsSeed-2.0-Litemedium Gemini 3 Flash PreviewmediumvsGPT-5.3-Codexmedium Gemini 3 Flash PreviewmediumvsQwen3.6 Plus PreviewmediumFree Available Gemini 3.1 Pro PreviewmediumvsClaude Opus 4.7medium Claude Opus 4.7mediumvsClaude Opus 4.7none Claude Opus 4.7nonevsGemini 3 Flash Previewlow Gemini 3 Flash PreviewlowvsSeed-2.0-Litemedium Seed-2.0-LitemediumvsGPT-5.3-Codexmedium

AI Benchmark Leaderboard

Filter models

Quick Compare