19 models, 12 prompts, ranked and compared
Best overall quality across every prompt category
Close to Opus quality at lower cost
Best value in the entire set — top-tier quality at 50-100x cheaper than Opus
Never empty, never broken, always solid. Excellent cost/quality ratio.
Reliable mid-to-high quality with good explanations
Concise and action-oriented
Strong on code review and Docker, inconsistent depth
Good quality when it works, but empty on 4+ prompts
Verbose but thorough
Very similar to Gemini 2.5 Flash — verbose, correct, noisy
Wildly inconsistent — 7K chars on some prompts, 143 on others
Excellent refactor-review (best of all models), but empty on 3 prompts
Decent when not truncated
Truncated on Docker, mediocre otherwise
Decent when working, but reliability issues
Mid-quality, truncation issues
Surprisingly terse for premium pricing
Never wrong, but always the most generic answer
Reasoning model that frequently produces nothing