slopshopper

19 models, 12 prompts, ranked and compared

powered by OpenRouter
Tier 1Consistently Excellent
1
Claude Opus 4.6AnthropicTier 1

Best overall quality across every prompt category

+ Best formatting, nuance, production-grade code. Binary search optimization in rate limiter. 'What Went Well' sections in postmortems.
- Most expensive (10-100x others)
2
Claude Sonnet 4.6AnthropicTier 1

Close to Opus quality at lower cost

+ Production-ready code, ASCII diagrams, per-user locks in rate limiter. Strong professional writing.
- Still expensive
3
gpt-oss-120bOpenAITier 1

Best value in the entire set — top-tier quality at 50-100x cheaper than Opus

+ Most detailed responses. Excellent documentation, CLI-ready scripts, thorough system design with tables.
- Sometimes hit 2048 token limit and truncated
4
DeepSeek V3.2DeepSeekTier 1

Never empty, never broken, always solid. Excellent cost/quality ratio.

+ Always gives multiple progressive options. Consistent quality. Among the cheapest.
- Never the absolute best on any single prompt
Tier 2Strong
5
Gemini 3 FlashGoogleTier 2

Reliable mid-to-high quality with good explanations

+ Clean code, good explanations, reliable across all categories.
- Rarely stands out as the best
6
Qwen3.6 PlusQwenTier 2

Concise and action-oriented

+ Excellent Dockerfile with size estimates. Pragmatic answers that get to the point.
- Sometimes too brief
7
Grok 4.1 FastxAITier 2

Strong on code review and Docker, inconsistent depth

+ Good code review with clear issue enumeration. Strong Dockerfile.
- SQL-to-mongo was 144 chars. Sometimes too terse to be useful.
8
Step 3.5 FlashStepFunTier 24 empty

Good quality when it works, but empty on 4+ prompts

+ Good git-rebase flowchart, strong React debug answer.
- Empty responses on cron, refactor, csv-to-chart, system-design. Unreliable.
Tier 3Decent but Flawed
9
Gemini 2.5 FlashGoogleTier 3

Verbose but thorough

+ Detailed explanations with error handling.
- 6-7K char responses with poor signal-to-noise ratio. Over-explains.
10
Gemini 2.5 Flash LiteGoogleTier 3

Very similar to Gemini 2.5 Flash — verbose, correct, noisy

+ Good detail at low cost.
- Extremely verbose for simple topics. Repeats concepts.
11
Nemotron 3 SuperNVIDIATier 3

Wildly inconsistent — 7K chars on some prompts, 143 on others

+ When verbose, produces educational step-by-step breakdowns.
- SQL-to-mongo was 143 chars bare code. System design was 3 sentences. Uses emoji excessively.
12
Kimi K2.5MoonshotTier 33 empty

Excellent refactor-review (best of all models), but empty on 3 prompts

+ Outstanding layered code review approach. Good system design.
- Empty on docker, cold-email, git-rebase. Reasoning model burns tokens thinking.
13
GLM 5Z.aiTier 3

Decent when not truncated

+ Reasonable quality on completed responses.
- Hit token limits on multiple prompts. Truncated mid-sentence.
14
GLM 5 TurboZ.aiTier 3

Truncated on Docker, mediocre otherwise

+ Led with APScheduler on cron (unique pick). Good React debug pedagogy.
- Dockerfile truncated mid-line. Inconsistent.
15
MiniMax M2.7MiniMaxTier 31 empty

Decent when working, but reliability issues

+ Good code with clear comments.
- Empty on rate-limiter. Chinese text leaked into postmortem. Truncated on system design.
16
MiniMax M2.5MiniMaxTier 3

Mid-quality, truncation issues

+ Reasonable answers when complete.
- Postmortem truncated mid-word. Inconsistent depth.
Tier 4Disappointing
17
GPT-5.4OpenAITier 4

Surprisingly terse for premium pricing

+ Correct and concise. Best email regex with RFC-aware lookaheads.
- 747-1430 chars on most prompts. Bare-minimum effort at $0.003-0.009/query.
18
GPT-4o-miniOpenAITier 4

Never wrong, but always the most generic answer

+ Cheapest model. Never produces errors or empty responses.
- Reads like paraphrased Wikipedia. No code examples in git-rebase. Template-like cold email.
19
MiMo-V2-ProXiaomiTier 42 empty

Reasoning model that frequently produces nothing

+ Good SQL-to-mongo mapping table when it works.
- Empty on email-regex and debug-react. Burns 2000+ reasoning tokens producing 0 output.