slopshopper

19 models, 12 prompts, ranked and compared

Tier 1Consistently Excellent

Claude Opus 4.6AnthropicTier 1

Best overall quality across every prompt category

+ Best formatting, nuance, production-grade code. Binary search optimization in rate limiter. 'What Went Well' sections in postmortems.

- Most expensive (10-100x others)

Claude Sonnet 4.6AnthropicTier 1

Close to Opus quality at lower cost

+ Production-ready code, ASCII diagrams, per-user locks in rate limiter. Strong professional writing.

- Still expensive

gpt-oss-120bOpenAITier 1

Best value in the entire set — top-tier quality at 50-100x cheaper than Opus

+ Most detailed responses. Excellent documentation, CLI-ready scripts, thorough system design with tables.

- Sometimes hit 2048 token limit and truncated

DeepSeek V3.2DeepSeekTier 1

Never empty, never broken, always solid. Excellent cost/quality ratio.

+ Always gives multiple progressive options. Consistent quality. Among the cheapest.

- Never the absolute best on any single prompt

Tier 2Strong

Gemini 3 FlashGoogleTier 2

Reliable mid-to-high quality with good explanations

+ Clean code, good explanations, reliable across all categories.

- Rarely stands out as the best

Qwen3.6 PlusQwenTier 2

Concise and action-oriented

+ Excellent Dockerfile with size estimates. Pragmatic answers that get to the point.

- Sometimes too brief

Grok 4.1 FastxAITier 2

Strong on code review and Docker, inconsistent depth

+ Good code review with clear issue enumeration. Strong Dockerfile.

- SQL-to-mongo was 144 chars. Sometimes too terse to be useful.

Step 3.5 FlashStepFunTier 24 empty

Good quality when it works, but empty on 4+ prompts

+ Good git-rebase flowchart, strong React debug answer.

- Empty responses on cron, refactor, csv-to-chart, system-design. Unreliable.

Tier 3Decent but Flawed

Gemini 2.5 FlashGoogleTier 3

Verbose but thorough

+ Detailed explanations with error handling.

- 6-7K char responses with poor signal-to-noise ratio. Over-explains.

Gemini 2.5 Flash LiteGoogleTier 3

Very similar to Gemini 2.5 Flash — verbose, correct, noisy

+ Good detail at low cost.

- Extremely verbose for simple topics. Repeats concepts.

Nemotron 3 SuperNVIDIATier 3

Wildly inconsistent — 7K chars on some prompts, 143 on others

+ When verbose, produces educational step-by-step breakdowns.

- SQL-to-mongo was 143 chars bare code. System design was 3 sentences. Uses emoji excessively.

Kimi K2.5MoonshotTier 33 empty

Excellent refactor-review (best of all models), but empty on 3 prompts

+ Outstanding layered code review approach. Good system design.

- Empty on docker, cold-email, git-rebase. Reasoning model burns tokens thinking.

GLM 5Z.aiTier 3

Decent when not truncated

+ Reasonable quality on completed responses.

- Hit token limits on multiple prompts. Truncated mid-sentence.

GLM 5 TurboZ.aiTier 3

Truncated on Docker, mediocre otherwise

+ Led with APScheduler on cron (unique pick). Good React debug pedagogy.

- Dockerfile truncated mid-line. Inconsistent.

MiniMax M2.7MiniMaxTier 31 empty

Decent when working, but reliability issues

+ Good code with clear comments.

- Empty on rate-limiter. Chinese text leaked into postmortem. Truncated on system design.

MiniMax M2.5MiniMaxTier 3

Mid-quality, truncation issues

+ Reasonable answers when complete.

- Postmortem truncated mid-word. Inconsistent depth.

Tier 4Disappointing

GPT-5.4OpenAITier 4

Surprisingly terse for premium pricing

+ Correct and concise. Best email regex with RFC-aware lookaheads.

- 747-1430 chars on most prompts. Bare-minimum effort at $0.003-0.009/query.

GPT-4o-miniOpenAITier 4

Never wrong, but always the most generic answer

+ Cheapest model. Never produces errors or empty responses.

- Reads like paraphrased Wikipedia. No code examples in git-rebase. Template-like cold email.

MiMo-V2-ProXiaomiTier 42 empty

Reasoning model that frequently produces nothing

+ Good SQL-to-mongo mapping table when it works.

- Empty on email-regex and debug-react. Burns 2000+ reasoning tokens producing 0 output.