Mistral Medium 3.5 Review: A 128B Open-Weight Model With a Coding Agent That Opens PRs For You

TL;DR: Mistral Medium 3.5 is a 128B open-weight model released April 29, 2026, with a 256K context window, configurable reasoning, and native multimodal input. It scores 77.6% on SWE-Bench Verified — close but not ahead of Claude Sonnet 4.6 — and ships alongside Vibe, a cloud coding agent that submits pull requests directly to GitHub without you babysitting it. API pricing is $1.50 per million input tokens. Open weights under a modified MIT license that routes high-revenue enterprises through Mistral's paid channel. It's a serious model with a real killer feature. But the pricing math is complicated.

77.6%. That's Mistral Medium 3.5's score on SWE-Bench Verified.

For context: Claude Sonnet 4.6 scores 79.6% on the same benchmark. So Medium 3.5 is close — very close — but it doesn't actually beat the current best. And it costs $1.50 per million input tokens on Mistral's API, while cheaper alternatives like Qwen 3.6 (27B, free to self-host under Apache 2.0) are nipping at its heels with 72.4% on the same benchmark at a fraction of the model size.

So this is either a strong open-weight option for developers who need local control, or an overpriced mid-tier API play, depending entirely on what you're trying to do. I'll tell you which one it is for you — but you need to actually read the answer, because "it depends" is doing real work here.

What Is Mistral Medium 3.5?

Mistral is a Paris-based AI lab founded in 2023, and they've been steadily building out their model lineup: Small, Medium, Large, with various numbered releases layered on top. Medium 3.5 is their April 2026 flagship — released April 29, 2026.

What's notable about the architecture is that Medium 3.5 consolidates three previously separate Mistral models into a single unified set of weights. You used to have to choose between Mistral Medium 3.1 (instruction-following), Magistral (reasoning), and Devstral 2 (coding). Now it's one model with a toggle.

128B parameters. All dense — meaning all 128 billion parameters activate on every token, not a mixture-of-experts setup where only a fraction fire. Dense models are generally more predictable in output quality, which matters if you're running production workloads.

256K context window. That's a big number. Larger than Claude Sonnet 4.6 (200K) and twice GPT-4o (128K). Whether you'll actually saturate a 256K window in practice is a separate question, but having the headroom is nice when you're feeding it a whole codebase.

Configurable reasoning. You can toggle between fast reply mode and deep reasoning mode per request. This is Mistral's version of extended thinking — the model applies more test-time compute to harder problems when you ask it to. It worked well for Magistral, and the integration here is smoother than flipping between two separate models.

Multimodal input. Text and images in, text out. The vision encoder was trained from scratch to handle variable image sizes and aspect ratios, which is a technical choice that often shows up as better handling of screenshots and diagrams compared to models that bolt on a fixed-size vision module.

The Vibe Remote Agent — This Is the Actually Interesting Part

OK, so the benchmark numbers are solid-but-not-exceptional. What actually got my attention when I read the launch announcement was Vibe.

Vibe is Mistral's cloud-based coding agent. And the way it works is fundamentally different from local coding assistants.

Standard coding agents — whether it's Cursor, GitHub Copilot, or a local Claude-powered setup — require you to be in the loop. The model suggests, you approve, the model writes, you review, repeat. It's faster than doing it yourself, but you're still the bottleneck. You have to babysit the session.

Vibe's remote agents change that model. You describe the task, hand it to Vibe, and it runs in an isolated cloud sandbox. You can walk away. It can handle long-running tasks — the kind that take hours — without needing you to keep a terminal open. When it's done, it doesn't just output code to a terminal. It opens a pull request on your GitHub repo, ready for review.

I want to be clear about what that means: you can queue up three coding tasks before bed, wake up the next morning, and have three open PRs waiting. You look at the diffs, approve what's good, request changes on what isn't. The edit-review loop compresses dramatically.

The integrations are solid too. Native GitHub support is table stakes. But Vibe also connects to Linear, Jira, Sentry, Slack, and Microsoft Teams. So in theory you could wire it to your issue tracker and have it automatically pick up tickets, implement them, and open PRs. That's not theoretical — it's the designed workflow.

One thing worth noting: Vibe requires explicit approval for any "sensitive actions," meaning it surfaces what it's about to do before doing it. That's the right design. Autonomous agents that go quiet and then surprise you are a support ticket waiting to happen.

The Benchmark Reality Check

Here's the honest picture of where Medium 3.5 stands.

SWE-Bench Verified (coding, real GitHub issues):

Mistral Medium 3.5: 77.6%
Claude Sonnet 4.6: 79.6%
Qwen 3.6 (27B): 72.4%

τ³-Telecom (agentic tool-use): 91.4% — a strong number, and this benchmark specifically tests the kind of multi-step tool-calling that Vibe relies on. So the model is well-suited to its headline use case.

What's notably absent from Mistral's launch materials: MMLU, GPQA, AIME, HumanEval, MATH. These are the standard general knowledge and reasoning benchmarks that every other frontier model publishes on release. Mistral hasn't published them. I don't know why.

That gap is actually meaningful. If you're using this model for coding and agentic tasks, the published SWE-Bench number is the one you care about. But if you're trying to evaluate Medium 3.5 for general knowledge queries, document analysis, or reasoning over complex text — you're flying without instruments. The predecessor, Medium 3, scored 92.1% on HumanEval and 57.1% on GPQA Diamond, but those numbers don't carry over automatically to a new architecture.

Developers on Hacker News noticed. The launch thread had a notably critical tone, with one developer summarizing it as: "their new flagship model is basically 'not the best' on any benchmark, yet costs multiple times more than most competitors." Harsh, but not unfair based on the published data.

Running It Locally: The 4-GPU Reality

The "runs on 4 GPUs" headline is accurate. But let's talk about which 4 GPUs.

The recommended production setup is 4× NVIDIA H100 80GB, running the model in FP8 precision. That gives you ~320GB of VRAM total, with FP8 cutting the model weight footprint to roughly 128GB, leaving headroom for the KV cache at the 256K context window.

H100 80GB GPUs cost about $30,000 each new, and rent for roughly $2-3/hour per GPU on cloud providers. So 4-GPU production inference on rented hardware runs $8-12/hour. At that compute cost, the API at $1.50 per million tokens starts looking attractive unless you're doing very high volume.

For teams that already own the hardware — inference labs, larger enterprise ML teams, research groups — self-hosting at this scale is very much a real option. Mistral officially supports vLLM and SGLang for production inference, both of which have well-developed deployment paths for models this size.

If you want to run it on something more accessible: Q4-quantized versions drop the VRAM requirement to roughly 70GB, which is approaching Mac Studio territory (128GB unified memory, around $3,500 at current pricing). You'll pay a quality penalty versus full precision, but for dev/test workloads it's workable.

For inference optimization, Mistral offers EAGLE speculative decoding via a separate draft head model that adds about 4GB of overhead but delivers 1.41× output throughput and 29% lower latency. Worth enabling in production if you're squeezing performance.

License: Open-Weight, With a Catch for Big Companies

The modified MIT license Mistral ships with isn't quite as clean as the Apache 2.0 on earlier models. The core terms: free to use for individuals, startups, mid-market companies, and most universities. No restrictions on commercial applications within that population.

The catch: companies above a revenue threshold need to negotiate a separate commercial arrangement with Mistral. The threshold isn't published explicitly, but the structure creates a two-tier situation — the model is effectively open for everyone except large enterprise.

For most developers reading this, that's a non-issue. If you're solo or at a startup, the modified MIT terms are plenty permissive. If you're at a Fortune 500, talk to a lawyer before deploying at scale.

How It Stacks Up Against the Competition

Let me just run through the practical comparisons.

vs. Claude Sonnet 4.6: Sonnet 4.6 wins on SWE-Bench (79.6% vs 77.6%) and has a comparable context window (200K vs 256K). Claude is proprietary API-only — you can't self-host it. If local deployment or open weights matter to you, Medium 3.5 is the obvious choice. If API-only is fine and you want the highest coding benchmark, Sonnet edges it out. Pricing: Medium 3.5 at $1.50/million input vs. Sonnet 4.6 at $3.00/million — half the cost.

vs. GPT-4o: Medium 3.5 has a larger context window (256K vs 128K) and is ~40% cheaper on input tokens. GPT-4o has better published general-knowledge benchmark coverage, and it's API-only. Medium 3.5 wins on the open-weight dimension.

vs. Qwen 3.6 (27B): This is the uncomfortable comparison. Qwen 3.6 is 27B parameters — roughly 1/5 the size — and still hits 72.4% on SWE-Bench Verified, under Apache 2.0, for free. Medium 3.5 beats it by 5.2 points on that benchmark. Whether 5.2 points of SWE-Bench is worth $1.50/million tokens vs. $0 depends on your workload. For a high-volume API use case, the math probably favors Qwen. For local enterprise deployment where you want a single consolidated model with reasoning mode built in — Medium 3.5 has a legitimate case.

vs. Kimi K2.6: Moonshot's K2.6 (our most recent open-weight review) is 1 trillion total parameters in a MoE architecture with 58.6% SWE-Bench Pro — a harder benchmark than Verified. Kimi K2.6 starts at $0.60 per million input tokens and can run 300-agent swarms. Medium 3.5 is considerably smaller (128B dense), more accessible to self-host, and has the Vibe async agent as a practical differentiator. Different use case: K2.6 for maximum agentic scale, Medium 3.5 for single-model unified deployment with coding agents.

Who Should Actually Use This

Self-hosting teams with H100s already in their stack. This is Medium 3.5's strongest use case. If you've got the hardware, you get a consolidated model (no more managing three separate Mistral deployments), a 256K window, configurable reasoning, and multimodal input. The open-weight terms let you fine-tune it on proprietary data. That's a legitimate enterprise play.

Developers who want Vibe's async coding agents. The "queue tasks, get PRs in the morning" workflow is genuinely compelling, and it's built on Medium 3.5. If Vibe's cloud agent pipeline fits how you work, you're getting a capable model as the backend.

Cost-sensitive API users coming from GPT-4o or Claude. At $1.50 input vs. $3.00 (Sonnet) or $2.50 (GPT-4o), there's real savings potential if you don't need the absolute top-of-benchmark model for your task. The performance delta is small for most practical applications.

Who should pass: If you're choosing a pure API model with no interest in self-hosting, and you need the best coding benchmark performance, Sonnet 4.6 still edges it at 79.6% SWE-Bench. If you need maximum open-weight model for the lowest possible cost, Qwen 3.6's Apache 2.0 license and 72.4% benchmark score make it very hard to justify the price gap.

Bottom Line

Mistral Medium 3.5 is a genuinely capable model with a smart consolidation story — three specialized models folded into one dense 128B architecture with a toggle for reasoning. The Vibe async coding agent is the most differentiated thing they shipped alongside it, and the "PR while you sleep" workflow is real and useful.

The friction points are real too. The benchmark coverage gaps matter if you're trying to assess general-purpose performance. The pricing faces legitimate pressure from efficient smaller models. And the enterprise license terms are worth reading carefully before you commit.

But for a European lab staying in the frontier race — open weights, self-hostable on accessible hardware, with a cloud agent product that integrates into actual developer workflows — Mistral Medium 3.5 is a meaningful release. It's not the best model on any individual benchmark. It might be the best practical open-weight model for teams that need the whole package.

Verification sources for this article: Mistral AI official announcement, HuggingFace model page, Mistral official documentation at docs.mistral.ai/models/model-cards/mistral-medium-3-5-26-04.

DE

Source

This article was originally published by DEV Community and written by Marcus Rowe.

Read original article on DEV Community

Back to Discover