Running Qwen3.6-35b-a3b locally on M4 Max 128GB with pi-coding-agent
/ 6 min read
Table of Contents
I’m experimenting with running pi-coding-agent on a local model, and I just set up Qwen3.6-35B-A3B on my M4 Max 128GB. So far it’s promising — fast enough for coding tasks, free, and everything stays on my machine.
Here’s how I got it running.
The model
Qwen3.6-35B-A3B is multimodal (vision encoder included), supports 262K context natively (extensible to ~1M), and is Apache 2.0 licensed. The bf16 weights are ~70GB — your M4 Max 128GB has plenty of headroom. At the end of this post you’ll find some more details about the model.
The MLX community has a ready-to-use bf16 conversion with 14 safetensor files, and the model supports the preserve_thinking feature that retains reasoning context across conversation turns — useful for iterative coding tasks.
Running the model server
First install the mlx-lm package via pip:
pip install mlx-lmSee the MLX LM README for more details on the package, including conda installation, Python API usage, and quantization tools.
Then run the server like this:
mlx_lm.server \ --model mlx-community/Qwen3.6-35B-A3B-bf16 \ --trust-remote-code \ --port 8082 \ --max-tokens 8192 \ --chat-template-args '{"enable_thinking":true,"preserve_thinking":true}' \ --prompt-cache-size 16 \ --prompt-cache-bytes 12GB \ --decode-concurrency 4 \ --prompt-concurrency 2Key flags:
--max-tokens 8192— the default 512 is way too low for coding agentspreserve_thinking: true— Qwen3.6’s new feature that retains reasoning context across turns, important for iterative coding--prompt-cache-bytes 12GB— caps KV cache memory. The model weights take ~70GB, so 12GB leaves ~46GB for KV cache and runtime overhead on 128GB RAM. 24GB would be too aggressive and starve the KV cache.--decode-concurrency 4— lower than default for better single-user latency
Note: mlx_lm.server doesn’t yet support --max-kv-size (open issue), so --prompt-cache-bytes is the way to cap memory. With ~70GB for model weights on 128GB RAM, 12GB is a safe sweet spot — enough to cache system prompts and shared context without starving the KV cache.
Community reports ~91 tok/s with the 4-bit variant on M4 Max. The bf16 version will be slower (~40-60 tok/s) but higher quality.
Configuring pi-coding-agent
Add the model to ~/.pi/agent/models.json:
{ "providers": { "mlx-local": { "baseUrl": "http://localhost:8082/v1", "api": "openai-completions", "apiKey": "mlx", "compat": { "supportsDeveloperRole": false, "supportsReasoningEffort": false }, "models": [ { "id": "mlx-community/Qwen3.6-35B-A3B-bf16", "name": "Qwen3.6 35B A3B bf16 (MLX local)", "reasoning": true, "input": ["text"], "contextWindow": 262144, "maxTokens": 8192, "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 }, "compat": { "thinkingFormat": "qwen-chat-template" } } ] } }}The file reloads on each /model invocation — no restart needed.
How it compares
Here’s how it stacks up against the big names, based on published benchmarks:
| Qwen3.6-35B-A3B | Claude Sonnet 4 | Claude Opus 4.5 | |
|---|---|---|---|
| Aggregate | 64 | 52 1 | 80 2 |
| SWE-bench Verified | 73.4 3 | 72.7 1 | 80.9 4 |
| Context | 262K | 200K | 200K |
Qwen3.6 is already ahead of Sonnet 4 on the aggregate score (64 vs 52) and SWE-bench Verified (73.4 vs 72.7), despite being a local model with zero cost. Opus 4.5 is in a different league — but costs $5/$25 per million tokens and sends your code to Anthropic.
For day-to-day coding, Qwen3.6 looks competitive enough that the local advantages (zero cost, instant response, full privacy, unlimited usage) make it worth trying.
I chose Sonnet 4 and Opus 4.5 for comparison since those were the models that felt less iterative but more like making leaps. Sonnet 4 is was got me into Claude Code and it caused my first agentic coding burst in June 2025. Opus 4.5 then again happened end of 2025 and felt like another game changer.
So if the comparison holds true, after less than a year I can now run a local model for coding better than my initial Claude Code + Sonnet 4 experience.
Some early quick benchmarks
I ran mlx_lm.benchmark to measure actual tokens/sec on my M4 Max 128GB with the bf16 model:
| Scenario | Prompt | Gen | Batch | Prompt TPS | Gen TPS | Total Time |
|---|---|---|---|---|---|---|
| Fast chat | 256 | 256 | 1 | ~914 | ~62 | 4.5s |
| Long context (RAG) | 8192 | 128 | 1 | ~1548 | ~59 | 7.4s |
| Heavy gen (code) | 512 | 2048 | 1 | ~1106 | ~59 | 35.5s |
| Batch size 4 | 512 | 512 | 4 | ~1656 | ~86 | 25.1s |
I used mlx_lm.benchmark to measure tokens/sec on my M4 Max 128GB with the bf16 model. Here are the exact commands:
Fast chat (short prompt, short response):
mlx_lm.benchmark --model mlx-community/Qwen3.6-35B-A3B-bf16 -p 256 -g 256 -n 5Long context / RAG (8K prompt):
mlx_lm.benchmark --model mlx-community/Qwen3.6-35B-A3B-bf16 -p 8192 -g 128 -n 5Heavy generation (code writing, long responses):
mlx_lm.benchmark --model mlx-community/Qwen3.6-35B-A3B-bf16 -p 512 -g 2048 -n 5Compare batch sizes (batch 1 vs 4):
mlx_lm.benchmark --model mlx-community/Qwen3.6-35B-A3B-bf16 -p 512 -g 512 -b 1 -n 3mlx_lm.benchmark --model mlx-community/Qwen3.6-35B-A3B-bf16 -p 512 -g 512 -b 4 -n 3Key flags: -p for prompt tokens, -g for generation tokens, -b for batch size, -n for number of trials. The model downloads automatically on first run if not cached locally.
Wired memory tip
If you see the warning Generating with a model that requires..., increase the wired memory limit:
sudo sysctl iogpu.wired_limit_mb=100000Set this to something larger than the model size in MB but smaller than your total RAM. Requires macOS 15+.
More details about the model
Qwen3.6-35B-A3B is a sparse Mixture-of-Experts model from Qwen with 35B total parameters but only 3B active per token (model card). Its architecture is a hybrid of Gated DeltaNet (linear attention) and Grouped Query Attention, arranged as 10 × (3 × (Gated DeltaNet → MoE) → 1 × (Gated Attention → MoE)) across 40 layers. It has 256 experts, with 8 routed + 1 shared per layer.
Here’s what that actually means:
Sparse Mixture-of-Experts (35B total / 3B active) Think of it like a firm with 35 billion employees, but for any given task, only 3 billion actually show up. The model has 35B parameters total, but a “gating” mechanism picks which ones to use for each token. So you get the knowledge capacity of a 35B model but the speed of a 3B model.
Gated DeltaNet (linear attention) Standard attention looks at every previous token to understand the current one — that’s why context windows are expensive. DeltaNet is a newer, more efficient approach that tracks a compressed “memory” of what it’s seen so far. It scales linearly instead of quadratically, which is how this model handles 262K context without choking.
Grouped Query Attention (GQA) A middle ground between full multi-head attention and multi-query attention. Instead of each query head maintaining its own key/value pairs (expensive) or all sharing one (fast but lower quality), groups of query heads share a single KV pair. Faster than standard attention, better quality than multi-query.
The architecture pattern 10 × (3 × (DeltaNet → MoE) → 1 × (Attention → MoE))
The model has 40 layers total, organized in repeating blocks of 4:
- 3 layers use DeltaNet + MoE (fast, good for long context)
- 1 layer uses standard Attention + MoE (more precise, good for complex reasoning)
- Repeat that 10 times → 30 DeltaNet + 10 Attention = 40 layers
The idea: most of the time use the fast DeltaNet, but sprinkle in standard attention at key points for when the model needs to really focus.
256 experts, 8 routed + 1 shared There are 256 different sub-networks (“experts”). For each token, the gate picks 8 of them to process it. Plus there’s 1 “shared” expert that every token goes through regardless. So each token passes through 9 experts total. Different experts specialize in different things — code, math, language, etc.
That’s it. A capable local coding agent running on Apple Silicon, no API keys, no rate limits. The MLX ecosystem is maturing fast and this setup is a solid foundation.
I used pi-coding-agent with Qwen3.6-35b-a3b to gather and summarize the benchmarks for this blog post, which is a nice meta loop.
Footnotes
-
BenchLM comparison: Opus 4.7 vs Sonnet 4 ↩ ↩2