No, you won’t get Opus levels of intelligence, but it is still pretty great.
I’ve been running Claude Code with a local model in Ollama, and while it’s not out this world it’s still surprising you can run something this (dare I say … this good?) on consumer hardware. You get Haiku-ish levels of intelligence. Given that a year or so ago these models were far too big to run on average laptops, this is is a big win.
I’m running this on a standard Macbook Pro M5, 32GB ram with Ollama and Claude Code.
I’m using Claude Code as it’s a great model harness. I tried for fun to write one a while back, and I quickly realised that LLMs aren’t all that smart, but it’s the harness which gives it all its power and abaility. A good model with a bad harness is pretty useless. A good model with a great harness is still pretty powerful.
I therefore affectionately call this my openclaude command, tweaked for the M5 32GB MBP:
openclaude() {
nohup ollama serve > /dev/null 2>&1 &
CLAUDE_CODE_SKIP_PROMPT_HISTORY=1 OLLAMA_BACKEND=mlx OLLAMA_NUM_CTX=16384 CLAUDE_CODE_MAX_OUTPUT_TOKENS=8192 ollama launch claude --model qwen3.6:35b-a3b-coding-nvfp4
pkill -f "ollama serve"
echo "ollama stopped"
sleep 1
}
The first line starts up Ollama in the background and pumps its output to /dev/null (it’s web scale!)
Then we have a few tweaks to get it working well on our resource constrained machine:
| Flag | What it does |
|---|---|
CLAUDE_CODE_SKIP_PROMPT_HISTORY=1 |
Tells Claude Code not to persist conversation history to disk. Each session starts fresh (similar to dev null not logging anything) |
OLLAMA_BACKEND=mlx |
Selects Apple’s MLX inference backend instead of llama.cpp. MLX is optimised for Apple Silicon and uses shared CPU/GPU memory, minimising copies and maximising throughput on the M5s unified memory |
OLLAMA_NUM_CTX=16384 |
Sets the context window to 16,384 tokens (max input + output the model can “remember”) |
CLAUDE_CODE_MAX_OUTPUT_TOKENS=8192 |
Caps output at 8,192 tokens per response to prevent runaway generations |
ollama launch claude |
Launches Claude Code through the Ollama interface |
--model qwen3.6:35b-a3b-coding-nvfp4 |
We load in Qwen 3.6, 35B parameters model with Active 3B Mixture of Expert (MoE), coding-optimized, NF4 (4-bit) quantized |
As a side note, because the context is smaller than normal, you will have to run /clear more, or set auto-compaction. I’ve not done that here, as I’m always running the /clear command when I’m changing tasks.
It’s no frontier model - but is it good for running things locally or when you have no internet.