← Back to home

Yes, you can code with local LLMs

No, you won’t get Opus levels of intelligence, but it is still pretty great.

I’ve been running Claude Code with a local model in Ollama, and while it’s not out this world it’s still surprising you can run something this (dare I say … this good?) on consumer hardware. You get Haiku-ish levels of intelligence. Given that a year or so ago these models were far too big to run on average laptops, this is is a big win.

I’m running this on a standard Macbook Pro M5, 32GB ram with Ollama and Claude Code.

I’m using Claude Code as it’s a great model harness. I tried for fun to write one a while back, and I quickly realised that LLMs aren’t all that smart, but it’s the harness which gives it all its power and abaility. A good model with a bad harness is pretty useless. A good model with a great harness is still pretty powerful.

I therefore affectionately call this my openclaude command, tweaked for the M5 32GB MBP:

openclaude() {
  nohup ollama serve > /dev/null 2>&1 &
  CLAUDE_CODE_SKIP_PROMPT_HISTORY=1 OLLAMA_BACKEND=mlx OLLAMA_NUM_CTX=16384 CLAUDE_CODE_MAX_OUTPUT_TOKENS=8192 ollama launch claude --model qwen3.6:35b-a3b-coding-nvfp4
  pkill -f "ollama serve"
  echo "ollama stopped"
  sleep 1
}

The first line starts up Ollama in the background and pumps its output to /dev/null (it’s web scale!)

Then we have a few tweaks to get it working well on our resource constrained machine:

Flag What it does
CLAUDE_CODE_SKIP_PROMPT_HISTORY=1 Tells Claude Code not to persist conversation history to disk. Each session starts fresh (similar to dev null not logging anything)
OLLAMA_BACKEND=mlx Selects Apple’s MLX inference backend instead of llama.cpp. MLX is optimised for Apple Silicon and uses shared CPU/GPU memory, minimising copies and maximising throughput on the M5s unified memory
OLLAMA_NUM_CTX=16384 Sets the context window to 16,384 tokens (max input + output the model can “remember”)
CLAUDE_CODE_MAX_OUTPUT_TOKENS=8192 Caps output at 8,192 tokens per response to prevent runaway generations
ollama launch claude Launches Claude Code through the Ollama interface
--model qwen3.6:35b-a3b-coding-nvfp4 We load in Qwen 3.6, 35B parameters model with Active 3B Mixture of Expert (MoE), coding-optimized, NF4 (4-bit) quantized

As a side note, because the context is smaller than normal, you will have to run /clear more, or set auto-compaction. I’ve not done that here, as I’m always running the /clear command when I’m changing tasks.

It’s no frontier model - but is it good for running things locally or when you have no internet.