Yes, you can code with local LLMs

No, you won’t get Opus levels of intelligence, but it is still pretty great.

I’ve been running Claude Code with a local model in Ollama, and while it’s not out this world it’s still surprising you can run something this (dare I say … this good?) on consumer hardware. You get Haiku-ish levels of intelligence. Given that a year or so ago these models were far too big to run on average laptops, this is is a big win.

I’m running this on a standard Macbook Pro M5, 32GB ram with Ollama and Claude Code.

I’m using Claude Code as it’s a great model harness. I tried for fun to write one a while back, and I quickly realised that LLMs aren’t all that smart, but it’s the harness which gives it all its power and abaility. A good model with a bad harness is pretty useless. A good model with a great harness is still pretty powerful.

I therefore affectionately call this my openclaude command, tweaked for the M5 32GB MBP:

openclaude() {
  nohup ollama serve > /dev/null 2>&1 &
  CLAUDE_CODE_SKIP_PROMPT_HISTORY=1 OLLAMA_BACKEND=mlx OLLAMA_NUM_CTX=16384 CLAUDE_CODE_MAX_OUTPUT_TOKENS=8192 ollama launch claude --model qwen3.6:35b-a3b-coding-nvfp4
  pkill -f "ollama serve"
  echo "ollama stopped"
  sleep 1
}

The first line starts up Ollama in the background and pumps its output to /dev/null (it’s web scale!)

Then we have a few tweaks to get it working well on our resource constrained machine:

Flag	What it does
`CLAUDE_CODE_SKIP_PROMPT_HISTORY=1`	Tells Claude Code not to persist conversation history to disk. Each session starts fresh (similar to dev null not logging anything)
`OLLAMA_BACKEND=mlx`	Selects Apple’s MLX inference backend instead of llama.cpp. MLX is optimised for Apple Silicon and uses shared CPU/GPU memory, minimising copies and maximising throughput on the M5s unified memory
`OLLAMA_NUM_CTX=16384`	Sets the context window to 16,384 tokens (max input + output the model can “remember”)
`CLAUDE_CODE_MAX_OUTPUT_TOKENS=8192`	Caps output at 8,192 tokens per response to prevent runaway generations
`ollama launch claude`	Launches Claude Code through the Ollama interface
`--model qwen3.6:35b-a3b-coding-nvfp4`	We load in Qwen 3.6, 35B parameters model with Active 3B Mixture of Expert (MoE), coding-optimized, NF4 (4-bit) quantized

As a side note, because the context is smaller than normal, you will have to run /clear more, or set auto-compaction. I’ve not done that here, as I’m always running the /clear command when I’m changing tasks.

Another thing to note is that if you put your mac into Low Power Mode, you’ll get reduced token output, however your machine won’t get hot and the fans won’t then spin up crazy.

It’s no frontier model - but is it good for running things locally or when you have no internet.