Sometimes when I run out of tokens in claude on my Mac laptop I switch to an LLM that I run on a Mac Studio (M4 Max 36 GB) that I have running other software on.
The LLM I run on the server is Qwen3.5
Speed of this local solution is about 1/3 of using claude with Sonnet 4.6.
Here is how I got this to work, it is pretty easy, but took me a while to find the right tools - i.e. vllm_mlx has a native Claude compatible server, differently from mlx_lm or mlx_vlm.
On the server - the Mac Studio:
brew install python@3.14mkdir mlx; cd mlx/opt/brew/bin/python3 -m venv .source bin/activatepip3 install vllm_mlxvllm-mlx serve mlx-community/Qwen3.5-35B-A3B-4Bit --port 8080 --host 0.0.0.0On the client - my laptop and 192.168.1.10 is the address of the server.
ANTHROPIC_MODEL=mlx-community/Qwen3.5-35B-A3B-4Bit
ANTHROPIC_DEFAULT_HAIKU_MODEL=mlx-community/Qwen3.5-35B-A3B-4Bit
ANTHROPIC_DEFAULT_OPUS_MODEL=mlx-community/Qwen3.5-35B-A3B-4Bit
ANTHROPIC_DEFAULT_SONNET_MODEL=mlx-community/Qwen3.5-35B-A3B-4Bit
ANTHROPIC_API_KEY=not-needed
ANTHROPIC_BASE_URL=http://192.168.1.10:8080
claude
explaining a 20KB python program took about 1m, on normal claude about 20 seconds. The explanation were different but both reasonable.
Update: when running qwen 3.5 under ollama (qwen3.5:35b-a3b) on a local PC with a 5090 i get near instantaneous performance using the Pi Coding Agent on my workstation.