Wolfgang Kandek

Claude and using a local LLM on a spare server

Sometimes when I run out of tokens in claude on my Mac laptop I switch to an LLM that I run on a Mac Studio (M4 Max 36 GB) that I have running other software on.
The LLM I run on the server is Qwen3.5

Speed of this local solution is about 1/3 of using claude with Sonnet 4.6.

Here is how I got this to work, it is pretty easy, but took me a while to find the right tools - i.e. vllm_mlx has a native Claude compatible server, differently from mlx_lm or mlx_vlm.

On the server - the Mac Studio:

upgrade python
brew install python@3.14
create a working diretcory
mkdir mlx; cd mlx
create a virtual python env using the new python
/opt/brew/bin/python3 -m venv .
source bin/activate
install vllm_mlx - install dependencies takes a little while
pip3 install vllm_mlx
run the server - first time will download the model takes a while
vllm-mlx serve mlx-community/Qwen3.5-35B-A3B-4Bit --port 8080 --host 0.0.0.0

On the client - my laptop and 192.168.1.10 is the address of the server.

ANTHROPIC_MODEL=mlx-community/Qwen3.5-35B-A3B-4Bit
ANTHROPIC_DEFAULT_HAIKU_MODEL=mlx-community/Qwen3.5-35B-A3B-4Bit
ANTHROPIC_DEFAULT_OPUS_MODEL=mlx-community/Qwen3.5-35B-A3B-4Bit
ANTHROPIC_DEFAULT_SONNET_MODEL=mlx-community/Qwen3.5-35B-A3B-4Bit
ANTHROPIC_API_KEY=not-needed  
ANTHROPIC_BASE_URL=http://192.168.1.10:8080
claude

explaining a 20KB python program took about 1m, on normal claude about 20 seconds. The explanation were different but both reasonable.

Update: when running qwen 3.5 under ollama (qwen3.5:35b-a3b) on a local PC with a 5090 i get near instantaneous performance using the Pi Coding Agent on my workstation.