Running Local AI Models on Your Mac — What Actually Works

We've been deep in the trenches this week testing local AI models on Apple Silicon, and I wanted to share the real-world playbook so you don't have to figure this out from scratch.

🔧 THE FRAMEWORK STACK (Start Here)

Not all local AI tools are equal on a Mac. There are 3 main frameworks:

1. MLX-LM — This is your best friend on Apple Silicon. It uses Apple's Metal GPU shaders + Neural Accelerators, making it 20-50% faster than the alternatives. Always reach for this first.

2. Ollama — Dead simple to install, great for getting started fast. Excellent for connecting tools like ProductiveBot to a local model.

3. llama.cpp — The fallback for specialized use cases. Great for very large models (100B+) and certain model types that don't have clean MLX builds yet.

Framework priority: MLX-LM → Ollama → llama.cpp

---

💾 THE RAM REALITY CHECK

Local models live entirely in your RAM. This is the #1 thing people underestimate:

8 GB RAM → Small models only (Phi-4 Mini, Gemma 2B) — limited but usable

16 GB RAM → Good 7B-13B models (Llama 3.2, Mistral 7B) — solid daily driver

32 GB RAM → Sweet spot — quality 27B-35B models, excellent performance

64 GB+ RAM → Flagship territory — 70B models, frontier-class responses

128 GB RAM → Beast mode — 120B+ models, near-cloud-quality output

Apple Silicon's unified memory architecture is a massive advantage here — your GPU and CPU share the same memory pool, which is why Macs run local models so much more efficiently than most Windows laptops.

---

🔢 QUANTIZATION — THE QUALITY VS. SPEED DIAL

Quantization is how you shrink a model to fit your RAM. Think of it like JPEG compression — you trade some quality for a smaller file size.

Q8 — Best quality, use for models under ~35B on 32GB RAM

Q6 — Great balance for 40-70B models

Q4 — Gets big models (70B+) running, quality dip is noticeable but still useful

Q2/Q3 — Only go here if you're desperate for RAM

Rule of thumb: Use the highest Q you can while still leaving 2-4GB of RAM free for your Mac to breathe.

---

🏆 MODELS WORTH TRYING RIGHT NOW

Best small/fast (great on 8-16GB):

• Phi-4 Mini — insanely fast, surprisingly capable

• Gemma 4 E2B — speed demon, good for quick tasks

Best all-around (16-32GB):

• Qwen3.6-27B — coding powerhouse

• Llama 3.3 70B at Q4 — well-rounded, great for general use

Flagship (64GB+):

• Qwen3.6-35B-A3B — currently ranked #1 on SWE-bench (coding benchmark)

• Qwen3.5-122B — frontier-class, requires 64GB+ at Q4

---

⚡ QUICK START: GET RUNNING IN 5 MINUTES

The fastest path to your first local model:

1. Install Ollama: brew install ollama (or download from ollama.ai)

2. Start it: ollama serve

3. Pull a model: ollama pull llama3.2 (for 16GB Mac) or ollama pull qwen2.5:32b (for 32GB+)

4. Chat with it: ollama run llama3.2

That's it. You're running AI locally, 100% private, zero API costs.

---

💡 PRO TIPS FROM REAL TESTING

✅ Always run models in a dedicated terminal session (tmux is great) so they stay alive

✅ MLX format files from mlx-community on HuggingFace are faster than GGUF on Apple Silicon

✅ Don't run other heavy apps while you're using a large model — thermal throttling is real

✅ If a model won't load, your RAM is the issue 99% of the time — try a lower Q first

✅ Keep 2-4GB RAM headroom — macOS needs breathing room or you'll get slowdowns

---

🔌 CONNECTING TO PRODUCTIVEBOT

Once you have Ollama running, you can connect it to ProductiveBot as an additional AI provider. This means you can route certain tasks to your local model — great for private data or when you want zero cloud calls. Full guide coming soon.

---

What Mac are you running, and have you tried any local models yet? Drop your setup in the comments — curious what the community is working with! 👇

1 comment