We've been deep in the trenches this week testing local AI models on Apple Silicon, and I wanted to share the real-world playbook so you don't have to figure this out from scratch.
🔧 THE FRAMEWORK STACK (Start Here)
Not all local AI tools are equal on a Mac. There are 3 main frameworks:
1. MLX-LM — This is your best friend on Apple Silicon. It uses Apple's Metal GPU shaders + Neural Accelerators, making it 20-50% faster than the alternatives. Always reach for this first.
2. Ollama — Dead simple to install, great for getting started fast. Excellent for connecting tools like ProductiveBot to a local model.
3. llama.cpp — The fallback for specialized use cases. Great for very large models (100B+) and certain model types that don't have clean MLX builds yet.
Framework priority: MLX-LM → Ollama → llama.cpp
---
💾 THE RAM REALITY CHECK
Local models live entirely in your RAM. This is the #1 thing people underestimate:
8 GB RAM → Small models only (Phi-4 Mini, Gemma 2B) — limited but usable
16 GB RAM → Good 7B-13B models (Llama 3.2, Mistral 7B) — solid daily driver
32 GB RAM → Sweet spot — quality 27B-35B models, excellent performance
64 GB+ RAM → Flagship territory — 70B models, frontier-class responses
128 GB RAM → Beast mode — 120B+ models, near-cloud-quality output
Apple Silicon's unified memory architecture is a massive advantage here — your GPU and CPU share the same memory pool, which is why Macs run local models so much more efficiently than most Windows laptops.
---
🔢 QUANTIZATION — THE QUALITY VS. SPEED DIAL
Quantization is how you shrink a model to fit your RAM. Think of it like JPEG compression — you trade some quality for a smaller file size.
Q8 — Best quality, use for models under ~35B on 32GB RAM
Q6 — Great balance for 40-70B models
Q4 — Gets big models (70B+) running, quality dip is noticeable but still useful
Q2/Q3 — Only go here if you're desperate for RAM
Rule of thumb: Use the highest Q you can while still leaving 2-4GB of RAM free for your Mac to breathe.
---
🏆 MODELS WORTH TRYING RIGHT NOW
Best small/fast (great on 8-16GB):
• Phi-4 Mini — insanely fast, surprisingly capable
• Gemma 4 E2B — speed demon, good for quick tasks
Best all-around (16-32GB):
• Qwen3.6-27B — coding powerhouse
• Llama 3.3 70B at Q4 — well-rounded, great for general use
Flagship (64GB+):
• Qwen3.6-35B-A3B — currently ranked #1 on SWE-bench (coding benchmark)
• Qwen3.5-122B — frontier-class, requires 64GB+ at Q4
---
⚡ QUICK START: GET RUNNING IN 5 MINUTES
The fastest path to your first local model:
1. Install Ollama: brew install ollama (or download from ollama.ai)
2. Start it: ollama serve
3. Pull a model: ollama pull llama3.2 (for 16GB Mac) or ollama pull qwen2.5:32b (for 32GB+)
4. Chat with it: ollama run llama3.2
That's it. You're running AI locally, 100% private, zero API costs.
---
💡 PRO TIPS FROM REAL TESTING
✅ Always run models in a dedicated terminal session (tmux is great) so they stay alive
✅ MLX format files from mlx-community on HuggingFace are faster than GGUF on Apple Silicon
✅ Don't run other heavy apps while you're using a large model — thermal throttling is real
✅ If a model won't load, your RAM is the issue 99% of the time — try a lower Q first
✅ Keep 2-4GB RAM headroom — macOS needs breathing room or you'll get slowdowns
---
🔌 CONNECTING TO PRODUCTIVEBOT
Once you have Ollama running, you can connect it to ProductiveBot as an additional AI provider. This means you can route certain tasks to your local model — great for private data or when you want zero cloud calls. Full guide coming soon.
---
What Mac are you running, and have you tried any local models yet? Drop your setup in the comments — curious what the community is working with! 👇