Best Local LLMs for Mac (2026): 16 GB, 24 GB, 32 GB & 64 GB Picks
The best local LLMs for Mac in 2026, split by unified memory: practical Qwen3.6, Gemma 4 and Llama 4 choices for 16 GB to 64 GB+ Macs.
- Qwen3.6 35B-A3B — about 24 GB Ollama package, 256K context, text+image, positioned for thinking and coding-agent workflows
- Gemma 4 31B — about 20 GB Ollama package, 256K context, text+image; vendor reports AIME 2026: 89.2%
- Gemma 4 26B A4B — MoE variant with 3.8B active parameters, about 18 GB Ollama package, efficient sweet spot for 24 GB Macs
- Llama 4 Scout (16x17B) — about 67 GB Ollama package, text+image, unsuitable for normal local Macs
- On a 24 GB Mac: start with
gemma4:26borqwen3.6:27band limit context; 32 GB+ is more comfortable
Graphic based on current Ollama model pages and official model cards. Sources: Ollama Qwen3.6, Ollama Gemma 4, Ollama Llama 4. Verified June 18, 2026.
The State of Open-Weight LLMs — June 2026
Quick answer: For most Mac users in 2026, Gemma 4 and Qwen3.6 are the relevant open-weight choices. gemma4:26b is the cautious first test for 24 GB Macs, while qwen3.6:27b is interesting for coding and agent workflows but needs more headroom with long context. gemma4:31b and qwen3.6:35b-a3b are better suited to 32 GB, 48 GB or more. Llama 4 Scout is impressive on paper, but its roughly 67 GB Ollama package makes it impractical for normal local Macs.
This is not an objective global leaderboard of every open model. It is a Mac-focused shortlist: what runs locally, what needs too much unified memory, and where benchmark numbers are only vendor/model-card signals.
Three model families matter in 2026: Qwen3.6 from Alibaba, Gemma 4 from Google and Llama 4 Scout from Meta. Local observations about Qwen3.6 and Gemma 4 are kept separate from vendor information about Llama 4 Scout: Scout does not fit in 32 GB of unified memory in the quantizations discussed here and is not a model for everyday Mac use.
Model Overview: Tag, Size, Context, License
| Model | Ollama Tag | Ollama Size | Context | Input | License |
|---|---|---|---|---|---|
| Qwen3.6 27B | qwen3.6:27b | ~17 GB | 256K | Text + Image | Apache 2.0 |
| Qwen3.6 27B MLX | qwen3.6:27b-mlx | ~20 GB | 256K | Text | Apache 2.0 |
| Qwen3.6 35B-A3B | qwen3.6:35b-a3b | ~24 GB | 256K | Text + Image | Apache 2.0 |
| Qwen3.6 35B MLX | qwen3.6:35b-a3b-mlx | ~22 GB | 256K | Text | Apache 2.0 |
| Gemma 4 E2B | gemma4:e2b | 7.2 GB | 128K | Text + Image; audio native on E2B per Google* | Apache 2.0 |
| Gemma 4 E4B | gemma4:e4b | 7.9 GB | 128K | Text + Image; audio native on E4B per Google* | Apache 2.0 |
| Gemma 4 12B | gemma4:12b | 7.6 GB | 256K | Text + Image; audio native per Google, check client* | Apache 2.0 |
| Gemma 4 26B A4B | gemma4:26b | 18 GB | 256K | Text + Image | Apache 2.0 |
| Gemma 4 31B | gemma4:31b | 20 GB | 256K | Text + Image | Apache 2.0 |
| Llama 4 Scout | llama4:16x17b | ~67 GB | 10M in Ollama | Text + Image | Llama 4 Community |
Important: The Ollama size is not the same as total memory use. Context windows, KV cache, macOS, browser, other apps and vision inputs add on top. Larger context windows require significantly more memory.
* Google lists native audio for E2B, E4B and 12B. The 26B A4B and 31B variants are text+image models. Audio support in a specific Ollama tag and client still needs to be checked separately.
What Runs on Your Mac: RAM Recommendations
The tiers below are anchored to your Mac’s unified memory — the shared pool for CPU and GPU on Apple Silicon. The Ollama package size alone is not the total memory cost; context windows, KV cache, macOS and parallel apps all add to it.
| Mac Configuration | Realistic Models | Recommendation |
|---|---|---|
| 8 GB unified memory | gemma4:e2b, gemma4:e4b, smaller Qwen3 models | Light models, short contexts |
| 16 GB unified memory | smaller Qwen3/Gemma models, no huge contexts | Entry-level mid-size models |
| 24 GB unified memory | gemma4:26b, qwen3.6:27b with limited context | gemma4:26b can be a useful first test, but context window, vision inputs and parallel apps must be limited. qwen3.6:27b can run, but is more sensitive to free unified memory and context. |
| 32 GB unified memory | gemma4:31b, qwen3.6:35b-a3b, qwen3.6:27b-mlx | gemma4:31b and qwen3.6:35b-a3b are testable, but not automatically comfortable with a large context. For longer agent runs, 48 GB+ is noticeably more relaxed. |
| 48 GB unified memory | + gemma4:31b with larger context | More relaxed 31B use with a larger context window |
| 64 GB+ unified memory | + qwen3.6:35b-a3b with context | 64 GB+ does not automatically mean Llama 4 Scout. The Ollama package is about 67 GB, plus runtime, KV cache, macOS and apps. |
Llama 4 Scout (~67 GB) is unsuitable for normal local Macs — including a Mac Studio M4 Max with 48 GB.
Model selection: current Ollama tags · verified June 18, 2026
Qwen3.6 — Coding and Agent Workflows with 27B Dense and 35B-A3B MoE
Qwen3.6 is a current open-weight Qwen generation for local and agentic workflows, available in 27B dense and 35B-A3B MoE variants. The 35B-A3B variant scores high on multiple benchmarks (see caveat below).
Setup:
# 27B — Text + Image, good local starting point
ollama pull qwen3.6:27b
# 35B-A3B — larger quality variant (from 32 GB+ unified memory).
# Use the explicit A3B tag; `qwen3.6:35b` resolves to the same model.
ollama pull qwen3.6:35b-a3b
# MLX tag — Text-only, not for vision
ollama pull qwen3.6:27b-mlx
# Start
ollama run qwen3.6:35b-a3b
Benchmarks (35B-A3B, per Qwen/Qwen Blog):
Benchmark note: The following values come from vendor pages, model cards or Ollama readmes. They are useful signals, but they are not ai-on-mac.com’s own measurements. Harness, tool use, context length, timeout, prompting, thinking mode and shot count can differ heavily between model families.
| Benchmark | Value | Position |
|---|---|---|
| AIME 2026 | 92.7 % | High score for an open-weight model |
| MMLU-Pro | 85.2 % | Knowledge / reasoning benchmark |
| LiveCodeBench v6 | 80.4 % | Live coding tasks |
| SWE-bench Verified | 73.4 % | Agentic coding with internal scaffold (see caveat) |
| Terminal-Bench 2.0 | 51.5 % | Terminal integration, Harbor/Terminus-2 setup |
Vendor/model-card values. Do not read this as a direct cross-family ranking: harness, prompting, tool use, thinking mode, shot count and evaluation can differ.
Key features:
- Thinking / agent workflows: Qwen3.6 is positioned for longer coding and repository tasks. In normal chat, ask for a brief rationale instead of full reasoning traces.
- Agentic Coding: Repository-level understanding, frontend workflows, terminal integration
- 256K context on 35B-A3B
- A3B = “Active 3 Billion” — only 3B parameters activate per token on the 35B MoE variant
On Mac: qwen3.6:27b is the more practical entry point when text and image input are required. The qwen3.6:27b-mlx tag is text-only. qwen3.6:35b-a3b needs more headroom and is more realistic from 32 GB of unified memory with a limited context window.
Gemma 4 — Vision and Reasoning across 12B / 26B A4B / 31B
Gemma 4 is Google DeepMind’s fourth Gemma generation and available in several sizes: E2B, E4B, 26B A4B (MoE) and 31B (Dense).
Setup:
# 26B A4B MoE — good efficiency compromise for more capable Macs
ollama pull gemma4:26b
# 31B Dense — higher Gemma quality, more memory needed
ollama pull gemma4:31b
# 4B — very lightweight, for older Macs and short tasks
ollama pull gemma4:e4b
Benchmarks (31B, Google/Ollama Gemma 4 table for instruction-tuned models):
| Benchmark | Value | Position |
|---|---|---|
| AIME 2026 (no tools) | 89.2 % | Reasoning benchmark without external tools |
| MMLU-Pro | 85.2 % | Knowledge / reasoning benchmark |
| LiveCodeBench v6 | 80.0 % | Live coding tasks |
| Codeforces ELO | 2150 | Competitive programming rating |
| GPQA Diamond | 84.3 % | Domain-specific reasoning |
| MMMU Pro | 76.9 % | Multimodal reasoning performance |
Vendor/model-card values. Do not read this as a direct cross-family ranking: harness, prompting, tool use, thinking mode, shot count and evaluation can differ.
Key features:
- 256K context on 26B A4B and 31B
- Text + Image on all sizes; audio native per Google on E2B, E4B and 12B — check Ollama and client support
- 26B A4B MoE: 25.2B total, 3.8B active per token — more efficient than 31B Dense
- Actively maintained on Ollama
On Mac: gemma4:26b is about 18 GB in Ollama and gemma4:31b about 20 GB. The 26B A4B variant leaves more headroom on a 32 GB Mac; the dense 31B variant is safer with more unified memory or a shorter context.
Llama 4 Scout — 67 GB Specialist Case for Very Large Unified-Memory Setups
Llama 4 Scout is Meta’s 109B MoE model with 17B active parameters. Its Ollama package is about 67 GB, before runtime and context overhead. That makes it unsuitable for 32 GB or 48 GB Macs and a specialist target for very large unified-memory systems or servers.
Benchmarks: Methodology
- Benchmark harnesses differ: The same benchmark can use different tools, shot counts and configurations, so results are not automatically comparable.
- Thinking vs. Non-Thinking: Reasoning benchmarks such as AIME are measured in different modes. Tool use affects results additionally.
- Benchmark ≠ real-world impression: A model can score high on benchmarks and still be less useful in your specific workflow than a lower-ranked model with better prompt engineering.
Context Windows: Ollama Settings
Ollama sets the default context length based on available unified memory: typically 4K below 24 GiB, 32K between 24 and 48 GiB, and 256K from 48 GiB upward. Larger context windows need significantly more memory — they grow with layers, heads and bytes per token. On Apple Silicon, unified memory is the relevant pool, but the actually usable memory depends on macOS, GPU offload, other apps running in parallel and the model itself.
# Start Ollama with larger context
OLLAMA_CONTEXT_LENGTH=64000 ollama serve
# Check how model, offload and context were loaded
ollama ps
Quick Start
For Ollama setup on Mac there is a dedicated step-by-step guide. This is the short version for a first test:
# 1. Install Ollama (if not already)
brew install ollama
# 2. Test Qwen3.6 locally
ollama pull qwen3.6:27b
ollama run qwen3.6:27b
# 3. Gemma 4 26B — efficient sweet spot (from 24 GB)
ollama pull gemma4:26b
ollama run gemma4:26b
# 4. Gemma 4 31B — reasoning enthusiasts (from 48 GB)
ollama pull gemma4:31b
ollama run gemma4:31b
Choose by Mac configuration and task: gemma4:26b as a cautious all-round test on 24 GB, qwen3.6:35b-a3b for coding agents and longer tasks from 32 GB, and gemma4:e4b for lightweight work on smaller Macs.
Further Reading
On ai-on-mac.com:
- Unified Memory on the Mac explained
- Setting up Ollama on Mac
- LM Studio vs. Ollama for local models
- Apple Intelligence vs. local AI on Mac
- Categories Local Models and Guides
External primary sources:
- Ollama — Qwen3.6
- Qwen3.6-27B Model Card
- Qwen3.6-35B-A3B Model Card
- Ollama — Gemma 4
- Ollama — Llama 4
- Qwen3 Blog
- Google Gemma 4 Blog
- Google Gemma 4 Model Card
- Google Gemma Releases
- Ollama Context Length Docs
Sources and Date
Verified June 18, 2026. Model sizes and context windows refer to the Ollama tags and official model pages listed at the time of verification. Benchmark values are vendor and model-card claims and are only directly comparable when the same model variant, runtime, harness, tool use, context length and prompting method are used. Apple-Silicon-specific tok/s figures in this article come from community reports and the Ollama / oMLX model page, not from ai-on-mac.com’s own measurements.
Frequently Asked Questions
Which model is best for Mac mini M4 Pro with 24 GB?
For 24 GB, gemma4:26b is the safer first test because the Ollama package is about 18 GB and supports text+image. qwen3.6:27b can also run, but long context and parallel apps leave much less headroom.
Does Llama 4 Scout run on a Mac?
Llama 4 Scout (16x17b) is about 67 GB in Ollama. That is impractical for normal local Mac setups. Very large unified-memory Macs can experiment, but Qwen3.6 and Gemma 4 are the more useful local candidates.
What is Thinking Mode in Qwen3.6?
Qwen3.6 is designed for thinking and agent workflows. For normal use, do not ask for long reasoning traces; ask for a brief rationale and the result.
How much RAM do I need for local open-weight models?
It depends on model and usage: 8 GB works for E2B/E4B variants (e.g. gemma4:e2b), 16 GB for 4B–8B models, 24 GB for 26B models, 32–48 GB for 31B dense models and 64 GB+ for the largest variants. Context windows, KV cache, macOS and other apps add to this.
Are the tok/s values in this article from local measurements?
The tok/s figures mentioned come from community reports and the Ollama/oMLX model page.
Does Gemma 4 really have audio support?
Google lists native audio support for E2B, E4B and 12B. The 26B A4B and 31B variants are text+image models. Audio support in a specific Ollama tag and client still needs to be checked separately.