Which model is best for Mac mini M4 Pro with 24 GB?

For 24 GB, gemma4:26b is the safer first test because the Ollama package is about 18 GB and supports text+image. qwen3.6:27b can also run, but long context and parallel apps leave much less headroom.

Does Llama 4 Scout run on a Mac?

Llama 4 Scout (16x17b) is about 67 GB in Ollama. That is impractical for normal local Mac setups. Very large unified-memory Macs can experiment, but Qwen3.6 and Gemma 4 are the more useful local candidates.

What is Thinking Mode in Qwen3.6?

Qwen3.6 is designed for thinking and agent workflows. For normal use, do not ask for long reasoning traces; ask for a brief rationale and the result.

How much RAM do I need for local open-weight models?

It depends on model and usage: 8 GB works for E2B/E4B variants (e.g. gemma4:e2b), 16 GB for 4B–8B models, 24 GB for 26B models, 32–48 GB for 31B dense models and 64 GB+ for the largest variants. Context windows, KV cache, macOS and other apps add to this.

Are the tok/s values in this article from local measurements?

The tok/s figures mentioned come from community reports and the Ollama/oMLX model page.

Best Local LLMs for Mac (2026): 16 / 24 / 32 / 64 GB Picks

Q: Does Gemma 4 really have audio support?

Google lists native audio support for E2B, E4B and 12B. The 26B A4B and 31B variants are text+image models. Audio support in a specific Ollama tag and client still needs to be checked separately.

Qwen3.6 35B-A3B — about 24 GB Ollama package, 256K context, text+image, positioned for thinking and coding-agent workflows
Gemma 4 31B — about 20 GB Ollama package, 256K context, text+image; vendor reports AIME 2026: 89.2%
Gemma 4 26B A4B — MoE variant with 3.8B active parameters, about 18 GB Ollama package, efficient sweet spot for 24 GB Macs
Llama 4 Scout (16x17B) — about 67 GB Ollama package, text+image, unsuitable for normal local Macs
On a 24 GB Mac: start with gemma4:26b or qwen3.6:27b and limit context; 32 GB+ is more comfortable

What runs on your Mac? RAM Reality Meter for Qwen3.6, Gemma 4 and Llama 4 Scout

Graphic based on current Ollama model pages and official model cards. Sources: Ollama Qwen3.6, Ollama Gemma 4, Ollama Llama 4. Verified June 18, 2026.

The State of Open-Weight LLMs — June 2026

Quick answer: For most Mac users in 2026, Gemma 4 and Qwen3.6 are the relevant open-weight choices. gemma4:26b is the cautious first test for 24 GB Macs, while qwen3.6:27b is interesting for coding and agent workflows but needs more headroom with long context. gemma4:31b and qwen3.6:35b-a3b are better suited to 32 GB, 48 GB or more. Llama 4 Scout is impressive on paper, but its roughly 67 GB Ollama package makes it impractical for normal local Macs.

This is not an objective global leaderboard of every open model. It is a Mac-focused shortlist: what runs locally, what needs too much unified memory, and where benchmark numbers are only vendor/model-card signals.

Three model families matter in 2026: Qwen3.6 from Alibaba, Gemma 4 from Google and Llama 4 Scout from Meta. Local observations about Qwen3.6 and Gemma 4 are kept separate from vendor information about Llama 4 Scout: Scout does not fit in 32 GB of unified memory in the quantizations discussed here and is not a model for everyday Mac use.

Model Overview: Tag, Size, Context, License

Model	Ollama Tag	Ollama Size	Context	Input	License
Qwen3.6 27B	`qwen3.6:27b`	~17 GB	256K	Text + Image	Apache 2.0
Qwen3.6 27B MLX	`qwen3.6:27b-mlx`	~20 GB	256K	Text	Apache 2.0
Qwen3.6 35B-A3B	`qwen3.6:35b-a3b`	~24 GB	256K	Text + Image	Apache 2.0
Qwen3.6 35B MLX	`qwen3.6:35b-a3b-mlx`	~22 GB	256K	Text	Apache 2.0
Gemma 4 E2B	`gemma4:e2b`	7.2 GB	128K	Text + Image; audio native on E2B per Google*	Apache 2.0
Gemma 4 E4B	`gemma4:e4b`	7.9 GB	128K	Text + Image; audio native on E4B per Google*	Apache 2.0
Gemma 4 12B	`gemma4:12b`	7.6 GB	256K	Text + Image; audio native per Google, check client*	Apache 2.0
Gemma 4 26B A4B	`gemma4:26b`	18 GB	256K	Text + Image	Apache 2.0
Gemma 4 31B	`gemma4:31b`	20 GB	256K	Text + Image	Apache 2.0
Llama 4 Scout	`llama4:16x17b`	~67 GB	10M in Ollama	Text + Image	Llama 4 Community

Important: The Ollama size is not the same as total memory use. Context windows, KV cache, macOS, browser, other apps and vision inputs add on top. Larger context windows require significantly more memory.

* Google lists native audio for E2B, E4B and 12B. The 26B A4B and 31B variants are text+image models. Audio support in a specific Ollama tag and client still needs to be checked separately.

What Runs on Your Mac: RAM Recommendations

The tiers below are anchored to your Mac’s unified memory — the shared pool for CPU and GPU on Apple Silicon. The Ollama package size alone is not the total memory cost; context windows, KV cache, macOS and parallel apps all add to it.

Mac Configuration	Realistic Models	Recommendation
8 GB unified memory	gemma4:e2b, gemma4:e4b, smaller Qwen3 models	Light models, short contexts
16 GB unified memory	smaller Qwen3/Gemma models, no huge contexts	Entry-level mid-size models
24 GB unified memory	gemma4:26b, qwen3.6:27b with limited context	`gemma4:26b` can be a useful first test, but context window, vision inputs and parallel apps must be limited. `qwen3.6:27b` can run, but is more sensitive to free unified memory and context.
32 GB unified memory	gemma4:31b, qwen3.6:35b-a3b, qwen3.6:27b-mlx	`gemma4:31b` and `qwen3.6:35b-a3b` are testable, but not automatically comfortable with a large context. For longer agent runs, 48 GB+ is noticeably more relaxed.
48 GB unified memory	+ gemma4:31b with larger context	More relaxed 31B use with a larger context window
64 GB+ unified memory	+ qwen3.6:35b-a3b with context	64 GB+ does not automatically mean Llama 4 Scout. The Ollama package is about 67 GB, plus runtime, KV cache, macOS and apps.

Llama 4 Scout (~67 GB) is unsuitable for normal local Macs — including a Mac Studio M4 Max with 48 GB.

Model fit cards: Qwen3.6, Gemma 4 and Llama 4 Scout — focus, strengths and caveats per model family

Model selection: current Ollama tags · verified June 18, 2026

Qwen3.6 — Coding and Agent Workflows with 27B Dense and 35B-A3B MoE

Qwen3.6 is a current open-weight Qwen generation for local and agentic workflows, available in 27B dense and 35B-A3B MoE variants. The 35B-A3B variant scores high on multiple benchmarks (see caveat below).

Setup:

# 27B — Text + Image, good local starting point
ollama pull qwen3.6:27b

# 35B-A3B — larger quality variant (from 32 GB+ unified memory).
# Use the explicit A3B tag; `qwen3.6:35b` resolves to the same model.
ollama pull qwen3.6:35b-a3b

# MLX tag — Text-only, not for vision
ollama pull qwen3.6:27b-mlx

# Start
ollama run qwen3.6:35b-a3b

Benchmarks (35B-A3B, per Qwen/Qwen Blog):

Benchmark caveat strip: vendor numbers are not direct model-comparison scores

Benchmark note: The following values come from vendor pages, model cards or Ollama readmes. They are useful signals, but they are not ai-on-mac.com’s own measurements. Harness, tool use, context length, timeout, prompting, thinking mode and shot count can differ heavily between model families.

Benchmark	Value	Position
AIME 2026	92.7 %	High score for an open-weight model
MMLU-Pro	85.2 %	Knowledge / reasoning benchmark
LiveCodeBench v6	80.4 %	Live coding tasks
SWE-bench Verified	73.4 %	Agentic coding with internal scaffold (see caveat)
Terminal-Bench 2.0	51.5 %	Terminal integration, Harbor/Terminus-2 setup

Vendor/model-card values. Do not read this as a direct cross-family ranking: harness, prompting, tool use, thinking mode, shot count and evaluation can differ.

Key features:

Thinking / agent workflows: Qwen3.6 is positioned for longer coding and repository tasks. In normal chat, ask for a brief rationale instead of full reasoning traces.
Agentic Coding: Repository-level understanding, frontend workflows, terminal integration
256K context on 35B-A3B
A3B = “Active 3 Billion” — only 3B parameters activate per token on the 35B MoE variant

On Mac: qwen3.6:27b is the more practical entry point when text and image input are required. The qwen3.6:27b-mlx tag is text-only. qwen3.6:35b-a3b needs more headroom and is more realistic from 32 GB of unified memory with a limited context window.

Gemma 4 — Vision and Reasoning across 12B / 26B A4B / 31B

Gemma 4 is Google DeepMind’s fourth Gemma generation and available in several sizes: E2B, E4B, 26B A4B (MoE) and 31B (Dense).

Setup:

# 26B A4B MoE — good efficiency compromise for more capable Macs
ollama pull gemma4:26b

# 31B Dense — higher Gemma quality, more memory needed
ollama pull gemma4:31b

# 4B — very lightweight, for older Macs and short tasks
ollama pull gemma4:e4b

Benchmarks (31B, Google/Ollama Gemma 4 table for instruction-tuned models):

Benchmark	Value	Position
AIME 2026 (no tools)	89.2 %	Reasoning benchmark without external tools
MMLU-Pro	85.2 %	Knowledge / reasoning benchmark
LiveCodeBench v6	80.0 %	Live coding tasks
Codeforces ELO	2150	Competitive programming rating
GPQA Diamond	84.3 %	Domain-specific reasoning
MMMU Pro	76.9 %	Multimodal reasoning performance

Vendor/model-card values. Do not read this as a direct cross-family ranking: harness, prompting, tool use, thinking mode, shot count and evaluation can differ.

Key features:

256K context on 26B A4B and 31B
Text + Image on all sizes; audio native per Google on E2B, E4B and 12B — check Ollama and client support
26B A4B MoE: 25.2B total, 3.8B active per token — more efficient than 31B Dense
Actively maintained on Ollama

On Mac: gemma4:26b is about 18 GB in Ollama and gemma4:31b about 20 GB. The 26B A4B variant leaves more headroom on a 32 GB Mac; the dense 31B variant is safer with more unified memory or a shorter context.

Llama 4 Scout — 67 GB Specialist Case for Very Large Unified-Memory Setups

Llama 4 Scout is Meta’s 109B MoE model with 17B active parameters. Its Ollama package is about 67 GB, before runtime and context overhead. That makes it unsuitable for 32 GB or 48 GB Macs and a specialist target for very large unified-memory systems or servers.

Benchmarks: Methodology

Benchmark harnesses differ: The same benchmark can use different tools, shot counts and configurations, so results are not automatically comparable.
Thinking vs. Non-Thinking: Reasoning benchmarks such as AIME are measured in different modes. Tool use affects results additionally.
Benchmark ≠ real-world impression: A model can score high on benchmarks and still be less useful in your specific workflow than a lower-ranked model with better prompt engineering.

Context Windows: Ollama Settings

Memory stack: what really fills your unified memory — example 24 GB Mac, qwen3.6:27b, 32K context

Ollama sets the default context length based on available unified memory: typically 4K below 24 GiB, 32K between 24 and 48 GiB, and 256K from 48 GiB upward. Larger context windows need significantly more memory — they grow with layers, heads and bytes per token. On Apple Silicon, unified memory is the relevant pool, but the actually usable memory depends on macOS, GPU offload, other apps running in parallel and the model itself.

# Start Ollama with larger context
OLLAMA_CONTEXT_LENGTH=64000 ollama serve

# Check how model, offload and context were loaded
ollama ps

Quick Start

For Ollama setup on Mac there is a dedicated step-by-step guide. This is the short version for a first test:

# 1. Install Ollama (if not already)
brew install ollama

# 2. Test Qwen3.6 locally
ollama pull qwen3.6:27b
ollama run qwen3.6:27b

# 3. Gemma 4 26B — efficient sweet spot (from 24 GB)
ollama pull gemma4:26b
ollama run gemma4:26b

# 4. Gemma 4 31B — reasoning enthusiasts (from 48 GB)
ollama pull gemma4:31b
ollama run gemma4:31b

Choose by Mac configuration and task: gemma4:26b as a cautious all-round test on 24 GB, qwen3.6:35b-a3b for coding agents and longer tasks from 32 GB, and gemma4:e4b for lightweight work on smaller Macs.

Sources and Date

Verified June 18, 2026. Model sizes and context windows refer to the Ollama tags and official model pages listed at the time of verification. Benchmark values are vendor and model-card claims and are only directly comparable when the same model variant, runtime, harness, tool use, context length and prompting method are used. Apple-Silicon-specific tok/s figures in this article come from community reports and the Ollama / oMLX model page, not from ai-on-mac.com’s own measurements.

Best Local LLMs for Mac (2026): 16 GB, 24 GB, 32 GB & 64 GB Picks

The State of Open-Weight LLMs — June 2026

Model Overview: Tag, Size, Context, License

What Runs on Your Mac: RAM Recommendations

Qwen3.6 — Coding and Agent Workflows with 27B Dense and 35B-A3B MoE

Gemma 4 — Vision and Reasoning across 12B / 26B A4B / 31B

Llama 4 Scout — 67 GB Specialist Case for Very Large Unified-Memory Setups

Benchmarks: Methodology

Context Windows: Ollama Settings

Quick Start

Further Reading

Sources and Date

Frequently Asked Questions

The State of Open-Weight LLMs — June 2026

Model Overview: Tag, Size, Context, License

What Runs on Your Mac: RAM Recommendations

Qwen3.6 — Coding and Agent Workflows with 27B Dense and 35B-A3B MoE

Gemma 4 — Vision and Reasoning across 12B / 26B A4B / 31B

Llama 4 Scout — 67 GB Specialist Case for Very Large Unified-Memory Setups

Benchmarks: Methodology

Context Windows: Ollama Settings

Quick Start

Further Reading

Sources and Date

Frequently Asked Questions

Read more