Hardware 7 min read

Unified Memory: Why Local LLMs Work on Mac

Unified Memory explained: why Apple Silicon helps local LLMs, where memory bandwidth matters, and when Mac mini M4, M4 Pro or cloud makes sense.

Technical research and editorial review. Original measurements are explicitly identified in the article.

Published: May 8, 2026 Updated: May 29, 2026

Editorial method

If you run local AI on a Mac — with Ollama, LM Studio, or MLX — Apple Silicon can work well for small and mid-size models. The reason is not only the GPU. The memory architecture matters: the CPU and integrated GPU access the same unified memory pool. I discovered this firsthand when I started running local LLMs on my Mac Mini M4 — the performance difference compared to my old Intel Mac was night and day.

And why does it matter so much for LLM inference, more than it first appears?

Unified Memory on Apple Silicon: conventional PC with CPU RAM/GPU VRAM compared with a shared memory pool

Original diagram based on Apple’s M1/Mac mini specifications and the MLX documentation on unified memory. Sources: Apple M1 Newsroom, Apple Mac mini Tech Specs, MLX Unified Memory. Checked May 27, 2026.

The Architecture — Not What You’d Expect

On a conventional PC, the CPU and GPU are physically on separate boards. Each component has its own RAM: the system has DDR5 modules, the GPU has its own VRAM (GDDR6X on Nvidia). The two memory pools are separate — and that has consequences.

On a discrete-GPU system, CPU RAM and GPU VRAM are separate pools. When model weights fit fully in VRAM, they remain there. PCIe transfers matter mainly during loading, CPU offloading and other data exchange. If VRAM is insufficient, those transfers can become a bottleneck.

Apple Silicon breaks with this architecture. In a Mac mini with M4 or M4 Pro, CPU, integrated GPU and Neural Engine sit inside the same system-on-a-chip and use the same unified memory pool.

For the integrated GPU, there is no separate VRAM pool that has to be synchronized with CPU RAM over PCIe. Weights, context and KV cache live in one shared memory space. That reduces copies and makes CPU/GPU handoff simpler.

Why This Makes a Massive Difference for LLMs

Transformer models work layer by layer. Each layer reads its own attention and feed-forward weights. Autoregressive generation repeats this memory access for every new token, which is why memory bandwidth often matters.

On conventional architecture, this is what happens:

  1. The weight tensor lives in GPU-VRAM
  2. For every attention step, the GPU reads the values — that’s fast
  3. But: if the CPU needs to contribute, data must be copied
  4. If the model does not fit in GPU RAM, CPU offloading may help; an OOM error instead means a memory allocation failed

On Apple Silicon, this is what happens:

  1. The weight tensor lives in Unified Memory
  2. The integrated GPU can directly access the same memory pool
  3. The CPU works on the same memory
  4. Apple lists up to 273 GB/s memory bandwidth for the Mac mini M4 Pro
  5. Less copying between CPU and GPU work

There’s a second effect: the zero-copy architecture. In a conventional setup, when the CPU prepares a prompt and the GPU then takes over inference, the preprocessed tensor has to be copied once. On Apple Silicon, there is no copy — CPU and GPU work on the same memory region. Latency between preprocessing and inference drops.

For LLMs this is especially relevant because prompt processing, sampling, runtime logic and matrix math stress different parts of the system. Unified Memory makes CPU/GPU handoff easier, but it does not replace enough memory capacity, the right quantization or a fast runtime.

The Numbers — Mac mini M4 Compared

Apple offers several relevant chip classes for local LLM inference. For the Mac mini, M4 and M4 Pro are the relevant ones; M4 Max and M3 Ultra belong to other Mac classes:

Chip classTypical Mac classMax unified memoryMemory bandwidthLocal LLM relevance
M4Mac mini, iMac, MacBook Proup to 32 GB in Mac mini~120 GB/sVery good for small to medium models
M4 ProMac mini, MacBook Proup to 64 GB in Mac mini273 GB/sMore headroom in the Mac mini for local AI
M4 MaxMacBook Pro, Mac Studioup to 128 GB~546 GB/sLarger local models, but not a Mac mini
M3 UltraMac Studioup to 512 GB~800 GB/sVery large models/workstations, different class

Why does memory bandwidth matter so much? Many LLM inference runs are heavily memory-bandwidth-bound: the GPU constantly reads weights and KV cache. More bandwidth helps, but it does not replace enough memory capacity or the right quantization.

For comparison: an Nvidia RTX 4090 has 1 TB/s memory bandwidth — more than the M4 Pro. But that figure is for GPU-specific GDDR6X VRAM. As soon as data has to be copied between CPU-RAM (DDR5) and GPU-VRAM, PCIe 4.0 x16 limits throughput to ~32 GB/s. That’s 8 to 16 times slower than Unified Memory bandwidth.

The relevant number for LLM inference is therefore not just absolute bandwidth, but also copy overhead between compute units. This is where Apple Silicon has a structural advantage that shows up in many local workflows. That does not mean a Mac is automatically faster than a strong discrete GPU; it means Apple’s memory architecture fits local integrated CPU/GPU workflows unusually well.

Practical Recommendation — 16, 24, or 32 GB?

The choice of Mac mini depends directly on memory. Here’s the honest assessment:

16 GB — Small models and short contexts

16 GB works for Whisper, small 1B to 4B models, short chats and simple automation. 7B/8B models can work depending on quantization, but context, macOS and parallel apps leave little reserve. For regular local LLM use, 16 GB is the lower edge.

24 GB — Better entry point

24 GB is a good entry point for serious LLM use. 7B/8B models in Q4 usually fit well, with moderate context and normal desktop apps. The 24 GB Mac mini M4 Pro is interesting when you also need the Pro chip’s CPU/GPU/I/O; for pure LLM use, a standard M4 with 32 GB can be more sensible.

32+ GB — More context and larger models

32 GB gives more context headroom and larger quantizations than 24 GB. A Mac mini M4 Pro with 48 or 64 GB is the more sensible choice for 30B/32B models, vision models, RAG or multiple parallel workflows. If you want to use much larger models such as 70B comfortably, you are looking at Mac Studio, MacBook Pro Max or cloud GPUs — not a standard Mac mini.

Conclusion — Why Unified Memory Is More Than a Marketing Term

Unified Memory is not an optimization at the margins. It’s a fundamental architectural decision that cuts directly into the heart of LLM inference: attention layers have to work heavily on weight tensors, and every copy operation across an external bus adds latency.

The M4 Pro offers substantial bandwidth and unified memory for 7B and 8B models in Q4. Typical Ollama and MLX LLM workflows use CPU and GPU; Neural Engine use should not be claimed without runtime-specific evidence.

Anyone running a Mac mini M4 Pro as a local AI machine can get quiet inference without ongoing token fees or cloud upload. Unified memory helps, but it does not replace a large NVIDIA GPU, a vLLM cluster or cloud GPUs for very large models.

Sources and Status

Status: checked on May 27, 2026. Mac mini M4 and M4 Pro data comes from Apple’s technical specifications. The Unified Memory explanation is based on Apple’s M1 introduction and the MLX documentation. Bandwidth and PCIe comparisons are technical orientation points; actual tok/s still depends on model, quantization, context length, runtime and free memory.

Frequently Asked Questions

Why is Unified Memory better for LLMs than conventional RAM?

On a conventional PC, CPU RAM and GPU VRAM are separate. When data moves between CPU and GPU, copies add latency. On Apple Silicon, the CPU and integrated GPU share one memory pool, which helps local LLMs when model execution switches between CPU and GPU work.

How much RAM do I need for local LLMs on a Mac mini?

16 GB is enough for small models and short contexts. 24 GB is a better entry point for 7B/8B models. 32 GB gives the standard M4 more headroom, while 48/64 GB on M4 Pro is much more comfortable for larger models, vision, RAG and parallel workflows.

What does Unified Memory mean in tok/s numbers?

Unified Memory mainly improves memory access and avoids copies between CPU RAM and GPU VRAM. Actual tok/s still depends heavily on model, quantization, runtime, context length and free memory.

Is the M4 Max fundamentally different from the M4 Pro for LLM inference?

Yes, but not inside the Mac mini. The Mac mini tops out at M4 Pro with up to 64 GB of unified memory and 273 GB/s memory bandwidth. M4 Max and M3 Ultra belong to MacBook Pro or Mac Studio classes and matter for much larger local models.