Last week, Apple announced the M5 Pro and M5 Max. The spec sheet was impressive — up to 4x faster LLM prompt processing, Neural Accelerators in every GPU core, 128GB of unified memory at 614 GB/s bandwidth. But the most telling detail was in the marketing. Apple showed LM Studio — a local LLM interface — running on stage. The press materials said "run advanced LLMs on device" and "train custom models locally."

A year ago, Apple's AI messaging was about Siri and photo features. Now they're talking about token generation speeds. That shift tells you everything about where this is heading.

The Pattern

Every transformative technology follows the same arc. First, the race pushes upward — bigger, faster, more powerful, more expensive. Then the race reverses. The question stops being "how powerful can we make this?" and becomes "how small can we make it while keeping the capability?"

Computers filled rooms before they fit in pockets. Digital cameras cost thousands before the best one most people would ever own was built into their phone. GPS was military satellite infrastructure before it became a chip in every device you carry.

The race to the most advanced version is never the end of the story. It's always the setup for the real transformation: making it run everywhere.

AI is entering that second phase. The frontier models have proven what's possible. Now the hardware manufacturers, chip designers, and open-source community are all working on the same problem: getting that capability onto the device in front of you.

It's Already Happening

The M5 generation is the clearest evidence yet. The M5 Max runs a quantised 70B model entirely on a laptop with room to spare. Early benchmarks show 65–88 tokens per second on a 120B model. That's not a demo — that's usable interactive inference.

More importantly, Apple embedded Neural Accelerators into every GPU core, which means AI compute now scales with core count rather than hitting a fixed ceiling. Every future chip gets proportionally better at inference. That's a structural change, not an incremental one.

On the model side, quantisation techniques shrink models to 4-bit precision with minimal quality loss. Mixture-of-Experts architectures activate only a fraction of parameters per token. Distillation keeps raising the floor. Apple's own MLX framework runs 20–30% faster than llama.cpp on Apple Silicon.

The M5 Ultra, expected later this year, will likely mean 256GB of unified memory with bandwidth exceeding 1 TB/s. Models that currently need multi-GPU datacentre setups — running silently on a desktop. What runs on a Mac Studio today will run on a MacBook in two years, and on a phone in five.

Why It Matters

Cloud AI has friction. Subscriptions, API keys, rate limits, latency, content policies, someone else's terms of service. None are deal-breakers individually, but collectively they create a gate. Local AI has none of that.

And then there's data sovereignty. Every prompt you send to a cloud model travels through infrastructure you don't control. Your ideas, your code, your business logic, your customer data — processed by companies whose incentives may not align with yours. Under GDPR and similar frameworks, that friction only increases as AI embeds deeper into workflows.

Local models sidestep this entirely. Your data stays on your machine. For regulated industries — healthcare, finance, legal — this isn't a nice-to-have. It's a prerequisite. But even beyond compliance, the principle is simple: the tools you use to think and build shouldn't require you to hand over the raw material of your work.

Good Enough Changes Everything

Local models won't match the absolute frontier any time soon. But they'll be good enough for the vast majority of practical work — writing code, analysing data, brainstorming, prototyping, building. The iPhone camera wasn't as good as a DSLR. It didn't need to be. It was good enough, and it was always in your pocket.

When powerful AI runs locally with zero friction and zero cost, the barrier to building drops to near zero. Multiply that by every person with a laptop and an idea, in every country, in every field — and the scale of what gets made is staggering.

The race to the biggest AI was always just the first half of the story. The second half — the one that changes the world — is the race to make it run everywhere. That race is already underway, and the M5 shows it's moving faster than anyone expected.