Can Gemini 3.5 Flash Run Locally on Mac? Ollama, MLX & Pricing
Can Gemini 3.5 Flash run in Ollama or MLX on a Mac? No. See the API setup, 1M context, privacy and current pricing.
What Google announced at I/O 2026
Google I/O 2026 took place on May 19, 2026. Google framed the event as a move into a more agentic Gemini era: models that do not just answer prompts, but can use tools, execute code, handle longer tasks and coordinate agents.
The central model story was Gemini 3.5. Gemini 3.5 Flash is the first stable model in that family. Google positions it for agentic coding, long-horizon tasks, tool use and real-world workflows.
Gemini 3.5 Flash outperforms Gemini 3.1 Pro on several coding and agentic benchmarks, including Terminal-Bench 2.1 at 76.2%, GDPval-AA at 1656 Elo, MCP Atlas at 83.6%, and CharXiv Reasoning at 84.2%. Google also says output tokens per second are 4x faster than other frontier models, citing Artificial Analysis.
What Gemini 3.5 Flash is
Gemini 3.5 Flash is a stable Gemini API model optimized for speed, agentic execution, coding and long-horizon workflows. It accepts text, image, video, audio and PDF inputs, and outputs text.
It is useful when you need a cloud model with a large context window, multimodal input, tool calling, code execution, File Search, Search/Maps grounding or Managed Agents.
It is not a local Mac model. Apple Silicon does not accelerate Gemini 3.5 Flash itself because inference runs on Google’s infrastructure.
Model ID and specs
| Item | Gemini 3.5 Flash |
|---|---|
| Official model ID | gemini-3.5-flash |
| Status | Stable / generally available |
| Input token limit | 1,048,576 tokens |
| Output token limit | 65,536 tokens |
| Knowledge cutoff | January 2025, according to the model page |
| Supported inputs | Text, image, video, audio, PDF |
| Supported output | Text |
| Supported capabilities | Batch API, caching, code execution, File Search, Flex, function calling, Maps/Search grounding, Priority, structured outputs, thinking, URL context |
| Not supported | Audio generation, computer use, image generation, Live API |
If a third-party router shows something like google/gemini-3.5-flash, treat that as provider-specific naming. In the official Gemini API, use gemini-3.5-flash.
Gemini 3.5 Flash is not a local Mac model
For Mac users, this distinction matters more than the benchmark headline.
| Workflow | Where inference runs | Best for | Main trade-off |
|---|---|---|---|
| Ollama / LM Studio / MLX | Your Mac | Private files, offline work, open-weight experiments, no token costs | Limited by RAM, model size and local speed |
| Gemini 3.5 Flash API | Google cloud | 1M context, multimodal input, tools, code execution, grounding, cloud agents | Data leaves device, API costs, network dependency |
Apple Silicon accelerates local inference. For Gemini API usage, the important Mac factors are internet reliability, API key handling, cost controls, data-flow decisions and developer tooling.
API setup on Mac
Use Google AI Studio to create a Gemini API key. Store it as GEMINI_API_KEY.
export GEMINI_API_KEY="your-api-key-here"
Do not expose API keys in browser code, public GitHub repositories or static Astro pages. Use a backend, serverless function, edge function or secure secret management. I store my Gemini API key in a .env file and load it via environment variables — simple and safe enough for local development.
Python
from google import genai
client = genai.Client()
response = client.models.generate_content(
model="gemini-3.5-flash",
contents="Explain unified memory on Apple Silicon in three sentences.",
)
print(response.text)
Python with thinking level
from google import genai
from google.genai import types
client = genai.Client()
response = client.models.generate_content(
model="gemini-3.5-flash",
contents="Analyze the trade-offs between local AI and cloud AI on Mac.",
config=types.GenerateContentConfig(
thinking_config=types.ThinkingConfig(thinking_level="high")
),
)
print(response.text)
JavaScript
import { GoogleGenAI } from "@google/genai";
const ai = new GoogleGenAI({});
async function main() {
const response = await ai.models.generateContent({
model: "gemini-3.5-flash",
contents: "Create a checklist for a hybrid local/cloud AI workflow on Mac."
});
console.log(response.text);
}
main();
REST
curl "https://generativelanguage.googleapis.com/v1beta/models/gemini-3.5-flash:generateContent" \
-H "x-goog-api-key: $GEMINI_API_KEY" \
-H "Content-Type: application/json" \
-X POST \
-d '{
"contents": [{
"parts": [{
"text": "Summarize the differences between local AI on Mac and Gemini 3.5 Flash."
}]
}]
}'
GenerateContent or Interactions API?
The examples above use generateContent because it is the simplest way to show a first working request. Google now recommends the Interactions API beta for new projects, especially if you are building agentic workflows, long-running/background tasks or server-side multi-turn conversations.
Use this split:
- Use
generateContentfor simple prompts, existing integrations, quick tests and cases where you need features that are not yet available in Interactions API, such as Batch API or explicit caching. - Use the Interactions API for new agentic apps, server-side history management, observable execution steps, tool orchestration, background tasks and future Gemini capabilities.
Important privacy/cost detail: Interactions API stores interaction objects by default (store=true) for server-side state, background execution and observability. Google lists a 55-day retention period for Paid Tier interactions and 1 day for Free Tier interactions. You can opt out with store=false, but that disables stored-state workflows such as previous_interaction_id and is incompatible with background execution.
Thinking levels and thought preservation
Gemini 3.5 Flash supports thinking. For Gemini 3.5 Flash, use thinking_level, not the older thinkingBudget pattern used by some Gemini 2.5 examples.
Recommended levels:
| Thinking level | Use it for |
|---|---|
minimal | Fast chat-like answers, simple facts, simple tool calls |
low | Lower-latency coding, analysis and writing tasks |
medium | Default for Gemini 3.5 Flash; best general setting for most tasks |
high | Hard coding, debugging, math, multi-step agents and complex tool use |
Thought preservation is handled through thought signatures and conversation history. The GenAI SDK manages this automatically in normal usage. Preserved thoughts can increase input token usage across turns because signatures are sent back with the conversation. Do not promise raw chain-of-thought access. If the API or SDK exposes thought summaries, treat them as controlled summaries, not the full internal reasoning trace.
Pricing: Standard, Batch, Flex, Priority and grounding
Prices below are Google Gemini API paid-tier prices checked on May 27, 2026. Output prices include thinking tokens.
| Mode | Input | Output incl. thinking | Context caching | Storage | Grounding |
|---|---|---|---|---|---|
| Standard | $1.50 / 1M tokens | $9.00 / 1M tokens | $0.15 / 1M tokens | $1.00 / 1M tokens/hour | 5,000 prompts/month free, then $14 / 1,000 search queries |
| Batch | $0.75 / 1M tokens | $4.50 / 1M tokens | $0.075 / 1M tokens | $1.00 / 1M tokens/hour | 5,000 requests/month free, then $14 / 1,000 search queries |
| Flex | $0.75 / 1M tokens | $4.50 / 1M tokens | $0.08 / 1M tokens | $1.00 / 1M tokens/hour | 5,000 requests/month free, then $14 / 1,000 search queries |
| Priority | $2.70 / 1M tokens | $16.20 / 1M tokens | $0.27 / 1M tokens | $1.00 / 1M tokens/hour | 5,000 prompts/month free, then $14 / 1,000 search queries |
Batch is for asynchronous or batch workloads, not “offline” local processing. Flex is better described as cost-sensitive workloads with flexible serving. Priority is for prioritized workloads; it is not the only way to get the 1M context window. The 1M input context is a model property of Gemini 3.5 Flash.
Search grounding can trigger more than one search query per Gemini request. Costs are charged per individual search query.
1M context: useful, but not free
Gemini 3.5 Flash supports 1,048,576 input tokens and 65,536 output tokens. That is useful for large PDFs, long codebases, multi-file analysis and agent traces. It also increases cost, latency and the chance that irrelevant material distracts the model.
Good long-context habits:
- Structure input with headings and file boundaries.
- Remove irrelevant sections before sending.
- Use File Search or RAG instead of pasting everything into every request.
- Check context caching for repeated large inputs.
- Summarize long conversations.
- Specify output formats strictly.
- Do not blindly copy entire repositories or PDFs into every request.
Managed Agents in the Gemini API
Managed Agents are cloud-based Gemini API agents. A single API call can start an agent that can reason, use tools and execute code in an isolated, ephemeral Linux environment.
According to Google, Managed Agents are powered by the Antigravity agent and built on Gemini 3.5 Flash. They are available through the Interactions API and Google AI Studio. For Mac users, this is a cloud-agent workflow, not local Apple Silicon inference.
Availability note: Google describes Managed Agents as rolling out in preview in the Gemini API. Treat this as a Google cloud/preview feature whose availability may depend on account, region, API access and product changes.
Antigravity 2.0 and Mac developer workflows
Antigravity 2.0 is Google’s agent-first developer platform and desktop app. It is designed around orchestration: multiple agents, subagents, scheduled tasks and integrations.
For Mac developers, Antigravity can be interesting as a cloud-agent development environment. Do not confuse it with local inference. Your Mac is the development machine; Gemini 3.5 Flash and the managed agents run in Google’s cloud ecosystem.
Gemini Spark and the macOS app
Google is also working on updates to the Gemini app for macOS. Gemini Spark is planned for the Gemini desktop app and is positioned as an agent that can help with local files and desktop workflows.
The key distinction: Gemini Spark is cloud-based. It may interact with desktop context, but it is not a local LLM running on Apple Silicon. Even if Spark can work with local files through the macOS app, the model inference is not local Apple Silicon inference.
Use this split:
- Local inference: Ollama, LM Studio, MLX.
- Cloud agents: Gemini Spark, Antigravity, Managed Agents.
Gemini 3.5 Flash vs local AI on Mac
| Use case | Better default | Reason |
|---|---|---|
| Confidential documents | Local model | Files stay on the Mac if the workflow is configured locally |
| Offline work | Local model | No internet or API dependency |
| Open-weight experiments | Local model | Reproducible model files, quantization and runtime testing |
| Very large context | Gemini 3.5 Flash | 1M input tokens, but cost and latency need control |
| Multimodal input | Gemini 3.5 Flash | Text, image, video, audio and PDF inputs |
| Tool calling and code execution | Gemini 3.5 Flash | Built-in cloud tooling and agent infrastructure |
| No token costs | Local model | You pay with hardware, power and time instead |
Useful internal next reads: Apple Intelligence vs local AI, Ollama setup, LM Studio vs Ollama, Unified Memory, Best open-weight LLMs and Whisper setup.
Privacy, logs and abuse monitoring
Gemini 3.5 Flash is a cloud model. Data leaves your device.
The privacy picture depends on tier, API surface and project configuration:
- Free Tier: according to Google’s pricing page, content may be used to improve Google products.
- Paid Tier: according to the pricing page, content is not used to improve Google products.
- Billing-enabled logs: Google’s data logging policy says prompts and responses in logs are not used for product improvement by default, unless you actively share datasets or feedback.
- Abuse monitoring: Google says prompts, contextual information and outputs may be retained for 55 days for misuse detection and policy enforcement.
- Interactions API storage: interactions are stored by default for server-side state and observability; Google lists 55 days retention for Paid Tier and 1 day for Free Tier, with
store=falseavailable for stateless requests where compatible.
Do not reduce this to “Google trains on everything” or “paid is automatically fully private.” For sensitive files, use local models, an enterprise/Vertex AI setup, or a formal compliance review.
When Gemini 3.5 Flash makes sense
Use Gemini 3.5 Flash when you need:
- 1M context for long documents or codebases.
- Multimodal input.
- Code execution.
- Function calling and structured outputs.
- File Search.
- Search or Maps grounding.
- Managed Agents or Antigravity workflows.
- Long-horizon cloud tasks.
When local AI is better
Use local AI when you need:
- Private documents to stay on the Mac.
- Offline work.
- Open-weight model experiments.
- Reproducible local benchmarks.
- No per-token API bill.
- Tight control over model files, quantization and runtime.
Start with Ollama or LM Studio if the goal is local inference on Apple Silicon.
Common Gemini 3.5 Flash mistakes
- Confusing the official model ID with provider aliases: in the Gemini API, use
gemini-3.5-flash, not necessarilygoogle/gemini-3.5-flash. - Copying old or unrelated API examples: use the official Google GenAI SDK or REST syntax for Gemini.
- Using outdated thinking parameters: for Gemini 3.x, use
thinking_levelinstead of older budget-style examples from other model generations. - Looking for Gemini 3.5 Flash in Ollama, LM Studio or MLX: it is not a local open-weight model.
- Filling the 1M context window blindly: long context increases cost, latency and failure surface.
- Mixing Free Tier and Paid Tier privacy assumptions: data use and product-improvement rules differ by tier.
- Treating Search Grounding as free by default: grounding can create billable search queries after the free allowance.
- Treating Gemini Spark as local inference: Spark may integrate with Mac workflows, but it is not local Apple Silicon inference.
- Expecting Computer Use, image generation or audio output: Gemini 3.5 Flash supports multimodal input and text output, but not those features right now.
Conclusion
Gemini 3.5 Flash is interesting for Mac users not because it runs on Apple Silicon — it does not. It matters because it combines 1M context, multimodal input, thinking, tool use, code execution, file search, grounding and Managed Agents in a fast cloud model.
For private files, offline work and open-weight experimentation, Ollama, LM Studio and MLX remain the better choice. For large context, agentic coding workflows, tool chains and cloud agents, Gemini 3.5 Flash is a practical cloud complement.
The useful question is which data must stay on the Mac and which tasks benefit from Google’s cloud tools. Local models cover many routine tasks; Gemini is relevant when its large context, multimodal inputs or managed tools justify sending data to a cloud service.
Sources and status
Status: checked on May 27, 2026, after Google I/O 2026. Model names, prices, limits, product availability and supported features can change. The details for model ID, status, context window, output limit, features, thinking levels, prices and Google I/O announcements come from official Google/Gemini sources.
- Gemini 3.5 Flash Model Page
- What’s new in Gemini 3.5 Flash
- Interactions API
- Gemini API Pricing
- Gemini 3 Developer Guide
- Gemini Thinking
- Gemini API Data Logging and Sharing
- Gemini API Abuse Monitoring
- Gemini 3.5: frontier intelligence with action
- Google I/O 2026 collection
- Managed Agents in the Gemini API
- I/O 2026 developer highlights
- Gemini app / Gemini Spark / macOS
Frequently Asked Questions
What is Gemini 3.5 Flash?
Gemini 3.5 Flash is Google's mid-tier LLM in the Gemini 3.5 family, positioned between Gemini 3.5 Pro (quality) and Gemini 3.5 Flash-Lite (cost). It targets latency-sensitive use cases like chat, agentic tool use, and large-context retrieval, with a 1M-token context window and lower per-token pricing than the Pro tier.
How much does Gemini 3.5 Flash cost?
Google's published list prices are $1.50 per 1M input tokens and $9 per 1M output tokens for prompts up to 200K context. Above 200K the price increases. Cached inputs are cheaper. Verify current pricing on ai.google.dev before production use.
Can Gemini 3.5 Flash run locally on a Mac in Ollama or MLX?
No. Gemini is a Google-hosted API model. There is no Ollama tag, no MLX checkpoint, and no LM Studio preset for Gemini 3.5 Flash. For local Mac workflows you need open-weight alternatives like Qwen3, Llama 3.3, Mistral, or Gemma. This article explains how to call Gemini via the API from a Mac and where the local/cloud line makes sense.
Does Google train on my Gemini 3.5 Flash API prompts?
By default Google does not use paid-tier API data for training. The Gemini API free tier and consumer products (Gemini app, AI Studio free) have different retention and training rules. Sensitive data should be redacted, prompts scoped, and ideally routed through a local model first. Review Google's current data usage policy on ai.google.dev before processing personal data.
What is the best Mac workflow with Gemini 3.5 Flash?
Use Gemini 3.5 Flash for: long-context retrieval (>200K tokens), agent workflows with Gemini's managed tools, multimodal inputs (image, video, audio) where local alternatives are weak. Use a local model (Ollama, LM Studio, MLX) for: private documents, offline work, reproducible testing, and cheap everyday chat. The recommended pattern is a hybrid router that picks the right model per task.