What is Gemini 3.5 Flash?

Gemini 3.5 Flash is Google's mid-tier LLM in the Gemini 3.5 family, positioned between Gemini 3.5 Pro (quality) and Gemini 3.5 Flash-Lite (cost). It targets latency-sensitive use cases like chat, agentic tool use, and large-context retrieval, with a 1M-token context window and lower per-token pricing than the Pro tier.

How much does Gemini 3.5 Flash cost?

Google's published list prices are $1.50 per 1M input tokens and $9 per 1M output tokens for prompts up to 200K context. Above 200K the price increases. Cached inputs are cheaper. Verify current pricing on ai.google.dev before production use.

Can Gemini 3.5 Flash run locally on a Mac in Ollama or MLX?

No. Gemini is a Google-hosted API model. There is no Ollama tag, no MLX checkpoint, and no LM Studio preset for Gemini 3.5 Flash. For local Mac workflows you need open-weight alternatives like Qwen3, Llama 3.3, Mistral, or Gemma. This article explains how to call Gemini via the API from a Mac and where the local/cloud line makes sense.

Does Google train on my Gemini 3.5 Flash API prompts?

By default Google does not use paid-tier API data for training. The Gemini API free tier and consumer products (Gemini app, AI Studio free) have different retention and training rules. Sensitive data should be redacted, prompts scoped, and ideally routed through a local model first. Review Google's current data usage policy on ai.google.dev before processing personal data.

What is the best Mac workflow with Gemini 3.5 Flash?

Use Gemini 3.5 Flash for: long-context retrieval (>200K tokens), agent workflows with Gemini's managed tools, multimodal inputs (image, video, audio) where local alternatives are weak. Use a local model (Ollama, LM Studio, MLX) for: private documents, offline work, reproducible testing, and cheap everyday chat. The recommended pattern is a hybrid router that picks the right model per task.

Gemini 3.5 Flash on Mac: API Pricing, Not Ollama

What Google announced at I/O 2026

Google I/O 2026 took place on May 19, 2026. Google framed the event as a move into a more agentic Gemini era: models that do not just answer prompts, but can use tools, execute code, handle longer tasks and coordinate agents.

The central model story was Gemini 3.5. Gemini 3.5 Flash is the first stable model in that family. Google positions it for agentic coding, long-horizon tasks, tool use and real-world workflows.

Gemini 3.5 Flash outperforms Gemini 3.1 Pro on several coding and agentic benchmarks, including Terminal-Bench 2.1 at 76.2%, GDPval-AA at 1656 Elo, MCP Atlas at 83.6%, and CharXiv Reasoning at 84.2%. Google also says output tokens per second are 4x faster than other frontier models, citing Artificial Analysis.

What Gemini 3.5 Flash is

Gemini 3.5 Flash is a stable Gemini API model optimized for speed, agentic execution, coding and long-horizon workflows. It accepts text, image, video, audio and PDF inputs, and outputs text.

It is useful when you need a cloud model with a large context window, multimodal input, tool calling, code execution, File Search, Search/Maps grounding or Managed Agents.

It is not a local Mac model. Apple Silicon does not accelerate Gemini 3.5 Flash itself because inference runs on Google’s infrastructure.

Model ID and specs

Item	Gemini 3.5 Flash
Official model ID	`gemini-3.5-flash`
Status	Stable / generally available
Input token limit	1,048,576 tokens
Output token limit	65,536 tokens
Knowledge cutoff	January 2025, according to the model page
Supported inputs	Text, image, video, audio, PDF
Supported output	Text
Supported capabilities	Batch API, caching, code execution, File Search, Flex, function calling, Maps/Search grounding, Priority, structured outputs, thinking, URL context
Not supported	Audio generation, computer use, image generation, Live API

If a third-party router shows something like google/gemini-3.5-flash, treat that as provider-specific naming. In the official Gemini API, use gemini-3.5-flash.

Gemini 3.5 Flash is not a local Mac model

For Mac users, this distinction matters more than the benchmark headline.

Workflow	Where inference runs	Best for	Main trade-off
Ollama / LM Studio / MLX	Your Mac	Private files, offline work, open-weight experiments, no token costs	Limited by RAM, model size and local speed
Gemini 3.5 Flash API	Google cloud	1M context, multimodal input, tools, code execution, grounding, cloud agents	Data leaves device, API costs, network dependency

Apple Silicon accelerates local inference. For Gemini API usage, the important Mac factors are internet reliability, API key handling, cost controls, data-flow decisions and developer tooling.

API setup on Mac

Use Google AI Studio to create a Gemini API key. Store it as GEMINI_API_KEY.

export GEMINI_API_KEY="your-api-key-here"

Do not expose API keys in browser code, public GitHub repositories or static Astro pages. Use a backend, serverless function, edge function or secure secret management. I store my Gemini API key in a .env file and load it via environment variables — simple and safe enough for local development.

Python

from google import genai

client = genai.Client()

response = client.models.generate_content(
    model="gemini-3.5-flash",
    contents="Explain unified memory on Apple Silicon in three sentences.",
)

print(response.text)

Python with thinking level

from google import genai
from google.genai import types

client = genai.Client()

response = client.models.generate_content(
    model="gemini-3.5-flash",
    contents="Analyze the trade-offs between local AI and cloud AI on Mac.",
    config=types.GenerateContentConfig(
        thinking_config=types.ThinkingConfig(thinking_level="high")
    ),
)

print(response.text)

JavaScript

import { GoogleGenAI } from "@google/genai";

const ai = new GoogleGenAI({});

async function main() {
  const response = await ai.models.generateContent({
    model: "gemini-3.5-flash",
    contents: "Create a checklist for a hybrid local/cloud AI workflow on Mac."
  });

  console.log(response.text);
}

main();

REST

curl "https://generativelanguage.googleapis.com/v1beta/models/gemini-3.5-flash:generateContent" \
  -H "x-goog-api-key: $GEMINI_API_KEY" \
  -H "Content-Type: application/json" \
  -X POST \
  -d '{
    "contents": [{
      "parts": [{
        "text": "Summarize the differences between local AI on Mac and Gemini 3.5 Flash."
      }]
    }]
  }'

GenerateContent or Interactions API?

The examples above use generateContent because it is the simplest way to show a first working request. Google now recommends the Interactions API beta for new projects, especially if you are building agentic workflows, long-running/background tasks or server-side multi-turn conversations.

Use this split:

Use generateContent for simple prompts, existing integrations, quick tests and cases where you need features that are not yet available in Interactions API, such as Batch API or explicit caching.
Use the Interactions API for new agentic apps, server-side history management, observable execution steps, tool orchestration, background tasks and future Gemini capabilities.

Important privacy/cost detail: Interactions API stores interaction objects by default (store=true) for server-side state, background execution and observability. Google lists a 55-day retention period for Paid Tier interactions and 1 day for Free Tier interactions. You can opt out with store=false, but that disables stored-state workflows such as previous_interaction_id and is incompatible with background execution.

Thinking levels and thought preservation

Gemini 3.5 Flash supports thinking. For Gemini 3.5 Flash, use thinking_level, not the older thinkingBudget pattern used by some Gemini 2.5 examples.

Recommended levels:

Thinking level	Use it for
`minimal`	Fast chat-like answers, simple facts, simple tool calls
`low`	Lower-latency coding, analysis and writing tasks
`medium`	Default for Gemini 3.5 Flash; best general setting for most tasks
`high`	Hard coding, debugging, math, multi-step agents and complex tool use

Thought preservation is handled through thought signatures and conversation history. The GenAI SDK manages this automatically in normal usage. Preserved thoughts can increase input token usage across turns because signatures are sent back with the conversation. Do not promise raw chain-of-thought access. If the API or SDK exposes thought summaries, treat them as controlled summaries, not the full internal reasoning trace.

Pricing: Standard, Batch, Flex, Priority and grounding

Prices below are Google Gemini API paid-tier prices checked on May 27, 2026. Output prices include thinking tokens.

Mode	Input	Output incl. thinking	Context caching	Storage	Grounding
Standard	$1.50 / 1M tokens	$9.00 / 1M tokens	$0.15 / 1M tokens	$1.00 / 1M tokens/hour	5,000 prompts/month free, then $14 / 1,000 search queries
Batch	$0.75 / 1M tokens	$4.50 / 1M tokens	$0.075 / 1M tokens	$1.00 / 1M tokens/hour	5,000 requests/month free, then $14 / 1,000 search queries
Flex	$0.75 / 1M tokens	$4.50 / 1M tokens	$0.08 / 1M tokens	$1.00 / 1M tokens/hour	5,000 requests/month free, then $14 / 1,000 search queries
Priority	$2.70 / 1M tokens	$16.20 / 1M tokens	$0.27 / 1M tokens	$1.00 / 1M tokens/hour	5,000 prompts/month free, then $14 / 1,000 search queries

Batch is for asynchronous or batch workloads, not “offline” local processing. Flex is better described as cost-sensitive workloads with flexible serving. Priority is for prioritized workloads; it is not the only way to get the 1M context window. The 1M input context is a model property of Gemini 3.5 Flash.

Search grounding can trigger more than one search query per Gemini request. Costs are charged per individual search query.

1M context: useful, but not free

Gemini 3.5 Flash supports 1,048,576 input tokens and 65,536 output tokens. That is useful for large PDFs, long codebases, multi-file analysis and agent traces. It also increases cost, latency and the chance that irrelevant material distracts the model.

Good long-context habits:

Structure input with headings and file boundaries.
Remove irrelevant sections before sending.
Use File Search or RAG instead of pasting everything into every request.
Check context caching for repeated large inputs.
Summarize long conversations.
Specify output formats strictly.
Do not blindly copy entire repositories or PDFs into every request.

Managed Agents in the Gemini API

Managed Agents are cloud-based Gemini API agents. A single API call can start an agent that can reason, use tools and execute code in an isolated, ephemeral Linux environment.

According to Google, Managed Agents are powered by the Antigravity agent and built on Gemini 3.5 Flash. They are available through the Interactions API and Google AI Studio. For Mac users, this is a cloud-agent workflow, not local Apple Silicon inference.

Availability note: Google describes Managed Agents as rolling out in preview in the Gemini API. Treat this as a Google cloud/preview feature whose availability may depend on account, region, API access and product changes.

Antigravity 2.0 and Mac developer workflows

Antigravity 2.0 is Google’s agent-first developer platform and desktop app. It is designed around orchestration: multiple agents, subagents, scheduled tasks and integrations.

For Mac developers, Antigravity can be interesting as a cloud-agent development environment. Do not confuse it with local inference. Your Mac is the development machine; Gemini 3.5 Flash and the managed agents run in Google’s cloud ecosystem.

Gemini Spark and the macOS app

Google is also working on updates to the Gemini app for macOS. Gemini Spark is planned for the Gemini desktop app and is positioned as an agent that can help with local files and desktop workflows.

The key distinction: Gemini Spark is cloud-based. It may interact with desktop context, but it is not a local LLM running on Apple Silicon. Even if Spark can work with local files through the macOS app, the model inference is not local Apple Silicon inference.

Use this split:

Local inference: Ollama, LM Studio, MLX.
Cloud agents: Gemini Spark, Antigravity, Managed Agents.

Gemini 3.5 Flash vs local AI on Mac

Use case	Better default	Reason
Confidential documents	Local model	Files stay on the Mac if the workflow is configured locally
Offline work	Local model	No internet or API dependency
Open-weight experiments	Local model	Reproducible model files, quantization and runtime testing
Very large context	Gemini 3.5 Flash	1M input tokens, but cost and latency need control
Multimodal input	Gemini 3.5 Flash	Text, image, video, audio and PDF inputs
Tool calling and code execution	Gemini 3.5 Flash	Built-in cloud tooling and agent infrastructure
No token costs	Local model	You pay with hardware, power and time instead

Useful internal next reads: Apple Intelligence vs local AI, Ollama setup, LM Studio vs Ollama, Unified Memory, Best open-weight LLMs and Whisper setup.

Privacy, logs and abuse monitoring

Gemini 3.5 Flash is a cloud model. Data leaves your device.

The privacy picture depends on tier, API surface and project configuration:

Free Tier: according to Google’s pricing page, content may be used to improve Google products.
Paid Tier: according to the pricing page, content is not used to improve Google products.
Billing-enabled logs: Google’s data logging policy says prompts and responses in logs are not used for product improvement by default, unless you actively share datasets or feedback.
Abuse monitoring: Google says prompts, contextual information and outputs may be retained for 55 days for misuse detection and policy enforcement.
Interactions API storage: interactions are stored by default for server-side state and observability; Google lists 55 days retention for Paid Tier and 1 day for Free Tier, with store=false available for stateless requests where compatible.

Do not reduce this to “Google trains on everything” or “paid is automatically fully private.” For sensitive files, use local models, an enterprise/Vertex AI setup, or a formal compliance review.

When Gemini 3.5 Flash makes sense

Use Gemini 3.5 Flash when you need:

1M context for long documents or codebases.
Multimodal input.
Code execution.
Function calling and structured outputs.
File Search.
Search or Maps grounding.
Managed Agents or Antigravity workflows.
Long-horizon cloud tasks.

When local AI is better

Use local AI when you need:

Private documents to stay on the Mac.
Offline work.
Open-weight model experiments.
Reproducible local benchmarks.
No per-token API bill.
Tight control over model files, quantization and runtime.

Start with Ollama or LM Studio if the goal is local inference on Apple Silicon.

Common Gemini 3.5 Flash mistakes

Confusing the official model ID with provider aliases: in the Gemini API, use gemini-3.5-flash, not necessarily google/gemini-3.5-flash.
Copying old or unrelated API examples: use the official Google GenAI SDK or REST syntax for Gemini.
Using outdated thinking parameters: for Gemini 3.x, use thinking_level instead of older budget-style examples from other model generations.
Looking for Gemini 3.5 Flash in Ollama, LM Studio or MLX: it is not a local open-weight model.
Filling the 1M context window blindly: long context increases cost, latency and failure surface.
Mixing Free Tier and Paid Tier privacy assumptions: data use and product-improvement rules differ by tier.
Treating Search Grounding as free by default: grounding can create billable search queries after the free allowance.
Treating Gemini Spark as local inference: Spark may integrate with Mac workflows, but it is not local Apple Silicon inference.
Expecting Computer Use, image generation or audio output: Gemini 3.5 Flash supports multimodal input and text output, but not those features right now.

Conclusion

Gemini 3.5 Flash is interesting for Mac users not because it runs on Apple Silicon — it does not. It matters because it combines 1M context, multimodal input, thinking, tool use, code execution, file search, grounding and Managed Agents in a fast cloud model.

For private files, offline work and open-weight experimentation, Ollama, LM Studio and MLX remain the better choice. For large context, agentic coding workflows, tool chains and cloud agents, Gemini 3.5 Flash is a practical cloud complement.

The useful question is which data must stay on the Mac and which tasks benefit from Google’s cloud tools. Local models cover many routine tasks; Gemini is relevant when its large context, multimodal inputs or managed tools justify sending data to a cloud service.

Sources and status

Status: checked on May 27, 2026, after Google I/O 2026. Model names, prices, limits, product availability and supported features can change. The details for model ID, status, context window, output limit, features, thinking levels, prices and Google I/O announcements come from official Google/Gemini sources.

Can Gemini 3.5 Flash Run Locally on Mac? Ollama, MLX & Pricing

What Google announced at I/O 2026

What Gemini 3.5 Flash is

Model ID and specs

Gemini 3.5 Flash is not a local Mac model

API setup on Mac

Python

Python with thinking level

JavaScript

REST

GenerateContent or Interactions API?

Thinking levels and thought preservation

Pricing: Standard, Batch, Flex, Priority and grounding

1M context: useful, but not free

Managed Agents in the Gemini API

Antigravity 2.0 and Mac developer workflows

Gemini Spark and the macOS app

Gemini 3.5 Flash vs local AI on Mac

Privacy, logs and abuse monitoring

When Gemini 3.5 Flash makes sense

When local AI is better

Common Gemini 3.5 Flash mistakes

Conclusion

Sources and status

Frequently Asked Questions

What Google announced at I/O 2026

What Gemini 3.5 Flash is

Model ID and specs

Gemini 3.5 Flash is not a local Mac model

API setup on Mac

Python

Python with thinking level

JavaScript

REST

GenerateContent or Interactions API?

Thinking levels and thought preservation

Pricing: Standard, Batch, Flex, Priority and grounding

1M context: useful, but not free

Managed Agents in the Gemini API

Antigravity 2.0 and Mac developer workflows

Gemini Spark and the macOS app

Gemini 3.5 Flash vs local AI on Mac

Privacy, logs and abuse monitoring

When Gemini 3.5 Flash makes sense

When local AI is better

Common Gemini 3.5 Flash mistakes

Conclusion

Sources and status

Frequently Asked Questions

Read more