StepFun Step 3.7 Flash: Mac RAM, Pricing & 256K

StepFun Step 3.7 Flash is the kind of model that is easy to misunderstand from the headline alone: 198B parameters, but only about 11B active parameters per token. Add a 256K context window, native image and video understanding, tool calling and three reasoning effort levels, and it sounds like a dream model for local Mac AI. It is not that simple.

Quick answer: StepFun Step 3.7 Flash is an open 198B MoE vision-language model for agents, coding, tool use and multimodal workflows. But it is not a practical local model for normal Macs. 8, 16, 24 or 32 GB of unified memory are not realistic for local use. Local experiments only start to make sense with very high-memory systems — StepFun and the model cards mention devices such as Mac Studio or MacBook Pro configurations with at least 128 GB of unified memory. For most Mac users, Step 3.7 Flash is mainly an API, cloud or workstation topic.

What is StepFun Step 3.7 Flash?

Step 3.7 Flash is StepFun’s multimodal Flash model for production agent workflows. It combines a large MoE language backbone with a vision encoder and is designed for tasks where a model does not just answer, but plans across many steps, uses tools, analyzes files, edits code and processes visual context.

That makes it less of a simple chatbot model and more of a model for workflows such as:

coding agents
terminal and browser agents
multi-step tool chains
long document analysis
UI, screenshot and chart understanding
search and verification loops
structured extraction from large files
agents that need to stay coherent over long task chains

The important point: Step 3.7 Flash is openly available, but it is not small. It is not the kind of model you casually start on a normal MacBook with a simple ollama run command.

Graphic: Step 3.7 Flash at a glance

198B total parameters

~11B active parameters per token

256K context window

Text + image + video native multimodal input

Key facts

Property	StepFun Step 3.7 Flash
Model name	Step 3.7 Flash
API model name	`step-3.7-flash`
Architecture	Sparse Mixture-of-Experts
Total parameters	198B
Active parameters	about 11B per token
Context window	256K tokens
Input	text, image, video
Output	text
Reasoning levels	`low`, `medium`, `high`
Tool calling	yes
API format	OpenAI-compatible Chat Completions
License	Apache 2.0
Practical on normal Macs	no
Realistic local target class	128 GB unified memory or server/workstation

Why 198B MoE is not the same as a normal 11B model

MoE means Mixture of Experts. In simple terms, the model contains many expert blocks, but only a subset is activated for each token. That is why Step 3.7 Flash can have 198B total parameters while activating only around 11B parameters per token.

This makes it more efficient than a dense 198B model. But it does not make it equivalent to a small 11B model. The weights still need to be stored, loaded and managed. You also need memory for the KV cache, context window, vision components, operating system, runtime and quantization overhead.

For Mac users, this distinction matters:

11B active does not mean it runs like a normal 11B model.
198B total parameters means the memory footprint is still huge.
256K context means KV-cache and memory use can grow quickly.
Efficient MoE does not mean MacBook-friendly.

Does Step 3.7 Flash run locally on Mac?

Theoretically: yes, with suitable quantizations and enough memory.

Practically: not on normal Macs.

The GGUF variants show why. Depending on quantization, the model files are roughly in this range:

Variant	Approximate size	Practical meaning
BF16 GGUF	about 394 GB	full-precision reference, not normal local use
Q8_0	about 209 GB	still extremely large
Q4_K_S	about 112 GB	realistic only with very high unified memory
IQ4_XS	about 105 GB	smaller, but still high-memory
Q3_K_M	about 94 GB	aggressive, quality and setup matter
IQ3_XXS	about 76 GB	smallest option, only when memory is the main constraint
Vision projector	about 4 GB	additional file for image processing

That means a Mac with 16, 24 or 32 GB of unified memory is not the target hardware. Even 64 GB is difficult for comfortable use, especially with long context, vision input or other apps running.

A fair Mac framing is:

8–32 GB unified memory: not practical locally.
64 GB unified memory: at most very constrained experiments with aggressive quantization.
96 GB unified memory: experimental, but not the comfortable target class.
128 GB unified memory: the first realistic high-memory class for local experiments.
Server/workstation: more realistic for production use.

Graphic: Mac reality check for Step 3.7 Flash

Normal Mac8–32 GB unified memoryNot practical locally

High-memory Mac96–128 GB unified memoryExperimental

Workstation / servervLLM, SGLang, llama.cpp, NIMMore realistic deployment

API / cloudStepFun Open PlatformBest path for most users

API access and pricing

For most Mac users, the API is the more realistic path. StepFun offers Step 3.7 Flash through the global Open Platform and the China platform. Important: API keys are regional. A key from the global platform belongs to the global base URL; a key from the China platform belongs to the China base URL.

Platform	Base URL
Global	`https://api.stepfun.ai/v1`
China	`https://api.stepfun.com/v1`

The official pricing is lower than many large frontier-model APIs, but it is not free:

Token type	Price
Input cache hit	$0.04 / 1M tokens
Input cache miss	$0.20 / 1M tokens
Output	$1.15 / 1M tokens

This is interesting for agent workflows because repeated context blocks can become cheaper with caching. Still, 256K context can become expensive if you blindly paste entire repositories, PDFs or log files into every request.

API example on Mac

StepFun uses an OpenAI-compatible Chat Completions format. On Mac, you can therefore use the OpenAI Python client with StepFun’s base URL and model name.

from openai import OpenAI
import os

client = OpenAI(
    api_key=os.environ["STEP_API_KEY"],
    base_url=os.environ.get("STEP_BASE_URL", "https://api.stepfun.ai/v1"),
)

completion = client.chat.completions.create(
    model="step-3.7-flash",
    messages=[
        {
            "role": "system",
            "content": "You are a precise assistant for Mac AI workflows."
        },
        {
            "role": "user",
            "content": "Explain why Step 3.7 Flash is difficult to run locally on a normal Mac."
        }
    ],
    reasoning_effort="medium",
)

print(completion.choices[0].message.content)

Important: API keys do not belong in frontend code, public repositories or static Astro pages. Use environment variables, a backend, a serverless function or secure secret storage.

Reasoning levels: low, medium, high

Step 3.7 Flash supports three reasoning levels:

Level	Best for
`low`	simple questions, summaries, rewriting, extraction
`medium`	default for normal multi-step tasks
`high`	difficult coding, planning, math, deeper analysis

For everyday questions, high is usually unnecessary. For agent runs, complex code analysis or long document chains, it can make sense. Best practice: start with medium, switch to high only for hard tasks and use low for simple extraction or rewriting.

Coding, agents and tool calling

Step 3.7 Flash is interesting because it is not just optimized for chat. StepFun positions it clearly for agent frameworks, tool use and production workflows. That includes:

terminal tasks
browser workflows
file operations
office-style workflows
search and verification loops
multi-file code edits
tool calling with tools and tool_choice

For Mac users, that means the model does not replace local tools like Ollama or LM Studio. It can be a strong cloud complement when local models hit limits in context length, tool stability or complex planning.

Multimodality: image and video

Step 3.7 Flash supports native image and video understanding. This matters for tasks such as:

analyzing screenshots
turning UI wireframes into code
describing charts
extracting structured data from images
identifying visual app issues
using video or frame context in agent workflows

Still, this should not be exaggerated. Multimodality does not mean every complex PDF page or tiny UI detail will be interpreted perfectly. For production workflows, cropping, clear screenshots, readable text and good prompts still matter.

How to read the benchmarks

StepFun publishes strong benchmark signals for agents, coding and multimodal tasks. Examples include:

Area	Benchmark	Step 3.7 Flash
Agentic coding	SWE-Bench Pro	56.3
Terminal/agent	Terminal-Bench 2.1	59.5
Tool use	Toolathlon	49.5
Agent robustness	ClawEval-1.1	67.1
Multimodal	SimpleVQA with Tool	79.2
Multimodal	V* with Python	95.3

These values are useful, but they are not a guarantee for your own website, repository or Mac workflow. Benchmarks depend on harness, tooling, prompting, model version, reasoning level, context, temperature and evaluation method.

The right statement is not: “Step 3.7 Flash is better than everything else.” It is:

Step 3.7 Flash looks strong for agent, coding and multimodal tool workflows, but practical tests still matter.

Graphic: official benchmark signals, not own Mac tests

SWE-Bench Pro56.356.3

Terminal-Bench 2.159.559.5

Toolathlon49.549.5

ClawEval-1.167.167.1

SimpleVQA with Tool79.279.2

V\* with Python95.395.3

Step 3.7 Flash vs local Mac models

Step 3.7 Flash and local Mac models solve different problems.

Criterion	Step 3.7 Flash	Local Mac models
Privacy	cloud processing when using API	can stay fully local
Context	256K tokens	depends on model, RAM and runtime
Model size	198B MoE	usually 3B to 32B on normal Macs
Cost	API cost or expensive hardware	no token cost, but hardware/time
Offline use	only with local high-memory setup	yes
Agent performance	key target area	depends on model
Setup effort	API easy, local hard	small models easy
Normal Mac fit	API yes, local no	yes

For confidential files, private notes and offline work, local models with Ollama, LM Studio, MLX or llama.cpp remain the better choice. For large agent runs, long contexts, complex tool chains and multimodal API workflows, Step 3.7 Flash can be a strong complement.

When Step 3.7 Flash makes sense

Step 3.7 Flash is most useful when you:

run coding agents over large repositories
need many tool calls
analyze long documents
combine image/video understanding with reasoning
can justify API cost against model quality
accept cloud processing
have high-memory hardware
use an agent framework with an OpenAI-compatible API

When it does not make sense

Step 3.7 Flash is probably not the right choice if you:

want local inference on a normal Mac
cannot send sensitive data to the cloud
only need simple chat tasks
want zero token cost
need a small offline model
have 8, 16, 24 or 32 GB of unified memory
only need short summaries or basic coding help

For those cases, Gemma, Qwen, Llama, Mistral or smaller coding models are usually more practical on Mac.

FAQ

Does Step 3.7 Flash run on a MacBook Air?

Not realistically as a local model. For normal MacBook Air configurations, the model is far too large. Use smaller local models or the API instead.

Is 32 GB of unified memory enough?

Not for practical local use. 32 GB is useful for many 7B, 8B, 14B or sometimes 27B models, but not for a 198B MoE model of this class.

Why are 11B active parameters not enough for normal Macs?

Because the total weights are still huge. MoE activates only part of the model per token, but model weights, quantization, KV cache and runtime still need a realistic memory budget.

Is Step 3.7 Flash open source?

More precisely: Step 3.7 Flash is an open-weight model under Apache 2.0. The weights are openly available, but the model is still not an easy local Mac model because of its size.

What does the API cost?

StepFun lists $0.04 per 1M input tokens for cache hits, $0.20 per 1M input tokens for cache misses and $1.15 per 1M output tokens.

Is Step 3.7 Flash better than local Qwen, Gemma or Llama models?

Not universally. It is larger and more focused on agents, tool use and multimodality. Local models are more private, cheaper to run once installed and more realistic on normal Macs.

Conclusion

StepFun Step 3.7 Flash is an exciting model, but not because it suddenly makes a huge open MoE model easy to run on every Mac. It is a large open MoE model for agents, coding, tool use and multimodal workflows.

Its strength is the combination of 198B total parameters, about 11B active parameters per token, 256K context, reasoning levels, tool calling and API availability. Its limit is just as clear: normal Macs are not the local target hardware.

The best AI on Mac framing is therefore: local models for private offline work, Step 3.7 Flash for large agent and cloud workflows — and local experiments only with a lot of unified memory.

Sources and status

Status: June 18, 2026. Model specs, pricing, availability and benchmark values can change.