OPEN SOURCE · AGENTIC INTELLIGENCE · JULY 2025

Hello, Kimi K2

Kimi K2 is Moonshot AI's original trillion-parameter open-source frontier model — a Mixture-of-Experts architecture with 32B active parameters, 128K context, and an agentic-first design built explicitly for coding, reasoning, and autonomous tool-use workflows. Released July 2025. Modified MIT License.

Try Kimi K2 Free Weights on HuggingFace

1TParameters

32BActive / Token

128KContext

15.5TTraining Tokens

MITLicense

01 — WHAT IS KIMI K2

The Open-Source Model That Changed the Frontier

Kimi K2 is Moonshot AI's state-of-the-art Mixture-of-Experts language model, released in July 2025 under a Modified MIT License. With one trillion total parameters and 32 billion activated per token, it became the most capable open-source agentic model available at release — distinguished not by scale alone, but by Moonshot's deliberate focus on autonomous, tool-using workflows over static benchmark performance.

What made K2's release historically significant was a rare combination: frontier benchmark performance across coding, reasoning, and knowledge tasks; deep native tool-use capability built through post-training on synthetic agentic scenarios; and full commercially usable open weights. Prior to K2, models at this capability level existed exclusively behind closed vendor APIs. K2 made frontier-class agentic intelligence genuinely accessible.

K2 ships in two variants: K2-Base - the raw pre-trained model for researchers and builders who want fine-tuning control; and K2-Instruct — the post-trained model ready for conversational and agentic use out of the box. Both sets of weights are on HuggingFace in block-fp8 format.

1T total, 32B active per token - MoE efficiency at trillion-parameter scale
384 expert networks - 8 specialist + 1 shared expert activated per token
61 layers - 1 dense standardization layer + 60 MoE layers
MuonClip optimizer - zero training instability across 15.5T tokens
128K context window - ~96,000 words per session
Modified MIT License - commercial use permitted for most deployments

KIMI K2 · JULY 2025

Open
Agentic
Intelligence

The original trillion-parameter open-source frontier. Built for coding, reasoning, and autonomous tool-use workflows at 32B active parameter efficiency.

65.8%SWE-bench

80%τ²-Bench

5×Cheaper

02 — MOE ARCHITECTURE

Mixture-of-Experts: Why 32B Active Feels Like 1T

Kimi K2's Mixture-of-Experts architecture is the engineering foundation that makes the model simultaneously powerful and practical to deploy. Rather than activating all one trillion parameters for every token — which would make inference cost-prohibitive — K2 uses sparse activation: only 32 billion parameters engage for any given token.

The architecture contains 384 specialized expert networks organized across 61 layers: one dense standardization layer at the base followed by 60 MoE layers for hierarchical processing. For each token, K2's routing mechanism selects exactly 8 specialist experts plus one shared expert. The shared expert ensures universal knowledge access; the 8 selected experts bring task-specific capabilities. This fine-grained routing — 384 experts with 9 activated — provides more specialization depth than coarser designs (e.g., 8 total experts with 2 activated).

The practical result: K2 delivers the knowledge capacity and representational depth of a trillion-parameter system at the inference cost of a ~32B dense model. API token costs are dramatically lower than the headline parameter count suggests, and self-hosted deployments need far less GPU memory than a truly dense 1T model would require. K2's MoE architecture is comparable in principle to DeepSeek-V3 but distinct in routing granularity, expert dimensionality, and the MuonClip optimizer that stabilizes training at this scale.

7,168 hidden dimension — 64 attention heads via Multi-Head Latent Attention (MLA)
Expert intermediate dimension: 2,048 — per MoE expert network
SwiGLU activations — proven performance at large model scale
160K vocabulary — broad multilingual, code, and scientific coverage
Block-fp8 format — efficient storage and inference
MLA attention — reduces KV cache size at longer context lengths

⚙ MOE ARCHITECTURE

384
Experts.
32B Active.

8 specialist + 1 shared expert activated per token. The knowledge capacity of 1 trillion parameters at the inference cost of 32 billion.

384Total experts

9Active per token

61Layers

03 — MUONCLIP OPTIMIZER

MuonClip: Stable Training at Unprecedented Scale

Training a 1T-parameter MoE model is a genuinely hard engineering challenge. MoE architectures at this scale are prone to attention logit explosion — where values in the attention layers grow unbounded during training, causing catastrophic loss spikes and requiring expensive checkpoint rollbacks. Historically, training runs at this scale have required multiple interventions to recover from instability events.

Moonshot AI developed MuonClip specifically to solve this problem. MuonClip applies the Muon optimizer — an improvement over standard Adam that applies the Nesterov momentum update in the spectral domain for better gradient conditioning — at unprecedented trillion-parameter scale. The key innovation is a novel qk-clip technique: rather than using standard gradient clipping after the fact, MuonClip directly adjusts the query and key projection matrices in the attention mechanism to prevent logit explosion before it begins.

The outcome was a complete pre-training run of 1 trillion parameters on 15.5 trillion tokens with zero training instability. This is an unusual achievement — most large MoE training runs encounter instability events requiring restart from checkpoint. Moonshot's ability to complete a clean 15.5T token run at this scale is a significant engineering validation of MuonClip's design, and it carries forward to K2.5 and K2.6 with the same stability characteristics.

Muon optimizer base — Nesterov momentum in the spectral domain, superior gradient conditioning vs Adam
Novel qk-clip technique — proactively prevents attention logit explosion in MoE layers
Zero training instability — clean run across all 15.5T tokens, no loss spike interventions
First applied at 1T scale — unprecedented use of Muon-family optimizers
Carries through the K2 family — MuonClip is consistent from K2 to K2.6

🛠 MUONCLIP OPTIMIZER

Zero
Instability.
15.5T Tokens.

A custom optimizer that solved attention logit explosion at trillion-parameter MoE scale, enabling a clean pre-training run with no interventions.

15.5TTokens trained

0Instabilities

QK-ClipNovel technique

04 — K2 THINKING · NOV 2025

K2 Thinking: Where Reasoning Meets Tool Use

Kimi K2 Thinking is the post-trained reasoning variant released in November 2025, four months after the original K2. It introduced a capability that had not been achieved reliably at production scale before: seamlessly interleaved chain-of-thought reasoning with native tool calls.

In practice, this means K2 Thinking can reason about a problem, determine that it needs external data or a code execution result, call the appropriate tool, receive the result, reason about the result, decide on a follow-up action, call another tool, and continue this reasoning-action chain for up to 200–300 sequential steps without losing track of the overall task. Earlier reasoning models would either complete their full thinking trace before acting, or execute tool calls without integrated reasoning between them — not both simultaneously.

K2 Thinking also introduced Quantization-Aware Training (QAT) at the post-training stage. By incorporating INT4 weight-only quantization to MoE components during training — rather than applying it as post-hoc compression — K2 Thinking achieves a ~2× generation speed improvement over FP16 inference without accuracy degradation. All K2 Thinking benchmark results are measured under INT4 precision, making them directly comparable to production deployment characteristics.

Interleaved thinking + tool use — reason, act, observe, reason, act — seamlessly
Up to 300 sequential tool calls — without losing task coherence
Native INT4 via QAT — ~2× speed vs FP16, no accuracy degradation
Recommended temperature: 1.0 — for optimal reasoning chain quality
reasoning_content in API response — access full chain-of-thought trace for debugging
Partners: Tencent CodeBuddy, Genspark — production deployments built on K2 Thinking

API access

K2 Thinking API: platform.moonshot.ai, model: "kimi-k2-thinking". Recommended temperature=1.0. Access the reasoning trace via response.choices[0].message.reasoning_content. Tool-calling API follows OpenAI's function-calling schema exactly.

🧠 K2 THINKING

Reason.
Act.
Observe.

Interleaved chain-of-thought and tool use — up to 300 sequential steps. Think, call a tool, see the result, think, call another tool. Seamlessly.

300Tool calls

2×Speed (INT4)

INT4Native QAT

05 — KEY FEATURES

Key Features of Kimi K2

Kimi K2 was engineered specifically for agentic intelligence — designed to perform reliably in the multi-step, tool-using, context-heavy workflows that real AI applications require, not just to score well on static benchmarks.

// 01

Native Agentic Intelligence

Post-trained extensively on synthetic agentic scenarios — multi-step tool use, autonomous problem-solving, error recovery, and long-horizon task execution. K2 was trained to understand when and how to invoke tools in complex workflows, not just to understand tool schemas abstractly.

Built for autonomous work

// 02

OpenAI + Anthropic Compatible API

K2's API at platform.moonshot.ai is compatible with both the OpenAI and Anthropic SDK standards. Any integration using OpenAI's messages format can switch to K2 by changing two lines: base_url and model name. The Anthropic-compatible endpoint scales temperature by 0.6 for existing Claude integrations.

Zero refactoring to migrate

// 03

Open Weights — Modified MIT

Both K2-Base and K2-Instruct model weights are available on HuggingFace under the Modified MIT License. Commercial use is permitted. The sole restriction applies only to deployments serving more than 100M monthly active users or generating more than $20M/month in revenue — an irrelevant threshold for the vast majority of developers and companies.

Commercial use permitted

// 04

128K Context Window

K2 ships with a 128K token context — approximately 96,000 words or 192 pages of text in a single session. Sufficient for large codebases, extensive document analysis, and multi-hour agentic sessions. The K2.5 and K2.6 successors extended this to 256K and 262K respectively for longer-horizon workflows.

~96,000 words per session

// 05

Flexible Multi-Path Deployment

Access via kimi.com for zero-setup; platform.moonshot.ai API for production integration; HuggingFace weights for self-hosted deployment with vLLM, SGLang, KTransformers, or TensorRT-LLM; or Kimi Code CLI for terminal-native coding workflows. Every path uses the same underlying model.

API · Cloud · Self-hosted · CLI

// 06

Cost-Efficient at Frontier Performance

MoE sparse activation means 32B active parameters despite 1T total. Running K2's full benchmark evaluation suite costs approximately $0.27 — versus $0.48 for GPT-5.2 and $1.14 for Claude Opus 4.5. K2 is estimated to be 5× cheaper than leading closed-source models at comparable performance levels.

5× cheaper than Claude Opus 4.5

06 — BENCHMARKS

Kimi K2 Benchmark Results

Released in July 2025, Kimi K2 immediately became one of the highest-performing open-source models on coding, agentic, and reasoning benchmarks. The results below reflect K2 Instruct performance at release, sourced from Moonshot AI's official technical report and independent evaluations.

Benchmark	Category	K2 Score
SWE-bench Verified	Coding	65.8%
MMLU-Pro	Knowledge	73.3%
GPQA Diamond	Science	71.0%
MATH-500	Maths	90.6%
HumanEval	Coding	88.2%
LiveCodeBench	Coding	57.6%
AIME 2025	Maths	66.2%
tau2-bench (Airline)	Agentic	80.0%

Scores from Moonshot AI K2 technical report (July 2025). K2 Thinking results reported under INT4 precision. For comparison: K2.5 (Jan 2026) reached 76.8% SWE-bench; K2.6 (Apr 2026) reached 80.2%. See K2.6 page for current frontier numbers.

K2 vs K2.5 vs K2.6 at a glance

K2 (Jul 2025): SWE-bench 65.8%, 128K context, text only, no Agent Swarm. K2.5 (Jan 2026): 76.8% SWE-bench, 256K context, native vision (MoonViT), Agent Swarm 100 agents. K2.6 (Apr 2026): 80.2% SWE-bench, 262K context, Agent Swarm 300 agents / 4,000 steps. K2 remains a strong cost-efficient production model when K2.5/K2.6 capability isn't needed.

07 — HOW TO USE KIMI K2

How to Use Kimi K2

Four deployment paths cover every use case from no-setup exploration to fully self-hosted infrastructure.

// PATH 01

kimi.com — No Setup Required

Access K2 through kimi.com (web) or the Kimi App (iOS/Android). Instant mode for fast responses; Thinking mode for deep reasoning chains. No API key or infrastructure needed. Free tier available with basic quotas; paid plans unlock higher usage.

Free · No setup

// PATH 02

Official API — Production

Get your key at platform.moonshot.ai. Use model="kimi-k2-instruct" or "kimi-k2-thinking" with any OpenAI SDK. Fully OpenAI and Anthropic-compatible. Token-based billing, separate from app membership.

OpenAI-compatible · Production

// PATH 03

Open Weights — Self-Host

Download K2-Base or K2-Instruct from HuggingFace (block-fp8 format). Deploy with vLLM, SGLang, KTransformers, or TensorRT-LLM. Run the Kimi Vendor Verifier before production traffic. Best for privacy-critical or regulated deployments.

Full infrastructure control

// PATH 04

Kimi Code CLI

Use K2 as the engine for the open-source Kimi Code CLI terminal agent. Integrates with VS Code, Cursor, JetBrains, and Zed. Supports autonomous coding sessions, codebase navigation, and multi-file editing workflows natively.

Terminal · VS Code · Cursor

Python API Quick Start

Python · OpenAI SDK · K2 Instruct & K2 Thinking

# pip install openai from openai import OpenAI client = OpenAI( api_key="YOUR_MOONSHOT_API_KEY", # get at platform.moonshot.ai base_url="https://api.moonshot.ai/v1" ) # K2 Instruct — conversational & agentic (temperature 0.6) response = client.chat.completions.create( model="kimi-k2-instruct", temperature=0.6, max_tokens=4096, messages=[ {"role": "system", "content": "You are Kimi, an AI assistant by Moonshot AI."}, {"role": "user", "content": "Debug this function and explain the fix..."} ] ) print(response.choices[0].message.content) # K2 Thinking — extended chain-of-thought (temperature 1.0) thinking = client.chat.completions.create( model="kimi-k2-thinking", temperature=1.0, messages=[{"role": "user", "content": "Prove this algorithm is O(n log n)..."}] ) # Access the chain-of-thought reasoning trace print(thinking.choices[0].message.reasoning_content) print(thinking.choices[0].message.content)

08 — USE CASES

What Can You Build with Kimi K2

K2's combination of agentic capability, native tool use, open weights, and cost efficiency makes it a strong foundation for autonomous systems that require deep reasoning and reliable multi-step execution. These are the highest-impact application patterns.

💻 Autonomous Code Agents

K2's instruction-following and logical reasoning capabilities underpin autonomous programming agents. Debugging, refactoring, and multi-step development workflows with up to 300 sequential tool calls in K2 Thinking mode.

🔬 Long-Horizon Research

K2 Thinking's 300-step tool-call support makes it viable for long research tasks: multi-source web search, data synthesis, competitive analysis, and structured report generation with source validation.

⚖️ Legal & Document Processing

Optimized for rigorous attention to detail — contract review, patent drafting, legal document analysis, and annotation workflows requiring strict adherence to terminology and logical structure.

📊 Financial Intelligence

AlphaEngine built a FinGPT Agent on K2 Thinking supporting 300+ tool calls — macroeconomic analysis, research report processing, supply chain breakdown, and automated financial report generation.

🏗️ Full-Stack Development

Generate complete applications from a description: backend logic, API design, database schema, and frontend — held simultaneously within K2's 128K context window for consistent cross-layer implementation.

🌐 Multilingual Applications

160K vocabulary with coverage of 50+ languages makes K2 effective for cross-language summarization, translation pipelines, and global content workflows. Particularly strong for CJK and English contexts.

🔧 Developer Tooling

Tencent CodeBuddy and Genspark run production coding agents on K2. The OpenAI-compatible API makes K2 a drop-in replacement in existing developer tool infrastructure without refactoring.

🧬 Scientific & Domain Research

Kimi K2's reasoning capabilities extend to specialized scientific domains. DP Technology and XtalPi deploy K2.5 (built on K2's base) for chemical literature understanding and drug discovery workflows.

09 — K2 MODEL FAMILY

The Kimi K2 Family: 9 Months of Evolution

Between July 2025 and April 2026, Moonshot AI shipped five major model updates in the K2 lineage. Each release targeted a specific capability frontier while preserving the same underlying trillion-parameter MoE architecture and MuonClip-stabilized training approach. This cadence — a major update roughly every two months — is faster than any closed-source frontier lab over the same period.

July 2025

Kimi K2 — Open-Source Frontier Foundation

1T MoE, 32B active, 128K context. MuonClip optimizer for zero-instability training on 15.5T tokens. K2-Base + K2-Instruct variants. Modified MIT License. First open-source model at this capability tier, setting the trillion-parameter open-source baseline for agentic AI.

K2-09

September 2025

K2-Instruct-0905 — Coding Capability Sharpened

An interim instruct update targeting coding performance improvements. SWE-bench accuracy improved versus the July checkpoint. Genspark deploys this specific version ("Kimi K2 (0905)") as the engine for their autonomous agent platform — a production endorsement of the 0905 checkpoint's reliability.

K2-T

November 2025

K2 Thinking — Interleaved Reasoning & Tool Use

Post-trained reasoning variant. Interleaves chain-of-thought with native tool calls, supporting 200–300 sequential steps. Native INT4 quantization via QAT for 2× speed improvement over FP16. Tencent CodeBuddy integrates K2 Thinking as its core model for developer workflows. Recommended temperature: 1.0.

K2.5

January 27, 2026

Kimi K2.5 — Visual Agentic Intelligence

Continual pretraining on 15T mixed visual + text tokens. MoonViT 400M encoder for native multimodal input (images + video). Context extended to 256K. Agent Swarm v1: 100 parallel sub-agents, 1,500 tool calls, 4.5× speedup on research tasks. SWE-bench 76.8%. BrowseComp (Swarm): 78.4%.

K2.6

April 20, 2026 — Current GA

Kimi K2.6 — Long-Horizon Agentic Coding General Availability

Context extended to 262K. Agent Swarm v2: 300 sub-agents, 4,000 coordinated steps — 3× capacity increase from K2.5. Claw Groups for cross-model collaboration. Document-to-Skill conversion. SWE-bench 80.2%. BrowseComp Swarm 86.3%. DeepSearchQA F1 92.5%. Beta label removed — GA release.

10 — COMPARE

Kimi K2 vs Other AI Models

At its July 2025 release, K2's most significant differentiator was delivering frontier-quality benchmarks with open, commercially usable weights. No other model in the same performance tier offered all three elements simultaneously.

Model	Kimi K2 Moonshot AI · Jul 2025	GPT-4.1 OpenAI	Claude Opus 4 Anthropic	Llama 3.1 405B Meta	DeepSeek-V3 DeepSeek
Architecture & Access
Open weights	✓ Modified MIT	✗	✗	✓ Llama license	✓
Total parameters	1T (MoE)	~200B est.	~200B est.	405B dense	671B (MoE)
Active parameters	32B	Dense	Dense	405B	37B
Context window	128K	128K	200K	128K	128K
Key Benchmarks (contemporary comparisons)
SWE-bench Verified	65.8%	~50%	~72%	~30%	~49%
MMLU-Pro	73.3%	~72%	~74%	~73%	~75%
GPQA Diamond	71.0%	~60%	~73%	~51%	~60%
tau2-bench (agentic)	80.0%	~65%	~70%	~55%	~60%
Cost
API input $/1M tokens	~$0.60	$2.00+	$3.00+	~$0.80	$0.27
Commercial use	✓ Free (most)	API only	API only	License terms	✓
Self-hostable	✓ vLLM, SGLang	✗	✗	✓	✓

Benchmarks at K2 release (July 2025). Competitor comparisons are contemporary estimates from available evaluations. Verify current numbers on official model pages. For April 2026 frontier comparisons, see the K2.6 guide.

11 — TIPS & BEST PRACTICES

Getting the Most from Kimi K2

Use K2 Instruct for most tasks; K2 Thinking for hard problems

K2 Instruct (temperature=0.6) is tuned for fluid conversation and reliable agentic execution. K2 Thinking (temperature=1.0) activates extended reasoning — meaningfully more accurate on complex multi-constraint problems, but slower. Match the model variant to the task. Defaulting to Thinking for everything wastes time and cost.

Access reasoning traces for debugging complex failures

When using K2 Thinking, always log response.choices[0].message.reasoning_content alongside the main content. The full chain-of-thought trace reveals exactly where the model's logic diverged when outputs are incorrect — invaluable for debugging agentic workflow failures that aren't obvious from the final response alone.

Design for iterative tool-call sequences, not single mega-prompts

K2's greatest strength emerges in multi-step execution. Design your system to let K2 plan, execute a tool call, observe the result, and decide what to do next — rather than cramming all instructions into one prompt. K2 supports 200–300 sequential tool calls. Use this budget for genuinely complex, long-horizon tasks.

Use the Anthropic-compatible endpoint for Claude migrations

If migrating from Claude, use K2's Anthropic-compatible endpoint at platform.moonshot.ai. Note: the Anthropic API applies temperature scaling — real_temperature = request_temperature × 0.6 — for compatibility with existing Claude integrations. Validate your specific task distribution before switching production traffic.

Validate self-hosted deployments before serving production

For self-hosted K2 deployments, run the official Kimi Vendor Verifier before routing production traffic. Quantization-related output drifts are subtle and will not surface during casual testing. The Vendor Verifier systematically checks output characteristics against Moonshot's reference implementation across representative tasks.

For new projects requiring vision or longer context, start with K2.5 or K2.6

K2 remains a solid cost-efficient model for text-only workflows. For new projects: if you need native image/video understanding, use K2.5 (256K context, MoonViT encoder). If you need Agent Swarm or the best possible long-horizon coding reliability, use K2.6 (262K, 300 agents, GA stability). Switching between K2 variants is a one-line model string change in your API calls.

12 — LIMITATIONS

Limitations of Kimi K2

// 01

Text-Only — No Vision

K2 does not process images or video. Native multimodal capability was the headline addition in K2.5 (January 2026) via the MoonViT 400M encoder. For any workflow involving screenshots, diagrams, visual design, chart analysis, or video understanding, use K2.5 or K2.6 instead.

Text-only model

// 02

128K Context vs Later Models

K2's 128K context is sufficient for many workflows but half the 256K of K2.5 and slightly over half the 262K of K2.6. For very large codebases, full 200-page document sets, or sustained multi-hour agent sessions, the longer context of K2.5/K2.6 provides a material workflow advantage.

128K vs 256K in K2.5/K2.6

// 03

Self-Hosting Requires Enterprise GPU

Despite MoE's efficiency advantage (32B active parameters), the full K2 model still requires enterprise-grade GPU infrastructure — A100-class at minimum for practical throughput. Block-fp8 format reduces storage requirements, but consumer GPU deployments cannot achieve usable inference speeds at production scale.

A100-class GPU minimum

// 04

Commercial License Threshold

The Modified MIT License restricts deployments serving more than 100M monthly active users or generating more than $20M/month in revenue — those must credit "Kimi K2" visibly in the UI. For the overwhelming majority of teams, this threshold is irrelevant. Enterprise legal review is recommended for large-scale commercial deployments.

Review license for large scale

13 — FAQ

Frequently Asked Questions About Kimi K2

Have more questions? Visit our contact page or the official GitHub repository.

Start Building with Kimi K2

Free to try at kimi.com. Open weights on HuggingFace. API access at platform.moonshot.ai. The full K2 lineage — from K2 to K2.6 - is available today.

Try Kimi K2 Free API Docs See K2.6 →