DEVELOPER API
Kimi API - Kimi K2.5 API
Build apps with Kimi using a fast, OpenAI style API. Send chat requests, connect your own tools, and power everything from support bots to research assistants. Choose K2 for reliable text + coding, or K2.5 for visual workflows like screenshot-to-code.
Chat + Tool Calling
Use a familiar chat-completions workflow, then add tools (functions) so Kimi can search, fetch data, run actions, and return structured outputs.
K2 vs K2.5 Models
Pick the right engine for your job: K2 for text-first agents and coding, K2.5 for multimodal tasks like UI screenshots, visual coding, and richer automation.
Comparison table (Kimi API vs other APIs)
| What you care about | Kimi API | OpenAI API | Anthropic API | Gemini API | Mistral API | Cohere API | Groq API | OpenRouter |
|---|---|---|---|---|---|---|---|---|
| API style / compatibility | OpenAI-SDK compatible for many APIs; migration guide exists | Native /v1/chat/completions and newer Responses API |
Messages API (/v1/messages) |
generateContent / streaming variants |
Chat endpoints + tool calling controls (tool_choice) |
Chat endpoint + tool use | OpenAI-compatible endpoints (/openai/v1/chat/completions) |
OpenAI Chat API normalized across many providers/models |
| Tool / function calling | Tool Use supported | Function calling + built-in tools in Responses | Tool use supported (Messages) | Function calling supported | Tool calling / function calling supported | Tool use patterns (parallel, multi-step) | Depends on model; OpenAI-compatible interface | Depends on underlying model; router doesn’t remove need for tool logic |
| Long context | Platform highlights up to 256K context | Model-dependent; documented per model | Model-dependent | Model-dependent; long-context docs exist | Model-dependent | Model-dependent | Model-dependent | Model-dependent (varies per routed model) |
| Multimodal (images/audio/video) | Model-dependent (K2.5 quickstart exists) | Supports text+image in Responses; more modalities via platform features | Vision supported (per Messages docs) | Explicit multimodal support (text, images, audio, etc.) | Has multimodal models + “agents” docs mention multimodal availability | Mostly text-focused; tool+citations strengths | Depends on model offered via Groq | Depends on model/provider you choose |
| Best “why choose it” | OpenAI-compatible feel + tool calling + long context | Deep platform features + broad docs + built-in tools | Strong tooling patterns + Messages-first API | Strong multimodal ecosystem + Google infra | Clean tool calling controls + growing agent features | Tool use + citations workflows are first-class | Very fast OpenAI-compatible inference focus | One endpoint, many models + routing/fallback |
Kimi API (Complete Developer Guide)
If you’ve used modern LLM APIs before, the Kimi API will feel familiar because it’s designed to be compatible with popular chat-completions style workflows while still offering some very “Kimi-specific” strengths: long context, agentic tool calling, and (with newer models) multimodal + visual coding workflows. The goal of this guide is to help you go from “I heard Kimi is good” to “I shipped something stable in production.”
Kimi is developed by Moonshot AI and offered through their Open Platform. The Open Platform provides an OpenAI-compatible base URL for API calls (plus region-specific alternatives), a model list endpoint, file endpoints, and documentation for tool calling and rate limiting.
What is the Kimi API?
The Kimi API is the set of endpoints you call to use Kimi models inside your own products apps, bots, websites, internal tools, automation pipelines, and developer workflows. In practice, the most common flow looks like this:
-
You generate an API key in the Moonshot Open Platform console.
-
Your server sends a request to the chat completions endpoint using that key.
-
You optionally attach tools/functions, files, or other structured inputs so Kimi can do agent-style work.
If you already have code for the OpenAI Chat Completions pattern, it’s often straightforward to migrate by changing the base URL + model names and adjusting a few small differences.
Why developers choose Kimi API
People typically pick Kimi for one (or more) of these reasons:
1) Long-context workflows that feel practical
Kimi is known for handling very large inputs, which matters when you’re building:
-
Document summarizers,
-
“Chat with docs” tools,
-
Research agents that hold lots of context,
-
Codebase assistants that need multiple files at once.
Kimi’s K2 / K2.5 family is also frequently described with very large context windows in official and ecosystem documentation.
2) Tool calling designed for agentic tasks
Tool calling (sometimes called function calling) is a big deal if you want Kimi to:
-
Call your search API,
-
Query a database,
-
Fetch web pages,
-
Run calculations,
-
Create multi-step plans and execute them.
Kimi K2 documentation explicitly highlights tool calling capability and shows patterns that match the “tools array” style many developers already use.
3) Modern coding and “visual coding” direction
Kimi’s newer positioning (especially around K2.5) puts a spotlight on design-to-code and agent swarm ideas, where the model can coordinate multi-step work and handle rich office outputs.
How Kimi API is structured
Think of the Kimi API as three layers:
Layer A - Transport + Auth
-
HTTPS requests
-
JSON request/response
-
Authorization via API key (Bearer token in the
Authorizationheader)
Layer B - Core endpoints
-
/v1/chat/completions(the main one) -
/v1/models(discover model IDs) -
/v1/files(upload and reference documents)
Layer C - Product behaviors
-
Tool calling / function calling
-
Long context + document workflows
-
Multimodal support (model-dependent)
-
Rate limits and quotas that shape production behavior
Base URL options (global, China, and overseas routing)
One of the easiest ways to avoid headaches is to use the right base URL for where your servers run.
Commonly documented options include:
-
Global base URL:
https://api.moonshot.ai/v1 -
China base URL:
https://api.moonshot.cn/v1 -
Overseas suggestion (help/FAQ): some guidance recommends an alternate base URL for overseas calling (example: a Singapore route).
Practical tip: pick one base URL, set it in config (env var), and keep it consistent across your services. If you deploy multi-region, treat base URL as part of per-region config.
Getting an API key (safe, production-friendly)
Step 1: Create an account + open the console
The Moonshot Open Platform console is where you manage billing, projects, and keys.
Step 2: Generate the key
Create a key, copy it once, and store it in a password manager or secrets vault. Many guides note keys often start with a prefix like sk-.
Step 3: Store it properly
Do:
-
Store in server-side env vars (
MOONSHOT_API_KEY) -
Use a secrets manager (AWS Secrets Manager, GCP Secret Manager, Vault, etc.)
-
Rotate keys per environment (dev/staging/prod)
Don’t:
-
Ship keys in frontend JavaScript
-
Commit keys to Git
-
Paste keys into public bug reports
Your first request (Chat Completions)
The Kimi platform documents a Chat Completions endpoint under:
-
POST https://api.moonshot.ai/v1/chat/completions
cURL example
Notes:
-
The model name above is an example. Use
/v1/modelsto discover the exact IDs your account can call. -
Kimi K2’s own documentation recommends a temperature like
0.6for the instruct variant in many cases.
Discovering models
Why you should always call /v1/models
Model naming changes. New variants appear. Some models are region- or account-dependent. The clean solution is: fetch the model list at startup (or cache it daily) so your UI can display available models.
The Moonshot docs mention:
-
GET https://api.moonshot.ai/v1/models
How to pick models in a real product
Instead of picking “the best” model globally, pick by job:
-
Fast chat / customer support: prefer an “instant” or “turbo” style variant if offered (lower latency, cheaper).
-
Deep reasoning / complex planning: use a “thinking” variant if available.
-
Design → code / screenshots / UI analysis: use a multimodal model (for Kimi, K2.5 is positioned strongly here).
-
Agentic tool calling: choose a model that reliably emits tool calls and handles long multi-step runs.
OpenAI-compatible behavior
A lot of developers don’t want to learn “another completely new API shape.” Kimi leans into that reality.
Kimi K2’s GitHub docs state that Moonshot provides an OpenAI/Anthropic-compatible API via the Open Platform.
Moonshot’s “migrating from OpenAI” guide also lists endpoints compatible with OpenAI (including chat completions and file endpoints).
What this means in practice
If you already have something like:
You can often do something like:
And then keep using chat.completions.create(...) style calls—changing model IDs and handling any small behavioral differences.
Tool calling (function calling) with Kimi
Tool calling is how you turn an LLM from “text generator” into “doer.”
The basic pattern
-
You send a request with a list of tools (JSON schema).
-
The model chooses to call one (or many).
-
Your code executes the tool(s).
-
You append tool results to the conversation.
-
You call the model again until it returns a final answer.
Kimi K2 documentation shows an end-to-end tool calling loop with tools and tool_choice="auto".
Example tool schema (weather tool)
Production tips for tool calling
-
Validate tool arguments (never blindly trust LLM output).
-
Add timeouts and retries for external calls.
-
Return small tool outputs when possible (avoid flooding context).
-
Keep a tool audit log (for debugging and safety review).
Files: uploading documents for “chat with docs”
Kimi’s platform docs include a files API.
This is useful when you want to:
-
Attach PDFs,
-
Analyze reports,
-
Summarize contracts,
-
Extract tables,
-
Build doc-driven workflows.
A reliable UX flow for files
-
Upload file → get a
file_id -
Store
file_idin your DB -
Reference it in later chat requests
-
Cache summaries to reduce repeated token spend
Design note: Don’t make the user re-upload the same file for every question. That’s slow and expensive.
Rate limits and reliability planning
Rate limits aren’t a “billing detail.” They shape architecture.
Moonshot documentation describes multiple measurement styles such as:
-
Concurrency
-
RPM (requests per minute)
-
TPM (tokens per minute)
-
TPD (tokens per day)
There’s also documentation stating rate limits can depend on cumulative recharge / account level.
How to design around rate limits (without pain)
-
Queue requests (don’t burst blindly)
-
Backoff on HTTP 429 (exponential, with jitter)
-
Use streaming for better UX (if supported)
-
Cache repeated prompts (especially system prompts and repeated instructions)
-
Add a “fallback model” in your app for graceful degradation
Cost control
Even if you don’t care about pennies, you should care about predictability.
The most effective cost levers
-
Prompt discipline: keep system prompts tight and reusable.
-
Summarize long history: don’t send 50-message threads forever.
-
Use retrieval: fetch only relevant doc chunks instead of whole documents.
-
Cache: repeated questions + repeated context = wasted tokens.
-
Right-size the model: don’t use a heavy “thinking” model for short tasks.
Moonshot provides pricing documentation for chat inference and limits (and those pages can change), so it’s best to link your product’s cost logic to live model pricing rather than hardcoding numbers forever.
Practical use cases
1) Customer support copilot
-
Model drafts responses
-
Tools fetch order status, refunds, delivery ETA
-
Human agent approves / edits
Why Kimi fits: tool calling + long context for policy docs.
2) Research assistant (internal)
-
Users ask: “Summarize this report + give risks + cite pages”
-
Files API stores PDFs
-
Tools call search or your internal wiki
Why Kimi fits: long context + file workflows.
3) Developer helper (codebase Q&A)
-
Tools read repo files
-
Model explains architecture, writes patches, generates tests
-
Optional agentic loops for multi-step fixes
Why Kimi fits: agentic tool calling patterns are explicitly supported and discussed in K2 materials.
4) Design-to-code pipeline
-
User uploads screenshot or mock
-
Model outputs HTML/CSS/JS scaffolding
-
Follow-up prompts refine spacing, responsiveness, accessibility
K2.5 is positioned around “visual coding” and agentic productivity workflows.
Best practices checklist
Security
-
API key never in browser code
-
Keys stored in secrets manager
-
Separate keys for dev/staging/prod
-
Rotate keys every 30–90 days
-
Log redaction (never log Authorization headers)
Reliability
-
Retry + exponential backoff
-
Handle 429 and 5xx separately
-
Timeouts on tool calls
-
Circuit breaker on external APIs
-
Fallback model strategy
Quality
-
Short, consistent system prompt
-
Constrain outputs (JSON schema / formatting rules)
-
Add verification steps for high-stakes tasks
-
Keep tool outputs small + structured
-
Evaluate with a test set (not vibes)
Common pitfalls
Pitfall 1: “It worked in cURL, but fails in production”
Usually caused by:
-
wrong base URL for your region
-
proxy blocking
-
missing headers
-
timeouts on long requests
Use official base URLs and follow region guidance when calling from overseas.
Pitfall 2: Tool calling loops that never end
Fix it by:
-
max tool-call steps (e.g., 5–12)
-
detect repeating tool calls with same args
-
force a “final answer” after N loops
-
add “if tool fails, ask user for clarification” logic
Pitfall 3: Sending entire history forever
Fix it by:
-
conversation summarization after N turns
-
store structured state outside the model (DB)
-
only send the last few turns + summary
Pitfall 4: Treating the model like a database
Models hallucinate. If it must be correct:
-
use tools to fetch truth
-
require citations
-
validate outputs
Kimi API Pricing Guide: K2 vs K2.5 Token Rates
Kimi API pricing is usage-based, meaning you pay for what you generate. Costs mainly come from tokens (the text you send in and the text the model outputs). Your bill is typically calculated as:
-
Input tokens (prompt, system instructions, chat history, retrieved document chunks)
-
Output tokens (the model’s answer, code, or structured JSON)
Some Kimi models may also support prompt caching, where repeated/stable parts of your prompt (like your system prompt or tool definitions) can be billed at a lower “cache hit” input rate when reused. Tool features (such as built-in web search, if enabled) may have additional per-call fees on top of token usage.
| Model | Input (cache hit) | Input (cache miss) | Output |
|---|---|---|---|
| Kimi K2 | $0.15 | $0.60 | $2.50 |
| Kimi K2.5 | $0.10 | $0.60 | $3.00 |
Notes
-
“Cache hit” means Kimi’s automatic context caching applied to some repeated/stable prompt tokens; “cache miss” is normal input pricing.
-
If you use built-in web search, there’s an extra $0.005 per
$web_searchcall (on top of tokens).
FAQ (Kimi API)
Is Kimi API OpenAI-compatible?
Kimi K2 documentation states that the Open Platform provides OpenAI/Anthropic-compatible API support, and Moonshot also provides a “migrating from OpenAI” guide listing compatible endpoints.
What is the main endpoint I call?
Chat completions is documented as:
POST https://api.moonshot.ai/v1/chat/completions
How do I list models?
The docs mention:
GET https://api.moonshot.ai/v1/models
Does Kimi support tool/function calling?
Yes-Kimi K2 documentation includes tool calling examples using a tools list and an “auto” tool choice flow.
Are there different base URLs for different regions?
Yes. Ecosystem docs mention a global base URL and a China base URL, and Moonshot FAQ guidance also references an overseas base URL option.
How are rate limits measured?
Moonshot documentation describes multiple dimensions like concurrency, RPM, TPM, and TPD.
Does the API support files?
Yes-Moonshot docs include a files API and migration docs list file endpoints as part of compatibility.
Which model should I use?
Use /v1/models to see what’s available, then pick based on job: fast/cheap for short tasks, “thinking” for deeper reasoning, multimodal for visual coding and image-aware workflows (where K2.5 is positioned strongly).
Final thoughts: the “good” way to build on Kimi API
If you want the Kimi API to feel great in production, build it like a real platform integration-not a demo script:
-
Config-driven base URL + model IDs
-
Strong key hygiene
-
Tool calling with guardrails
-
File + long-context workflows that don’t spam tokens
-
Rate-limit aware architecture
-
A test set that measures quality