Here’s the summary of the technical details about Kimi K2 from Moonshot AI
📐 Architecture & Parameters
- Mixture-of-Experts (MoE) architecture with 1 trillion total parameters, 32 billion active per token.
- Features 384 experts, 8 active per token + 1 general expert.
- 61 layers and 64 attention heads.
- Uses SwiGLU activation functions.
🎯 Muon Optimizer / MuonClip
- Employs Muon optimizer with MuonClip and “qk‑clip” to stabilize large-scale attention.
- Trained on 15.5 trillion tokens, achieving 2× efficiency and 50% less memory usage.
🧠 Context Window
- Supports a 128,000-token context window (~150–200 pages).
📊 Performance & Benchmarks
- SWE‑bench (software engineering): 65.8% Pass@1.
- LiveCodeBench: 53.7% Pass@1.
- MATH‑500: 97.4% accuracy.
- MMLU (language understanding): 89.5%.
- Tau2 retail tasks: 70.6% Avg@4.
- Also performed strongly in ZebraLogic, GPQA, etc.
🤖 Agentic Capabilities
- Designed for agentic AI, with multi-step workflows and tool orchestration.
- Trained using Large-Scale Agentic Data Synthesis, RL, and self-critique.
⚙️ Deployment & APIs
- Available as Kimi‑K2‑Base (pre-trained) and Kimi‑K2‑Instruct (fine-tuned for chat/general use).
- Context limits:
- Estimated speed: ~200 tokens per second (Groq Cloud).
- Accessible via Groq Cloud, SiliconFlow, OpenRouter, Together AI.
💾 Licensing & Open Source
- Open weights and code under modified MIT license (requires citation “Kimi K2” for commercial use).
- Can run locally (~1 TB weights) or in the cloud.
🧩 Advanced Features
- Mooncake service: disaggregated KV‑Cache architecture for high throughput on long contexts.
- Compatible with VS Code, Cline, Roo Code.
- No native multimodal support (no Kimi‑VL or Kimi‑Audio yet).