Here’s the summary of the technical details about Kimi K2 from Moonshot AI 📐 Architecture & Parameters - Mixture-of-Experts (MoE) architecture with 1 trillion total parameters, 32 billion active per token. - Features 384 experts, 8 active per token + 1 general expert. - 61 layers and 64 attention heads. - Uses SwiGLU activation functions. 🎯 Muon Optimizer / MuonClip - Employs Muon optimizer with MuonClip and “qk‑clip” to stabilize large-scale attention. - Trained on 15.5 trillion tokens, achieving 2× efficiency and 50% less memory usage. 🧠 Context Window - Supports a 128,000-token context window (~150–200 pages). 📊 Performance & Benchmarks - SWE‑bench (software engineering): 65.8% Pass@1. - LiveCodeBench: 53.7% Pass@1. - MATH‑500: 97.4% accuracy. - MMLU (language understanding): 89.5%. - Tau2 retail tasks: 70.6% Avg@4. - Also performed strongly in ZebraLogic, GPQA, etc. 🤖 Agentic Capabilities - Designed for agentic AI, with multi-step workflows and tool orchestration. - Trained using Large-Scale Agentic Data Synthesis, RL, and self-critique. ⚙️ Deployment & APIs - Available as Kimi‑K2‑Base (pre-trained) and Kimi‑K2‑Instruct (fine-tuned for chat/general use). - Context limits: - Estimated speed: ~200 tokens per second (Groq Cloud). - Accessible via Groq Cloud, SiliconFlow, OpenRouter, Together AI. 💾 Licensing & Open Source - Open weights and code under modified MIT license (requires citation “Kimi K2” for commercial use). - Can run locally (~1 TB weights) or in the cloud. 🧩 Advanced Features - Mooncake service: disaggregated KV‑Cache architecture for high throughput on long contexts. - Compatible with VS Code, Cline, Roo Code. - No native multimodal support (no Kimi‑VL or Kimi‑Audio yet).