Meet the Kubernetes Agent
Welcome to k8s-autopilot — a stateful, multi-agent AI system that orchestrates Kubernetes deployments, manages progressive GitOps delivery, and safely debugs your cluster through conversation.
We designed k8s-autopilot to feel less like a rigid script and more like a senior DevOps colleague. Whether you need to generate a complex Helm chart, execute a zero-downtime canary rollout, triage firing alerts at 3 AM, or debug a crashing pod — the agent handles the heavy lifting while keeping you firmly in control through mandatory Human-in-the-Loop approval gates.
Why we built this
Managing Kubernetes at scale is tough. Junior engineers hit a steep learning curve, while senior architects drown in repetitive runbooks, troubleshooting YAML indentation errors, orchestrating canary rollouts, or context-switching between kubectl, Argo dashboards, Helm releases, and Prometheus metrics.
We wanted to fix this by combining the reasoning power of Large Language Models (LLMs) with the strict reliability of tools you already trust — delivered through a conversational interface that actually understands your cluster's context.
With k8s-autopilot, you get:
- 4 specialized domains covering Helm, ArgoCD/Rollouts/Traefik, Kubernetes ops, and Observability
- 13 sub-agents each with deep expertise in their respective tools
- 8 MCP server integrations providing standardized tool access
- Human-in-the-Loop safety at every state-modifying operation
- Self-healing — if a generation fails validation, the agent catches it, reads the error log, and fixes its own YAML dynamically
How it works under the hood
The architecture uses a Supervisor → Coordinator → Sub-agent hierarchy powered by LangGraph. The Supervisor acts as a pure router, delegating to four domain-specific coordinators that each manage their own team of specialized sub-agents.
When you ask the system to "Deploy the checkout API with zero downtime," the Supervisor routes to the App Operator, which reads your cluster state via MCP, generates a workloadRef migration plan, and waits for your explicit HITL approval before touching a single resource.
Key capabilities at a glance
| Domain | What it Does | Key Workflows |
|---|---|---|
| 📦 Helm Operator | Chart generation, validation, live operations, GitHub persistence | Create chart → Validate → Approve → Commit to GitHub |
| 🔄 App Operator | ArgoCD GitOps, progressive delivery, edge routing | Canary rollouts, blue-green, NGINX→Traefik migration |
| ☸️ K8s Operator | Cluster operations, pod debugging, scaling, RBAC | Root cause analysis, ephemeral debug pods, multi-cluster |
| 🔭 Observability | Prometheus monitoring, Alertmanager alerting | PromQL queries, exporter lifecycle, silence management |
Getting Started
The easiest way to take k8s-autopilot for a spin is via Docker Compose.
Quick Start
# Create docker-compose.yml and .env (see Configuration page for details)
docker compose up -d
# k8s-autopilot Agent running at http://localhost:10102
# TalkOps UI running at http://localhost:8080
Open http://localhost:8080 and start talking to the orchestrator.
From Source
git clone https://github.com/talkops-ai/k8s-autopilot.git
cd k8s-autopilot
# Install uv for dependency management
uv venv --python=3.12
source .venv/bin/activate
# Install dependencies
uv pip install -e .
# Create .env and configure API keys
cp .env.example .env
# Start the A2A server
k8s-autopilot --host localhost --port 10102
The agent is model-agnostic — you can use OpenAI, Anthropic, or Google Gemini by setting LLM_PROVIDER in your .env file. You can even route different tiers to different models (e.g., a fast model for the Supervisor and a reasoning model for coordinators).
What's next?
Explore the rest of the documentation:
- Components — Deep dive into the Supervisor, coordinators, state management, and middleware architecture
- Capabilities — Per-domain breakdown of all 13 sub-agents and their workflows
- Configuration — Environment variables, Docker Compose, and LLM model configuration
- Examples — Real-world scenarios across all four domains
- Troubleshooting — Common issues and debugging guides