Skip to main content

Meet the Kubernetes Agent

Welcome to k8s-autopilot — a stateful, multi-agent AI system that orchestrates Kubernetes deployments, manages progressive GitOps delivery, and safely debugs your cluster through conversation.

We designed k8s-autopilot to feel less like a rigid script and more like a senior DevOps colleague. Whether you need to generate a complex Helm chart, execute a zero-downtime canary rollout, triage firing alerts at 3 AM, or debug a crashing pod — the agent handles the heavy lifting while keeping you firmly in control through mandatory Human-in-the-Loop approval gates.


Why we built this

Managing Kubernetes at scale is tough. Junior engineers hit a steep learning curve, while senior architects drown in repetitive runbooks, troubleshooting YAML indentation errors, orchestrating canary rollouts, or context-switching between kubectl, Argo dashboards, Helm releases, and Prometheus metrics.

We wanted to fix this by combining the reasoning power of Large Language Models (LLMs) with the strict reliability of tools you already trust — delivered through a conversational interface that actually understands your cluster's context.

With k8s-autopilot, you get:

  • 4 specialized domains covering Helm, ArgoCD/Rollouts/Traefik, Kubernetes ops, and Observability
  • 13 sub-agents each with deep expertise in their respective tools
  • 8 MCP server integrations providing standardized tool access
  • Human-in-the-Loop safety at every state-modifying operation
  • Self-healing — if a generation fails validation, the agent catches it, reads the error log, and fixes its own YAML dynamically

How it works under the hood

The architecture uses a Supervisor → Coordinator → Sub-agent hierarchy powered by LangGraph. The Supervisor acts as a pure router, delegating to four domain-specific coordinators that each manage their own team of specialized sub-agents.

When you ask the system to "Deploy the checkout API with zero downtime," the Supervisor routes to the App Operator, which reads your cluster state via MCP, generates a workloadRef migration plan, and waits for your explicit HITL approval before touching a single resource.

Key capabilities at a glance

DomainWhat it DoesKey Workflows
📦 Helm OperatorChart generation, validation, live operations, GitHub persistenceCreate chart → Validate → Approve → Commit to GitHub
🔄 App OperatorArgoCD GitOps, progressive delivery, edge routingCanary rollouts, blue-green, NGINX→Traefik migration
☸️ K8s OperatorCluster operations, pod debugging, scaling, RBACRoot cause analysis, ephemeral debug pods, multi-cluster
🔭 ObservabilityPrometheus monitoring, Alertmanager alertingPromQL queries, exporter lifecycle, silence management

Getting Started

The easiest way to take k8s-autopilot for a spin is via Docker Compose.

Quick Start

# Create docker-compose.yml and .env (see Configuration page for details)
docker compose up -d

# k8s-autopilot Agent running at http://localhost:10102
# TalkOps UI running at http://localhost:8080

Open http://localhost:8080 and start talking to the orchestrator.

From Source

git clone https://github.com/talkops-ai/k8s-autopilot.git
cd k8s-autopilot

# Install uv for dependency management
uv venv --python=3.12
source .venv/bin/activate

# Install dependencies
uv pip install -e .

# Create .env and configure API keys
cp .env.example .env

# Start the A2A server
k8s-autopilot --host localhost --port 10102

The agent is model-agnostic — you can use OpenAI, Anthropic, or Google Gemini by setting LLM_PROVIDER in your .env file. You can even route different tiers to different models (e.g., a fast model for the Supervisor and a reasoning model for coordinators).


What's next?

Explore the rest of the documentation:

  • Components — Deep dive into the Supervisor, coordinators, state management, and middleware architecture
  • Capabilities — Per-domain breakdown of all 13 sub-agents and their workflows
  • Configuration — Environment variables, Docker Compose, and LLM model configuration
  • Examples — Real-world scenarios across all four domains
  • Troubleshooting — Common issues and debugging guides