☸️ K8s Operator
Manages raw Kubernetes cluster operations — pod debugging, resource CRUD, scaling, ephemeral debugging, node diagnostics, and multi-cluster context switching.
The K8s Operator coordinates a single highly-capable sub-agent (k8s-cluster-ops) that connects to the Kubernetes MCP server. It serves as the "Level-1 SRE" for your cluster, handling everything from simple list pods queries to complex root cause analysis of CrashLoopBackOff situations.
Architecture
Dual-Path Architecture
Every request is classified before any action:
Read-Only Operations (Fast-Path)
For read-only queries, the agent calls a single tool, formats the output, and returns immediately — no files are read, no planning is performed.
| Query Type | Tool | Example |
|---|---|---|
| List all pods (cluster-wide) | pods_list | "Show me all pods" |
| List pods in namespace | pods_list_in_namespace | "List pods in production" |
| Get pod details | pods_get | "Describe the checkout pod" |
| Pod logs | pods_log | "Show me logs for payment-service" |
| Pod resource usage | pods_top | "Which pods are using the most memory?" |
| List resources | resources_list | "List all deployments in staging" |
| Get resource details | resources_get | "Show me the nginx ingress" |
| List namespaces | namespaces_list | "What namespaces exist?" |
| List events | events_list | "Show recent cluster events" |
| Node resource usage | nodes_top | "How much CPU are the nodes using?" |
| Node stats | nodes_stats_summary | "Node health summary" |
| Node logs | nodes_log | "Show kubelet logs for node-1" |
| List kubeconfig contexts | configuration_contexts_list | "What clusters can I access?" |
| View kubeconfig | configuration_view | "Show current kubeconfig" |
| Check current replicas | resources_scale (without scale param) | "How many replicas does nginx have?" |
Cluster Health Check
For comprehensive cluster assessments, the agent uses the cluster-health-check MCP prompt, which automatically runs multiple read-only tools to provide a holistic health overview.
State-Modifying Operations (Phased Workflow)
For mutations (create, update, scale, delete, exec, run pod), the agent follows a mandatory 4-phase workflow:
Phase 1: Discovery
- If the task description provides resource kind, name, namespace, and action → skip to Planning
- Otherwise: check
/memories/k8s-operator/operations-log.mdfor recent operations context - Only as last resort: call
request_human_inputfor missing parameters
Phase 2: Planning (Mandatory)
Always read before write. Call the read variant first to capture current state, then present a clear action plan:
| Operation | HITL Phase | Plan Content |
|---|---|---|
| Create/Update | create_update_plan_review | Kind, Name, Namespace, YAML preview, Impact |
| Delete | deletion_plan_review | Kind, Name, Namespace, what will be removed |
| Scale | scale_plan_review | Kind, Name, Current replicas → Target replicas |
| Exec | exec_approval | Pod name, Container, Command, Namespace — ⚠️ grants shell-level access |
Phase 3: Execution
Tools are gated by HumanInTheLoopMiddleware as a background safety net.
| Operation | Tool |
|---|---|
| Create or update resource | resources_create_or_update |
| Delete resource | resources_delete |
| Scale deployment | resources_scale (with scale param) |
| Execute in pod | pods_exec |
| Run debug pod | pods_run |
Phase 4: Verification
After every mutation, the agent re-reads the resource to confirm the change took effect:
| After... | Verify with... | Success Criteria |
|---|---|---|
| Scale | resources_scale (read) | readyReplicas matches target |
| Delete | resources_get | Resource returns 404 |
| Create/Update | resources_get | Resource exists with expected spec |
| Exec | Command output | Expected output received |
The agent never declares success based solely on tool stdout. It always runs a verification query.
Key Capabilities
Automated Root Cause Analysis
Ask the agent to "debug my failing pod" and it will:
- Pull exit codes and container status
- Scan previous container logs (
--previous) - Check cluster events for the pod/namespace
- Correlate
OOMKilledwith memory limits vs livepods_topstats - Propose a fix (e.g., "Increase memory limit by 20%")
Safe Exec & Ephemeral Debugging
- Pod exec: Run commands inside running containers with explicit approval
- Debug pods: Spin up temporary debug containers (like
netshootorbusybox) to test DNS and connectivity - All exec operations require HITL approval
Multi-Cluster Context Switching
- List available kubeconfig contexts across clusters
- Switch contexts for cross-cluster operations
- Maintains awareness of current context throughout the session
RBAC & Security Inspection
- Query Roles, RoleBindings, ClusterRoles, and ClusterRoleBindings
- Audit ServiceAccount permissions
- Inspect who has access to what resources
Resource Pressure Investigation
- Correlate pod memory limits with live
nodes_topandpods_topstats - Identify scheduling bottlenecks and quota limits
- Deep-dive into Pod/Node descriptions for resource issues
Idempotency Rules
The agent never creates a resource without first checking if it already exists:
| Before creating... | First check with... | If exists... |
|---|---|---|
| Any resource | resources_get(apiVersion, kind, name, namespace) | Use resources_create_or_update (upsert) — warn about overwrite |
Pod (via pods_run) | pods_get(name, namespace) | Report existing pod — do NOT create duplicate |
| Scale target | resources_scale (without scale param) | Read current replicas, confirm new target with user |
Skills Reference
| Skill Directory | Sub-Agent | Purpose |
|---|---|---|
k8s-operator/kubernetes-cluster-ops | k8s-cluster-ops | Cluster operations workflow patterns, safety rules, debugging playbooks |
MCP Integration
| MCP Server | Package / Command | Transport | Used By |
|---|---|---|---|
kubernetes_mcp_server | npx kubernetes-mcp-server@latest | stdio | k8s-cluster-ops |
The Kubernetes MCP server is the only third-party MCP server in k8s-autopilot (installed via npx). All other MCP servers are TalkOps-native PyPI packages.
Cross-Domain Integration
The K8s Operator frequently receives cross-domain handoffs:
| Scenario | Source Domain | Context Received |
|---|---|---|
| "Investigate pod crashes after ArgoCD sync failed" | App Operator | App name, namespace, sync status |
| "Check pod resource usage — 5 alerts firing" | Observability | Alert details, affected services |
| "Verify Helm release pods are healthy" | Helm Operator | Release name, namespace, expected pods |
When the K8s Operator encounters requests outside its scope (e.g., ArgoCD operations, Prometheus queries), it returns a structured handoff signal for the Supervisor to re-route.
Safety & Governance
Scope Boundaries
The K8s Operator handles raw Kubernetes objects only. It will immediately defer to the appropriate operator for:
- Helm chart operations → Helm Operator
- ArgoCD/Argo Rollouts → App Operator
- Prometheus/Alertmanager → Observability Operator
HITL Gates
Every state-modifying operation (create, update, delete, scale, exec) requires explicit user approval. The HumanInTheLoopMiddleware provides a mechanical safety backstop.
[PLAN-LOCKED] Execution
When the coordinator has already obtained approval, the sub-agent skips Phase 2 planning and executes the pre-approved parameters directly.