Skip to main content

☸️ K8s Operator

Manages raw Kubernetes cluster operations — pod debugging, resource CRUD, scaling, ephemeral debugging, node diagnostics, and multi-cluster context switching.

The K8s Operator coordinates a single highly-capable sub-agent (k8s-cluster-ops) that connects to the Kubernetes MCP server. It serves as the "Level-1 SRE" for your cluster, handling everything from simple list pods queries to complex root cause analysis of CrashLoopBackOff situations.


Architecture


Dual-Path Architecture

Every request is classified before any action:


Read-Only Operations (Fast-Path)

For read-only queries, the agent calls a single tool, formats the output, and returns immediately — no files are read, no planning is performed.

Query TypeToolExample
List all pods (cluster-wide)pods_list"Show me all pods"
List pods in namespacepods_list_in_namespace"List pods in production"
Get pod detailspods_get"Describe the checkout pod"
Pod logspods_log"Show me logs for payment-service"
Pod resource usagepods_top"Which pods are using the most memory?"
List resourcesresources_list"List all deployments in staging"
Get resource detailsresources_get"Show me the nginx ingress"
List namespacesnamespaces_list"What namespaces exist?"
List eventsevents_list"Show recent cluster events"
Node resource usagenodes_top"How much CPU are the nodes using?"
Node statsnodes_stats_summary"Node health summary"
Node logsnodes_log"Show kubelet logs for node-1"
List kubeconfig contextsconfiguration_contexts_list"What clusters can I access?"
View kubeconfigconfiguration_view"Show current kubeconfig"
Check current replicasresources_scale (without scale param)"How many replicas does nginx have?"

Cluster Health Check

For comprehensive cluster assessments, the agent uses the cluster-health-check MCP prompt, which automatically runs multiple read-only tools to provide a holistic health overview.


State-Modifying Operations (Phased Workflow)

For mutations (create, update, scale, delete, exec, run pod), the agent follows a mandatory 4-phase workflow:

Phase 1: Discovery

  1. If the task description provides resource kind, name, namespace, and action → skip to Planning
  2. Otherwise: check /memories/k8s-operator/operations-log.md for recent operations context
  3. Only as last resort: call request_human_input for missing parameters

Phase 2: Planning (Mandatory)

Always read before write. Call the read variant first to capture current state, then present a clear action plan:

OperationHITL PhasePlan Content
Create/Updatecreate_update_plan_reviewKind, Name, Namespace, YAML preview, Impact
Deletedeletion_plan_reviewKind, Name, Namespace, what will be removed
Scalescale_plan_reviewKind, Name, Current replicas → Target replicas
Execexec_approvalPod name, Container, Command, Namespace — ⚠️ grants shell-level access

Phase 3: Execution

Tools are gated by HumanInTheLoopMiddleware as a background safety net.

OperationTool
Create or update resourceresources_create_or_update
Delete resourceresources_delete
Scale deploymentresources_scale (with scale param)
Execute in podpods_exec
Run debug podpods_run

Phase 4: Verification

After every mutation, the agent re-reads the resource to confirm the change took effect:

After...Verify with...Success Criteria
Scaleresources_scale (read)readyReplicas matches target
Deleteresources_getResource returns 404
Create/Updateresources_getResource exists with expected spec
ExecCommand outputExpected output received
caution

The agent never declares success based solely on tool stdout. It always runs a verification query.


Key Capabilities

Automated Root Cause Analysis

Ask the agent to "debug my failing pod" and it will:

  1. Pull exit codes and container status
  2. Scan previous container logs (--previous)
  3. Check cluster events for the pod/namespace
  4. Correlate OOMKilled with memory limits vs live pods_top stats
  5. Propose a fix (e.g., "Increase memory limit by 20%")

Safe Exec & Ephemeral Debugging

  • Pod exec: Run commands inside running containers with explicit approval
  • Debug pods: Spin up temporary debug containers (like netshoot or busybox) to test DNS and connectivity
  • All exec operations require HITL approval

Multi-Cluster Context Switching

  • List available kubeconfig contexts across clusters
  • Switch contexts for cross-cluster operations
  • Maintains awareness of current context throughout the session

RBAC & Security Inspection

  • Query Roles, RoleBindings, ClusterRoles, and ClusterRoleBindings
  • Audit ServiceAccount permissions
  • Inspect who has access to what resources

Resource Pressure Investigation

  • Correlate pod memory limits with live nodes_top and pods_top stats
  • Identify scheduling bottlenecks and quota limits
  • Deep-dive into Pod/Node descriptions for resource issues

Idempotency Rules

The agent never creates a resource without first checking if it already exists:

Before creating...First check with...If exists...
Any resourceresources_get(apiVersion, kind, name, namespace)Use resources_create_or_update (upsert) — warn about overwrite
Pod (via pods_run)pods_get(name, namespace)Report existing pod — do NOT create duplicate
Scale targetresources_scale (without scale param)Read current replicas, confirm new target with user

Skills Reference

Skill DirectorySub-AgentPurpose
k8s-operator/kubernetes-cluster-opsk8s-cluster-opsCluster operations workflow patterns, safety rules, debugging playbooks

MCP Integration

MCP ServerPackage / CommandTransportUsed By
kubernetes_mcp_servernpx kubernetes-mcp-server@lateststdiok8s-cluster-ops
info

The Kubernetes MCP server is the only third-party MCP server in k8s-autopilot (installed via npx). All other MCP servers are TalkOps-native PyPI packages.


Cross-Domain Integration

The K8s Operator frequently receives cross-domain handoffs:

ScenarioSource DomainContext Received
"Investigate pod crashes after ArgoCD sync failed"App OperatorApp name, namespace, sync status
"Check pod resource usage — 5 alerts firing"ObservabilityAlert details, affected services
"Verify Helm release pods are healthy"Helm OperatorRelease name, namespace, expected pods

When the K8s Operator encounters requests outside its scope (e.g., ArgoCD operations, Prometheus queries), it returns a structured handoff signal for the Supervisor to re-route.


Safety & Governance

Scope Boundaries

The K8s Operator handles raw Kubernetes objects only. It will immediately defer to the appropriate operator for:

  • Helm chart operations → Helm Operator
  • ArgoCD/Argo Rollouts → App Operator
  • Prometheus/Alertmanager → Observability Operator

HITL Gates

Every state-modifying operation (create, update, delete, scale, exec) requires explicit user approval. The HumanInTheLoopMiddleware provides a mechanical safety backstop.

[PLAN-LOCKED] Execution

When the coordinator has already obtained approval, the sub-agent skips Phase 2 planning and executes the pre-approved parameters directly.