☸️ K8s Operator

Manages raw Kubernetes cluster operations — pod debugging, resource CRUD, scaling, ephemeral debugging, node diagnostics, and multi-cluster context switching.

The K8s Operator coordinates a single highly-capable sub-agent (k8s-cluster-ops) that connects to the Kubernetes MCP server. It serves as the "Level-1 SRE" for your cluster, handling everything from simple list pods queries to complex root cause analysis of CrashLoopBackOff situations.

Architecture

Dual-Path Architecture

Every request is classified before any action:

Read-Only Operations (Fast-Path)

For read-only queries, the agent calls a single tool, formats the output, and returns immediately — no files are read, no planning is performed.

Query Type	Tool	Example
List all pods (cluster-wide)	`pods_list`	"Show me all pods"
List pods in namespace	`pods_list_in_namespace`	"List pods in production"
Get pod details	`pods_get`	"Describe the checkout pod"
Pod logs	`pods_log`	"Show me logs for payment-service"
Pod resource usage	`pods_top`	"Which pods are using the most memory?"
List resources	`resources_list`	"List all deployments in staging"
Get resource details	`resources_get`	"Show me the nginx ingress"
List namespaces	`namespaces_list`	"What namespaces exist?"
List events	`events_list`	"Show recent cluster events"
Node resource usage	`nodes_top`	"How much CPU are the nodes using?"
Node stats	`nodes_stats_summary`	"Node health summary"
Node logs	`nodes_log`	"Show kubelet logs for node-1"
List kubeconfig contexts	`configuration_contexts_list`	"What clusters can I access?"
View kubeconfig	`configuration_view`	"Show current kubeconfig"
Check current replicas	`resources_scale` (without `scale` param)	"How many replicas does nginx have?"

Cluster Health Check

For comprehensive cluster assessments, the agent uses the cluster-health-check MCP prompt, which automatically runs multiple read-only tools to provide a holistic health overview.

State-Modifying Operations (Phased Workflow)

For mutations (create, update, scale, delete, exec, run pod), the agent follows a mandatory 4-phase workflow:

Phase 1: Discovery

If the task description provides resource kind, name, namespace, and action → skip to Planning
Otherwise: check /memories/k8s-operator/operations-log.md for recent operations context
Only as last resort: call request_human_input for missing parameters

Phase 2: Planning (Mandatory)

Always read before write. Call the read variant first to capture current state, then present a clear action plan:

Operation	HITL Phase	Plan Content
Create/Update	`create_update_plan_review`	Kind, Name, Namespace, YAML preview, Impact
Delete	`deletion_plan_review`	Kind, Name, Namespace, what will be removed
Scale	`scale_plan_review`	Kind, Name, Current replicas → Target replicas
Exec	`exec_approval`	Pod name, Container, Command, Namespace — ⚠️ grants shell-level access

Phase 3: Execution

Tools are gated by HumanInTheLoopMiddleware as a background safety net.

Operation	Tool
Create or update resource	`resources_create_or_update`
Delete resource	`resources_delete`
Scale deployment	`resources_scale` (with `scale` param)
Execute in pod	`pods_exec`
Run debug pod	`pods_run`

Phase 4: Verification

After every mutation, the agent re-reads the resource to confirm the change took effect:

After...	Verify with...	Success Criteria
Scale	`resources_scale` (read)	`readyReplicas` matches target
Delete	`resources_get`	Resource returns 404
Create/Update	`resources_get`	Resource exists with expected spec
Exec	Command output	Expected output received

caution

The agent never declares success based solely on tool stdout. It always runs a verification query.

Key Capabilities

Automated Root Cause Analysis

Ask the agent to "debug my failing pod" and it will:

Pull exit codes and container status
Scan previous container logs (--previous)
Check cluster events for the pod/namespace
Correlate OOMKilled with memory limits vs live pods_top stats
Propose a fix (e.g., "Increase memory limit by 20%")

Safe Exec & Ephemeral Debugging

Pod exec: Run commands inside running containers with explicit approval
Debug pods: Spin up temporary debug containers (like netshoot or busybox) to test DNS and connectivity
All exec operations require HITL approval

Multi-Cluster Context Switching

List available kubeconfig contexts across clusters
Switch contexts for cross-cluster operations
Maintains awareness of current context throughout the session

RBAC & Security Inspection

Query Roles, RoleBindings, ClusterRoles, and ClusterRoleBindings
Audit ServiceAccount permissions
Inspect who has access to what resources

Resource Pressure Investigation

Correlate pod memory limits with live nodes_top and pods_top stats
Identify scheduling bottlenecks and quota limits
Deep-dive into Pod/Node descriptions for resource issues

Idempotency Rules

The agent never creates a resource without first checking if it already exists:

Before creating...	First check with...	If exists...
Any resource	`resources_get(apiVersion, kind, name, namespace)`	Use `resources_create_or_update` (upsert) — warn about overwrite
Pod (via `pods_run`)	`pods_get(name, namespace)`	Report existing pod — do NOT create duplicate
Scale target	`resources_scale` (without scale param)	Read current replicas, confirm new target with user

Skills Reference

Skill Directory	Sub-Agent	Purpose
`k8s-operator/kubernetes-cluster-ops`	k8s-cluster-ops	Cluster operations workflow patterns, safety rules, debugging playbooks

MCP Integration

MCP Server	Package / Command	Transport	Used By
`kubernetes_mcp_server`	`npx kubernetes-mcp-server@latest`	stdio	`k8s-cluster-ops`

info

The Kubernetes MCP server is the only third-party MCP server in k8s-autopilot (installed via npx). All other MCP servers are TalkOps-native PyPI packages.

Cross-Domain Integration

The K8s Operator frequently receives cross-domain handoffs:

Scenario	Source Domain	Context Received
"Investigate pod crashes after ArgoCD sync failed"	App Operator	App name, namespace, sync status
"Check pod resource usage — 5 alerts firing"	Observability	Alert details, affected services
"Verify Helm release pods are healthy"	Helm Operator	Release name, namespace, expected pods

When the K8s Operator encounters requests outside its scope (e.g., ArgoCD operations, Prometheus queries), it returns a structured handoff signal for the Supervisor to re-route.

Safety & Governance

Scope Boundaries

The K8s Operator handles raw Kubernetes objects only. It will immediately defer to the appropriate operator for:

Helm chart operations → Helm Operator
ArgoCD/Argo Rollouts → App Operator
Prometheus/Alertmanager → Observability Operator

HITL Gates

Every state-modifying operation (create, update, delete, scale, exec) requires explicit user approval. The HumanInTheLoopMiddleware provides a mechanical safety backstop.

`[PLAN-LOCKED]` Execution

When the coordinator has already obtained approval, the sub-agent skips Phase 2 planning and executes the pre-approved parameters directly.

Architecture​

Dual-Path Architecture​

Read-Only Operations (Fast-Path)​

Cluster Health Check​

State-Modifying Operations (Phased Workflow)​

Phase 1: Discovery​

Phase 2: Planning (Mandatory)​

Phase 3: Execution​

Phase 4: Verification​

Key Capabilities​

Automated Root Cause Analysis​

Safe Exec & Ephemeral Debugging​

Multi-Cluster Context Switching​

RBAC & Security Inspection​

Resource Pressure Investigation​

Idempotency Rules​

Skills Reference​

MCP Integration​

Cross-Domain Integration​

Safety & Governance​

Scope Boundaries​

HITL Gates​

[PLAN-LOCKED] Execution​