Skip to main content

Architecture & Components

k8s-autopilot is built on the production-grade Deep Agent pattern — a multi-tier hierarchy of agents, MCP servers, and HITL gates. This page covers the architectural components that power the system. For domain-specific capabilities, see the Capabilities pages.


1. 🎯 Supervisor Agent

The Supervisor Agent is a pure router that delegates ALL Kubernetes infrastructure requests to the appropriate domain coordinator. It never performs operations directly.

Routing Table

Request TypeToolTarget Coordinator
Helm chart generation/updatetransfer_to_helm_operatorHelm Operator Coordinator
K8s cluster ops (pods, scaling, exec)transfer_to_k8s_operatorK8s Operator Coordinator
ArgoCD / Argo Rollouts / Traefiktransfer_to_app_operatorApp Operator Coordinator
Prometheus / Alertmanagertransfer_to_observability_operatorObservability Coordinator
Clarification / out-of-scoperequest_human_feedbackUser

Natural Language Mapping

The Supervisor translates non-technical language into domain-aware routing:

User SaysMaps ToCoordinator
"deploy", "ship", "release"ArgoCD syncApp Operator
"zero downtime", "gradual"Argo Rollouts canary/blue-greenApp Operator
"split traffic", "A/B test"Traefik weighted routingApp Operator
"scale up", "more capacity"K8s scalingK8s Operator
"what's firing", "on-call"Alert triageObservability
"silence", "mute"Create silenceObservability
"metrics", "PromQL"Prometheus queryObservability

Context Engineering

The Supervisor uses a 3-layer middleware stack to maintain routing accuracy across long sessions:

MiddlewarePurpose
SupervisorContextMiddlewareRe-injects accumulated domain summaries as a SystemMessage before every model call — ensures cross-domain awareness survives summarization
SummarizationMiddlewareAuto-compresses conversation history when it exceeds ~75% of context budget (default: 4000 tokens), keeping only the last 6 messages
ModelCallLimitMiddlewareCaps model calls at 15 per turn to prevent runaway routing loops

2. 🧩 Domain Coordinators

Each coordinator is a Deep Agent — a LangGraph-based orchestrator that manages its own team of sub-agents. Coordinators handle:

  • Intent extraction: Translating user requests into DevOps-aware parameters
  • Sub-agent delegation: Routing to the correct sub-agent with [PLAN-LOCKED] context
  • HITL orchestration: Presenting plans and collecting approval before delegating execution
  • Operations journaling: Logging every operation for context persistence
CoordinatorSub-AgentsCapabilities Page
helm-operator-coordinator7 (planner, skill-builder, generator, validator, updater, operation, github)Helm Operator
app-operator-coordinator3 (argocd-onboarder, argo-rollouts-onboarder, traefik-edge-router)App Operator
k8s-operator-coordinator1 (k8s-cluster-ops)K8s Operator
observability-coordinator2 (prometheus-operator, alertmanager-operator)Observability

3. 🔌 JIT MCP Connections

Sub-agents that interact with external systems use a Just-In-Time (JIT) connection pattern. Instead of holding open connections to all 8 MCP servers for the entire session, each sub-agent is wrapped in a CompiledSubAgent that only opens its MCP connection when that specific node is executed. The connection is closed immediately after the sub-agent completes.

Connection Types

TypeDescriptionUsed By
JIT MCPOpens MCP connection lazily, closes after executionAll MCP-connected sub-agents
Static DictSimple dict spec, no MCP — uses filesystem and in-memory toolshelm-skill-builder, helm-generator, helm-updater, helm-validator
Compiled SubgraphLangGraph subgraph with its own internal supervisorhelm-planner

4. 🛡️ Human-in-the-Loop Governance

AI shouldn't arbitrarily execute state-modifying operations on your cluster. k8s-autopilot enforces strict HITL governance at multiple layers.

The [PLAN-LOCKED] Delegation Protocol

Every state-modifying operation follows a mandatory Intent → Plan → Approve → Execute lifecycle:

  1. Intent Extraction: The coordinator translates the user's request into DevOps-aware parameters
  2. Plan Presentation: A structured plan is presented via request_user_input with action details, resource names, namespaces, and impact assessment
  3. User Approval: The LangGraph execution pauses (interrupt()) and the UI renders an approval card
  4. [PLAN-LOCKED] Execution: After approval, the coordinator delegates to the sub-agent with the [PLAN-LOCKED] prefix — the sub-agent skips its own planning phase and executes pre-approved parameters directly
  5. Verification: The sub-agent confirms the operation's success independently

Operation Classification

Operation TypeApproval RequiredExample
Read-OnlyNo — instant execution"List pods", "check sync status", "query metrics"
State-ModifyingYes — full HITL pipeline"Deploy app", "scale deployment", "create silence"
Commit GatesYes — explicit confirmation"Push chart to GitHub", "sync ArgoCD app"

Rejection Protocol

If a user rejects a plan:

  • The agent does not retry autonomously with modified parameters
  • It asks the user what to adjust
  • Maximum of 2 plan presentations per request before asking the user to rephrase

5. 🔄 Cross-Domain Handoff Protocol

When a coordinator determines that a user's request belongs to a different domain, it emits a structured signal:

"This is outside my scope. Please use the appropriate operator.
User Request: [The user's specific request]
Context: [What was previously discovered]"

The Supervisor detects this via pattern matching, extracts the structured context, and immediately re-routes to the correct coordinator with a [CROSS-DOMAIN] prefix — injecting the prior coordinator's findings:

[CROSS-DOMAIN] Source: observability. 
Prior findings: 5 critical alerts for checkout service.
User Request: Check pod status for checkout service

This "blackboard pattern" enables seamless multi-domain investigations without the user repeating themselves.


6. 🧠 Skills, Memory & Context Engineering

k8s-autopilot maintains persistence and context awareness using a multi-layered virtual filesystem backed by a CompositeBackend:

Virtual PathBackendPurpose
/skills/StateBackend (LangGraph state)Operational workflow instructions loaded by sub-agents
/memories/StoreBackend (InMemoryStore, org-scoped)Governance files and operations journals
/workspace/FilesystemBackend (real disk)Generated chart files, synced via sync_workspace_to_disk
/shared/StoreBackend (shared namespace)Cross-domain shared context

Skills

Skills are strict operational playbooks that dictate exactly how each sub-agent must interact with MCP servers. Each skill directory contains:

  • SKILL.md — YAML frontmatter + step-by-step workflow
  • references/ — Domain-specific patterns and templates
DomainSkill DirectoriesSub-Agents
Helm6 directories (generator, skill-builder, operation, validator, updater, github-agent)6 sub-agents
App3 directories (argocd-gitops, argo-rollouts-gitops, traefik-edge-routing)3 sub-agents
K8s1 directory (kubernetes-cluster-ops)1 sub-agent
Observability2 directories (prometheus, alertmanager)2 sub-agents

Memory

Each domain maintains two static governance files pre-seeded at startup:

  • AGENTS.md: Agent interaction patterns, HITL gate schemas, parameter completeness lookup tables
  • hitl-policies.md: Governance rules defining when HITL approval is mandatory vs. optional

Operations Journal

The operations journal (operations-log.md) is the primary mechanism for context persistence across conversation turns:

  1. Layer 1 — Tool writes: After every state-modifying operation, the coordinator logs a structured entry
  2. Layer 2 — Middleware re-injects: The OperationContextMiddleware reads the journal and re-injects it as a SystemMessage before every coordinator model call — surviving summarization
  3. Layer 3 — Prompt engineering: All prompts instruct the LLM to reference the journal for follow-up operations

Journals are capped at 20 entries with automatic trimming.


7. ⚙️ State Management

Each agent operates on its own dedicated state schema, optimized for its specific task:

State SchemaUsed ByKey Fields
MainSupervisorStateSupervisoruser_query, workflow_state, active_phase, domain_summaries, cross_domain_context
Coordinator StatesEach coordinatormessages, user_query, domain-specific fields
Sub-Agent StatesEach sub-agentInherited from coordinator via input_transform

State Transformers

Data is not shared blindly. A StateTransformer middleware explicitly converts data when moving between the Supervisor and coordinators. This ensures context isolation — sub-agents see only what they need, preventing hallucination from irrelevant history.

Persistence & Handoffs

  • Checkpointer: All states are persisted to PostgreSQL (or in-memory fallback). This enables long-running workflows where the user might take hours to approve a plan
  • Interrupts: When a HITL gate is triggered, the state is saved, execution stops, and the system waits. Upon approval, it resumes exactly where it left off

8. 🛠 Tech Stack

ComponentTechnologyPurpose
Agent Frameworkdeepagents / LangGraphState machine, orchestration, sub-graph routing
LLM InterfaceLangChain CoreTool execution, message schemas
Tools/IntegrationsModel Context Protocol (MCP)Standardized protocol for 8 external systems
User InterfaceA2UI / TalkOps A2AReal-time streaming, HITL approval cards, markdown rendering
RuntimePython 3.12+Core agent backend

MCP Servers

k8s-autopilot connects to 8 MCP servers — 7 TalkOps-native servers (PyPI packages, stdio transport) and 1 third-party (npx):

MCP ServerPackage / CommandTransportDomain
helm_mcp_serverhelm-mcp-serverstdioHelm chart operations
argocd_mcp_serverargocd-mcp-serverstdioArgoCD GitOps
traefik_mcp_servertraefik-mcp-serverstdioTraefik edge routing
argo_rollout_mcp_serverargo-rollout-mcp-serverstdioArgo Rollouts progressive delivery
prometheus-mcp-serverprometheus-mcp-serverstdioPrometheus monitoring & PromQL
alertmanager-mcp-serveralertmanager-mcp-serverstdioAlertmanager alerting & silences
github_mcpGitHub Copilot APIHTTPGitHub file operations
kubernetes_mcp_servernpx kubernetes-mcp-server@lateststdioRaw Kubernetes cluster ops

LLM Configuration

The agent uses a three-tier LLM configuration — different models for different jobs:

TierConfig KeyDefaultUsed By
StandardLLM_MODELgpt-4o-miniFast parsing, repetitive tasks
HigherLLM_HIGHER_MODELgpt-5-miniSupervisor routing, HITL context
Deep AgentLLM_DEEPAGENT_MODELo4-miniCoordinators, complex code generation