Architecture & Components
k8s-autopilot is built on the production-grade Deep Agent pattern — a multi-tier hierarchy of agents, MCP servers, and HITL gates. This page covers the architectural components that power the system. For domain-specific capabilities, see the Capabilities pages.
1. 🎯 Supervisor Agent
The Supervisor Agent is a pure router that delegates ALL Kubernetes infrastructure requests to the appropriate domain coordinator. It never performs operations directly.
Routing Table
| Request Type | Tool | Target Coordinator |
|---|---|---|
| Helm chart generation/update | transfer_to_helm_operator | Helm Operator Coordinator |
| K8s cluster ops (pods, scaling, exec) | transfer_to_k8s_operator | K8s Operator Coordinator |
| ArgoCD / Argo Rollouts / Traefik | transfer_to_app_operator | App Operator Coordinator |
| Prometheus / Alertmanager | transfer_to_observability_operator | Observability Coordinator |
| Clarification / out-of-scope | request_human_feedback | User |
Natural Language Mapping
The Supervisor translates non-technical language into domain-aware routing:
| User Says | Maps To | Coordinator |
|---|---|---|
| "deploy", "ship", "release" | ArgoCD sync | App Operator |
| "zero downtime", "gradual" | Argo Rollouts canary/blue-green | App Operator |
| "split traffic", "A/B test" | Traefik weighted routing | App Operator |
| "scale up", "more capacity" | K8s scaling | K8s Operator |
| "what's firing", "on-call" | Alert triage | Observability |
| "silence", "mute" | Create silence | Observability |
| "metrics", "PromQL" | Prometheus query | Observability |
Context Engineering
The Supervisor uses a 3-layer middleware stack to maintain routing accuracy across long sessions:
| Middleware | Purpose |
|---|---|
SupervisorContextMiddleware | Re-injects accumulated domain summaries as a SystemMessage before every model call — ensures cross-domain awareness survives summarization |
SummarizationMiddleware | Auto-compresses conversation history when it exceeds ~75% of context budget (default: 4000 tokens), keeping only the last 6 messages |
ModelCallLimitMiddleware | Caps model calls at 15 per turn to prevent runaway routing loops |
2. 🧩 Domain Coordinators
Each coordinator is a Deep Agent — a LangGraph-based orchestrator that manages its own team of sub-agents. Coordinators handle:
- Intent extraction: Translating user requests into DevOps-aware parameters
- Sub-agent delegation: Routing to the correct sub-agent with
[PLAN-LOCKED]context - HITL orchestration: Presenting plans and collecting approval before delegating execution
- Operations journaling: Logging every operation for context persistence
| Coordinator | Sub-Agents | Capabilities Page |
|---|---|---|
helm-operator-coordinator | 7 (planner, skill-builder, generator, validator, updater, operation, github) | Helm Operator |
app-operator-coordinator | 3 (argocd-onboarder, argo-rollouts-onboarder, traefik-edge-router) | App Operator |
k8s-operator-coordinator | 1 (k8s-cluster-ops) | K8s Operator |
observability-coordinator | 2 (prometheus-operator, alertmanager-operator) | Observability |
3. 🔌 JIT MCP Connections
Sub-agents that interact with external systems use a Just-In-Time (JIT) connection pattern. Instead of holding open connections to all 8 MCP servers for the entire session, each sub-agent is wrapped in a CompiledSubAgent that only opens its MCP connection when that specific node is executed. The connection is closed immediately after the sub-agent completes.
Connection Types
| Type | Description | Used By |
|---|---|---|
| JIT MCP | Opens MCP connection lazily, closes after execution | All MCP-connected sub-agents |
| Static Dict | Simple dict spec, no MCP — uses filesystem and in-memory tools | helm-skill-builder, helm-generator, helm-updater, helm-validator |
| Compiled Subgraph | LangGraph subgraph with its own internal supervisor | helm-planner |
4. 🛡️ Human-in-the-Loop Governance
AI shouldn't arbitrarily execute state-modifying operations on your cluster. k8s-autopilot enforces strict HITL governance at multiple layers.
The [PLAN-LOCKED] Delegation Protocol
Every state-modifying operation follows a mandatory Intent → Plan → Approve → Execute lifecycle:
- Intent Extraction: The coordinator translates the user's request into DevOps-aware parameters
- Plan Presentation: A structured plan is presented via
request_user_inputwith action details, resource names, namespaces, and impact assessment - User Approval: The LangGraph execution pauses (
interrupt()) and the UI renders an approval card [PLAN-LOCKED]Execution: After approval, the coordinator delegates to the sub-agent with the[PLAN-LOCKED]prefix — the sub-agent skips its own planning phase and executes pre-approved parameters directly- Verification: The sub-agent confirms the operation's success independently
Operation Classification
| Operation Type | Approval Required | Example |
|---|---|---|
| Read-Only | No — instant execution | "List pods", "check sync status", "query metrics" |
| State-Modifying | Yes — full HITL pipeline | "Deploy app", "scale deployment", "create silence" |
| Commit Gates | Yes — explicit confirmation | "Push chart to GitHub", "sync ArgoCD app" |
Rejection Protocol
If a user rejects a plan:
- The agent does not retry autonomously with modified parameters
- It asks the user what to adjust
- Maximum of 2 plan presentations per request before asking the user to rephrase
5. 🔄 Cross-Domain Handoff Protocol
When a coordinator determines that a user's request belongs to a different domain, it emits a structured signal:
"This is outside my scope. Please use the appropriate operator.
User Request: [The user's specific request]
Context: [What was previously discovered]"
The Supervisor detects this via pattern matching, extracts the structured context, and immediately re-routes to the correct coordinator with a [CROSS-DOMAIN] prefix — injecting the prior coordinator's findings:
[CROSS-DOMAIN] Source: observability.
Prior findings: 5 critical alerts for checkout service.
User Request: Check pod status for checkout service
This "blackboard pattern" enables seamless multi-domain investigations without the user repeating themselves.
6. 🧠 Skills, Memory & Context Engineering
k8s-autopilot maintains persistence and context awareness using a multi-layered virtual filesystem backed by a CompositeBackend:
| Virtual Path | Backend | Purpose |
|---|---|---|
/skills/ | StateBackend (LangGraph state) | Operational workflow instructions loaded by sub-agents |
/memories/ | StoreBackend (InMemoryStore, org-scoped) | Governance files and operations journals |
/workspace/ | FilesystemBackend (real disk) | Generated chart files, synced via sync_workspace_to_disk |
/shared/ | StoreBackend (shared namespace) | Cross-domain shared context |
Skills
Skills are strict operational playbooks that dictate exactly how each sub-agent must interact with MCP servers. Each skill directory contains:
SKILL.md— YAML frontmatter + step-by-step workflowreferences/— Domain-specific patterns and templates
| Domain | Skill Directories | Sub-Agents |
|---|---|---|
| Helm | 6 directories (generator, skill-builder, operation, validator, updater, github-agent) | 6 sub-agents |
| App | 3 directories (argocd-gitops, argo-rollouts-gitops, traefik-edge-routing) | 3 sub-agents |
| K8s | 1 directory (kubernetes-cluster-ops) | 1 sub-agent |
| Observability | 2 directories (prometheus, alertmanager) | 2 sub-agents |
Memory
Each domain maintains two static governance files pre-seeded at startup:
AGENTS.md: Agent interaction patterns, HITL gate schemas, parameter completeness lookup tableshitl-policies.md: Governance rules defining when HITL approval is mandatory vs. optional
Operations Journal
The operations journal (operations-log.md) is the primary mechanism for context persistence across conversation turns:
- Layer 1 — Tool writes: After every state-modifying operation, the coordinator logs a structured entry
- Layer 2 — Middleware re-injects: The
OperationContextMiddlewarereads the journal and re-injects it as aSystemMessagebefore every coordinator model call — surviving summarization - Layer 3 — Prompt engineering: All prompts instruct the LLM to reference the journal for follow-up operations
Journals are capped at 20 entries with automatic trimming.
7. ⚙️ State Management
Each agent operates on its own dedicated state schema, optimized for its specific task:
| State Schema | Used By | Key Fields |
|---|---|---|
MainSupervisorState | Supervisor | user_query, workflow_state, active_phase, domain_summaries, cross_domain_context |
| Coordinator States | Each coordinator | messages, user_query, domain-specific fields |
| Sub-Agent States | Each sub-agent | Inherited from coordinator via input_transform |
State Transformers
Data is not shared blindly. A StateTransformer middleware explicitly converts data when moving between the Supervisor and coordinators. This ensures context isolation — sub-agents see only what they need, preventing hallucination from irrelevant history.
Persistence & Handoffs
- Checkpointer: All states are persisted to PostgreSQL (or in-memory fallback). This enables long-running workflows where the user might take hours to approve a plan
- Interrupts: When a HITL gate is triggered, the state is saved, execution stops, and the system waits. Upon approval, it resumes exactly where it left off
8. 🛠 Tech Stack
| Component | Technology | Purpose |
|---|---|---|
| Agent Framework | deepagents / LangGraph | State machine, orchestration, sub-graph routing |
| LLM Interface | LangChain Core | Tool execution, message schemas |
| Tools/Integrations | Model Context Protocol (MCP) | Standardized protocol for 8 external systems |
| User Interface | A2UI / TalkOps A2A | Real-time streaming, HITL approval cards, markdown rendering |
| Runtime | Python 3.12+ | Core agent backend |
MCP Servers
k8s-autopilot connects to 8 MCP servers — 7 TalkOps-native servers (PyPI packages, stdio transport) and 1 third-party (npx):
| MCP Server | Package / Command | Transport | Domain |
|---|---|---|---|
helm_mcp_server | helm-mcp-server | stdio | Helm chart operations |
argocd_mcp_server | argocd-mcp-server | stdio | ArgoCD GitOps |
traefik_mcp_server | traefik-mcp-server | stdio | Traefik edge routing |
argo_rollout_mcp_server | argo-rollout-mcp-server | stdio | Argo Rollouts progressive delivery |
prometheus-mcp-server | prometheus-mcp-server | stdio | Prometheus monitoring & PromQL |
alertmanager-mcp-server | alertmanager-mcp-server | stdio | Alertmanager alerting & silences |
github_mcp | GitHub Copilot API | HTTP | GitHub file operations |
kubernetes_mcp_server | npx kubernetes-mcp-server@latest | stdio | Raw Kubernetes cluster ops |
LLM Configuration
The agent uses a three-tier LLM configuration — different models for different jobs:
| Tier | Config Key | Default | Used By |
|---|---|---|---|
| Standard | LLM_MODEL | gpt-4o-mini | Fast parsing, repetitive tasks |
| Higher | LLM_HIGHER_MODEL | gpt-5-mini | Supervisor routing, HITL context |
| Deep Agent | LLM_DEEPAGENT_MODEL | o4-mini | Coordinators, complex code generation |