Architecture & Components

k8s-autopilot is built on the production-grade Deep Agent pattern — a multi-tier hierarchy of agents, MCP servers, and HITL gates. This page covers the architectural components that power the system. For domain-specific capabilities, see the Capabilities pages.

1. 🎯 Supervisor Agent

The Supervisor Agent is a pure router that delegates ALL Kubernetes infrastructure requests to the appropriate domain coordinator. It never performs operations directly.

Routing Table

Request Type	Tool	Target Coordinator
Helm chart generation/update	`transfer_to_helm_operator`	Helm Operator Coordinator
K8s cluster ops (pods, scaling, exec)	`transfer_to_k8s_operator`	K8s Operator Coordinator
ArgoCD / Argo Rollouts / Traefik	`transfer_to_app_operator`	App Operator Coordinator
Prometheus / Alertmanager	`transfer_to_observability_operator`	Observability Coordinator
Clarification / out-of-scope	`request_human_feedback`	User

Natural Language Mapping

The Supervisor translates non-technical language into domain-aware routing:

User Says	Maps To	Coordinator
"deploy", "ship", "release"	ArgoCD sync	App Operator
"zero downtime", "gradual"	Argo Rollouts canary/blue-green	App Operator
"split traffic", "A/B test"	Traefik weighted routing	App Operator
"scale up", "more capacity"	K8s scaling	K8s Operator
"what's firing", "on-call"	Alert triage	Observability
"silence", "mute"	Create silence	Observability
"metrics", "PromQL"	Prometheus query	Observability

Context Engineering

The Supervisor uses a 3-layer middleware stack to maintain routing accuracy across long sessions:

Middleware	Purpose
`SupervisorContextMiddleware`	Re-injects accumulated domain summaries as a `SystemMessage` before every model call — ensures cross-domain awareness survives summarization
`SummarizationMiddleware`	Auto-compresses conversation history when it exceeds ~75% of context budget (default: 4000 tokens), keeping only the last 6 messages
`ModelCallLimitMiddleware`	Caps model calls at 15 per turn to prevent runaway routing loops

2. 🧩 Domain Coordinators

Each coordinator is a Deep Agent — a LangGraph-based orchestrator that manages its own team of sub-agents. Coordinators handle:

Intent extraction: Translating user requests into DevOps-aware parameters
Sub-agent delegation: Routing to the correct sub-agent with [PLAN-LOCKED] context
HITL orchestration: Presenting plans and collecting approval before delegating execution
Operations journaling: Logging every operation for context persistence

Coordinator	Sub-Agents	Capabilities Page
`helm-operator-coordinator`	7 (planner, skill-builder, generator, validator, updater, operation, github)	Helm Operator
`app-operator-coordinator`	3 (argocd-onboarder, argo-rollouts-onboarder, traefik-edge-router)	App Operator
`k8s-operator-coordinator`	1 (k8s-cluster-ops)	K8s Operator
`observability-coordinator`	2 (prometheus-operator, alertmanager-operator)	Observability

3. 🔌 JIT MCP Connections

Sub-agents that interact with external systems use a Just-In-Time (JIT) connection pattern. Instead of holding open connections to all 8 MCP servers for the entire session, each sub-agent is wrapped in a CompiledSubAgent that only opens its MCP connection when that specific node is executed. The connection is closed immediately after the sub-agent completes.

Connection Types

Type	Description	Used By
JIT MCP	Opens MCP connection lazily, closes after execution	All MCP-connected sub-agents
Static Dict	Simple dict spec, no MCP — uses filesystem and in-memory tools	helm-skill-builder, helm-generator, helm-updater, helm-validator
Compiled Subgraph	LangGraph subgraph with its own internal supervisor	helm-planner

4. 🛡️ Human-in-the-Loop Governance

AI shouldn't arbitrarily execute state-modifying operations on your cluster. k8s-autopilot enforces strict HITL governance at multiple layers.

The `[PLAN-LOCKED]` Delegation Protocol

Every state-modifying operation follows a mandatory Intent → Plan → Approve → Execute lifecycle:

Intent Extraction: The coordinator translates the user's request into DevOps-aware parameters
Plan Presentation: A structured plan is presented via request_user_input with action details, resource names, namespaces, and impact assessment
User Approval: The LangGraph execution pauses (interrupt()) and the UI renders an approval card
[PLAN-LOCKED] Execution: After approval, the coordinator delegates to the sub-agent with the [PLAN-LOCKED] prefix — the sub-agent skips its own planning phase and executes pre-approved parameters directly
Verification: The sub-agent confirms the operation's success independently

Operation Classification

Operation Type	Approval Required	Example
Read-Only	No — instant execution	"List pods", "check sync status", "query metrics"
State-Modifying	Yes — full HITL pipeline	"Deploy app", "scale deployment", "create silence"
Commit Gates	Yes — explicit confirmation	"Push chart to GitHub", "sync ArgoCD app"

Rejection Protocol

If a user rejects a plan:

The agent does not retry autonomously with modified parameters
It asks the user what to adjust
Maximum of 2 plan presentations per request before asking the user to rephrase

5. 🔄 Cross-Domain Handoff Protocol

When a coordinator determines that a user's request belongs to a different domain, it emits a structured signal:

"This is outside my scope. Please use the appropriate operator.
User Request: [The user's specific request]
Context: [What was previously discovered]"

The Supervisor detects this via pattern matching, extracts the structured context, and immediately re-routes to the correct coordinator with a [CROSS-DOMAIN] prefix — injecting the prior coordinator's findings:

[CROSS-DOMAIN] Source: observability. 
Prior findings: 5 critical alerts for checkout service.
User Request: Check pod status for checkout service

This "blackboard pattern" enables seamless multi-domain investigations without the user repeating themselves.

6. 🧠 Skills, Memory & Context Engineering

k8s-autopilot maintains persistence and context awareness using a multi-layered virtual filesystem backed by a CompositeBackend:

Virtual Path	Backend	Purpose
`/skills/`	`StateBackend` (LangGraph state)	Operational workflow instructions loaded by sub-agents
`/memories/`	`StoreBackend` (InMemoryStore, org-scoped)	Governance files and operations journals
`/workspace/`	`FilesystemBackend` (real disk)	Generated chart files, synced via `sync_workspace_to_disk`
`/shared/`	`StoreBackend` (shared namespace)	Cross-domain shared context

Skills

Skills are strict operational playbooks that dictate exactly how each sub-agent must interact with MCP servers. Each skill directory contains:

SKILL.md — YAML frontmatter + step-by-step workflow
references/ — Domain-specific patterns and templates

Domain	Skill Directories	Sub-Agents
Helm	6 directories (generator, skill-builder, operation, validator, updater, github-agent)	6 sub-agents
App	3 directories (argocd-gitops, argo-rollouts-gitops, traefik-edge-routing)	3 sub-agents
K8s	1 directory (kubernetes-cluster-ops)	1 sub-agent
Observability	2 directories (prometheus, alertmanager)	2 sub-agents

Memory

Each domain maintains two static governance files pre-seeded at startup:

AGENTS.md: Agent interaction patterns, HITL gate schemas, parameter completeness lookup tables
hitl-policies.md: Governance rules defining when HITL approval is mandatory vs. optional

Operations Journal

The operations journal (operations-log.md) is the primary mechanism for context persistence across conversation turns:

Layer 1 — Tool writes: After every state-modifying operation, the coordinator logs a structured entry
Layer 2 — Middleware re-injects: The OperationContextMiddleware reads the journal and re-injects it as a SystemMessage before every coordinator model call — surviving summarization
Layer 3 — Prompt engineering: All prompts instruct the LLM to reference the journal for follow-up operations

Journals are capped at 20 entries with automatic trimming.

7. ⚙️ State Management

Each agent operates on its own dedicated state schema, optimized for its specific task:

State Schema	Used By	Key Fields
`MainSupervisorState`	Supervisor	`user_query`, `workflow_state`, `active_phase`, `domain_summaries`, `cross_domain_context`
Coordinator States	Each coordinator	`messages`, `user_query`, domain-specific fields
Sub-Agent States	Each sub-agent	Inherited from coordinator via `input_transform`

State Transformers

Data is not shared blindly. A StateTransformer middleware explicitly converts data when moving between the Supervisor and coordinators. This ensures context isolation — sub-agents see only what they need, preventing hallucination from irrelevant history.

Persistence & Handoffs

Checkpointer: All states are persisted to PostgreSQL (or in-memory fallback). This enables long-running workflows where the user might take hours to approve a plan
Interrupts: When a HITL gate is triggered, the state is saved, execution stops, and the system waits. Upon approval, it resumes exactly where it left off

8. 🛠 Tech Stack

Component	Technology	Purpose
Agent Framework	`deepagents` / LangGraph	State machine, orchestration, sub-graph routing
LLM Interface	LangChain Core	Tool execution, message schemas
Tools/Integrations	Model Context Protocol (MCP)	Standardized protocol for 8 external systems
User Interface	A2UI / TalkOps A2A	Real-time streaming, HITL approval cards, markdown rendering
Runtime	Python 3.12+	Core agent backend

MCP Servers

k8s-autopilot connects to 8 MCP servers — 7 TalkOps-native servers (PyPI packages, stdio transport) and 1 third-party (npx):

MCP Server	Package / Command	Transport	Domain
`helm_mcp_server`	`helm-mcp-server`	stdio	Helm chart operations
`argocd_mcp_server`	`argocd-mcp-server`	stdio	ArgoCD GitOps
`traefik_mcp_server`	`traefik-mcp-server`	stdio	Traefik edge routing
`argo_rollout_mcp_server`	`argo-rollout-mcp-server`	stdio	Argo Rollouts progressive delivery
`prometheus-mcp-server`	`prometheus-mcp-server`	stdio	Prometheus monitoring & PromQL
`alertmanager-mcp-server`	`alertmanager-mcp-server`	stdio	Alertmanager alerting & silences
`github_mcp`	GitHub Copilot API	HTTP	GitHub file operations
`kubernetes_mcp_server`	`npx kubernetes-mcp-server@latest`	stdio	Raw Kubernetes cluster ops

LLM Configuration

The agent uses a three-tier LLM configuration — different models for different jobs:

Tier	Config Key	Default	Used By
Standard	`LLM_MODEL`	`gpt-4o-mini`	Fast parsing, repetitive tasks
Higher	`LLM_HIGHER_MODEL`	`gpt-5-mini`	Supervisor routing, HITL context
Deep Agent	`LLM_DEEPAGENT_MODEL`	`o4-mini`	Coordinators, complex code generation

1. 🎯 Supervisor Agent​

Routing Table​

Natural Language Mapping​

Context Engineering​

2. 🧩 Domain Coordinators​

3. 🔌 JIT MCP Connections​

Connection Types​

4. 🛡️ Human-in-the-Loop Governance​

The [PLAN-LOCKED] Delegation Protocol​

Operation Classification​

Rejection Protocol​

5. 🔄 Cross-Domain Handoff Protocol​

6. 🧠 Skills, Memory & Context Engineering​

Skills​

Memory​

Operations Journal​

7. ⚙️ State Management​

State Transformers​

Persistence & Handoffs​

8. 🛠 Tech Stack​

MCP Servers​

LLM Configuration​