Examples & Use Cases
Real-world scenarios demonstrating the Kubernetes Agent's capabilities across all four operational domains — Helm Operator, App Operator, K8s Operator, and Observability.
Each scenario shows the actual conversation flow, including gap detection, HITL approval gates, and verification steps.
📦 Helm Operator
Scenario 1: Intelligent Chart Generation
Objective: Create a production-ready Helm chart using the multi-phase pipeline.
The agent uses a full generation pipeline: helm-planner → helm-skill-builder (if needed) → helm-generator → helm-validator → HITL approval → github-agent.
The Request
User: "Create a helm chart for my payment-service. It's a Python FastAPI app using
payment:v2image. It needs a Redis connection and should be exposed on port 8000."
1. Planning (Gap Detection)
The Planner Agent reads the request and notices a missing detail (Redis connection info), triggering a clarifying question.
Agent: "I can certainly help with that. To ensure the Redis connection is configured correctly, how should the application connect to Redis?
- Should I include a Redis sub-chart (deploy a new Redis)?
- Or will it connect to an external Redis URL provided via environment variables?"
User: "It connects to an external Redis at
redis.infrastructure.svc."
2. Architecture Design
The planner generates a plan:
- Workload:
Deployment(Stateless generic app) - Config:
ConfigMapforREDIS_HOSTenvironment variable - Networking:
Serviceon port 8000 +Ingressrules - Health: Standard HTTP liveness probes on
/health
3. Generation & Self-Healing Validation
The Generator Agent writes the files following Bitnami conventions, then the Validator runs triple-check validation:
Issue Detected:
Event: helm_lint_validator
Status: FAILED
Error: templates/deployment.yaml:23:21: invalid indentation in 'env' block.
Self-Healing Action: The agent automatically reads line 23, fixes the whitespace, saves the file, and re-runs validation. Passed.
4. GitHub Persistence
After HITL approval, the github-agent commits all chart files to your repository using GitHub MCP tools — no manual copy-pasting required.
Scenario 2: Live Helm Operations (Install / Upgrade / Rollback)
Objective: Install a complex third-party application with the full phased safety pipeline.
The helm-operation sub-agent follows: Discovery → Planning → Dry-Run → Execution → Verification.
The Request
User: "Install ArgoCD from the official repo into the argocd namespace."
1. Discovery & Context
The Helm Operations Agent checks the cluster:
- Namespace exists? No → Plans creation.
- Repo added? No → Adds
argo https://argoproj.github.io/argo-helm. - Release exists? No → Mode: INSTALL.
2. Values Confirmation (HITL Gate)
Agent: "I found the chart
argo/argo-cdversion5.51.0.Required Configuration: No mandatory values found, but I recommend enabling
server.ingress.enabledif you want external access.Do you want to proceed with default values or customize anything?"
User: "Enable ingress with host
argocd.example.com."
3. Plan Approval
The agent generates the installation plan and presents it for final sign-off.
Agent: "🚀 INSTALLATION PLAN REVIEW
Action: Installation Chart:
argo/argo-cdVersion:5.51.0Release Name:argo-cdNamespace:argocdConfiguration Values
server.ingress.hosts[0]=argocd.example.comSafety Check: Cluster connectivity is green.
[APPROVE / REJECT]"
User: "APPROVE"
4. Execution
The agent runs helm_dry_run_install first (mandatory for new installs), then helm_install_chart, and finally verifies via helm_get_release_status.
Scenario 3: Rollback
Objective: Revert a failed upgrade quickly.
User: "The last upgrade broke the site! Roll it back immediately."
- The agent lists history via
helm_get_release_history:- Rev 5:
DEPLOYED(Current, Broken) - Rev 4:
SUPERSEDED(Last Good)
- Rev 5:
- Presents rollback plan for HITL approval.
- Executes:
helm_rollback_releaseto revision 4. - Verifies: Confirms pods from Rev 4 are running and healthy via
helm_get_release_status.
Agent: "Successfully rolled back
my-appto revision 4. Pods are stable."
🔄 App Operator — ArgoCD
Scenario 4: ArgoCD App Onboarding
Objective: Declaratively onboard an application via ArgoCD with plan preview and approvals.
User: "Onboard
checkout-apifrom repogithub.com/acme/platformat path/apps/checkoutinto projectpayments."
1. Prerequisite Checks
The ArgoCD Sub-Agent verifies:
- Project
paymentsexists →get_project - Repo is registered and allowed →
list_repositories - Application name is available →
list_applications
2. Plan Preview (HITL Gate)
Agent: "Here is the plan:
- Confirm repo access for
github.com/acme/platform.- Create ArgoCD app
checkout-apiin projectpayments.- Set destination namespace
paymentsand sync policymanual.[APPROVE / REJECT]"
User: "APPROVE"
3. Execute + Tool-Level Approval
The agent runs the MCP calls and requests a final confirmation before the create action.
Scenario 5: Automated Sync Debugging
Objective: Diagnose and fix an OutOfSync ArgoCD application.
User: "My checkout app is stuck in OutOfSync. What's wrong?"
The agent:
- Checks sync status:
get_sync_status→ findsOutOfSync - Runs diff:
get_application_diff→ shows discrepancy inspec.replicas - Checks events:
get_application_events→ finds a failed sync attempt - Reports actionable remediation:
Agent: "The
checkoutapp is OutOfSync because someone manually scaled the deployment to 5 replicas on the cluster, but the Git manifest still says 3.Options:
- Sync — Overwrite the cluster state with Git (reset to 3 replicas)
- Commit — Update the Git manifest to 5 replicas
Which do you prefer?"
🔄 App Operator — Argo Rollouts
Scenario 6: Zero-Downtime Deployment Migration
Objective: Migrate a standard Deployment to an Argo Rollout using workloadRef without duplicating pods or causing ArgoCD drift.
User: "Convert the checkout deployment to a canary rollout."
1. Pre-Migration Validation
The Argo Rollouts Sub-Agent runs validate_deployment_ready:
- Deployment exists ✅
- All replicas healthy ✅
- No active HPA conflicts ✅
2. Migration Plan (HITL Gate)
Agent: "🔄 MIGRATION PLAN REVIEW
Action: Convert Deployment → Rollout (workloadRef) Deployment:
checkout(namespace:production) Strategy: Canary — 20% → 50% → 100% Analysis: Prometheus AnalysisTemplate (error rate < 1%, p99 < 500ms)⚠️ Post-migration steps required:
- Add
ignoreDifferencesto ArgoCD Application (generated for you)- Commit scale-down manifest to Git (generated for you)
[APPROVE / REJECT]"
3. Execution & Follow-Up
After migration, the agent generates two mandatory artifacts:
generate_argocd_ignore_differences→ Prevents falseOutOfSyncgenerate_scale_down_manifest→ Prevents duplicate pods
Scenario 7: Canary Deployment with Autonomous Promotion
Objective: Execute a canary rollout with automatic step progression.
User: "Update the checkout service to v2.3.0 with a canary rollout."
Execution & Autonomous Promotion
- Step 1: Traffic shifts to 20%. AnalysisRun passes → Agent promotes automatically (≤50% threshold).
- Step 2: Traffic shifts to 50%. AnalysisRun passes → Agent pauses and asks for explicit approval (≥50% threshold).
Agent: "Canary at 50%. Metrics healthy:
- Error rate: 0.2% (threshold: < 1%) ✅
- p99 latency: 320ms (threshold: < 500ms) ✅
Promote to 100%? [APPROVE / REJECT]"
User: "APPROVE"
- Step 3: Full promotion. Verification confirms all pods healthy.
🔄 App Operator — Traefik
Scenario 8: Weighted Canary Routing
Objective: Gradually shift traffic between service versions at the edge.
User: "Set 80/20 traffic split between checkout-v1 and checkout-v2."
The Traefik Sub-Agent reads current distribution, then presents:
Agent: "🔀 TRAFFIC SHIFT REVIEW
TraefikService:
checkout-weighted(namespace:production) Current: checkout-v1: 100% / checkout-v2: 0% Proposed: checkout-v1: 80% / checkout-v2: 20%[APPROVE / REJECT]"
After approval, applies the weight shift and verifies the new distribution.
Scenario 9: NGINX to Traefik Migration
Objective: Automatically translate legacy NGINX Ingress annotations into native Traefik middleware and IngressRoute configurations.
User: "Migrate my NGINX ingresses to Traefik."
1. NGINX Discovery
The agent scans all NGINX Ingress resources: traefik://migration/nginx-ingress-scan
2. Annotation Analysis
Analyzes each Ingress for breaking annotations: traefik://migration/nginx-ingress-analyze
Agent: "Found 12 NGINX Ingress resources across 4 namespaces. 2 have breaking annotations:
checkout-ingress: Usesnginx.ingress.kubernetes.io/auth-url(requires Traefik ForwardAuth middleware)api-ingress: Usesnginx.ingress.kubernetes.io/canary-weight(requires TraefikService weighted routing)I'll generate equivalent Traefik CRDs for all 12. Review before applying?"
3. Generate & Apply (HITL Gate)
The agent generates Traefik IngressRoute + Middleware YAML (action=generate), presents for review, then applies after approval (action=apply).
☸️ K8s Operator
Scenario 10: Automated Root Cause Analysis
Objective: Diagnose a crashing pod with automated correlation.
User: "Why is the
payment-servicecrashing?"
The agent runs an automated investigation:
pods_list_in_namespace→ Findspayment-service-5f8c...inCrashLoopBackOffpods_log --previous→ Reads crash logevents_list→ Checks cluster eventspods_top→ Correlates memory usage with limits
Log Finding:
panic: FATAL: env var 'STRIPE_API_KEY' not set
Agent: "Root Cause: The
STRIPE_API_KEYenvironment variable is missing from the Deployment spec.Recommended Fix: Add the environment variable referencing an existing Secret:
- name: STRIPE_API_KEY
valueFrom:
secretKeyRef:
name: payment-secrets
key: stripe-api-keyShall I update the Deployment with this fix?"
Scenario 11: Resource Pressure Investigation (OOMKilled)
Objective: Investigate OOMKilled containers by correlating memory limits with live usage.
User: "My checkout pods keep getting OOMKilled."
The agent:
pods_list_in_namespace→ Finds 3 pods withOOMKilledexit codepods_top→ Live memory: 490Mi per podresources_get(Deployment) → Memory limit: 512Minodes_top→ Node memory pressure: 78% utilization
Agent: "Root Cause: Checkout pods are hitting the 512Mi memory limit. Live usage peaks at 490Mi (96% of limit).
Recommendation: Increase memory limit to 768Mi (490Mi peak + 50% headroom).
The node has sufficient capacity (78% utilization). Shall I patch the Deployment?"
Scenario 12: Safe Exec & Ephemeral Debugging
Objective: Debug DNS resolution issues using a temporary debug pod.
User: "Run a debug pod with busybox in the checkout namespace."
Agent: "🐛 EXEC APPROVAL
Action: Create ephemeral debug pod Image:
busybox:latestNamespace:checkoutCommand:sh(interactive shell)⚠️ This grants shell-level access to the namespace. [APPROVE / REJECT]"
After approval, the agent creates the pod and relays command output.
🔭 Observability — Prometheus
Scenario 13: Exporter Lifecycle Management
Objective: Deploy, verify, and monitor a PostgreSQL exporter.
The agent follows the strict sequence: prom_install_exporter → prom_verify_exporter → prom_query_instant for validation.
User: "Install a postgres exporter in the monitoring namespace."
1. Recommendation
Agent: "I recommend the
postgres-exporterfor PostgreSQL monitoring. It exposes 50+ metrics including connection pools, query performance, and replication lag."
2. Plan Presentation (HITL Gate)
Agent: "📦 EXPORTER INSTALL REVIEW
Exporter: postgres-exporter Namespace:
monitoringResources: Deployment + Service + ServiceMonitor Connection: RequiresDATA_SOURCE_NAMEenv var pointing to PostgreSQL[APPROVE / REJECT]"
3. Install & Mandatory Verification
prom_install_exporter → prom_verify_exporter → prom_query_instant(query="up{job='postgres-exporter'}")
Agent: "✅ Verified: postgres-exporter is scraping successfully.
up{job='postgres-exporter'}= 1. 52 metrics discovered."
Scenario 14: Synthetic Monitoring with Probes
Objective: Set up endpoint monitoring using Blackbox exporter and Probe CRDs.
User: "Monitor the checkout API health endpoint at
https://checkout.example.com/health."
The agent follows: prom_install_exporter (blackbox) → prom_apply_probe → prom_query_instant (validation).
Agent: "✅ Probe configured. Blackbox exporter is monitoring
https://checkout.example.com/healthevery 30s. Current status: UP (200 OK, latency: 45ms)."
Scenario 15: Alerting Rule Authoring
Objective: Author and deploy a PrometheusRule CRD for high error rate detection.
User: "Create an alerting rule that fires when the checkout error rate exceeds 1% for 5 minutes."
The agent:
- Checks existing rules:
prom://rules/groups - Discovers CRD details:
prom://kubernetes/prometheusrules - Drafts the rule:
prom_draft_alert_rule - Presents for HITL approval, then deploys via
prom_upsert_rule_groupink8s_crdmode
Agent: "✅ Alert rule
CheckoutHighErrorRate(P1 critical) deployed to namespacemonitoring. Group:checkout-alerts."
🔭 Observability — Alertmanager
Scenario 16: On-Call Alert Triage
Objective: Get a human-readable summary of all firing alerts.
User: "What alerts are firing?"
Agent: "📊 ON-CALL SUMMARY (via
am_summarize_oncall)Critical (2):
CheckoutHighErrorRate— Error rate 8.5% (threshold: 1%) — checkout service — firing for 15mPaymentTimeoutP99— p99 latency 3.2s (threshold: 500ms) — payment service — firing for 8mWarning (3):
CheckoutPodRestarting— 12 restarts in last hourCheckoutReplicasMismatch— 2/5 replicas readyDiskSpaceWarning— Node disk at 82%Would you like me to investigate the critical alerts or silence any of them?"
Scenario 17: Silence Lifecycle (Preview → Validate → Create → Extend → Expire)
Objective: Silence noisy alerts during a maintenance window with mandatory blast radius checks.
User: "Silence alerts for the checkout service for 2 hours — we're doing planned maintenance."
1. Preview (Mandatory)
am_preview_silence → am_validate_silence_policy → am_create_silence
Agent: "🔇 SILENCE PREVIEW
Matchers:
service=checkoutDuration: 2 hours Blast Radius: 3 active alerts would be silenced:
CheckoutHighErrorRate(critical)CheckoutLatencyP99(warning)CheckoutPodRestarting(warning)Policy Check: ✅ Duration under 24h cap. Matchers are service-scoped (recommended).
[APPROVE / REJECT]"
2. Creation & Verification
Agent: "✅ Silence created (ID:
abc123). 3 alerts silenced for 2 hours. Expires at 15:30 UTC.To extend or expire early, just ask."
3. Extension (Later)
User: "Extend the checkout silence by 1 hour."
Agent: "✅ Silence
abc123extended by 1 hour. New expiry: 16:30 UTC."
Scenario 18: Routing Audit & Integration Testing
Objective: Verify that alert routing is correctly configured.
User: "Who gets paged for critical alerts?"
The agent uses am_explain_routing to simulate routing:
Agent: "Routing Analysis for
severity=critical:
Alert Route Receiver Channel CheckoutHighErrorRateservice=checkout→checkout-criticalPagerDuty #checkout-oncall PaymentTimeoutP99service=payment→payment-criticalPagerDuty #payment-oncall DiskSpaceCriticalDefault route ⚠️ default-receiver#general-alerts ⚠️
DiskSpaceCriticalis hitting the default route — it may be misconfigured. Want me to audit the default route for other misrouted alerts?"
🌐 Cross-Domain Investigation
Scenario 19: Observability → K8s Handoff
Objective: Investigate a production incident starting from alerting and drilling into cluster health.
User: "What alerts are firing for checkout?"
1. Observability Phase
Agent: "Found 5 critical alerts for
checkout:
CheckoutHighErrorRate: Error rate at 8.5% (threshold: 1%)CheckoutOOMKilled: 3 pods OOMKilled in last 30mCheckoutLatencyP99: p99 at 2.3s (threshold: 500ms)CheckoutPodRestarting: 12 restarts in last hourCheckoutReplicasMismatch: 2/5 replicas readyThis looks like a pod crash storm. Want me to investigate the pods?"
User: "Yes"
2. Cross-Domain Handoff
The Observability coordinator returns: "This is outside my scope."
The Supervisor auto-routes to the K8s Operator with full context:
[CROSS-DOMAIN] Source: observability.
Prior findings: 5 critical alerts — error rate 8.5%, OOMKilled x3, p99 2.3s, 12 restarts, 2/5 replicas.
User Request: Check pod status for checkout service
3. K8s Phase
The K8s Operator investigates with full context — no re-asking needed:
- Lists pods → 3 in
CrashLoopBackOff - Reads logs →
FATAL: out of memory - Checks
pods_top→ pods at 490Mi (limit: 512Mi)
Agent: "Root Cause: Checkout pods are hitting the 512Mi memory limit. 3 pods have been OOMKilled.
Recommendation: Increase memory limit to 768Mi. This aligns with the observed peak usage of 490Mi + 50% headroom.
Shall I patch the Deployment?"