Examples & Use Cases

Real-world scenarios demonstrating the Kubernetes Agent's capabilities across all four operational domains — Helm Operator, App Operator, K8s Operator, and Observability.

Each scenario shows the actual conversation flow, including gap detection, HITL approval gates, and verification steps.

📦 Helm Operator

Scenario 1: Intelligent Chart Generation

Objective: Create a production-ready Helm chart using the multi-phase pipeline.

The agent uses a full generation pipeline: helm-planner → helm-skill-builder (if needed) → helm-generator → helm-validator → HITL approval → github-agent.

The Request

User: "Create a helm chart for my payment-service. It's a Python FastAPI app using payment:v2 image. It needs a Redis connection and should be exposed on port 8000."

1. Planning (Gap Detection)

The Planner Agent reads the request and notices a missing detail (Redis connection info), triggering a clarifying question.

Agent: "I can certainly help with that. To ensure the Redis connection is configured correctly, how should the application connect to Redis?

Should I include a Redis sub-chart (deploy a new Redis)?

Or will it connect to an external Redis URL provided via environment variables?"

User: "It connects to an external Redis at redis.infrastructure.svc."

2. Architecture Design

The planner generates a plan:

Workload: Deployment (Stateless generic app)
Config: ConfigMap for REDIS_HOST environment variable
Networking: Service on port 8000 + Ingress rules
Health: Standard HTTP liveness probes on /health

3. Generation & Self-Healing Validation

The Generator Agent writes the files following Bitnami conventions, then the Validator runs triple-check validation:

Issue Detected:

Event: helm_lint_validator
Status: FAILED
Error: templates/deployment.yaml:23:21: invalid indentation in 'env' block.

Self-Healing Action: The agent automatically reads line 23, fixes the whitespace, saves the file, and re-runs validation. Passed.

4. GitHub Persistence

After HITL approval, the github-agent commits all chart files to your repository using GitHub MCP tools — no manual copy-pasting required.

Scenario 2: Live Helm Operations (Install / Upgrade / Rollback)

Objective: Install a complex third-party application with the full phased safety pipeline.

The helm-operation sub-agent follows: Discovery → Planning → Dry-Run → Execution → Verification.

The Request

User: "Install ArgoCD from the official repo into the argocd namespace."

1. Discovery & Context

The Helm Operations Agent checks the cluster:

Namespace exists? No → Plans creation.
Repo added? No → Adds argo https://argoproj.github.io/argo-helm.
Release exists? No → Mode: INSTALL.

2. Values Confirmation (HITL Gate)

Agent: "I found the chart argo/argo-cd version 5.51.0.

Required Configuration: No mandatory values found, but I recommend enabling server.ingress.enabled if you want external access.

Do you want to proceed with default values or customize anything?"

User: "Enable ingress with host argocd.example.com."

3. Plan Approval

The agent generates the installation plan and presents it for final sign-off.

Agent: "🚀 INSTALLATION PLAN REVIEW

Action: Installation Chart: argo/argo-cd Version: 5.51.0 Release Name: argo-cd Namespace: argocd

Configuration Values

server.ingress.hosts[0]=argocd.example.com

Safety Check: Cluster connectivity is green.

[APPROVE / REJECT]"

User: "APPROVE"

4. Execution

The agent runs helm_dry_run_install first (mandatory for new installs), then helm_install_chart, and finally verifies via helm_get_release_status.

Scenario 3: Rollback

Objective: Revert a failed upgrade quickly.

User: "The last upgrade broke the site! Roll it back immediately."

The agent lists history via helm_get_release_history:
- Rev 5: DEPLOYED (Current, Broken)
- Rev 4: SUPERSEDED (Last Good)
Presents rollback plan for HITL approval.
Executes: helm_rollback_release to revision 4.
Verifies: Confirms pods from Rev 4 are running and healthy via helm_get_release_status.

Agent: "Successfully rolled back my-app to revision 4. Pods are stable."

🔄 App Operator — ArgoCD

Scenario 4: ArgoCD App Onboarding

Objective: Declaratively onboard an application via ArgoCD with plan preview and approvals.

User: "Onboard checkout-api from repo github.com/acme/platform at path /apps/checkout into project payments."

1. Prerequisite Checks

The ArgoCD Sub-Agent verifies:

Project payments exists → get_project
Repo is registered and allowed → list_repositories
Application name is available → list_applications

2. Plan Preview (HITL Gate)

Agent: "Here is the plan:

Confirm repo access for github.com/acme/platform.

Create ArgoCD app checkout-api in project payments.

Set destination namespace payments and sync policy manual.

[APPROVE / REJECT]"

User: "APPROVE"

3. Execute + Tool-Level Approval

The agent runs the MCP calls and requests a final confirmation before the create action.

Scenario 5: Automated Sync Debugging

Objective: Diagnose and fix an OutOfSync ArgoCD application.

User: "My checkout app is stuck in OutOfSync. What's wrong?"

The agent:

Checks sync status: get_sync_status → finds OutOfSync
Runs diff: get_application_diff → shows discrepancy in spec.replicas
Checks events: get_application_events → finds a failed sync attempt
Reports actionable remediation:

Agent: "The checkout app is OutOfSync because someone manually scaled the deployment to 5 replicas on the cluster, but the Git manifest still says 3.

Options:

Sync — Overwrite the cluster state with Git (reset to 3 replicas)

Commit — Update the Git manifest to 5 replicas

Which do you prefer?"

🔄 App Operator — Argo Rollouts

Scenario 6: Zero-Downtime Deployment Migration

Objective: Migrate a standard Deployment to an Argo Rollout using workloadRef without duplicating pods or causing ArgoCD drift.

User: "Convert the checkout deployment to a canary rollout."

1. Pre-Migration Validation

The Argo Rollouts Sub-Agent runs validate_deployment_ready:

Deployment exists ✅
All replicas healthy ✅
No active HPA conflicts ✅

2. Migration Plan (HITL Gate)

Agent: "🔄 MIGRATION PLAN REVIEW

Action: Convert Deployment → Rollout (workloadRef) Deployment: checkout (namespace: production) Strategy: Canary — 20% → 50% → 100% Analysis: Prometheus AnalysisTemplate (error rate < 1%, p99 < 500ms)

⚠️ Post-migration steps required:

Add ignoreDifferences to ArgoCD Application (generated for you)

Commit scale-down manifest to Git (generated for you)

[APPROVE / REJECT]"

3. Execution & Follow-Up

After migration, the agent generates two mandatory artifacts:

generate_argocd_ignore_differences → Prevents false OutOfSync
generate_scale_down_manifest → Prevents duplicate pods

Scenario 7: Canary Deployment with Autonomous Promotion

Objective: Execute a canary rollout with automatic step progression.

User: "Update the checkout service to v2.3.0 with a canary rollout."

Execution & Autonomous Promotion

Step 1: Traffic shifts to 20%. AnalysisRun passes → Agent promotes automatically (≤50% threshold).
Step 2: Traffic shifts to 50%. AnalysisRun passes → Agent pauses and asks for explicit approval (≥50% threshold).

Agent: "Canary at 50%. Metrics healthy:

Error rate: 0.2% (threshold: < 1%) ✅

p99 latency: 320ms (threshold: < 500ms) ✅

Promote to 100%? [APPROVE / REJECT]"

User: "APPROVE"

Step 3: Full promotion. Verification confirms all pods healthy.

🔄 App Operator — Traefik

Scenario 8: Weighted Canary Routing

Objective: Gradually shift traffic between service versions at the edge.

User: "Set 80/20 traffic split between checkout-v1 and checkout-v2."

The Traefik Sub-Agent reads current distribution, then presents:

Agent: "🔀 TRAFFIC SHIFT REVIEW

TraefikService: checkout-weighted (namespace: production) Current: checkout-v1: 100% / checkout-v2: 0% Proposed: checkout-v1: 80% / checkout-v2: 20%

[APPROVE / REJECT]"

After approval, applies the weight shift and verifies the new distribution.

Scenario 9: NGINX to Traefik Migration

Objective: Automatically translate legacy NGINX Ingress annotations into native Traefik middleware and IngressRoute configurations.

User: "Migrate my NGINX ingresses to Traefik."

1. NGINX Discovery

The agent scans all NGINX Ingress resources: traefik://migration/nginx-ingress-scan

2. Annotation Analysis

Analyzes each Ingress for breaking annotations: traefik://migration/nginx-ingress-analyze

Agent: "Found 12 NGINX Ingress resources across 4 namespaces. 2 have breaking annotations:

checkout-ingress: Uses nginx.ingress.kubernetes.io/auth-url (requires Traefik ForwardAuth middleware)

api-ingress: Uses nginx.ingress.kubernetes.io/canary-weight (requires TraefikService weighted routing)

I'll generate equivalent Traefik CRDs for all 12. Review before applying?"

3. Generate & Apply (HITL Gate)

The agent generates Traefik IngressRoute + Middleware YAML (action=generate), presents for review, then applies after approval (action=apply).

☸️ K8s Operator

Scenario 10: Automated Root Cause Analysis

Objective: Diagnose a crashing pod with automated correlation.

User: "Why is the payment-service crashing?"

The agent runs an automated investigation:

pods_list_in_namespace → Finds payment-service-5f8c... in CrashLoopBackOff
pods_log --previous → Reads crash log
events_list → Checks cluster events
pods_top → Correlates memory usage with limits

Log Finding:

panic: FATAL: env var 'STRIPE_API_KEY' not set

Agent: "Root Cause: The STRIPE_API_KEY environment variable is missing from the Deployment spec.

Recommended Fix: Add the environment variable referencing an existing Secret:
- name: STRIPE_API_KEY
  valueFrom:
    secretKeyRef:
      name: payment-secrets
      key: stripe-api-key
Shall I update the Deployment with this fix?"

Scenario 11: Resource Pressure Investigation (OOMKilled)

Objective: Investigate OOMKilled containers by correlating memory limits with live usage.

User: "My checkout pods keep getting OOMKilled."

The agent:

pods_list_in_namespace → Finds 3 pods with OOMKilled exit code
pods_top → Live memory: 490Mi per pod
resources_get (Deployment) → Memory limit: 512Mi
nodes_top → Node memory pressure: 78% utilization

Agent: "Root Cause: Checkout pods are hitting the 512Mi memory limit. Live usage peaks at 490Mi (96% of limit).

Recommendation: Increase memory limit to 768Mi (490Mi peak + 50% headroom).

The node has sufficient capacity (78% utilization). Shall I patch the Deployment?"

Scenario 12: Safe Exec & Ephemeral Debugging

Objective: Debug DNS resolution issues using a temporary debug pod.

User: "Run a debug pod with busybox in the checkout namespace."

Agent: "🐛 EXEC APPROVAL

Action: Create ephemeral debug pod Image: busybox:latest Namespace: checkout Command: sh (interactive shell)

⚠️ This grants shell-level access to the namespace. [APPROVE / REJECT]"

After approval, the agent creates the pod and relays command output.

🔭 Observability — Prometheus

Scenario 13: Exporter Lifecycle Management

Objective: Deploy, verify, and monitor a PostgreSQL exporter.

The agent follows the strict sequence: prom_install_exporter → prom_verify_exporter → prom_query_instant for validation.

User: "Install a postgres exporter in the monitoring namespace."

1. Recommendation

Agent: "I recommend the postgres-exporter for PostgreSQL monitoring. It exposes 50+ metrics including connection pools, query performance, and replication lag."

2. Plan Presentation (HITL Gate)

Agent: "📦 EXPORTER INSTALL REVIEW

Exporter: postgres-exporter Namespace: monitoring Resources: Deployment + Service + ServiceMonitor Connection: Requires DATA_SOURCE_NAME env var pointing to PostgreSQL

[APPROVE / REJECT]"

3. Install & Mandatory Verification

prom_install_exporter → prom_verify_exporter → prom_query_instant(query="up{job='postgres-exporter'}")

Agent: "✅ Verified: postgres-exporter is scraping successfully. up{job='postgres-exporter'} = 1. 52 metrics discovered."

Scenario 14: Synthetic Monitoring with Probes

Objective: Set up endpoint monitoring using Blackbox exporter and Probe CRDs.

User: "Monitor the checkout API health endpoint at https://checkout.example.com/health."

The agent follows: prom_install_exporter (blackbox) → prom_apply_probe → prom_query_instant (validation).

Agent: "✅ Probe configured. Blackbox exporter is monitoring https://checkout.example.com/health every 30s. Current status: UP (200 OK, latency: 45ms)."

Scenario 15: Alerting Rule Authoring

Objective: Author and deploy a PrometheusRule CRD for high error rate detection.

User: "Create an alerting rule that fires when the checkout error rate exceeds 1% for 5 minutes."

The agent:

Checks existing rules: prom://rules/groups
Discovers CRD details: prom://kubernetes/prometheusrules
Drafts the rule: prom_draft_alert_rule
Presents for HITL approval, then deploys via prom_upsert_rule_group in k8s_crd mode

Agent: "✅ Alert rule CheckoutHighErrorRate (P1 critical) deployed to namespace monitoring. Group: checkout-alerts."

🔭 Observability — Alertmanager

Scenario 16: On-Call Alert Triage

Objective: Get a human-readable summary of all firing alerts.

User: "What alerts are firing?"

Agent: "📊 ON-CALL SUMMARY (via am_summarize_oncall)

Critical (2):

CheckoutHighErrorRate — Error rate 8.5% (threshold: 1%) — checkout service — firing for 15m

PaymentTimeoutP99 — p99 latency 3.2s (threshold: 500ms) — payment service — firing for 8m

Warning (3):

CheckoutPodRestarting — 12 restarts in last hour

CheckoutReplicasMismatch — 2/5 replicas ready

DiskSpaceWarning — Node disk at 82%

Would you like me to investigate the critical alerts or silence any of them?"

Scenario 17: Silence Lifecycle (Preview → Validate → Create → Extend → Expire)

Objective: Silence noisy alerts during a maintenance window with mandatory blast radius checks.

User: "Silence alerts for the checkout service for 2 hours — we're doing planned maintenance."

1. Preview (Mandatory)

am_preview_silence → am_validate_silence_policy → am_create_silence

Agent: "🔇 SILENCE PREVIEW

Matchers: service=checkout Duration: 2 hours Blast Radius: 3 active alerts would be silenced:

CheckoutHighErrorRate (critical)

CheckoutLatencyP99 (warning)

CheckoutPodRestarting (warning)

Policy Check: ✅ Duration under 24h cap. Matchers are service-scoped (recommended).

[APPROVE / REJECT]"

2. Creation & Verification

Agent: "✅ Silence created (ID: abc123). 3 alerts silenced for 2 hours. Expires at 15:30 UTC.

To extend or expire early, just ask."

3. Extension (Later)

User: "Extend the checkout silence by 1 hour."

Agent: "✅ Silence abc123 extended by 1 hour. New expiry: 16:30 UTC."

Scenario 18: Routing Audit & Integration Testing

Objective: Verify that alert routing is correctly configured.

User: "Who gets paged for critical alerts?"

The agent uses am_explain_routing to simulate routing:

Agent: "Routing Analysis for severity=critical:

Alert Route Receiver Channel
CheckoutHighErrorRate service=checkout → checkout-critical PagerDuty #checkout-oncall
PaymentTimeoutP99 service=payment → payment-critical PagerDuty #payment-oncall
DiskSpaceCritical Default route ⚠️ default-receiver #general-alerts

⚠️ DiskSpaceCritical is hitting the default route — it may be misconfigured. Want me to audit the default route for other misrouted alerts?"

Alert	Route	Receiver	Channel
`CheckoutHighErrorRate`	`service=checkout` → `checkout-critical`	PagerDuty	#checkout-oncall
`PaymentTimeoutP99`	`service=payment` → `payment-critical`	PagerDuty	#payment-oncall
`DiskSpaceCritical`	Default route ⚠️	`default-receiver`	#general-alerts

🌐 Cross-Domain Investigation

Scenario 19: Observability → K8s Handoff

Objective: Investigate a production incident starting from alerting and drilling into cluster health.

User: "What alerts are firing for checkout?"

1. Observability Phase

Agent: "Found 5 critical alerts for checkout:

CheckoutHighErrorRate: Error rate at 8.5% (threshold: 1%)

CheckoutOOMKilled: 3 pods OOMKilled in last 30m

CheckoutLatencyP99: p99 at 2.3s (threshold: 500ms)

CheckoutPodRestarting: 12 restarts in last hour

CheckoutReplicasMismatch: 2/5 replicas ready

This looks like a pod crash storm. Want me to investigate the pods?"

User: "Yes"

2. Cross-Domain Handoff

The Observability coordinator returns: "This is outside my scope."

The Supervisor auto-routes to the K8s Operator with full context:

[CROSS-DOMAIN] Source: observability.
Prior findings: 5 critical alerts — error rate 8.5%, OOMKilled x3, p99 2.3s, 12 restarts, 2/5 replicas.
User Request: Check pod status for checkout service

3. K8s Phase

The K8s Operator investigates with full context — no re-asking needed:

Lists pods → 3 in CrashLoopBackOff
Reads logs → FATAL: out of memory
Checks pods_top → pods at 490Mi (limit: 512Mi)

Agent: "Root Cause: Checkout pods are hitting the 512Mi memory limit. 3 pods have been OOMKilled.

Recommendation: Increase memory limit to 768Mi. This aligns with the observed peak usage of 490Mi + 50% headroom.

Shall I patch the Deployment?"

📦 Helm Operator​

Scenario 1: Intelligent Chart Generation​

The Request​

1. Planning (Gap Detection)​

2. Architecture Design​

3. Generation & Self-Healing Validation​

4. GitHub Persistence​

Scenario 2: Live Helm Operations (Install / Upgrade / Rollback)​

The Request​

1. Discovery & Context​

2. Values Confirmation (HITL Gate)​

3. Plan Approval​

Configuration Values​

4. Execution​

Scenario 3: Rollback​

🔄 App Operator — ArgoCD​

Scenario 4: ArgoCD App Onboarding​

1. Prerequisite Checks​

2. Plan Preview (HITL Gate)​

3. Execute + Tool-Level Approval​

Scenario 5: Automated Sync Debugging​

🔄 App Operator — Argo Rollouts​

Scenario 6: Zero-Downtime Deployment Migration​

1. Pre-Migration Validation​

2. Migration Plan (HITL Gate)​

3. Execution & Follow-Up​

Scenario 7: Canary Deployment with Autonomous Promotion​

Execution & Autonomous Promotion​

🔄 App Operator — Traefik​

Scenario 8: Weighted Canary Routing​

Scenario 9: NGINX to Traefik Migration​

1. NGINX Discovery​

2. Annotation Analysis​

3. Generate & Apply (HITL Gate)​

☸️ K8s Operator​

Scenario 10: Automated Root Cause Analysis​

Scenario 11: Resource Pressure Investigation (OOMKilled)​

Scenario 12: Safe Exec & Ephemeral Debugging​

🔭 Observability — Prometheus​

Scenario 13: Exporter Lifecycle Management​

1. Recommendation​

2. Plan Presentation (HITL Gate)​

3. Install & Mandatory Verification​

Scenario 14: Synthetic Monitoring with Probes​

Scenario 15: Alerting Rule Authoring​

🔭 Observability — Alertmanager​

Scenario 16: On-Call Alert Triage​

Scenario 17: Silence Lifecycle (Preview → Validate → Create → Extend → Expire)​

1. Preview (Mandatory)​

2. Creation & Verification​

3. Extension (Later)​

Scenario 18: Routing Audit & Integration Testing​

🌐 Cross-Domain Investigation​

Scenario 19: Observability → K8s Handoff​

1. Observability Phase​

2. Cross-Domain Handoff​

3. K8s Phase​

📦 Helm Operator

Scenario 1: Intelligent Chart Generation

The Request

1. Planning (Gap Detection)

2. Architecture Design

3. Generation & Self-Healing Validation

4. GitHub Persistence

Scenario 2: Live Helm Operations (Install / Upgrade / Rollback)

The Request

1. Discovery & Context

2. Values Confirmation (HITL Gate)

3. Plan Approval

Configuration Values

4. Execution

Scenario 3: Rollback

🔄 App Operator — ArgoCD

Scenario 4: ArgoCD App Onboarding

1. Prerequisite Checks

2. Plan Preview (HITL Gate)

3. Execute + Tool-Level Approval

Scenario 5: Automated Sync Debugging

🔄 App Operator — Argo Rollouts

Scenario 6: Zero-Downtime Deployment Migration

1. Pre-Migration Validation

2. Migration Plan (HITL Gate)

3. Execution & Follow-Up

Scenario 7: Canary Deployment with Autonomous Promotion

Execution & Autonomous Promotion

🔄 App Operator — Traefik

Scenario 8: Weighted Canary Routing

Scenario 9: NGINX to Traefik Migration

1. NGINX Discovery

2. Annotation Analysis

3. Generate & Apply (HITL Gate)

☸️ K8s Operator

Scenario 10: Automated Root Cause Analysis

Scenario 11: Resource Pressure Investigation (OOMKilled)

Scenario 12: Safe Exec & Ephemeral Debugging

🔭 Observability — Prometheus

Scenario 13: Exporter Lifecycle Management

1. Recommendation

2. Plan Presentation (HITL Gate)

3. Install & Mandatory Verification

Scenario 14: Synthetic Monitoring with Probes

Scenario 15: Alerting Rule Authoring

🔭 Observability — Alertmanager

Scenario 16: On-Call Alert Triage

Scenario 17: Silence Lifecycle (Preview → Validate → Create → Extend → Expire)

1. Preview (Mandatory)

2. Creation & Verification

3. Extension (Later)

Scenario 18: Routing Audit & Integration Testing

🌐 Cross-Domain Investigation

Scenario 19: Observability → K8s Handoff

1. Observability Phase

2. Cross-Domain Handoff

3. K8s Phase