Skip to main content

Examples & Use Cases

Real-world scenarios demonstrating the Kubernetes Agent's capabilities across all four operational domains — Helm Operator, App Operator, K8s Operator, and Observability.

Each scenario shows the actual conversation flow, including gap detection, HITL approval gates, and verification steps.


📦 Helm Operator

Scenario 1: Intelligent Chart Generation

Objective: Create a production-ready Helm chart using the multi-phase pipeline.

The agent uses a full generation pipeline: helm-plannerhelm-skill-builder (if needed) → helm-generatorhelm-validator → HITL approval → github-agent.

The Request

User: "Create a helm chart for my payment-service. It's a Python FastAPI app using payment:v2 image. It needs a Redis connection and should be exposed on port 8000."

1. Planning (Gap Detection)

The Planner Agent reads the request and notices a missing detail (Redis connection info), triggering a clarifying question.

Agent: "I can certainly help with that. To ensure the Redis connection is configured correctly, how should the application connect to Redis?

  1. Should I include a Redis sub-chart (deploy a new Redis)?
  2. Or will it connect to an external Redis URL provided via environment variables?"

User: "It connects to an external Redis at redis.infrastructure.svc."

2. Architecture Design

The planner generates a plan:

  • Workload: Deployment (Stateless generic app)
  • Config: ConfigMap for REDIS_HOST environment variable
  • Networking: Service on port 8000 + Ingress rules
  • Health: Standard HTTP liveness probes on /health

3. Generation & Self-Healing Validation

The Generator Agent writes the files following Bitnami conventions, then the Validator runs triple-check validation:

Issue Detected:

Event: helm_lint_validator
Status: FAILED
Error: templates/deployment.yaml:23:21: invalid indentation in 'env' block.

Self-Healing Action: The agent automatically reads line 23, fixes the whitespace, saves the file, and re-runs validation. Passed.

4. GitHub Persistence

After HITL approval, the github-agent commits all chart files to your repository using GitHub MCP tools — no manual copy-pasting required.


Scenario 2: Live Helm Operations (Install / Upgrade / Rollback)

Objective: Install a complex third-party application with the full phased safety pipeline.

The helm-operation sub-agent follows: Discovery → Planning → Dry-Run → Execution → Verification.

The Request

User: "Install ArgoCD from the official repo into the argocd namespace."

1. Discovery & Context

The Helm Operations Agent checks the cluster:

  • Namespace exists? No → Plans creation.
  • Repo added? No → Adds argo https://argoproj.github.io/argo-helm.
  • Release exists? No → Mode: INSTALL.

2. Values Confirmation (HITL Gate)

Agent: "I found the chart argo/argo-cd version 5.51.0.

Required Configuration: No mandatory values found, but I recommend enabling server.ingress.enabled if you want external access.

Do you want to proceed with default values or customize anything?"

User: "Enable ingress with host argocd.example.com."

3. Plan Approval

The agent generates the installation plan and presents it for final sign-off.

Agent: "🚀 INSTALLATION PLAN REVIEW

Action: Installation Chart: argo/argo-cd Version: 5.51.0 Release Name: argo-cd Namespace: argocd

Configuration Values

server.ingress.hosts[0]=argocd.example.com

Safety Check: Cluster connectivity is green.

[APPROVE / REJECT]"

User: "APPROVE"

4. Execution

The agent runs helm_dry_run_install first (mandatory for new installs), then helm_install_chart, and finally verifies via helm_get_release_status.


Scenario 3: Rollback

Objective: Revert a failed upgrade quickly.

User: "The last upgrade broke the site! Roll it back immediately."

  1. The agent lists history via helm_get_release_history:
    • Rev 5: DEPLOYED (Current, Broken)
    • Rev 4: SUPERSEDED (Last Good)
  2. Presents rollback plan for HITL approval.
  3. Executes: helm_rollback_release to revision 4.
  4. Verifies: Confirms pods from Rev 4 are running and healthy via helm_get_release_status.

Agent: "Successfully rolled back my-app to revision 4. Pods are stable."


🔄 App Operator — ArgoCD

Scenario 4: ArgoCD App Onboarding

Objective: Declaratively onboard an application via ArgoCD with plan preview and approvals.

User: "Onboard checkout-api from repo github.com/acme/platform at path /apps/checkout into project payments."

1. Prerequisite Checks

The ArgoCD Sub-Agent verifies:

  • Project payments exists → get_project
  • Repo is registered and allowed → list_repositories
  • Application name is available → list_applications

2. Plan Preview (HITL Gate)

Agent: "Here is the plan:

  1. Confirm repo access for github.com/acme/platform.
  2. Create ArgoCD app checkout-api in project payments.
  3. Set destination namespace payments and sync policy manual.

[APPROVE / REJECT]"

User: "APPROVE"

3. Execute + Tool-Level Approval

The agent runs the MCP calls and requests a final confirmation before the create action.


Scenario 5: Automated Sync Debugging

Objective: Diagnose and fix an OutOfSync ArgoCD application.

User: "My checkout app is stuck in OutOfSync. What's wrong?"

The agent:

  1. Checks sync status: get_sync_status → finds OutOfSync
  2. Runs diff: get_application_diff → shows discrepancy in spec.replicas
  3. Checks events: get_application_events → finds a failed sync attempt
  4. Reports actionable remediation:

Agent: "The checkout app is OutOfSync because someone manually scaled the deployment to 5 replicas on the cluster, but the Git manifest still says 3.

Options:

  1. Sync — Overwrite the cluster state with Git (reset to 3 replicas)
  2. Commit — Update the Git manifest to 5 replicas

Which do you prefer?"


🔄 App Operator — Argo Rollouts

Scenario 6: Zero-Downtime Deployment Migration

Objective: Migrate a standard Deployment to an Argo Rollout using workloadRef without duplicating pods or causing ArgoCD drift.

User: "Convert the checkout deployment to a canary rollout."

1. Pre-Migration Validation

The Argo Rollouts Sub-Agent runs validate_deployment_ready:

  • Deployment exists ✅
  • All replicas healthy ✅
  • No active HPA conflicts ✅

2. Migration Plan (HITL Gate)

Agent: "🔄 MIGRATION PLAN REVIEW

Action: Convert Deployment → Rollout (workloadRef) Deployment: checkout (namespace: production) Strategy: Canary — 20% → 50% → 100% Analysis: Prometheus AnalysisTemplate (error rate < 1%, p99 < 500ms)

⚠️ Post-migration steps required:

  1. Add ignoreDifferences to ArgoCD Application (generated for you)
  2. Commit scale-down manifest to Git (generated for you)

[APPROVE / REJECT]"

3. Execution & Follow-Up

After migration, the agent generates two mandatory artifacts:

  • generate_argocd_ignore_differences → Prevents false OutOfSync
  • generate_scale_down_manifest → Prevents duplicate pods

Scenario 7: Canary Deployment with Autonomous Promotion

Objective: Execute a canary rollout with automatic step progression.

User: "Update the checkout service to v2.3.0 with a canary rollout."

Execution & Autonomous Promotion

  • Step 1: Traffic shifts to 20%. AnalysisRun passes → Agent promotes automatically (≤50% threshold).
  • Step 2: Traffic shifts to 50%. AnalysisRun passes → Agent pauses and asks for explicit approval (≥50% threshold).

Agent: "Canary at 50%. Metrics healthy:

  • Error rate: 0.2% (threshold: < 1%) ✅
  • p99 latency: 320ms (threshold: < 500ms) ✅

Promote to 100%? [APPROVE / REJECT]"

User: "APPROVE"

  • Step 3: Full promotion. Verification confirms all pods healthy.

🔄 App Operator — Traefik

Scenario 8: Weighted Canary Routing

Objective: Gradually shift traffic between service versions at the edge.

User: "Set 80/20 traffic split between checkout-v1 and checkout-v2."

The Traefik Sub-Agent reads current distribution, then presents:

Agent: "🔀 TRAFFIC SHIFT REVIEW

TraefikService: checkout-weighted (namespace: production) Current: checkout-v1: 100% / checkout-v2: 0% Proposed: checkout-v1: 80% / checkout-v2: 20%

[APPROVE / REJECT]"

After approval, applies the weight shift and verifies the new distribution.


Scenario 9: NGINX to Traefik Migration

Objective: Automatically translate legacy NGINX Ingress annotations into native Traefik middleware and IngressRoute configurations.

User: "Migrate my NGINX ingresses to Traefik."

1. NGINX Discovery

The agent scans all NGINX Ingress resources: traefik://migration/nginx-ingress-scan

2. Annotation Analysis

Analyzes each Ingress for breaking annotations: traefik://migration/nginx-ingress-analyze

Agent: "Found 12 NGINX Ingress resources across 4 namespaces. 2 have breaking annotations:

  • checkout-ingress: Uses nginx.ingress.kubernetes.io/auth-url (requires Traefik ForwardAuth middleware)
  • api-ingress: Uses nginx.ingress.kubernetes.io/canary-weight (requires TraefikService weighted routing)

I'll generate equivalent Traefik CRDs for all 12. Review before applying?"

3. Generate & Apply (HITL Gate)

The agent generates Traefik IngressRoute + Middleware YAML (action=generate), presents for review, then applies after approval (action=apply).


☸️ K8s Operator

Scenario 10: Automated Root Cause Analysis

Objective: Diagnose a crashing pod with automated correlation.

User: "Why is the payment-service crashing?"

The agent runs an automated investigation:

  1. pods_list_in_namespace → Finds payment-service-5f8c... in CrashLoopBackOff
  2. pods_log --previous → Reads crash log
  3. events_list → Checks cluster events
  4. pods_top → Correlates memory usage with limits

Log Finding:

panic: FATAL: env var 'STRIPE_API_KEY' not set

Agent: "Root Cause: The STRIPE_API_KEY environment variable is missing from the Deployment spec.

Recommended Fix: Add the environment variable referencing an existing Secret:

- name: STRIPE_API_KEY
valueFrom:
secretKeyRef:
name: payment-secrets
key: stripe-api-key

Shall I update the Deployment with this fix?"


Scenario 11: Resource Pressure Investigation (OOMKilled)

Objective: Investigate OOMKilled containers by correlating memory limits with live usage.

User: "My checkout pods keep getting OOMKilled."

The agent:

  1. pods_list_in_namespace → Finds 3 pods with OOMKilled exit code
  2. pods_top → Live memory: 490Mi per pod
  3. resources_get (Deployment) → Memory limit: 512Mi
  4. nodes_top → Node memory pressure: 78% utilization

Agent: "Root Cause: Checkout pods are hitting the 512Mi memory limit. Live usage peaks at 490Mi (96% of limit).

Recommendation: Increase memory limit to 768Mi (490Mi peak + 50% headroom).

The node has sufficient capacity (78% utilization). Shall I patch the Deployment?"


Scenario 12: Safe Exec & Ephemeral Debugging

Objective: Debug DNS resolution issues using a temporary debug pod.

User: "Run a debug pod with busybox in the checkout namespace."

Agent: "🐛 EXEC APPROVAL

Action: Create ephemeral debug pod Image: busybox:latest Namespace: checkout Command: sh (interactive shell)

⚠️ This grants shell-level access to the namespace. [APPROVE / REJECT]"

After approval, the agent creates the pod and relays command output.


🔭 Observability — Prometheus

Scenario 13: Exporter Lifecycle Management

Objective: Deploy, verify, and monitor a PostgreSQL exporter.

The agent follows the strict sequence: prom_install_exporterprom_verify_exporterprom_query_instant for validation.

User: "Install a postgres exporter in the monitoring namespace."

1. Recommendation

Agent: "I recommend the postgres-exporter for PostgreSQL monitoring. It exposes 50+ metrics including connection pools, query performance, and replication lag."

2. Plan Presentation (HITL Gate)

Agent: "📦 EXPORTER INSTALL REVIEW

Exporter: postgres-exporter Namespace: monitoring Resources: Deployment + Service + ServiceMonitor Connection: Requires DATA_SOURCE_NAME env var pointing to PostgreSQL

[APPROVE / REJECT]"

3. Install & Mandatory Verification

prom_install_exporter → prom_verify_exporter → prom_query_instant(query="up{job='postgres-exporter'}")

Agent: "✅ Verified: postgres-exporter is scraping successfully. up{job='postgres-exporter'} = 1. 52 metrics discovered."


Scenario 14: Synthetic Monitoring with Probes

Objective: Set up endpoint monitoring using Blackbox exporter and Probe CRDs.

User: "Monitor the checkout API health endpoint at https://checkout.example.com/health."

The agent follows: prom_install_exporter (blackbox) → prom_apply_probeprom_query_instant (validation).

Agent: "✅ Probe configured. Blackbox exporter is monitoring https://checkout.example.com/health every 30s. Current status: UP (200 OK, latency: 45ms)."


Scenario 15: Alerting Rule Authoring

Objective: Author and deploy a PrometheusRule CRD for high error rate detection.

User: "Create an alerting rule that fires when the checkout error rate exceeds 1% for 5 minutes."

The agent:

  1. Checks existing rules: prom://rules/groups
  2. Discovers CRD details: prom://kubernetes/prometheusrules
  3. Drafts the rule: prom_draft_alert_rule
  4. Presents for HITL approval, then deploys via prom_upsert_rule_group in k8s_crd mode

Agent: "✅ Alert rule CheckoutHighErrorRate (P1 critical) deployed to namespace monitoring. Group: checkout-alerts."


🔭 Observability — Alertmanager

Scenario 16: On-Call Alert Triage

Objective: Get a human-readable summary of all firing alerts.

User: "What alerts are firing?"

Agent: "📊 ON-CALL SUMMARY (via am_summarize_oncall)

Critical (2):

  • CheckoutHighErrorRate — Error rate 8.5% (threshold: 1%) — checkout service — firing for 15m
  • PaymentTimeoutP99 — p99 latency 3.2s (threshold: 500ms) — payment service — firing for 8m

Warning (3):

  • CheckoutPodRestarting — 12 restarts in last hour
  • CheckoutReplicasMismatch — 2/5 replicas ready
  • DiskSpaceWarning — Node disk at 82%

Would you like me to investigate the critical alerts or silence any of them?"


Scenario 17: Silence Lifecycle (Preview → Validate → Create → Extend → Expire)

Objective: Silence noisy alerts during a maintenance window with mandatory blast radius checks.

User: "Silence alerts for the checkout service for 2 hours — we're doing planned maintenance."

1. Preview (Mandatory)

am_preview_silence → am_validate_silence_policy → am_create_silence

Agent: "🔇 SILENCE PREVIEW

Matchers: service=checkout Duration: 2 hours Blast Radius: 3 active alerts would be silenced:

  • CheckoutHighErrorRate (critical)
  • CheckoutLatencyP99 (warning)
  • CheckoutPodRestarting (warning)

Policy Check: ✅ Duration under 24h cap. Matchers are service-scoped (recommended).

[APPROVE / REJECT]"

2. Creation & Verification

Agent: "✅ Silence created (ID: abc123). 3 alerts silenced for 2 hours. Expires at 15:30 UTC.

To extend or expire early, just ask."

3. Extension (Later)

User: "Extend the checkout silence by 1 hour."

Agent: "✅ Silence abc123 extended by 1 hour. New expiry: 16:30 UTC."


Scenario 18: Routing Audit & Integration Testing

Objective: Verify that alert routing is correctly configured.

User: "Who gets paged for critical alerts?"

The agent uses am_explain_routing to simulate routing:

Agent: "Routing Analysis for severity=critical:

AlertRouteReceiverChannel
CheckoutHighErrorRateservice=checkoutcheckout-criticalPagerDuty#checkout-oncall
PaymentTimeoutP99service=paymentpayment-criticalPagerDuty#payment-oncall
DiskSpaceCriticalDefault route ⚠️default-receiver#general-alerts

⚠️ DiskSpaceCritical is hitting the default route — it may be misconfigured. Want me to audit the default route for other misrouted alerts?"


🌐 Cross-Domain Investigation

Scenario 19: Observability → K8s Handoff

Objective: Investigate a production incident starting from alerting and drilling into cluster health.

User: "What alerts are firing for checkout?"

1. Observability Phase

Agent: "Found 5 critical alerts for checkout:

  • CheckoutHighErrorRate: Error rate at 8.5% (threshold: 1%)
  • CheckoutOOMKilled: 3 pods OOMKilled in last 30m
  • CheckoutLatencyP99: p99 at 2.3s (threshold: 500ms)
  • CheckoutPodRestarting: 12 restarts in last hour
  • CheckoutReplicasMismatch: 2/5 replicas ready

This looks like a pod crash storm. Want me to investigate the pods?"

User: "Yes"

2. Cross-Domain Handoff

The Observability coordinator returns: "This is outside my scope."

The Supervisor auto-routes to the K8s Operator with full context:

[CROSS-DOMAIN] Source: observability.
Prior findings: 5 critical alerts — error rate 8.5%, OOMKilled x3, p99 2.3s, 12 restarts, 2/5 replicas.
User Request: Check pod status for checkout service

3. K8s Phase

The K8s Operator investigates with full context — no re-asking needed:

  1. Lists pods → 3 in CrashLoopBackOff
  2. Reads logs → FATAL: out of memory
  3. Checks pods_top → pods at 490Mi (limit: 512Mi)

Agent: "Root Cause: Checkout pods are hitting the 512Mi memory limit. 3 pods have been OOMKilled.

Recommendation: Increase memory limit to 768Mi. This aligns with the observed peak usage of 490Mi + 50% headroom.

Shall I patch the Deployment?"