Troubleshooting Guide

Diagnose and resolve issues across all four domains — Helm Operator, App Operator, K8s Operator, and Observability.

🔌 1. Connectivity & Startup

"Config not found"

Error: Kubernetes configuration file not found at /root/.kube/config Cause: The agent container cannot see your kubeconfig file. Fix: Ensure your Docker volume mount is correct.

# Correct mount syntax
-v ~/.kube/config:/root/.kube/config

"Connection Refused (MCP)"

Error: Failed to connect to MCP server at http://localhost:9000 or Connection refused Cause: The main agent container cannot talk to the MCP server container. Fix:

Docker Compose: Ensure both services are in the same network (default behavior).
DNS: Use the service name (e.g., helm-mcp-server) instead of localhost in configuration if running manually.

"Connection Refused (Kubernetes)"

Error: dial tcp 127.0.0.1:6443: connect: connection refused Cause: Your kubeconfig points to localhost. Inside Docker, localhost is the container, not your laptop. Fix:

Docker Desktop: Ensure context is set to docker-desktop.
Linux: Use --network host mode.
Generic: Replace localhost in kubeconfig with host.docker.internal (Docker Desktop) or your machine's LAN IP.

"MCP Server Initialization Failed"

Error: JIT MCP connection failed for {server_name} or TaskGroup auth error Cause: The MCP server is unreachable or authentication has expired. Fix:

Check if the MCP server container is running: docker ps | grep mcp
For GitHub MCP: Regenerate your Personal Access Token and update GITHUB_PERSONAL_ACCESS_TOKEN
For ArgoCD MCP: Verify ARGOCD_AUTH_TOKEN is valid and the ArgoCD server is reachable

🛑 2. Workflow & Logic Issues

Agent is Stuck "Thinking" or "Validating"

Symptom: The agent enters a loop of "Validating..." → "Fixing..." → "Validating...". Cause: The Generator Agent is caught in a self-healing loop where its "fix" causes a new error. Fix:

Wait: The loop has a hard limit (default: 2 retries per tool). It will eventually stop and ask_human for help.
Interrupt: You can manually stop the agent and provide the correct file content via the prompt.

"Approval Required" but no Buttons

Symptom: The agent says it needs approval, but the UI doesn't show buttons. Cause: The UI might not have rendered the A2UI card correctly, or you are using a CLI that doesn't support interactive elements. Fix: Type explicit text approval: "APPROVE", "YES", or "Proceed". The agent's text parser often acts as a fallback for UI buttons.

Agent Asks for Information it Should Already Have

Symptom: The agent asks for parameters (chart name, namespace) that were already provided. Cause: Conversation history may have been summarized, losing context. Or the sub-agent didn't receive context via [PLAN-LOCKED]. Fix:

Repeat the critical parameters in your response
If persistent, check LOG_LEVEL=DEBUG for whether the operations journal is being read correctly

Cross-Domain Handoff Fails

Symptom: The agent says "outside my scope" but doesn't route to the correct domain — instead presenting the raw handoff message to the user. Cause: The Supervisor's pattern matching didn't detect the handoff signal. Fix: This is a known edge case. Manually tell the Supervisor which domain to use: e.g., "Check pods for checkout" (triggers K8s Operator) or "What alerts are firing?" (triggers Observability).

🔒 3. Permissions (RBAC)

"Forbidden" on Deployment

Error: deployments.apps is forbidden: User "system:serviceaccount:..." cannot create Cause: The agent's Service Account lacks the create verb for that resource. Fix: Update the ClusterRole bound to the agent.

# Grant full access to apps API group
- apiGroups: ["apps"]
  resources: ["deployments", "statefulsets"]
  verbs: ["*"]

Helm "Release Not Found" (but it exists)

Error: Error: release: "my-app" not found Cause: Helm uses Secrets/ConfigMaps to store state. If the agent lacks permission to read Secrets in the target namespace, it cannot see the release. Fix: Ensure the RBAC role includes secrets access in the namespace where the chart is installed.

ArgoCD "Permission Denied"

Error: permission denied: applications, get, ... Cause: The ArgoCD auth token doesn't have sufficient RBAC permissions for the target project/application. Fix: Update the ArgoCD RBAC policy or regenerate the auth token with broader permissions. Check argocd admin settings rbac for current policies.

🔄 4. App Operator Issues

ArgoCD Sync Fails with "OutOfSync" After Rollout Migration

Symptom: After migrating a Deployment to an Argo Rollout, ArgoCD shows OutOfSync indefinitely. Cause: ArgoCD detects differences between the Git manifest and the live state because Argo Rollouts modifies the spec.template.metadata and status fields. Fix: Apply the ignoreDifferences configuration that the Argo Rollouts sub-agent generates:

spec:
  ignoreDifferences:
    - group: argoproj.io
      kind: Rollout
      jsonPointers:
        - /spec/template/metadata
        - /status

Argo Rollouts "AnalysisRun Failed"

Symptom: Canary rollout aborts because the AnalysisRun reports failure. Cause: The Prometheus query in the AnalysisTemplate returned a value exceeding the threshold. Fix:

Check the AnalysisRun details: argorollout://rollouts/{ns}/{name}/detail
Run the Prometheus query manually: ask the agent "Query the error rate for checkout in the last 30 minutes"
If the metric is genuinely healthy, adjust the AnalysisTemplate thresholds

Traefik "No Matching Route"

Symptom: Traffic shift applied but requests return 404. Cause: The Traefik IngressRoute or TraefikService references a Kubernetes Service that doesn't exist or has mismatched port names. Fix: Verify the target services exist and have matching ports. Ask the agent: "Show me the traefik route distribution for checkout."

🔭 5. Observability Issues

Prometheus Exporter Shows `up=0`

Symptom: Exporter deployed but up{job='...'} == 0. Cause: Prometheus can't scrape the exporter. Common reasons:

Port mismatch: ServiceMonitor port doesn't match Service port
Namespace isolation: ServiceMonitor is in a different namespace than the serviceMonitorSelector allows
Network policy: The monitoring namespace can't reach the exporter pod

Fix: Ask the agent to diagnose:

"The postgres exporter shows up=0. Can you diagnose?"

The agent will run:

prom://topology/failed_targets — check if the target appears as failed
prom_test_endpoint — test if the metrics endpoint is reachable
prom://kubernetes/prometheusrules — verify ServiceMonitor configuration

Alertmanager Silence Created but Alerts Still Firing

Symptom: Silence created successfully but alerts remain visible. Cause: Silence matchers don't match the alert labels exactly. Common issue: using service=checkout when the alert has service_name=checkout. Fix:

Check the exact alert labels: am_list_alerts
Compare with silence matchers: am://silences/active
Create a new silence with corrected matchers

"Duplicate PrometheusRule CRD Created"

Symptom: After creating an alerting rule, two PrometheusRule CRDs exist with similar content. Cause: prom_upsert_rule_group in k8s_crd mode was called with the wrong namespace, creating a new CRD instead of patching the existing one. Fix:

Delete the duplicate: kubectl delete prometheusrule <name> -n <wrong-namespace>
Verify the correct one exists: prom://kubernetes/prometheusrules

🐛 6. Low-Level Debugging

Enable Debug Logs

Increase verbosity to see the raw tool calls and MCP payloads.

Environment Variable:

LOG_LEVEL=DEBUG

Output:

[DEBUG] Calling tool: helm_mcp_server.list_releases
[DEBUG] MCP Payload: {"namespace": "default"}
[DEBUG] Tool Output: [...]

Inspect Generated Artifacts

If the Generator fails, inspecting the raw files helps.

Check the WORKSPACE_DIR (default /tmp/helm-charts inside container).
If you mounted a volume, check your local mapped folder.

ls -F my-local-charts/my-app/templates/
cat my-local-charts/my-app/values.yaml

Check Operations Journal

The operations journal records all previous operations. If context seems lost, inspect it:

# Inside the container or via the agent
read_file /memories/helm-operator/operations-log.md

Inspect MCP Server Logs

Each MCP server runs in its own container. Check their logs for connectivity or tool execution issues:

docker logs helm-mcp-server
docker logs argocd-mcp-server
docker logs prometheus-mcp-server
docker logs alertmanager-mcp-server

🔌 1. Connectivity & Startup​

"Config not found"​

"Connection Refused (MCP)"​

"Connection Refused (Kubernetes)"​

"MCP Server Initialization Failed"​

🛑 2. Workflow & Logic Issues​

Agent is Stuck "Thinking" or "Validating"​

"Approval Required" but no Buttons​

Agent Asks for Information it Should Already Have​

Cross-Domain Handoff Fails​

🔒 3. Permissions (RBAC)​

"Forbidden" on Deployment​

Helm "Release Not Found" (but it exists)​

ArgoCD "Permission Denied"​

🔄 4. App Operator Issues​

ArgoCD Sync Fails with "OutOfSync" After Rollout Migration​

Argo Rollouts "AnalysisRun Failed"​

Traefik "No Matching Route"​

🔭 5. Observability Issues​

Prometheus Exporter Shows up=0​

Alertmanager Silence Created but Alerts Still Firing​

"Duplicate PrometheusRule CRD Created"​

🐛 6. Low-Level Debugging​

Enable Debug Logs​

Inspect Generated Artifacts​

Check Operations Journal​

Inspect MCP Server Logs​