Troubleshooting Guide
Diagnose and resolve issues across all four domains — Helm Operator, App Operator, K8s Operator, and Observability.
🔌 1. Connectivity & Startup
"Config not found"
Error: Kubernetes configuration file not found at /root/.kube/config
Cause: The agent container cannot see your kubeconfig file.
Fix: Ensure your Docker volume mount is correct.
# Correct mount syntax
-v ~/.kube/config:/root/.kube/config
"Connection Refused (MCP)"
Error: Failed to connect to MCP server at http://localhost:9000 or Connection refused
Cause: The main agent container cannot talk to the MCP server container.
Fix:
- Docker Compose: Ensure both services are in the same network (default behavior).
- DNS: Use the service name (e.g.,
helm-mcp-server) instead oflocalhostin configuration if running manually.
"Connection Refused (Kubernetes)"
Error: dial tcp 127.0.0.1:6443: connect: connection refused
Cause: Your kubeconfig points to localhost. Inside Docker, localhost is the container, not your laptop.
Fix:
- Docker Desktop: Ensure context is set to
docker-desktop. - Linux: Use
--network hostmode. - Generic: Replace
localhostin kubeconfig withhost.docker.internal(Docker Desktop) or your machine's LAN IP.
"MCP Server Initialization Failed"
Error: JIT MCP connection failed for {server_name} or TaskGroup auth error
Cause: The MCP server is unreachable or authentication has expired.
Fix:
- Check if the MCP server container is running:
docker ps | grep mcp - For GitHub MCP: Regenerate your Personal Access Token and update
GITHUB_PERSONAL_ACCESS_TOKEN - For ArgoCD MCP: Verify
ARGOCD_AUTH_TOKENis valid and the ArgoCD server is reachable
🛑 2. Workflow & Logic Issues
Agent is Stuck "Thinking" or "Validating"
Symptom: The agent enters a loop of "Validating..." → "Fixing..." → "Validating...". Cause: The Generator Agent is caught in a self-healing loop where its "fix" causes a new error. Fix:
- Wait: The loop has a hard limit (default: 2 retries per tool). It will eventually stop and
ask_humanfor help. - Interrupt: You can manually stop the agent and provide the correct file content via the prompt.
"Approval Required" but no Buttons
Symptom: The agent says it needs approval, but the UI doesn't show buttons.
Cause: The UI might not have rendered the A2UI card correctly, or you are using a CLI that doesn't support interactive elements.
Fix:
Type explicit text approval: "APPROVE", "YES", or "Proceed". The agent's text parser often acts as a fallback for UI buttons.
Agent Asks for Information it Should Already Have
Symptom: The agent asks for parameters (chart name, namespace) that were already provided.
Cause: Conversation history may have been summarized, losing context. Or the sub-agent didn't receive context via [PLAN-LOCKED].
Fix:
- Repeat the critical parameters in your response
- If persistent, check
LOG_LEVEL=DEBUGfor whether the operations journal is being read correctly
Cross-Domain Handoff Fails
Symptom: The agent says "outside my scope" but doesn't route to the correct domain — instead presenting the raw handoff message to the user. Cause: The Supervisor's pattern matching didn't detect the handoff signal. Fix: This is a known edge case. Manually tell the Supervisor which domain to use: e.g., "Check pods for checkout" (triggers K8s Operator) or "What alerts are firing?" (triggers Observability).
🔒 3. Permissions (RBAC)
"Forbidden" on Deployment
Error: deployments.apps is forbidden: User "system:serviceaccount:..." cannot create
Cause: The agent's Service Account lacks the create verb for that resource.
Fix: Update the ClusterRole bound to the agent.
# Grant full access to apps API group
- apiGroups: ["apps"]
resources: ["deployments", "statefulsets"]
verbs: ["*"]
Helm "Release Not Found" (but it exists)
Error: Error: release: "my-app" not found
Cause: Helm uses Secrets/ConfigMaps to store state. If the agent lacks permission to read Secrets in the target namespace, it cannot see the release.
Fix: Ensure the RBAC role includes secrets access in the namespace where the chart is installed.
ArgoCD "Permission Denied"
Error: permission denied: applications, get, ...
Cause: The ArgoCD auth token doesn't have sufficient RBAC permissions for the target project/application.
Fix: Update the ArgoCD RBAC policy or regenerate the auth token with broader permissions. Check argocd admin settings rbac for current policies.
🔄 4. App Operator Issues
ArgoCD Sync Fails with "OutOfSync" After Rollout Migration
Symptom: After migrating a Deployment to an Argo Rollout, ArgoCD shows OutOfSync indefinitely.
Cause: ArgoCD detects differences between the Git manifest and the live state because Argo Rollouts modifies the spec.template.metadata and status fields.
Fix: Apply the ignoreDifferences configuration that the Argo Rollouts sub-agent generates:
spec:
ignoreDifferences:
- group: argoproj.io
kind: Rollout
jsonPointers:
- /spec/template/metadata
- /status
Argo Rollouts "AnalysisRun Failed"
Symptom: Canary rollout aborts because the AnalysisRun reports failure. Cause: The Prometheus query in the AnalysisTemplate returned a value exceeding the threshold. Fix:
- Check the AnalysisRun details:
argorollout://rollouts/{ns}/{name}/detail - Run the Prometheus query manually: ask the agent "Query the error rate for checkout in the last 30 minutes"
- If the metric is genuinely healthy, adjust the AnalysisTemplate thresholds
Traefik "No Matching Route"
Symptom: Traffic shift applied but requests return 404. Cause: The Traefik IngressRoute or TraefikService references a Kubernetes Service that doesn't exist or has mismatched port names. Fix: Verify the target services exist and have matching ports. Ask the agent: "Show me the traefik route distribution for checkout."
🔭 5. Observability Issues
Prometheus Exporter Shows up=0
Symptom: Exporter deployed but up{job='...'} == 0.
Cause: Prometheus can't scrape the exporter. Common reasons:
- Port mismatch: ServiceMonitor port doesn't match Service port
- Namespace isolation: ServiceMonitor is in a different namespace than the
serviceMonitorSelectorallows - Network policy: The monitoring namespace can't reach the exporter pod
Fix: Ask the agent to diagnose:
"The postgres exporter shows up=0. Can you diagnose?"
The agent will run:
prom://topology/failed_targets— check if the target appears as failedprom_test_endpoint— test if the metrics endpoint is reachableprom://kubernetes/prometheusrules— verify ServiceMonitor configuration
Alertmanager Silence Created but Alerts Still Firing
Symptom: Silence created successfully but alerts remain visible.
Cause: Silence matchers don't match the alert labels exactly. Common issue: using service=checkout when the alert has service_name=checkout.
Fix:
- Check the exact alert labels:
am_list_alerts - Compare with silence matchers:
am://silences/active - Create a new silence with corrected matchers
"Duplicate PrometheusRule CRD Created"
Symptom: After creating an alerting rule, two PrometheusRule CRDs exist with similar content.
Cause: prom_upsert_rule_group in k8s_crd mode was called with the wrong namespace, creating a new CRD instead of patching the existing one.
Fix:
- Delete the duplicate:
kubectl delete prometheusrule <name> -n <wrong-namespace> - Verify the correct one exists:
prom://kubernetes/prometheusrules
🐛 6. Low-Level Debugging
Enable Debug Logs
Increase verbosity to see the raw tool calls and MCP payloads.
Environment Variable:
LOG_LEVEL=DEBUG
Output:
[DEBUG] Calling tool: helm_mcp_server.list_releases
[DEBUG] MCP Payload: {"namespace": "default"}
[DEBUG] Tool Output: [...]
Inspect Generated Artifacts
If the Generator fails, inspecting the raw files helps.
- Check the
WORKSPACE_DIR(default/tmp/helm-chartsinside container). - If you mounted a volume, check your local mapped folder.
ls -F my-local-charts/my-app/templates/
cat my-local-charts/my-app/values.yaml
Check Operations Journal
The operations journal records all previous operations. If context seems lost, inspect it:
# Inside the container or via the agent
read_file /memories/helm-operator/operations-log.md
Inspect MCP Server Logs
Each MCP server runs in its own container. Check their logs for connectivity or tool execution issues:
docker logs helm-mcp-server
docker logs argocd-mcp-server
docker logs prometheus-mcp-server
docker logs alertmanager-mcp-server