🔭 Observability
Manages the full monitoring and alerting stack — Prometheus PromQL queries, exporter lifecycle, alerting rules, TSDB analysis, Alertmanager triage, silence management, and routing audit.
The Observability domain coordinates 2 sub-agents (prometheus-operator and alertmanager-operator), each connecting to its own MCP server. Together they provide deep monitoring and alerting capabilities that integrate with the rest of the k8s-autopilot ecosystem.
Architecture
Prometheus Sub-Agent (prometheus-operator)
Orchestrates Prometheus monitoring operations — from simple PromQL queries to complex exporter lifecycle management, rule authoring, and TSDB cardinality analysis.
Read-Only Operations (Fast-Path)
Resource URIs (via read_mcp_resource)
| Query Type | Resource URI |
|---|---|
| All backends health | prom://system/backends |
| Backend detail | prom://system/backends/{backend_id} |
| Service catalog | prom://topology/services |
| Service metrics | prom://topology/services/{job}/metrics |
| Failed targets | prom://topology/failed_targets |
| TSDB cardinality | prom://tsdb/cardinality |
| Runtime config | prom://config/runtime |
| Rule groups | prom://rules/groups |
| K8s PrometheusRules CRDs | prom://kubernetes/prometheusrules |
| Metric catalog | prom://metadata/catalog |
| Exporter catalog | prom://exporters/catalog |
| Best practices | prom://best-practices |
| Onboarding guide | prom://onboarding-guide |
Tool-Based Queries
| Query Type | Tool |
|---|---|
| Run instant query | prom_query_instant |
| Run range query | prom_query_range |
| Validate PromQL | prom_validate_promql |
| Explore metric labels | prom_explore_labels |
| Test endpoint health | prom_test_endpoint |
| Recommend instrumentation | prom_recommend_instrumentation |
| Recommend exporter | prom_recommend_exporter |
| Describe alert rule | prom_describe_alert_rule |
| Analyze firing history | prom_analyze_firing_history |
| Draft alert rule | prom_draft_alert_rule |
| Tune alert thresholds | prom_tune_alert_rule |
State-Modifying Workflows
| Operation | HITL Phase | Plan Content |
|---|---|---|
| Exporter Install | exporter_install_review | 📦 Exporter type, namespace, K8s resources to create |
| Rule Create/Update | rule_group_review | 📋 Group name, backend, rule count, storage mode |
| ServiceMonitor | servicemonitor_review | 📡 Service, namespace, scrape interval |
| File SD Add/Remove | file_sd_review | 📁 Targets, file path, action |
Verification & Failure Diagnosis (Mandatory)
The agent never declares success based on tool stdout. After every mutation:
| After... | Verify with... | If Failed |
|---|---|---|
| Exporter install | prom_verify_exporter → confirm up{} series | 1. Check prom://topology/failed_targets 2. Run prom_test_endpoint 3. Escalate |
| ServiceMonitor | prom_query_instant(query="up{job='...'}") | Same as exporter install |
| Rule upsert | prom://rules/groups → confirm group appears | Check namespace and ruleSelector in config |
| File SD add | prom_query_instant(query="up{job='...'}") | Same as exporter install |
Verification returns one of:
- ✅ Verified — operation confirmed successful
- ⚠️ Deployed but Unhealthy — resource exists but not scraping
- ❌ Failed — operation did not take effect
Key Capabilities
PromQL Queries & Metric Exploration
Ask natural language questions like "how much CPU is my app using?" and the agent translates them into precise PromQL queries. Supports both instant and range queries with automatic downsampling.
Exporter Lifecycle Management
Deploy, verify, and uninstall Prometheus exporters (Redis, PostgreSQL, etc.) with automatic health validation:
prom_install_exporter → prom_verify_exporter → prom_query_instant(up{})
Synthetic Monitoring & Probes
Set up endpoint monitoring using native Probe CRDs with Blackbox exporter:
prom_install_exporter → prom_apply_probe → prom_query_instant (validation)
Alerting & Recording Rules
Author and deploy PrometheusRule CRDs (P1/P2 severity) directly to Kubernetes namespaces using k8s_crd storage mode. The agent cross-references:
prom://kubernetes/prometheusrules— discover CRD name, namespace, labelsprom://rules/groups— existing group namesprom_upsert_rule_group— create or update
Using incorrect namespace with prom_upsert_rule_group in k8s_crd mode will silently create a duplicate CRD instead of patching the existing one.
TSDB Cardinality & FinOps
Analyze label cardinality, identify high-cardinality metrics, and optimize storage costs via prom://tsdb/cardinality.
PromQL Safety Guardrails
| Guardrail | Rule |
|---|---|
| Counter Enforcement | Counters MUST use rate() or increase() unless allow_raw_counters=true |
| Auto-Downsampling | Range queries capped at ~200 points/series |
| Validate First | Complex queries should be checked via prom_validate_promql before execution |
Alertmanager Sub-Agent (alertmanager-operator)
Manages alert lifecycle, silence management, routing audit, and governance — from on-call triage through silence creation to integration testing.
Read-Only Operations (Fast-Path)
Resource URIs (via read_mcp_resource)
| Query Type | Resource URI |
|---|---|
| All backends health | am://system/backends |
| Backend detail | am://system/backends/{backend_id} |
| System status/version | am://system/status |
| Configured receivers | am://system/receivers |
| Routing tree + config | am://system/config |
| MCP audit log | am://system/audit-log |
| Active alerts snapshot | am://alerts/active |
| Alert groups snapshot | am://alerts/groups |
| Active silences | am://silences/active |
| Best practices | am://best-practices |
| Onboarding guide | am://onboarding-guide |
Tool-Based Queries
| Query Type | Tool |
|---|---|
| List alerts (filtered) | am_list_alerts |
| Alert groups (filtered) | am_list_alert_groups |
| On-call summary | am_summarize_oncall |
| Explain routing | am_explain_routing |
| Audit default route | am_audit_default_route |
| Recent silence changes | am_list_recent_changes |
| Preview silence blast radius | am_preview_silence |
| Validate silence policy | am_validate_silence_policy |
State-Modifying Workflows
| Operation | HITL Phase | Plan Content |
|---|---|---|
| Create Silence | silence_creation_review | 🔇 Matchers, duration, blast radius, creator |
| Expire Silence | silence_expire_review | 🔔 Silence ID, affected alerts |
| Push Test Alert | test_alert_review | 🧪 Alert labels, target receiver |
| Update/Extend Silence | silence_update_review | 🔄 Silence ID, extension duration |
Silence Lifecycle — Mandatory Sequence
For any silence creation, the agent follows this exact sequence:
am_preview_silence— Check blast radius (mandatory, never skipped)am_validate_silence_policy— Check policy complianceam_create_silence— Only after both checks pass
Verification (Mandatory)
| After... | Verify with... | If Failed |
|---|---|---|
| Silence create | am_list_silences(state="active") | Check am_list_recent_changes for immediate expiry |
| Silence expire | am_list_silences (check expired) | Check if another active silence matches |
| Test alert push | am_list_alerts | Check am_explain_routing for routing destination |
| Silence update | am_list_silences(state="active") | Check if max duration was exceeded |
Key Capabilities
On-Call Alert Triage
Get a human-readable on-call summary of all firing alerts, grouped by severity and service, with actionable remediation steps via am_summarize_oncall.
Silence Lifecycle Management
Full lifecycle: preview blast radius → validate policy → create → extend → expire. All with mandatory dry-run previews.
Routing Audit & Governance
- Routing tree inspection:
am://system/configfor full config export - "Who gets paged?" simulation:
am_explain_routingfor routing path analysis - Default route audit:
am_audit_default_routeto find misrouted alerts hitting fallback receiver - Compliance review:
am_validate_silence_policyfor policy compliance of existing silences - Change audit:
am_list_recent_changesfor silence create/expire activity
Integration Testing
Push synthetic test alerts to validate that downstream notification channels (Slack, PagerDuty, email) are correctly configured.
Silence Safety Guardrails
| Guardrail | Rule |
|---|---|
| Duration Cap | Max silence duration is 24 hours (default). Override via AM_MAX_SILENCE_MINUTES |
| Blast Radius Warning | Warns if silence affects ≥ N alerts. Preview is mandatory |
| Duplicate Detection | Built-in — blocks creating equivalent active silences |
| Scope Control | instance (narrowest) → service (recommended) → env (broadest) |
Skills Reference
| Skill Directory | Sub-Agent | Purpose |
|---|---|---|
observability/prometheus | prometheus-operator | Prometheus workflow patterns, exporter lifecycle, rule authoring |
observability/alertmanager | alertmanager-operator | Silence lifecycle, routing audit, governance patterns |
MCP Integration
| MCP Server | Package | Transport | Used By |
|---|---|---|---|
prometheus-mcp-server | prometheus-mcp-server | stdio | prometheus-operator |
alertmanager-mcp-server | alertmanager-mcp-server | stdio | alertmanager-operator |
Environment Variables
| Variable | Default | Purpose |
|---|---|---|
PROMETHEUS_BASE_URL | http://prometheus-operated.monitoring.svc:9090 | Prometheus server URL |
PROMETHEUS_VERIFY_SSL | false | SSL verification |
PROMETHEUS_BACKEND_ID | default | Multi-backend identifier |
ALERTMANAGER_BASE_URL | http://alertmanager-operated.monitoring.svc:9093 | Alertmanager server URL |
ALERTMANAGER_VERIFY_SSL | false | SSL verification |
AM_MAX_SILENCE_MINUTES | 1440 | Maximum silence duration (24 hours) |
AM_SILENCE_WARNING_THRESHOLD | 50 | Blast radius warning threshold |
Cross-Domain Integration
The Observability domain frequently collaborates with other domains:
| Scenario | Source | Target | Context Passed |
|---|---|---|---|
| "5 critical alerts for checkout — check pods" | Observability | K8s Operator | Alert details, service name, namespace |
| "High error rate after rollout — abort?" | Observability | App Operator | Error metrics, rollout name, timeline |
| "Prometheus AnalysisTemplate for canary" | App Operator | Observability | Rollout config, metric queries |
Safety & Governance
Out-of-Scope Escalation
The Prometheus operator exhausts all MCP diagnostic tools before escalating. If the root cause remains hidden (e.g., up=0 but prom_test_endpoint is unreachable), it explicitly returns:
"I have exhausted my MCP diagnostic tools. Further diagnosis requires cluster access. Please run
kubectl logs <pod-name> -n <namespace>andkubectl describe pod <pod-name> -n <namespace>and share the output."
It never uses filesystem tools (ls, grep) to read pod logs directly.
HITL Gates
Every state-modifying operation requires explicit user approval. The silence lifecycle is particularly strict with mandatory preview and policy validation before creation.
[PLAN-LOCKED] Execution
When the coordinator has already obtained approval, sub-agents skip Phase 2 and execute pre-approved parameters directly. HumanInTheLoopMiddleware still provides the mechanical safety backstop.