🔭 Observability

Manages the full monitoring and alerting stack — Prometheus PromQL queries, exporter lifecycle, alerting rules, TSDB analysis, Alertmanager triage, silence management, and routing audit.

The Observability domain coordinates 2 sub-agents (prometheus-operator and alertmanager-operator), each connecting to its own MCP server. Together they provide deep monitoring and alerting capabilities that integrate with the rest of the k8s-autopilot ecosystem.

Architecture

Prometheus Sub-Agent (`prometheus-operator`)

Orchestrates Prometheus monitoring operations — from simple PromQL queries to complex exporter lifecycle management, rule authoring, and TSDB cardinality analysis.

Read-Only Operations (Fast-Path)

Resource URIs (via `read_mcp_resource`)

Query Type	Resource URI
All backends health	`prom://system/backends`
Backend detail	`prom://system/backends/{backend_id}`
Service catalog	`prom://topology/services`
Service metrics	`prom://topology/services/{job}/metrics`
Failed targets	`prom://topology/failed_targets`
TSDB cardinality	`prom://tsdb/cardinality`
Runtime config	`prom://config/runtime`
Rule groups	`prom://rules/groups`
K8s PrometheusRules CRDs	`prom://kubernetes/prometheusrules`
Metric catalog	`prom://metadata/catalog`
Exporter catalog	`prom://exporters/catalog`
Best practices	`prom://best-practices`
Onboarding guide	`prom://onboarding-guide`

Tool-Based Queries

Query Type	Tool
Run instant query	`prom_query_instant`
Run range query	`prom_query_range`
Validate PromQL	`prom_validate_promql`
Explore metric labels	`prom_explore_labels`
Test endpoint health	`prom_test_endpoint`
Recommend instrumentation	`prom_recommend_instrumentation`
Recommend exporter	`prom_recommend_exporter`
Describe alert rule	`prom_describe_alert_rule`
Analyze firing history	`prom_analyze_firing_history`
Draft alert rule	`prom_draft_alert_rule`
Tune alert thresholds	`prom_tune_alert_rule`

State-Modifying Workflows

Operation	HITL Phase	Plan Content
Exporter Install	`exporter_install_review`	📦 Exporter type, namespace, K8s resources to create
Rule Create/Update	`rule_group_review`	📋 Group name, backend, rule count, storage mode
ServiceMonitor	`servicemonitor_review`	📡 Service, namespace, scrape interval
File SD Add/Remove	`file_sd_review`	📁 Targets, file path, action

Verification & Failure Diagnosis (Mandatory)

The agent never declares success based on tool stdout. After every mutation:

After...	Verify with...	If Failed
Exporter install	`prom_verify_exporter` → confirm `up{}` series	1. Check `prom://topology/failed_targets` 2. Run `prom_test_endpoint` 3. Escalate
ServiceMonitor	`prom_query_instant(query="up{job='...'}")`	Same as exporter install
Rule upsert	`prom://rules/groups` → confirm group appears	Check namespace and `ruleSelector` in config
File SD add	`prom_query_instant(query="up{job='...'}")`	Same as exporter install

Verification returns one of:

✅ Verified — operation confirmed successful
⚠️ Deployed but Unhealthy — resource exists but not scraping
❌ Failed — operation did not take effect

Key Capabilities

PromQL Queries & Metric Exploration

Ask natural language questions like "how much CPU is my app using?" and the agent translates them into precise PromQL queries. Supports both instant and range queries with automatic downsampling.

Exporter Lifecycle Management

Deploy, verify, and uninstall Prometheus exporters (Redis, PostgreSQL, etc.) with automatic health validation:

prom_install_exporter → prom_verify_exporter → prom_query_instant(up{})

Synthetic Monitoring & Probes

Set up endpoint monitoring using native Probe CRDs with Blackbox exporter:

prom_install_exporter → prom_apply_probe → prom_query_instant (validation)

Alerting & Recording Rules

Author and deploy PrometheusRule CRDs (P1/P2 severity) directly to Kubernetes namespaces using k8s_crd storage mode. The agent cross-references:

prom://kubernetes/prometheusrules — discover CRD name, namespace, labels
prom://rules/groups — existing group names
prom_upsert_rule_group — create or update

warning

Using incorrect namespace with prom_upsert_rule_group in k8s_crd mode will silently create a duplicate CRD instead of patching the existing one.

TSDB Cardinality & FinOps

Analyze label cardinality, identify high-cardinality metrics, and optimize storage costs via prom://tsdb/cardinality.

PromQL Safety Guardrails

Guardrail	Rule
Counter Enforcement	Counters MUST use `rate()` or `increase()` unless `allow_raw_counters=true`
Auto-Downsampling	Range queries capped at ~200 points/series
Validate First	Complex queries should be checked via `prom_validate_promql` before execution

Alertmanager Sub-Agent (`alertmanager-operator`)

Manages alert lifecycle, silence management, routing audit, and governance — from on-call triage through silence creation to integration testing.

Read-Only Operations (Fast-Path)

Resource URIs (via `read_mcp_resource`)

Query Type	Resource URI
All backends health	`am://system/backends`
Backend detail	`am://system/backends/{backend_id}`
System status/version	`am://system/status`
Configured receivers	`am://system/receivers`
Routing tree + config	`am://system/config`
MCP audit log	`am://system/audit-log`
Active alerts snapshot	`am://alerts/active`
Alert groups snapshot	`am://alerts/groups`
Active silences	`am://silences/active`
Best practices	`am://best-practices`
Onboarding guide	`am://onboarding-guide`

Tool-Based Queries

Query Type	Tool
List alerts (filtered)	`am_list_alerts`
Alert groups (filtered)	`am_list_alert_groups`
On-call summary	`am_summarize_oncall`
Explain routing	`am_explain_routing`
Audit default route	`am_audit_default_route`
Recent silence changes	`am_list_recent_changes`
Preview silence blast radius	`am_preview_silence`
Validate silence policy	`am_validate_silence_policy`

State-Modifying Workflows

Operation	HITL Phase	Plan Content
Create Silence	`silence_creation_review`	🔇 Matchers, duration, blast radius, creator
Expire Silence	`silence_expire_review`	🔔 Silence ID, affected alerts
Push Test Alert	`test_alert_review`	🧪 Alert labels, target receiver
Update/Extend Silence	`silence_update_review`	🔄 Silence ID, extension duration

Silence Lifecycle — Mandatory Sequence

For any silence creation, the agent follows this exact sequence:

am_preview_silence — Check blast radius (mandatory, never skipped)
am_validate_silence_policy — Check policy compliance
am_create_silence — Only after both checks pass

Verification (Mandatory)

After...	Verify with...	If Failed
Silence create	`am_list_silences(state="active")`	Check `am_list_recent_changes` for immediate expiry
Silence expire	`am_list_silences` (check expired)	Check if another active silence matches
Test alert push	`am_list_alerts`	Check `am_explain_routing` for routing destination
Silence update	`am_list_silences(state="active")`	Check if max duration was exceeded

Key Capabilities

On-Call Alert Triage

Get a human-readable on-call summary of all firing alerts, grouped by severity and service, with actionable remediation steps via am_summarize_oncall.

Silence Lifecycle Management

Full lifecycle: preview blast radius → validate policy → create → extend → expire. All with mandatory dry-run previews.

Routing Audit & Governance

Routing tree inspection: am://system/config for full config export
"Who gets paged?" simulation: am_explain_routing for routing path analysis
Default route audit: am_audit_default_route to find misrouted alerts hitting fallback receiver
Compliance review: am_validate_silence_policy for policy compliance of existing silences
Change audit: am_list_recent_changes for silence create/expire activity

Integration Testing

Push synthetic test alerts to validate that downstream notification channels (Slack, PagerDuty, email) are correctly configured.

Silence Safety Guardrails

Guardrail	Rule
Duration Cap	Max silence duration is 24 hours (default). Override via `AM_MAX_SILENCE_MINUTES`
Blast Radius Warning	Warns if silence affects ≥ N alerts. Preview is mandatory
Duplicate Detection	Built-in — blocks creating equivalent active silences
Scope Control	`instance` (narrowest) → `service` (recommended) → `env` (broadest)

Skills Reference

Skill Directory	Sub-Agent	Purpose
`observability/prometheus`	prometheus-operator	Prometheus workflow patterns, exporter lifecycle, rule authoring
`observability/alertmanager`	alertmanager-operator	Silence lifecycle, routing audit, governance patterns

MCP Integration

MCP Server	Package	Transport	Used By
`prometheus-mcp-server`	`prometheus-mcp-server`	stdio	`prometheus-operator`
`alertmanager-mcp-server`	`alertmanager-mcp-server`	stdio	`alertmanager-operator`

Environment Variables

Variable	Default	Purpose
`PROMETHEUS_BASE_URL`	`http://prometheus-operated.monitoring.svc:9090`	Prometheus server URL
`PROMETHEUS_VERIFY_SSL`	`false`	SSL verification
`PROMETHEUS_BACKEND_ID`	`default`	Multi-backend identifier
`ALERTMANAGER_BASE_URL`	`http://alertmanager-operated.monitoring.svc:9093`	Alertmanager server URL
`ALERTMANAGER_VERIFY_SSL`	`false`	SSL verification
`AM_MAX_SILENCE_MINUTES`	`1440`	Maximum silence duration (24 hours)
`AM_SILENCE_WARNING_THRESHOLD`	`50`	Blast radius warning threshold

Cross-Domain Integration

The Observability domain frequently collaborates with other domains:

Scenario	Source	Target	Context Passed
"5 critical alerts for checkout — check pods"	Observability	K8s Operator	Alert details, service name, namespace
"High error rate after rollout — abort?"	Observability	App Operator	Error metrics, rollout name, timeline
"Prometheus AnalysisTemplate for canary"	App Operator	Observability	Rollout config, metric queries

Safety & Governance

Out-of-Scope Escalation

The Prometheus operator exhausts all MCP diagnostic tools before escalating. If the root cause remains hidden (e.g., up=0 but prom_test_endpoint is unreachable), it explicitly returns:

"I have exhausted my MCP diagnostic tools. Further diagnosis requires cluster access. Please run kubectl logs <pod-name> -n <namespace> and kubectl describe pod <pod-name> -n <namespace> and share the output."

It never uses filesystem tools (ls, grep) to read pod logs directly.

HITL Gates

Every state-modifying operation requires explicit user approval. The silence lifecycle is particularly strict with mandatory preview and policy validation before creation.

`[PLAN-LOCKED]` Execution

When the coordinator has already obtained approval, sub-agents skip Phase 2 and execute pre-approved parameters directly. HumanInTheLoopMiddleware still provides the mechanical safety backstop.

Architecture​

Prometheus Sub-Agent (prometheus-operator)​

Read-Only Operations (Fast-Path)​

Resource URIs (via read_mcp_resource)​

Tool-Based Queries​

State-Modifying Workflows​

Verification & Failure Diagnosis (Mandatory)​

Key Capabilities​

PromQL Queries & Metric Exploration​

Exporter Lifecycle Management​

Synthetic Monitoring & Probes​

Alerting & Recording Rules​

TSDB Cardinality & FinOps​

PromQL Safety Guardrails​

Alertmanager Sub-Agent (alertmanager-operator)​

Read-Only Operations (Fast-Path)​

Resource URIs (via read_mcp_resource)​

Tool-Based Queries​

State-Modifying Workflows​

Silence Lifecycle — Mandatory Sequence​

Verification (Mandatory)​

Key Capabilities​

On-Call Alert Triage​

Silence Lifecycle Management​

Routing Audit & Governance​

Integration Testing​

Silence Safety Guardrails​

Skills Reference​

MCP Integration​

Environment Variables​

Cross-Domain Integration​

Safety & Governance​

Out-of-Scope Escalation​

HITL Gates​

[PLAN-LOCKED] Execution​

Architecture

Prometheus Sub-Agent (`prometheus-operator`)

Read-Only Operations (Fast-Path)

Resource URIs (via `read_mcp_resource`)

Tool-Based Queries

State-Modifying Workflows

Verification & Failure Diagnosis (Mandatory)

Key Capabilities

PromQL Queries & Metric Exploration

Exporter Lifecycle Management

Synthetic Monitoring & Probes

Alerting & Recording Rules

TSDB Cardinality & FinOps

PromQL Safety Guardrails

Alertmanager Sub-Agent (`alertmanager-operator`)

Read-Only Operations (Fast-Path)

Resource URIs (via `read_mcp_resource`)

Tool-Based Queries

State-Modifying Workflows

Silence Lifecycle — Mandatory Sequence

Verification (Mandatory)

Key Capabilities

On-Call Alert Triage

Silence Lifecycle Management

Routing Audit & Governance

Integration Testing

Silence Safety Guardrails

Skills Reference

MCP Integration

Environment Variables

Cross-Domain Integration

Safety & Governance

Out-of-Scope Escalation

HITL Gates

`[PLAN-LOCKED]` Execution