Skip to main content

🔭 Observability

Manages the full monitoring and alerting stack — Prometheus PromQL queries, exporter lifecycle, alerting rules, TSDB analysis, Alertmanager triage, silence management, and routing audit.

The Observability domain coordinates 2 sub-agents (prometheus-operator and alertmanager-operator), each connecting to its own MCP server. Together they provide deep monitoring and alerting capabilities that integrate with the rest of the k8s-autopilot ecosystem.


Architecture


Prometheus Sub-Agent (prometheus-operator)

Orchestrates Prometheus monitoring operations — from simple PromQL queries to complex exporter lifecycle management, rule authoring, and TSDB cardinality analysis.

Read-Only Operations (Fast-Path)

Resource URIs (via read_mcp_resource)

Query TypeResource URI
All backends healthprom://system/backends
Backend detailprom://system/backends/{backend_id}
Service catalogprom://topology/services
Service metricsprom://topology/services/{job}/metrics
Failed targetsprom://topology/failed_targets
TSDB cardinalityprom://tsdb/cardinality
Runtime configprom://config/runtime
Rule groupsprom://rules/groups
K8s PrometheusRules CRDsprom://kubernetes/prometheusrules
Metric catalogprom://metadata/catalog
Exporter catalogprom://exporters/catalog
Best practicesprom://best-practices
Onboarding guideprom://onboarding-guide

Tool-Based Queries

Query TypeTool
Run instant queryprom_query_instant
Run range queryprom_query_range
Validate PromQLprom_validate_promql
Explore metric labelsprom_explore_labels
Test endpoint healthprom_test_endpoint
Recommend instrumentationprom_recommend_instrumentation
Recommend exporterprom_recommend_exporter
Describe alert ruleprom_describe_alert_rule
Analyze firing historyprom_analyze_firing_history
Draft alert ruleprom_draft_alert_rule
Tune alert thresholdsprom_tune_alert_rule

State-Modifying Workflows

OperationHITL PhasePlan Content
Exporter Installexporter_install_review📦 Exporter type, namespace, K8s resources to create
Rule Create/Updaterule_group_review📋 Group name, backend, rule count, storage mode
ServiceMonitorservicemonitor_review📡 Service, namespace, scrape interval
File SD Add/Removefile_sd_review📁 Targets, file path, action

Verification & Failure Diagnosis (Mandatory)

The agent never declares success based on tool stdout. After every mutation:

After...Verify with...If Failed
Exporter installprom_verify_exporter → confirm up{} series1. Check prom://topology/failed_targets 2. Run prom_test_endpoint 3. Escalate
ServiceMonitorprom_query_instant(query="up{job='...'}")Same as exporter install
Rule upsertprom://rules/groups → confirm group appearsCheck namespace and ruleSelector in config
File SD addprom_query_instant(query="up{job='...'}")Same as exporter install

Verification returns one of:

  • Verified — operation confirmed successful
  • ⚠️ Deployed but Unhealthy — resource exists but not scraping
  • Failed — operation did not take effect

Key Capabilities

PromQL Queries & Metric Exploration

Ask natural language questions like "how much CPU is my app using?" and the agent translates them into precise PromQL queries. Supports both instant and range queries with automatic downsampling.

Exporter Lifecycle Management

Deploy, verify, and uninstall Prometheus exporters (Redis, PostgreSQL, etc.) with automatic health validation:

prom_install_exporter → prom_verify_exporter → prom_query_instant(up{})

Synthetic Monitoring & Probes

Set up endpoint monitoring using native Probe CRDs with Blackbox exporter:

prom_install_exporter → prom_apply_probe → prom_query_instant (validation)

Alerting & Recording Rules

Author and deploy PrometheusRule CRDs (P1/P2 severity) directly to Kubernetes namespaces using k8s_crd storage mode. The agent cross-references:

  1. prom://kubernetes/prometheusrules — discover CRD name, namespace, labels
  2. prom://rules/groups — existing group names
  3. prom_upsert_rule_group — create or update
warning

Using incorrect namespace with prom_upsert_rule_group in k8s_crd mode will silently create a duplicate CRD instead of patching the existing one.

TSDB Cardinality & FinOps

Analyze label cardinality, identify high-cardinality metrics, and optimize storage costs via prom://tsdb/cardinality.

PromQL Safety Guardrails

GuardrailRule
Counter EnforcementCounters MUST use rate() or increase() unless allow_raw_counters=true
Auto-DownsamplingRange queries capped at ~200 points/series
Validate FirstComplex queries should be checked via prom_validate_promql before execution

Alertmanager Sub-Agent (alertmanager-operator)

Manages alert lifecycle, silence management, routing audit, and governance — from on-call triage through silence creation to integration testing.

Read-Only Operations (Fast-Path)

Resource URIs (via read_mcp_resource)

Query TypeResource URI
All backends healtham://system/backends
Backend detailam://system/backends/{backend_id}
System status/versionam://system/status
Configured receiversam://system/receivers
Routing tree + configam://system/config
MCP audit logam://system/audit-log
Active alerts snapshotam://alerts/active
Alert groups snapshotam://alerts/groups
Active silencesam://silences/active
Best practicesam://best-practices
Onboarding guideam://onboarding-guide

Tool-Based Queries

Query TypeTool
List alerts (filtered)am_list_alerts
Alert groups (filtered)am_list_alert_groups
On-call summaryam_summarize_oncall
Explain routingam_explain_routing
Audit default routeam_audit_default_route
Recent silence changesam_list_recent_changes
Preview silence blast radiusam_preview_silence
Validate silence policyam_validate_silence_policy

State-Modifying Workflows

OperationHITL PhasePlan Content
Create Silencesilence_creation_review🔇 Matchers, duration, blast radius, creator
Expire Silencesilence_expire_review🔔 Silence ID, affected alerts
Push Test Alerttest_alert_review🧪 Alert labels, target receiver
Update/Extend Silencesilence_update_review🔄 Silence ID, extension duration

Silence Lifecycle — Mandatory Sequence

For any silence creation, the agent follows this exact sequence:

  1. am_preview_silence — Check blast radius (mandatory, never skipped)
  2. am_validate_silence_policy — Check policy compliance
  3. am_create_silence — Only after both checks pass

Verification (Mandatory)

After...Verify with...If Failed
Silence createam_list_silences(state="active")Check am_list_recent_changes for immediate expiry
Silence expiream_list_silences (check expired)Check if another active silence matches
Test alert pusham_list_alertsCheck am_explain_routing for routing destination
Silence updateam_list_silences(state="active")Check if max duration was exceeded

Key Capabilities

On-Call Alert Triage

Get a human-readable on-call summary of all firing alerts, grouped by severity and service, with actionable remediation steps via am_summarize_oncall.

Silence Lifecycle Management

Full lifecycle: preview blast radius → validate policy → create → extend → expire. All with mandatory dry-run previews.

Routing Audit & Governance

  • Routing tree inspection: am://system/config for full config export
  • "Who gets paged?" simulation: am_explain_routing for routing path analysis
  • Default route audit: am_audit_default_route to find misrouted alerts hitting fallback receiver
  • Compliance review: am_validate_silence_policy for policy compliance of existing silences
  • Change audit: am_list_recent_changes for silence create/expire activity

Integration Testing

Push synthetic test alerts to validate that downstream notification channels (Slack, PagerDuty, email) are correctly configured.

Silence Safety Guardrails

GuardrailRule
Duration CapMax silence duration is 24 hours (default). Override via AM_MAX_SILENCE_MINUTES
Blast Radius WarningWarns if silence affects ≥ N alerts. Preview is mandatory
Duplicate DetectionBuilt-in — blocks creating equivalent active silences
Scope Controlinstance (narrowest) → service (recommended) → env (broadest)

Skills Reference

Skill DirectorySub-AgentPurpose
observability/prometheusprometheus-operatorPrometheus workflow patterns, exporter lifecycle, rule authoring
observability/alertmanageralertmanager-operatorSilence lifecycle, routing audit, governance patterns

MCP Integration

MCP ServerPackageTransportUsed By
prometheus-mcp-serverprometheus-mcp-serverstdioprometheus-operator
alertmanager-mcp-serveralertmanager-mcp-serverstdioalertmanager-operator

Environment Variables

VariableDefaultPurpose
PROMETHEUS_BASE_URLhttp://prometheus-operated.monitoring.svc:9090Prometheus server URL
PROMETHEUS_VERIFY_SSLfalseSSL verification
PROMETHEUS_BACKEND_IDdefaultMulti-backend identifier
ALERTMANAGER_BASE_URLhttp://alertmanager-operated.monitoring.svc:9093Alertmanager server URL
ALERTMANAGER_VERIFY_SSLfalseSSL verification
AM_MAX_SILENCE_MINUTES1440Maximum silence duration (24 hours)
AM_SILENCE_WARNING_THRESHOLD50Blast radius warning threshold

Cross-Domain Integration

The Observability domain frequently collaborates with other domains:

ScenarioSourceTargetContext Passed
"5 critical alerts for checkout — check pods"ObservabilityK8s OperatorAlert details, service name, namespace
"High error rate after rollout — abort?"ObservabilityApp OperatorError metrics, rollout name, timeline
"Prometheus AnalysisTemplate for canary"App OperatorObservabilityRollout config, metric queries

Safety & Governance

Out-of-Scope Escalation

The Prometheus operator exhausts all MCP diagnostic tools before escalating. If the root cause remains hidden (e.g., up=0 but prom_test_endpoint is unreachable), it explicitly returns:

"I have exhausted my MCP diagnostic tools. Further diagnosis requires cluster access. Please run kubectl logs <pod-name> -n <namespace> and kubectl describe pod <pod-name> -n <namespace> and share the output."

It never uses filesystem tools (ls, grep) to read pod logs directly.

HITL Gates

Every state-modifying operation requires explicit user approval. The silence lifecycle is particularly strict with mandatory preview and policy validation before creation.

[PLAN-LOCKED] Execution

When the coordinator has already obtained approval, sub-agents skip Phase 2 and execute pre-approved parameters directly. HumanInTheLoopMiddleware still provides the mechanical safety backstop.