Skip to main content

Monitoring Agent

Status: In Development

The Monitoring Agent is currently in active development. This page outlines the planned architecture and capabilities. We'll update it as features ship.


What we're building

Setting up observability is one of those tasks that sounds simple until you're three hours deep in YAML — configuring scrapers, chasing metric names across provider docs, building dashboards that nobody looks at, and writing alerting rules that either fire too much or not at all.

The Monitoring Agent is being built to handle this through conversation. Describe what you want to monitor, and the agent pipeline will instrument it, build dashboards, configure alerts, and validate everything is working — across cloud-native monitoring platforms, APM tools, and log management systems.

How it relates to K8s Autopilot

If you're monitoring Kubernetes workloads with Prometheus and Alertmanager, that's already covered — the Kubernetes Agent's Observability domain ships with deep Prometheus + Alertmanager support (PromQL queries, exporter lifecycle, alerting rules, silence management, routing audit).

The Monitoring Agent covers everything else:

ScopeAgentWhat It Covers
Kubernetes (Prometheus + Alertmanager)K8s Autopilot — Observability✅ Shipped — PromQL, exporters, alerting rules, silence lifecycle, TSDB analysis
Cloud monitoring (CloudWatch, Azure Monitor, GCP Monitoring)Monitoring Agent🔨 In development
APM platforms (Datadog, New Relic, Dynatrace)Monitoring Agent🔨 In development
Log management (ELK, Loki, CloudWatch Logs)Monitoring Agent📋 Planned
Visualization (Grafana, custom dashboards)Monitoring Agent📋 Planned

Planned architecture

The Monitoring Agent follows the same Deep Agent pattern used across TalkOps, with MCP servers providing standardized access to each monitoring platform:

Design principles

  1. Platform-agnostic — each monitoring platform gets its own sub-agent and MCP server. No vendor lock-in.
  2. Same HITL governance — creating alerts, dashboards, and silences all require explicit user approval.
  3. Cross-platform correlation — the coordinator can query metrics from one platform and create alerts on another (e.g., correlate CloudWatch metrics with Datadog alerts).
  4. Knowledge-driven — the agent understands monitoring best practices (RED method, USE method, SLO-based alerting) and applies them automatically.

Planned capabilities

🔷 Datadog Integration

CapabilityDescription
Metric explorationQuery and explore Datadog metrics using natural language
Monitor managementCreate, update, and manage Datadog monitors with HITL approval
Dashboard generationBuild contextual dashboards from service descriptions
APM tracingQuery distributed traces, identify latency bottlenecks
Log analyticsSearch and analyze logs with natural language queries
SLO managementCreate and track Service Level Objectives
Downtime schedulingSchedule maintenance windows with blast radius preview

🟠 AWS CloudWatch Integration

CapabilityDescription
Metric queriesQuery CloudWatch metrics across AWS services
Alarm managementCreate and manage CloudWatch Alarms with threshold tuning
Log InsightsRun CloudWatch Logs Insights queries in natural language
Dashboard creationGenerate CloudWatch dashboards with cross-service widgets
Anomaly detectionConfigure anomaly detection bands on key metrics
Composite alarmsBuild composite alarms from multiple metrics

🟧 Grafana Integration

CapabilityDescription
Dashboard generationCreate rich Grafana dashboards with panels, variables, and annotations
Data source managementConfigure and validate data source connections
Alert rule authoringCreate Grafana-managed alerting rules with notification policies
Folder organizationOrganize dashboards by team, service, or environment
TemplatingGenerate reusable dashboard templates with variables

🟢 Log Management (Loki / ELK)

CapabilityDescription
Log queriesTranslate natural language questions into LogQL or Elasticsearch queries
Log-based alertingCreate alert rules triggered by log patterns
Log explorationCorrelate logs with metrics and traces
Retention policiesConfigure log retention and archival rules

Example requests (what you'll be able to do)

Cloud monitoring

"Set up CloudWatch alarms for all RDS instances — alert on CPU > 80%,
free storage < 20%, and connection count > 100."

APM & tracing

"Show me the slowest API endpoints in Datadog for the checkout service
over the last 24 hours. Create a monitor for any endpoint with p99 > 500ms."

Dashboard generation

"Build a Grafana dashboard for the payments service. Include panels for
request rate, error rate, latency percentiles, and pod resource usage.
Use the RED method layout."

Log analysis

"Search Loki for all ERROR logs from the auth-service in the last hour.
Group by error type and create an alert if any error type exceeds 50 occurrences
in a 5-minute window."

Cross-platform

"Our checkout service is slow. Check Datadog APM for the latency breakdown,
then check CloudWatch for any RDS or ElastiCache performance issues that
might be causing it."

Planned monitoring frameworks

The Monitoring Agent will understand and apply industry-standard observability frameworks:

RED Method (Request-oriented)

For microservices:

  • Rate — requests per second
  • Errors — error rate
  • Duration — latency distribution

USE Method (Resource-oriented)

For infrastructure:

  • Utilization — percent of resource capacity used
  • Saturation — degree of queued work
  • Errors — error events

SLO-Based Alerting

Instead of threshold-based alerts, configure Service Level Objectives:

  • Define SLO targets (e.g., 99.9% availability, p99 < 200ms)
  • Agent calculates error budgets and burn rates
  • Alerts fire when error budget consumption is abnormal

How it differs from K8s Observability

AspectK8s Autopilot — ObservabilityMonitoring Agent
ScopeKubernetes-native (in-cluster Prometheus + Alertmanager)Cloud-native platforms, APM, logs
PlatformsPrometheus, AlertmanagerDatadog, CloudWatch, Grafana, Loki, New Relic
Metric sourcePromQL against in-cluster PrometheusPlatform-specific APIs (Datadog, CloudWatch, etc.)
Alert targetAlertmanager (Kubernetes CRDs)Platform-native (Datadog monitors, CloudWatch alarms, Grafana alerts)
DashboardN/A (Prometheus-focused)Grafana dashboards, Datadog dashboards, CloudWatch dashboards
LoggingN/ALoki, ELK, CloudWatch Logs
APM / TracesN/ADatadog APM, distributed tracing
CollaborationHands off to K8s Operator for pod-level issuesHands off to Infrastructure Agents for provisioning changes
When to use which?
  • Running Prometheus + Alertmanager on Kubernetes? → Use the K8s Autopilot Observability domain
  • Using Datadog, CloudWatch, Grafana, or other managed platforms? → The Monitoring Agent is what you need
  • Both? → They'll work together. The Monitoring Agent can correlate cloud-level metrics with K8s Autopilot's cluster-level insights

Roadmap

PhaseTargetStatus
Phase 1Datadog integration (metrics, monitors, dashboards)🔨 In development
Phase 2AWS CloudWatch integration (metrics, alarms, logs)📋 Planned
Phase 3Grafana dashboard generation and management📋 Planned
Phase 4Log management (Loki, ELK, CloudWatch Logs)📋 Planned
Phase 5Cross-platform correlation and unified alerting📋 Planned

Stay updated