Monitoring Agent

Status: In Development

The Monitoring Agent is currently in active development. This page outlines the planned architecture and capabilities. We'll update it as features ship.

What we're building

Setting up observability is one of those tasks that sounds simple until you're three hours deep in YAML — configuring scrapers, chasing metric names across provider docs, building dashboards that nobody looks at, and writing alerting rules that either fire too much or not at all.

The Monitoring Agent is being built to handle this through conversation. Describe what you want to monitor, and the agent pipeline will instrument it, build dashboards, configure alerts, and validate everything is working — across cloud-native monitoring platforms, APM tools, and log management systems.

How it relates to K8s Autopilot

If you're monitoring Kubernetes workloads with Prometheus and Alertmanager, that's already covered — the Kubernetes Agent's Observability domain ships with deep Prometheus + Alertmanager support (PromQL queries, exporter lifecycle, alerting rules, silence management, routing audit).

The Monitoring Agent covers everything else:

Scope	Agent	What It Covers
Kubernetes (Prometheus + Alertmanager)	K8s Autopilot — Observability	✅ Shipped — PromQL, exporters, alerting rules, silence lifecycle, TSDB analysis
Cloud monitoring (CloudWatch, Azure Monitor, GCP Monitoring)	Monitoring Agent	🔨 In development
APM platforms (Datadog, New Relic, Dynatrace)	Monitoring Agent	🔨 In development
Log management (ELK, Loki, CloudWatch Logs)	Monitoring Agent	📋 Planned
Visualization (Grafana, custom dashboards)	Monitoring Agent	📋 Planned

Planned architecture

The Monitoring Agent follows the same Deep Agent pattern used across TalkOps, with MCP servers providing standardized access to each monitoring platform:

Design principles

Platform-agnostic — each monitoring platform gets its own sub-agent and MCP server. No vendor lock-in.
Same HITL governance — creating alerts, dashboards, and silences all require explicit user approval.
Cross-platform correlation — the coordinator can query metrics from one platform and create alerts on another (e.g., correlate CloudWatch metrics with Datadog alerts).
Knowledge-driven — the agent understands monitoring best practices (RED method, USE method, SLO-based alerting) and applies them automatically.

Planned capabilities

🔷 Datadog Integration

Capability	Description
Metric exploration	Query and explore Datadog metrics using natural language
Monitor management	Create, update, and manage Datadog monitors with HITL approval
Dashboard generation	Build contextual dashboards from service descriptions
APM tracing	Query distributed traces, identify latency bottlenecks
Log analytics	Search and analyze logs with natural language queries
SLO management	Create and track Service Level Objectives
Downtime scheduling	Schedule maintenance windows with blast radius preview

🟠 AWS CloudWatch Integration

Capability	Description
Metric queries	Query CloudWatch metrics across AWS services
Alarm management	Create and manage CloudWatch Alarms with threshold tuning
Log Insights	Run CloudWatch Logs Insights queries in natural language
Dashboard creation	Generate CloudWatch dashboards with cross-service widgets
Anomaly detection	Configure anomaly detection bands on key metrics
Composite alarms	Build composite alarms from multiple metrics

🟧 Grafana Integration

Capability	Description
Dashboard generation	Create rich Grafana dashboards with panels, variables, and annotations
Data source management	Configure and validate data source connections
Alert rule authoring	Create Grafana-managed alerting rules with notification policies
Folder organization	Organize dashboards by team, service, or environment
Templating	Generate reusable dashboard templates with variables

🟢 Log Management (Loki / ELK)

Capability	Description
Log queries	Translate natural language questions into LogQL or Elasticsearch queries
Log-based alerting	Create alert rules triggered by log patterns
Log exploration	Correlate logs with metrics and traces
Retention policies	Configure log retention and archival rules

Example requests (what you'll be able to do)

Cloud monitoring

"Set up CloudWatch alarms for all RDS instances — alert on CPU > 80%,
free storage < 20%, and connection count > 100."

APM & tracing

"Show me the slowest API endpoints in Datadog for the checkout service
over the last 24 hours. Create a monitor for any endpoint with p99 > 500ms."

Dashboard generation

"Build a Grafana dashboard for the payments service. Include panels for
request rate, error rate, latency percentiles, and pod resource usage.
Use the RED method layout."

Log analysis

"Search Loki for all ERROR logs from the auth-service in the last hour.
Group by error type and create an alert if any error type exceeds 50 occurrences
in a 5-minute window."

Cross-platform

"Our checkout service is slow. Check Datadog APM for the latency breakdown,
then check CloudWatch for any RDS or ElastiCache performance issues that
might be causing it."

Planned monitoring frameworks

The Monitoring Agent will understand and apply industry-standard observability frameworks:

RED Method (Request-oriented)

For microservices:

Rate — requests per second
Errors — error rate
Duration — latency distribution

USE Method (Resource-oriented)

For infrastructure:

Utilization — percent of resource capacity used
Saturation — degree of queued work
Errors — error events

SLO-Based Alerting

Instead of threshold-based alerts, configure Service Level Objectives:

Define SLO targets (e.g., 99.9% availability, p99 < 200ms)
Agent calculates error budgets and burn rates
Alerts fire when error budget consumption is abnormal

How it differs from K8s Observability

Aspect	K8s Autopilot — Observability	Monitoring Agent
Scope	Kubernetes-native (in-cluster Prometheus + Alertmanager)	Cloud-native platforms, APM, logs
Platforms	Prometheus, Alertmanager	Datadog, CloudWatch, Grafana, Loki, New Relic
Metric source	PromQL against in-cluster Prometheus	Platform-specific APIs (Datadog, CloudWatch, etc.)
Alert target	Alertmanager (Kubernetes CRDs)	Platform-native (Datadog monitors, CloudWatch alarms, Grafana alerts)
Dashboard	N/A (Prometheus-focused)	Grafana dashboards, Datadog dashboards, CloudWatch dashboards
Logging	N/A	Loki, ELK, CloudWatch Logs
APM / Traces	N/A	Datadog APM, distributed tracing
Collaboration	Hands off to K8s Operator for pod-level issues	Hands off to Infrastructure Agents for provisioning changes

When to use which?

Running Prometheus + Alertmanager on Kubernetes? → Use the K8s Autopilot Observability domain
Using Datadog, CloudWatch, Grafana, or other managed platforms? → The Monitoring Agent is what you need
Both? → They'll work together. The Monitoring Agent can correlate cloud-level metrics with K8s Autopilot's cluster-level insights

Roadmap

Phase	Target	Status
Phase 1	Datadog integration (metrics, monitors, dashboards)	🔨 In development
Phase 2	AWS CloudWatch integration (metrics, alarms, logs)	📋 Planned
Phase 3	Grafana dashboard generation and management	📋 Planned
Phase 4	Log management (Loki, ELK, CloudWatch Logs)	📋 Planned
Phase 5	Cross-platform correlation and unified alerting	📋 Planned

Stay updated

⭐ Star the TalkOps repo to get notified when the Monitoring Agent ships
💬 Join our Discord to follow development updates
📝 Request a feature — tell us which monitoring platforms to prioritize

What we're building​

How it relates to K8s Autopilot​

Planned architecture​

Design principles​

Planned capabilities​

🔷 Datadog Integration​

🟠 AWS CloudWatch Integration​

🟧 Grafana Integration​

🟢 Log Management (Loki / ELK)​

Example requests (what you'll be able to do)​

Cloud monitoring​

APM & tracing​

Dashboard generation​

Log analysis​

Cross-platform​

Planned monitoring frameworks​

RED Method (Request-oriented)​

USE Method (Resource-oriented)​

SLO-Based Alerting​

How it differs from K8s Observability​

Roadmap​

Stay updated​