Monitoring Agent
The Monitoring Agent is currently in active development. This page outlines the planned architecture and capabilities. We'll update it as features ship.
What we're building
Setting up observability is one of those tasks that sounds simple until you're three hours deep in YAML — configuring scrapers, chasing metric names across provider docs, building dashboards that nobody looks at, and writing alerting rules that either fire too much or not at all.
The Monitoring Agent is being built to handle this through conversation. Describe what you want to monitor, and the agent pipeline will instrument it, build dashboards, configure alerts, and validate everything is working — across cloud-native monitoring platforms, APM tools, and log management systems.
How it relates to K8s Autopilot
If you're monitoring Kubernetes workloads with Prometheus and Alertmanager, that's already covered — the Kubernetes Agent's Observability domain ships with deep Prometheus + Alertmanager support (PromQL queries, exporter lifecycle, alerting rules, silence management, routing audit).
The Monitoring Agent covers everything else:
| Scope | Agent | What It Covers |
|---|---|---|
| Kubernetes (Prometheus + Alertmanager) | K8s Autopilot — Observability | ✅ Shipped — PromQL, exporters, alerting rules, silence lifecycle, TSDB analysis |
| Cloud monitoring (CloudWatch, Azure Monitor, GCP Monitoring) | Monitoring Agent | 🔨 In development |
| APM platforms (Datadog, New Relic, Dynatrace) | Monitoring Agent | 🔨 In development |
| Log management (ELK, Loki, CloudWatch Logs) | Monitoring Agent | 📋 Planned |
| Visualization (Grafana, custom dashboards) | Monitoring Agent | 📋 Planned |
Planned architecture
The Monitoring Agent follows the same Deep Agent pattern used across TalkOps, with MCP servers providing standardized access to each monitoring platform:
Design principles
- Platform-agnostic — each monitoring platform gets its own sub-agent and MCP server. No vendor lock-in.
- Same HITL governance — creating alerts, dashboards, and silences all require explicit user approval.
- Cross-platform correlation — the coordinator can query metrics from one platform and create alerts on another (e.g., correlate CloudWatch metrics with Datadog alerts).
- Knowledge-driven — the agent understands monitoring best practices (RED method, USE method, SLO-based alerting) and applies them automatically.
Planned capabilities
🔷 Datadog Integration
| Capability | Description |
|---|---|
| Metric exploration | Query and explore Datadog metrics using natural language |
| Monitor management | Create, update, and manage Datadog monitors with HITL approval |
| Dashboard generation | Build contextual dashboards from service descriptions |
| APM tracing | Query distributed traces, identify latency bottlenecks |
| Log analytics | Search and analyze logs with natural language queries |
| SLO management | Create and track Service Level Objectives |
| Downtime scheduling | Schedule maintenance windows with blast radius preview |
🟠 AWS CloudWatch Integration
| Capability | Description |
|---|---|
| Metric queries | Query CloudWatch metrics across AWS services |
| Alarm management | Create and manage CloudWatch Alarms with threshold tuning |
| Log Insights | Run CloudWatch Logs Insights queries in natural language |
| Dashboard creation | Generate CloudWatch dashboards with cross-service widgets |
| Anomaly detection | Configure anomaly detection bands on key metrics |
| Composite alarms | Build composite alarms from multiple metrics |
🟧 Grafana Integration
| Capability | Description |
|---|---|
| Dashboard generation | Create rich Grafana dashboards with panels, variables, and annotations |
| Data source management | Configure and validate data source connections |
| Alert rule authoring | Create Grafana-managed alerting rules with notification policies |
| Folder organization | Organize dashboards by team, service, or environment |
| Templating | Generate reusable dashboard templates with variables |
🟢 Log Management (Loki / ELK)
| Capability | Description |
|---|---|
| Log queries | Translate natural language questions into LogQL or Elasticsearch queries |
| Log-based alerting | Create alert rules triggered by log patterns |
| Log exploration | Correlate logs with metrics and traces |
| Retention policies | Configure log retention and archival rules |
Example requests (what you'll be able to do)
Cloud monitoring
"Set up CloudWatch alarms for all RDS instances — alert on CPU > 80%,
free storage < 20%, and connection count > 100."
APM & tracing
"Show me the slowest API endpoints in Datadog for the checkout service
over the last 24 hours. Create a monitor for any endpoint with p99 > 500ms."
Dashboard generation
"Build a Grafana dashboard for the payments service. Include panels for
request rate, error rate, latency percentiles, and pod resource usage.
Use the RED method layout."
Log analysis
"Search Loki for all ERROR logs from the auth-service in the last hour.
Group by error type and create an alert if any error type exceeds 50 occurrences
in a 5-minute window."
Cross-platform
"Our checkout service is slow. Check Datadog APM for the latency breakdown,
then check CloudWatch for any RDS or ElastiCache performance issues that
might be causing it."
Planned monitoring frameworks
The Monitoring Agent will understand and apply industry-standard observability frameworks:
RED Method (Request-oriented)
For microservices:
- Rate — requests per second
- Errors — error rate
- Duration — latency distribution
USE Method (Resource-oriented)
For infrastructure:
- Utilization — percent of resource capacity used
- Saturation — degree of queued work
- Errors — error events
SLO-Based Alerting
Instead of threshold-based alerts, configure Service Level Objectives:
- Define SLO targets (e.g., 99.9% availability, p99 < 200ms)
- Agent calculates error budgets and burn rates
- Alerts fire when error budget consumption is abnormal
How it differs from K8s Observability
| Aspect | K8s Autopilot — Observability | Monitoring Agent |
|---|---|---|
| Scope | Kubernetes-native (in-cluster Prometheus + Alertmanager) | Cloud-native platforms, APM, logs |
| Platforms | Prometheus, Alertmanager | Datadog, CloudWatch, Grafana, Loki, New Relic |
| Metric source | PromQL against in-cluster Prometheus | Platform-specific APIs (Datadog, CloudWatch, etc.) |
| Alert target | Alertmanager (Kubernetes CRDs) | Platform-native (Datadog monitors, CloudWatch alarms, Grafana alerts) |
| Dashboard | N/A (Prometheus-focused) | Grafana dashboards, Datadog dashboards, CloudWatch dashboards |
| Logging | N/A | Loki, ELK, CloudWatch Logs |
| APM / Traces | N/A | Datadog APM, distributed tracing |
| Collaboration | Hands off to K8s Operator for pod-level issues | Hands off to Infrastructure Agents for provisioning changes |
- Running Prometheus + Alertmanager on Kubernetes? → Use the K8s Autopilot Observability domain
- Using Datadog, CloudWatch, Grafana, or other managed platforms? → The Monitoring Agent is what you need
- Both? → They'll work together. The Monitoring Agent can correlate cloud-level metrics with K8s Autopilot's cluster-level insights
Roadmap
| Phase | Target | Status |
|---|---|---|
| Phase 1 | Datadog integration (metrics, monitors, dashboards) | 🔨 In development |
| Phase 2 | AWS CloudWatch integration (metrics, alarms, logs) | 📋 Planned |
| Phase 3 | Grafana dashboard generation and management | 📋 Planned |
| Phase 4 | Log management (Loki, ELK, CloudWatch Logs) | 📋 Planned |
| Phase 5 | Cross-platform correlation and unified alerting | 📋 Planned |
Stay updated
- ⭐ Star the TalkOps repo to get notified when the Monitoring Agent ships
- 💬 Join our Discord to follow development updates
- 📝 Request a feature — tell us which monitoring platforms to prioritize