SRE Agent
The SRE Agent is currently in active development. This page outlines the planned architecture and capabilities. We'll update it as features ship.
What we're building
Your infrastructure already has agents that can debug pods, query metrics, rollback deployments, and manage alerts. What's missing is someone to coordinate all of them during an incident — the 3 AM engineer who acknowledges the page, triages the blast radius, pulls the right data from the right systems, decides on a remediation action, and writes the postmortem.
The SRE Agent fills that gap. It's not another monitoring agent or another Kubernetes operator. It's the incident commander — an orchestration layer that sits above the other TalkOps agents and drives the full incident lifecycle from page to postmortem.
What it does vs what already exists
| Capability | Covered By | SRE Agent's Role |
|---|---|---|
| Pod debugging, scaling, restarts | K8s Autopilot — K8s Operator | Delegates pod-level remediation |
| Helm rollback, Argo Rollouts abort | K8s Autopilot — Helm / App Operator | Delegates deployment rollbacks |
| Prometheus queries, alert triage | K8s Autopilot — Observability | Delegates metric collection |
| Datadog, CloudWatch, Grafana | Monitoring Agent | Delegates cloud-level metrics |
| PagerDuty / OpsGenie integration | Nobody today | ✅ Owns this |
| Incident lifecycle orchestration | Nobody today | ✅ Owns this |
| Cross-agent coordination during incidents | Nobody today | ✅ Owns this |
| SLO enforcement & error budget tracking | Nobody today | ✅ Owns this |
| Runbook automation | Nobody today | ✅ Owns this |
| Automated postmortem generation | Nobody today | ✅ Owns this |
Planned architecture
The SRE Agent acts as an orchestration layer that delegates work to other TalkOps agents rather than duplicating their capabilities:
Design principles
- Orchestrate, don't duplicate — the SRE Agent never queries Prometheus directly or restarts pods. It delegates to the agents that already do this well.
- Incident-first — everything is scoped to incident lifecycle. This is not a general-purpose agent.
- Escalation, not automation — auto-remediation only runs pre-approved runbooks with HITL gates. The agent never takes irreversible action without explicit sign-off.
- Error budget awareness — every remediation decision is informed by the SLO error budget. If the budget is exhausted, the agent switches from "move fast" to "be cautious."
Planned capabilities
🚨 Incident Lifecycle Management
The core capability — driving an incident from trigger to resolution:
| Phase | What the SRE Agent does |
|---|---|
| Trigger | Receives incident from PagerDuty/OpsGenie, or user reports an issue |
| Acknowledge | Acknowledges the page, notifies the incident channel, sets severity |
| Triage | Assesses blast radius — which services, how many users, what's the SLO impact? |
| Diagnose | Delegates to Monitoring Agent (metrics) + K8s Autopilot (pods) to gather evidence |
| Remediate | Matches symptoms to runbooks, presents remediation options with HITL approval |
| Verify | Confirms fix worked — re-checks metrics, verifies pod health |
| Resolve | Closes the incident on PagerDuty/OpsGenie, updates status page |
| Postmortem | Generates a structured postmortem from incident timeline + actions taken |
📟 PagerDuty / OpsGenie Integration
| Capability | Description |
|---|---|
| Incident ingestion | Receive and parse incident triggers with full alert context |
| Acknowledge | Auto-acknowledge on triage start (configurable) |
| Escalation | Escalate to the next on-call responder if remediation fails |
| Status updates | Post triage findings and remediation progress as incident notes |
| Resolution | Close incidents with structured resolution summary |
| On-call lookup | Query current on-call schedule for escalation routing |
📊 SLO Enforcement & Error Budgets
| Capability | Description |
|---|---|
| SLO tracking | Track Service Level Objectives (availability, latency, error rate) |
| Error budget calculation | Calculate remaining error budget for each service |
| Budget-aware decisions | Switch remediation strategy based on budget state: fast rollback vs. careful investigation |
| Budget alerts | Warn when error budget drops below threshold (e.g., 20% remaining) |
| Burn rate analysis | Detect abnormal error budget consumption and trigger proactive triage |
📋 Runbook Automation
Pre-approved remediation playbooks that the agent can execute with HITL approval:
| Runbook | Trigger | Action | Delegates To |
|---|---|---|---|
| High error rate | Error rate > SLO threshold for 5m | Rollback last deployment | K8s Autopilot (Helm/Argo) |
| Pod crash storm | > 3 pods in CrashLoopBackOff | Investigate logs + scale healthy pods | K8s Autopilot (K8s Operator) |
| Memory pressure | OOMKilled events detected | Increase memory limits | K8s Autopilot (K8s Operator) |
| Database saturation | Connection count > 80% of max | Scale read replicas | AWS Orchestrator |
| Upstream timeout | p99 latency > 5s for 3m | Check dependency health | Monitoring Agent |
Runbooks always require HITL approval before execution. The SRE Agent proposes the remediation — it never executes destructive actions autonomously.
📝 Automated Postmortem Generation
After every incident, the agent generates a structured postmortem:
| Section | Source |
|---|---|
| Timeline | Reconstructed from incident events, agent actions, and timestamps |
| Impact | Blast radius, affected services, user impact, SLO burn |
| Root cause | Diagnosis findings from K8s Autopilot + Monitoring Agent |
| Remediation | Actions taken (automated + manual) with outcomes |
| Action items | Recommended follow-ups to prevent recurrence |
| Metrics | Time-to-detect, time-to-acknowledge, time-to-resolve |
Example scenarios (what you'll be able to do)
Incident triggered by PagerDuty
[PagerDuty → SRE Agent]: CRITICAL — checkout-service error rate at 8.5%
(threshold: 1%). Firing for 5 minutes.
SRE Agent:
1. Acknowledges on PagerDuty
2. Queries Monitoring Agent → confirms 8.5% error rate
3. Queries K8s Autopilot → finds 3 pods in CrashLoopBackOff, OOMKilled
4. Matches runbook: "Pod crash storm"
5. Proposes: "Increase memory limit from 512Mi to 768Mi" → [APPROVE / REJECT]
6. On approval → delegates to K8s Autopilot
7. Verifies fix → error rate drops to 0.3%
8. Resolves incident on PagerDuty
9. Generates postmortem
Proactive error budget alert
User: "How's the checkout error budget looking?"
SRE Agent: "⚠️ checkout-service error budget is at 15% remaining (30-day window).
Current burn rate: 3.2x normal.
At this rate, the budget exhausts in ~4.7 days.
Root cause: p99 latency increased from 180ms to 420ms after
yesterday's deploy (v2.3.1).
Recommended action:
1. Roll back to v2.3.0 (saves ~2.1x burn rate)
2. Investigate memory leak in v2.3.1
Execute rollback? [APPROVE / REJECT]"
Cross-agent incident investigation
User: "The payments API is timing out. Figure out what's wrong."
SRE Agent:
1. Opens incident → severity P2
2. Delegates to Monitoring Agent → Datadog shows p99 at 3.2s (normal: 200ms)
3. Delegates to K8s Autopilot → pods healthy, no crashes
4. Delegates to Monitoring Agent → CloudWatch shows RDS CPU at 95%
5. Root cause: slow database queries from a missing index
"Root cause: RDS instance db-payments-prod is at 95% CPU.
Top query: SELECT * FROM transactions WHERE user_id = ? (full table scan).
Recommended actions:
1. Add index on transactions.user_id (manual — requires DBA)
2. Scale RDS to db.r6g.xlarge (temporary relief) → [APPROVE / REJECT]
I've posted findings to the incident channel."
How it fits in the TalkOps ecosystem
The SRE Agent is the coordination layer — it doesn't replace any existing agent, it brings them together during incidents.
Roadmap
| Phase | Target | Status |
|---|---|---|
| Phase 1 | PagerDuty/OpsGenie integration + incident lifecycle | 🔨 In development |
| Phase 2 | Cross-agent delegation (K8s Autopilot + Monitoring Agent) | 📋 Planned |
| Phase 3 | Runbook automation with HITL approval | 📋 Planned |
| Phase 4 | SLO tracking & error budget enforcement | 📋 Planned |
| Phase 5 | Automated postmortem generation | 📋 Planned |
Stay updated
- ⭐ Star the TalkOps repo to get notified when the SRE Agent ships
- 💬 Join our Discord to follow development updates
- 📝 Request a feature — tell us which incident management integrations to prioritize