SRE Agent

Status: In Development

The SRE Agent is currently in active development. This page outlines the planned architecture and capabilities. We'll update it as features ship.

What we're building

Your infrastructure already has agents that can debug pods, query metrics, rollback deployments, and manage alerts. What's missing is someone to coordinate all of them during an incident — the 3 AM engineer who acknowledges the page, triages the blast radius, pulls the right data from the right systems, decides on a remediation action, and writes the postmortem.

The SRE Agent fills that gap. It's not another monitoring agent or another Kubernetes operator. It's the incident commander — an orchestration layer that sits above the other TalkOps agents and drives the full incident lifecycle from page to postmortem.

What it does vs what already exists

Capability	Covered By	SRE Agent's Role
Pod debugging, scaling, restarts	K8s Autopilot — K8s Operator	Delegates pod-level remediation
Helm rollback, Argo Rollouts abort	K8s Autopilot — Helm / App Operator	Delegates deployment rollbacks
Prometheus queries, alert triage	K8s Autopilot — Observability	Delegates metric collection
Datadog, CloudWatch, Grafana	Monitoring Agent	Delegates cloud-level metrics
PagerDuty / OpsGenie integration	Nobody today	✅ Owns this
Incident lifecycle orchestration	Nobody today	✅ Owns this
Cross-agent coordination during incidents	Nobody today	✅ Owns this
SLO enforcement & error budget tracking	Nobody today	✅ Owns this
Runbook automation	Nobody today	✅ Owns this
Automated postmortem generation	Nobody today	✅ Owns this

Planned architecture

The SRE Agent acts as an orchestration layer that delegates work to other TalkOps agents rather than duplicating their capabilities:

Design principles

Orchestrate, don't duplicate — the SRE Agent never queries Prometheus directly or restarts pods. It delegates to the agents that already do this well.
Incident-first — everything is scoped to incident lifecycle. This is not a general-purpose agent.
Escalation, not automation — auto-remediation only runs pre-approved runbooks with HITL gates. The agent never takes irreversible action without explicit sign-off.
Error budget awareness — every remediation decision is informed by the SLO error budget. If the budget is exhausted, the agent switches from "move fast" to "be cautious."

Planned capabilities

🚨 Incident Lifecycle Management

The core capability — driving an incident from trigger to resolution:

Phase	What the SRE Agent does
Trigger	Receives incident from PagerDuty/OpsGenie, or user reports an issue
Acknowledge	Acknowledges the page, notifies the incident channel, sets severity
Triage	Assesses blast radius — which services, how many users, what's the SLO impact?
Diagnose	Delegates to Monitoring Agent (metrics) + K8s Autopilot (pods) to gather evidence
Remediate	Matches symptoms to runbooks, presents remediation options with HITL approval
Verify	Confirms fix worked — re-checks metrics, verifies pod health
Resolve	Closes the incident on PagerDuty/OpsGenie, updates status page
Postmortem	Generates a structured postmortem from incident timeline + actions taken

📟 PagerDuty / OpsGenie Integration

Capability	Description
Incident ingestion	Receive and parse incident triggers with full alert context
Acknowledge	Auto-acknowledge on triage start (configurable)
Escalation	Escalate to the next on-call responder if remediation fails
Status updates	Post triage findings and remediation progress as incident notes
Resolution	Close incidents with structured resolution summary
On-call lookup	Query current on-call schedule for escalation routing

📊 SLO Enforcement & Error Budgets

Capability	Description
SLO tracking	Track Service Level Objectives (availability, latency, error rate)
Error budget calculation	Calculate remaining error budget for each service
Budget-aware decisions	Switch remediation strategy based on budget state: fast rollback vs. careful investigation
Budget alerts	Warn when error budget drops below threshold (e.g., 20% remaining)
Burn rate analysis	Detect abnormal error budget consumption and trigger proactive triage

📋 Runbook Automation

Pre-approved remediation playbooks that the agent can execute with HITL approval:

Runbook	Trigger	Action	Delegates To
High error rate	Error rate > SLO threshold for 5m	Rollback last deployment	K8s Autopilot (Helm/Argo)
Pod crash storm	> 3 pods in CrashLoopBackOff	Investigate logs + scale healthy pods	K8s Autopilot (K8s Operator)
Memory pressure	OOMKilled events detected	Increase memory limits	K8s Autopilot (K8s Operator)
Database saturation	Connection count > 80% of max	Scale read replicas	AWS Orchestrator
Upstream timeout	p99 latency > 5s for 3m	Check dependency health	Monitoring Agent

caution

Runbooks always require HITL approval before execution. The SRE Agent proposes the remediation — it never executes destructive actions autonomously.

📝 Automated Postmortem Generation

After every incident, the agent generates a structured postmortem:

Section	Source
Timeline	Reconstructed from incident events, agent actions, and timestamps
Impact	Blast radius, affected services, user impact, SLO burn
Root cause	Diagnosis findings from K8s Autopilot + Monitoring Agent
Remediation	Actions taken (automated + manual) with outcomes
Action items	Recommended follow-ups to prevent recurrence
Metrics	Time-to-detect, time-to-acknowledge, time-to-resolve

Example scenarios (what you'll be able to do)

Incident triggered by PagerDuty

[PagerDuty → SRE Agent]: CRITICAL — checkout-service error rate at 8.5%
                         (threshold: 1%). Firing for 5 minutes.

SRE Agent:
Acknowledges on PagerDuty
Queries Monitoring Agent → confirms 8.5% error rate
Queries K8s Autopilot → finds 3 pods in CrashLoopBackOff, OOMKilled
Matches runbook: "Pod crash storm"
Proposes: "Increase memory limit from 512Mi to 768Mi" → [APPROVE / REJECT]
On approval → delegates to K8s Autopilot
Verifies fix → error rate drops to 0.3%
Resolves incident on PagerDuty
Generates postmortem

Proactive error budget alert

User: "How's the checkout error budget looking?"

SRE Agent: "⚠️ checkout-service error budget is at 15% remaining (30-day window).
           Current burn rate: 3.2x normal.
           At this rate, the budget exhausts in ~4.7 days.

           Root cause: p99 latency increased from 180ms to 420ms after
           yesterday's deploy (v2.3.1).

           Recommended action:
           1. Roll back to v2.3.0 (saves ~2.1x burn rate)
           2. Investigate memory leak in v2.3.1

           Execute rollback? [APPROVE / REJECT]"

Cross-agent incident investigation

User: "The payments API is timing out. Figure out what's wrong."

SRE Agent:
  1. Opens incident → severity P2
  2. Delegates to Monitoring Agent → Datadog shows p99 at 3.2s (normal: 200ms)
  3. Delegates to K8s Autopilot → pods healthy, no crashes
  4. Delegates to Monitoring Agent → CloudWatch shows RDS CPU at 95%
  5. Root cause: slow database queries from a missing index

  "Root cause: RDS instance db-payments-prod is at 95% CPU.
   Top query: SELECT * FROM transactions WHERE user_id = ? (full table scan).

   Recommended actions:
   1. Add index on transactions.user_id (manual — requires DBA)
   2. Scale RDS to db.r6g.xlarge (temporary relief) → [APPROVE / REJECT]

   I've posted findings to the incident channel."

How it fits in the TalkOps ecosystem

The SRE Agent is the coordination layer — it doesn't replace any existing agent, it brings them together during incidents.

Roadmap

Phase	Target	Status
Phase 1	PagerDuty/OpsGenie integration + incident lifecycle	🔨 In development
Phase 2	Cross-agent delegation (K8s Autopilot + Monitoring Agent)	📋 Planned
Phase 3	Runbook automation with HITL approval	📋 Planned
Phase 4	SLO tracking & error budget enforcement	📋 Planned
Phase 5	Automated postmortem generation	📋 Planned

Stay updated

⭐ Star the TalkOps repo to get notified when the SRE Agent ships
💬 Join our Discord to follow development updates
📝 Request a feature — tell us which incident management integrations to prioritize

What we're building​

What it does vs what already exists​

Planned architecture​

Design principles​

Planned capabilities​

🚨 Incident Lifecycle Management​

📟 PagerDuty / OpsGenie Integration​

📊 SLO Enforcement & Error Budgets​

📋 Runbook Automation​

📝 Automated Postmortem Generation​

Example scenarios (what you'll be able to do)​

Incident triggered by PagerDuty​

Proactive error budget alert​

Cross-agent incident investigation​

How it fits in the TalkOps ecosystem​

Roadmap​

Stay updated​