Skip to main content

SRE Agent

Status: In Development

The SRE Agent is currently in active development. This page outlines the planned architecture and capabilities. We'll update it as features ship.


What we're building

Your infrastructure already has agents that can debug pods, query metrics, rollback deployments, and manage alerts. What's missing is someone to coordinate all of them during an incident — the 3 AM engineer who acknowledges the page, triages the blast radius, pulls the right data from the right systems, decides on a remediation action, and writes the postmortem.

The SRE Agent fills that gap. It's not another monitoring agent or another Kubernetes operator. It's the incident commander — an orchestration layer that sits above the other TalkOps agents and drives the full incident lifecycle from page to postmortem.

What it does vs what already exists

CapabilityCovered BySRE Agent's Role
Pod debugging, scaling, restartsK8s Autopilot — K8s OperatorDelegates pod-level remediation
Helm rollback, Argo Rollouts abortK8s Autopilot — Helm / App OperatorDelegates deployment rollbacks
Prometheus queries, alert triageK8s Autopilot — ObservabilityDelegates metric collection
Datadog, CloudWatch, GrafanaMonitoring AgentDelegates cloud-level metrics
PagerDuty / OpsGenie integrationNobody todayOwns this
Incident lifecycle orchestrationNobody todayOwns this
Cross-agent coordination during incidentsNobody todayOwns this
SLO enforcement & error budget trackingNobody todayOwns this
Runbook automationNobody todayOwns this
Automated postmortem generationNobody todayOwns this

Planned architecture

The SRE Agent acts as an orchestration layer that delegates work to other TalkOps agents rather than duplicating their capabilities:

Design principles

  1. Orchestrate, don't duplicate — the SRE Agent never queries Prometheus directly or restarts pods. It delegates to the agents that already do this well.
  2. Incident-first — everything is scoped to incident lifecycle. This is not a general-purpose agent.
  3. Escalation, not automation — auto-remediation only runs pre-approved runbooks with HITL gates. The agent never takes irreversible action without explicit sign-off.
  4. Error budget awareness — every remediation decision is informed by the SLO error budget. If the budget is exhausted, the agent switches from "move fast" to "be cautious."

Planned capabilities

🚨 Incident Lifecycle Management

The core capability — driving an incident from trigger to resolution:

PhaseWhat the SRE Agent does
TriggerReceives incident from PagerDuty/OpsGenie, or user reports an issue
AcknowledgeAcknowledges the page, notifies the incident channel, sets severity
TriageAssesses blast radius — which services, how many users, what's the SLO impact?
DiagnoseDelegates to Monitoring Agent (metrics) + K8s Autopilot (pods) to gather evidence
RemediateMatches symptoms to runbooks, presents remediation options with HITL approval
VerifyConfirms fix worked — re-checks metrics, verifies pod health
ResolveCloses the incident on PagerDuty/OpsGenie, updates status page
PostmortemGenerates a structured postmortem from incident timeline + actions taken

📟 PagerDuty / OpsGenie Integration

CapabilityDescription
Incident ingestionReceive and parse incident triggers with full alert context
AcknowledgeAuto-acknowledge on triage start (configurable)
EscalationEscalate to the next on-call responder if remediation fails
Status updatesPost triage findings and remediation progress as incident notes
ResolutionClose incidents with structured resolution summary
On-call lookupQuery current on-call schedule for escalation routing

📊 SLO Enforcement & Error Budgets

CapabilityDescription
SLO trackingTrack Service Level Objectives (availability, latency, error rate)
Error budget calculationCalculate remaining error budget for each service
Budget-aware decisionsSwitch remediation strategy based on budget state: fast rollback vs. careful investigation
Budget alertsWarn when error budget drops below threshold (e.g., 20% remaining)
Burn rate analysisDetect abnormal error budget consumption and trigger proactive triage

📋 Runbook Automation

Pre-approved remediation playbooks that the agent can execute with HITL approval:

RunbookTriggerActionDelegates To
High error rateError rate > SLO threshold for 5mRollback last deploymentK8s Autopilot (Helm/Argo)
Pod crash storm> 3 pods in CrashLoopBackOffInvestigate logs + scale healthy podsK8s Autopilot (K8s Operator)
Memory pressureOOMKilled events detectedIncrease memory limitsK8s Autopilot (K8s Operator)
Database saturationConnection count > 80% of maxScale read replicasAWS Orchestrator
Upstream timeoutp99 latency > 5s for 3mCheck dependency healthMonitoring Agent
caution

Runbooks always require HITL approval before execution. The SRE Agent proposes the remediation — it never executes destructive actions autonomously.

📝 Automated Postmortem Generation

After every incident, the agent generates a structured postmortem:

SectionSource
TimelineReconstructed from incident events, agent actions, and timestamps
ImpactBlast radius, affected services, user impact, SLO burn
Root causeDiagnosis findings from K8s Autopilot + Monitoring Agent
RemediationActions taken (automated + manual) with outcomes
Action itemsRecommended follow-ups to prevent recurrence
MetricsTime-to-detect, time-to-acknowledge, time-to-resolve

Example scenarios (what you'll be able to do)

Incident triggered by PagerDuty

[PagerDuty → SRE Agent]: CRITICAL — checkout-service error rate at 8.5%
(threshold: 1%). Firing for 5 minutes.

SRE Agent:
1. Acknowledges on PagerDuty
2. Queries Monitoring Agent → confirms 8.5% error rate
3. Queries K8s Autopilot → finds 3 pods in CrashLoopBackOff, OOMKilled
4. Matches runbook: "Pod crash storm"
5. Proposes: "Increase memory limit from 512Mi to 768Mi" → [APPROVE / REJECT]
6. On approval → delegates to K8s Autopilot
7. Verifies fix → error rate drops to 0.3%
8. Resolves incident on PagerDuty
9. Generates postmortem

Proactive error budget alert

User: "How's the checkout error budget looking?"

SRE Agent: "⚠️ checkout-service error budget is at 15% remaining (30-day window).
Current burn rate: 3.2x normal.
At this rate, the budget exhausts in ~4.7 days.

Root cause: p99 latency increased from 180ms to 420ms after
yesterday's deploy (v2.3.1).

Recommended action:
1. Roll back to v2.3.0 (saves ~2.1x burn rate)
2. Investigate memory leak in v2.3.1

Execute rollback? [APPROVE / REJECT]"

Cross-agent incident investigation

User: "The payments API is timing out. Figure out what's wrong."

SRE Agent:
1. Opens incident → severity P2
2. Delegates to Monitoring Agent → Datadog shows p99 at 3.2s (normal: 200ms)
3. Delegates to K8s Autopilot → pods healthy, no crashes
4. Delegates to Monitoring Agent → CloudWatch shows RDS CPU at 95%
5. Root cause: slow database queries from a missing index

"Root cause: RDS instance db-payments-prod is at 95% CPU.
Top query: SELECT * FROM transactions WHERE user_id = ? (full table scan).

Recommended actions:
1. Add index on transactions.user_id (manual — requires DBA)
2. Scale RDS to db.r6g.xlarge (temporary relief) → [APPROVE / REJECT]

I've posted findings to the incident channel."

How it fits in the TalkOps ecosystem

The SRE Agent is the coordination layer — it doesn't replace any existing agent, it brings them together during incidents.


Roadmap

PhaseTargetStatus
Phase 1PagerDuty/OpsGenie integration + incident lifecycle🔨 In development
Phase 2Cross-agent delegation (K8s Autopilot + Monitoring Agent)📋 Planned
Phase 3Runbook automation with HITL approval📋 Planned
Phase 4SLO tracking & error budget enforcement📋 Planned
Phase 5Automated postmortem generation📋 Planned

Stay updated