sre-agent: AI-powered incident diagnosis with integrated evaluation for production reliability

Automating production incident diagnosis is a tough problem that often relies on manual log sifting and guesswork. The sre-agent project tackles this by combining AI with cloud monitoring and source code inspection to produce root-cause analyses and fix suggestions automatically. It integrates tightly with AWS CloudWatch, GitHub, and Slack, aiming not just to automate but also to measure and improve the quality of its diagnoses.

what sre-agent does and how it works

sre-agent is an open-source Python AI agent designed specifically for production reliability engineering (SRE) workflows. It reads error logs from AWS CloudWatch, performs source code inspection through a GitHub MCP integration, generates root-cause analysis along with suggested fixes, and posts the results to a Slack channel for the relevant team.

The architecture is intentionally narrow in scope. It assumes logs are already ingested into CloudWatch and that an external system triggers the agent, such as CloudWatch metric filters or alarms. Once triggered, the agent handles the diagnosis pipeline — querying logs, analyzing code, generating a human-readable report, and delivering it via Slack.

Under the hood, the core is a Python CLI tool that can run locally or be deployed remotely on AWS ECS. The default AI model backing the agent is Claude Sonnet 4.5, but this can be overridden via environment variables.

The integration with GitHub leverages the MCP protocol to inspect source code from a specified repo and branch. This allows the agent to correlate log errors with actual code context, improving the relevance of its diagnosis and fix suggestions.

why the evaluation suite matters and what sets sre-agent apart

Most AI agent projects focus on building capabilities but skip rigorous evaluation of their outputs — especially in complex workflows like incident diagnosis. sre-agent is different in that it includes a dedicated evaluation suite baked into the project from day one.

This suite includes commands to evaluate the agent’s tool-use behavior (how it calls APIs and interacts with external systems) as well as the quality of the produced diagnoses. Having evaluation as a first-class feature allows developers to track improvements, detect regressions, and reason about tradeoffs like cost, safety, and observability.

The codebase reflects this focus with clear separation of concerns, modular tool integrations, and explicit support for evaluation modes. The diagnostic logic is not just a black box LLM call — it uses structured prompts, context window management, and multi-step reasoning to handle real-world log data and code.

The tradeoff here is intentional scope limitation. sre-agent does not try to be a full autonomous incident management system. It relies on CloudWatch for log ingestion and triggering, and AWS ECS for remote deployment. This keeps the agent focused and practical but means it’s tied to AWS-centric environments.

The code quality is surprisingly clean for an AI agent project, with well-documented setup steps, configuration management, and a CLI interface that guides users through initial setup including API keys and tokens. The project’s emphasis on observability and cost control is worth understanding for anyone building production AI agents.

quick start

prerequisites

Python 3.13+
Docker (required for local mode)

install the sre agent

pip install sre-agent

start the CLI

sre-agent

On first run, the setup wizard will guide you through configuration. It will ask for:

ANTHROPIC_API_KEY
GITHUB_PERSONAL_ACCESS_TOKEN
GITHUB_OWNER, GITHUB_REPO, GITHUB_REF
SLACK_BOT_TOKEN, SLACK_CHANNEL_ID
AWS credentials (AWS_PROFILE or access keys) and AWS_REGION

By default, the agent uses the claude-sonnet-4-5-20250929 model, but you can override this by setting the MODEL environment variable.

After setup, you can choose between two running modes:

Local: run diagnoses from your machine against a CloudWatch log group.
Remote Deployment: deploy and run the agent on AWS ECS.

Remote mode currently supports AWS ECS only as the deployment target.

verdict

sre-agent targets SRE teams and developers working within AWS environments who want to automate incident diagnosis using AI. Its tight integration with CloudWatch, GitHub MCP, and Slack makes it a practical tool for those already invested in these platforms.

The standout feature is its baked-in evaluation suite, which addresses a real gap in AI agent projects by enabling measurement of diagnosis quality and tool-use behavior. This makes it a valuable learning vehicle for anyone building or operating AI-powered SRE tools.

That said, the project’s scope is deliberately narrow — it assumes CloudWatch logs and external triggers, and it only supports AWS ECS for deployment. It’s not a plug-and-play solution for incident management across diverse cloud environments.

If you want to experiment with AI-assisted incident diagnosis in a real AWS context and appreciate the importance of evaluation and observability, sre-agent is worth a closer look. The CLI-driven setup and clear documentation make it accessible despite the complexity of the domain.

AutoGPT: A modular platform for continuous AI agents and workflow automation — AutoGPT is a Python-based platform for building and managing continuous AI agents that automate workflows, featuring a m
Cloudflare Agents: Building persistent AI agents with stateful Durable Objects — Cloudflare Agents offers a TypeScript framework for stateful AI agents on Durable Objects with real-time communication,

→ GitHub Repo: fuzzylabs/sre-agent ⭐ 220 · Python

Noureddine RAMDI / sre-agent: AI-powered incident diagnosis with integrated evaluation for production reliability