How to Build an "On-Call Agent" using PagerDuty & GPT-4o
It is 3:14 AM. Your phone buzzes with a PagerDuty alert: "CRITICAL: Checkout API Latency > 2000ms". You open your laptop, squinting at the screen, and spend the next 20 minutes grepping through logs to find that a single Redis pod is stuck.
This tutorial changes that workflow. We are going to build a Python-based "On-Call Agent" that wakes up with you (or instead of you). It listens to PagerDuty, reads the error logs, sends them to GPT-4o for analysis, and posts a Root Cause Analysis (RCA) directly to your Slack incident channel.
1. The Architecture
We will build a simple middleware service using Python (Flask) that connects three APIs. Here is the data flow:
- Trigger: PagerDuty sends a
webhookwhen an incident is triggered. - Context: Our Python script uses the PagerDuty API to fetch the latest
log_entriesfor that incident. - Brain: We send those logs to OpenAI's
GPT-4oAPI with a specialized system prompt. - Action: The script posts the AI's diagnosis to Slack using the
chat.postMessagemethod.
2. Prerequisites
Before writing code, ensure you have the following API keys:
- PagerDuty: An API Key (read-only is fine for fetching logs) and a Generic Webhook V3 URL (we will generate this in step 4).
- OpenAI: An API Key with access to the
gpt-4omodel. - Slack: A Bot User OAuth Token with
chat:writepermissions.
3. The Code (Python)
Create a file named app.py. We will use Flask to accept the webhook and the OpenAI client to process the data.
import os
from flask import Flask, request, jsonify
from openai import OpenAI
from slack_sdk import WebClient
import pdpyras
app = Flask(__name__)
# Configuration
PD_API_KEY = os.getenv("PD_API_KEY")
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
SLACK_BOT_TOKEN = os.getenv("SLACK_BOT_TOKEN")
SLACK_CHANNEL = "#incidents-ai"
# Initialize Clients
pd_session = pdpyras.APISession(PD_API_KEY)
ai_client = OpenAI(api_key=OPENAI_API_KEY)
slack_client = WebClient(token=SLACK_BOT_TOKEN)
@app.route('/webhook', methods=['POST'])
def pagerduty_webhook():
data = request.json
messages = data.get('messages', [])
for message in messages:
if message['event'] == 'incident.trigger':
incident_id = message['incident']['id']
handle_trigger(incident_id)
return jsonify({"status": "received"}), 200
def handle_trigger(incident_id):
# 1. Fetch Logs from PagerDuty
# Note: detailed logic omitted for brevity, but you would use
# /incidents/{id}/log_entries to get the error details.
logs = pd_session.rget(f"/incidents/{incident_id}/log_entries")
error_snippet = extract_error_from_logs(logs)
# 2. Ask GPT-4o for Analysis
response = ai_client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "You are a Senior SRE. Analyze the following error logs and provide 3 probable root causes and a recommended fix."},
{"role": "user", "content": f"Incident {incident_id} Logs: {error_snippet}"}
]
)
analysis = response.choices[0].message.content
# 3. Post to Slack
slack_client.chat_postMessage(
channel=SLACK_CHANNEL,
text=f"*Auto-Analysis for Incident {incident_id}*\n\n{analysis}"
)
def extract_error_from_logs(logs):
# Simple helper to grab the first 1000 chars of the trigger log
return str(logs[0]['channel']['details'])[:1000]
if __name__ == '__main__':
app.run(port=5000)
This script sets up a listener on /webhook. When PagerDuty hits it, we extract the incident ID, query PagerDuty for the full log context, and pass that to GPT-4o.
4. The GPT-4o System Prompt
The magic is in the prompt. You don't want a generic answer; you want an SRE's opinion. Use this prompt structure:
1. Identify the subsystem failing (Database, API, Load Balancer).
2. Look for keywords: 'Connection Refused', 'Timeout', 'OOM'.
3. Provide a bulleted list of immediate remediation steps (e.g., 'Check AWS RDS CPU usage').
Do NOT be vague. Be technical and precise."
5. Next Steps: From "Read Only" to "Action"
Once you trust the agent, you can upgrade it. Instead of just posting to Slack, give the agent tools (using OpenAI Function Calling) to execute safe commands:
- Restart Pod:
kubectl delete pod {pod_name} - Clear Cache:
redis-cli flushall - Scale ASG: Increase desired capacity by 1.
This moves you from "Observability" into the true "Self-Healing Enterprise" we discussed in our pillar guide.
Frequently Asked Questions (FAQ)
A: You must sanitize PII (Personally Identifiable Information) before sending logs. We recommend using a regex filter in your Python script to redact emails, IP addresses, and customer names before the API call. For enterprise use, consider Azure OpenAI Service for stronger compliance.
A: GPT-4o is cost-effective. A typical 50-line log analysis consumes about 1,000 input tokens ($0.005). If you process 50 incidents a day, your AI cost is less than $10/month—significantly cheaper than one hour of an engineer's time.
A: Yes, but we recommend "Human-in-the-Loop" for the first phase. You can upgrade this script to trigger an Ansible playbook via a "Click to Fix" button in Slack, but only allow the AI to execute read-only commands (like clearing cache) initially.