Back to Strategy: The CIO’s Guide to AIOps Understand the bigger picture of Self-Healing Systems

How to Build an "On-Call Agent" using PagerDuty & GPT-4o

Building an AI On-Call Agent with Python

It is 3:14 AM. Your phone buzzes with a PagerDuty alert: "CRITICAL: Checkout API Latency > 2000ms". You open your laptop, squinting at the screen, and spend the next 20 minutes grepping through logs to find that a single Redis pod is stuck.

This tutorial changes that workflow. We are going to build a Python-based "On-Call Agent" that wakes up with you (or instead of you). It listens to PagerDuty, reads the error logs, sends them to GPT-4o for analysis, and posts a Root Cause Analysis (RCA) directly to your Slack incident channel.

"We aren't replacing the SRE. We are giving the SRE a bionic arm. This agent handles the 'Discovery' phase so you can focus on the 'Remediation' phase."

1. The Architecture

We will build a simple middleware service using Python (Flask) that connects three APIs. Here is the data flow:

  • Trigger: PagerDuty sends a webhook when an incident is triggered.
  • Context: Our Python script uses the PagerDuty API to fetch the latest log_entries for that incident.
  • Brain: We send those logs to OpenAI's GPT-4o API with a specialized system prompt.
  • Action: The script posts the AI's diagnosis to Slack using the chat.postMessage method.

2. Prerequisites

Before writing code, ensure you have the following API keys:

  • PagerDuty: An API Key (read-only is fine for fetching logs) and a Generic Webhook V3 URL (we will generate this in step 4).
  • OpenAI: An API Key with access to the gpt-4o model.
  • Slack: A Bot User OAuth Token with chat:write permissions.

3. The Code (Python)

Create a file named app.py. We will use Flask to accept the webhook and the OpenAI client to process the data.

import os
from flask import Flask, request, jsonify
from openai import OpenAI
from slack_sdk import WebClient
import pdpyras

app = Flask(__name__)

# Configuration
PD_API_KEY = os.getenv("PD_API_KEY")
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
SLACK_BOT_TOKEN = os.getenv("SLACK_BOT_TOKEN")
SLACK_CHANNEL = "#incidents-ai"

# Initialize Clients
pd_session = pdpyras.APISession(PD_API_KEY)
ai_client = OpenAI(api_key=OPENAI_API_KEY)
slack_client = WebClient(token=SLACK_BOT_TOKEN)

@app.route('/webhook', methods=['POST'])
def pagerduty_webhook():
    data = request.json
    messages = data.get('messages', [])
    
    for message in messages:
        if message['event'] == 'incident.trigger':
            incident_id = message['incident']['id']
            handle_trigger(incident_id)
            
    return jsonify({"status": "received"}), 200

def handle_trigger(incident_id):
    # 1. Fetch Logs from PagerDuty
    # Note: detailed logic omitted for brevity, but you would use
    # /incidents/{id}/log_entries to get the error details.
    logs = pd_session.rget(f"/incidents/{incident_id}/log_entries")
    error_snippet = extract_error_from_logs(logs)

    # 2. Ask GPT-4o for Analysis
    response = ai_client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "You are a Senior SRE. Analyze the following error logs and provide 3 probable root causes and a recommended fix."},
            {"role": "user", "content": f"Incident {incident_id} Logs: {error_snippet}"}
        ]
    )
    analysis = response.choices[0].message.content

    # 3. Post to Slack
    slack_client.chat_postMessage(
        channel=SLACK_CHANNEL,
        text=f"*Auto-Analysis for Incident {incident_id}*\n\n{analysis}"
    )

def extract_error_from_logs(logs):
    # Simple helper to grab the first 1000 chars of the trigger log
    return str(logs[0]['channel']['details'])[:1000]

if __name__ == '__main__':
    app.run(port=5000)

This script sets up a listener on /webhook. When PagerDuty hits it, we extract the incident ID, query PagerDuty for the full log context, and pass that to GPT-4o.

4. The GPT-4o System Prompt

The magic is in the prompt. You don't want a generic answer; you want an SRE's opinion. Use this prompt structure:

"You are an expert Site Reliability Engineer. You are analyzing a production outage.
1. Identify the subsystem failing (Database, API, Load Balancer).
2. Look for keywords: 'Connection Refused', 'Timeout', 'OOM'.
3. Provide a bulleted list of immediate remediation steps (e.g., 'Check AWS RDS CPU usage').
Do NOT be vague. Be technical and precise."

5. Next Steps: From "Read Only" to "Action"

Once you trust the agent, you can upgrade it. Instead of just posting to Slack, give the agent tools (using OpenAI Function Calling) to execute safe commands:

  • Restart Pod: kubectl delete pod {pod_name}
  • Clear Cache: redis-cli flushall
  • Scale ASG: Increase desired capacity by 1.

This moves you from "Observability" into the true "Self-Healing Enterprise" we discussed in our pillar guide.

Frequently Asked Questions (FAQ)

Q: Is it safe to send server logs to OpenAI?

A: You must sanitize PII (Personally Identifiable Information) before sending logs. We recommend using a regex filter in your Python script to redact emails, IP addresses, and customer names before the API call. For enterprise use, consider Azure OpenAI Service for stronger compliance.

Q: How much does this automation cost?

A: GPT-4o is cost-effective. A typical 50-line log analysis consumes about 1,000 input tokens ($0.005). If you process 50 incidents a day, your AI cost is less than $10/month—significantly cheaper than one hour of an engineer's time.

Q: Can the AI Agent execute the fix automatically?

A: Yes, but we recommend "Human-in-the-Loop" for the first phase. You can upgrade this script to trigger an Ansible playbook via a "Click to Fix" button in Slack, but only allow the AI to execute read-only commands (like clearing cache) initially.

Gather feedback and optimize your AI workflows with SurveyMonkey. The leader in online surveys and forms. Sign up for free.

SurveyMonkey - Online Surveys and Forms

This link leads to a paid promotion