Files
pixelheros/.claude/docs/templates/incident-response.md
2026-05-15 14:52:29 +08:00

3.9 KiB

Incident Response: [Incident Title]

Severity: [S1-Critical / S2-Major / S3-Moderate / S4-Minor] Status: [Active / Mitigated / Resolved / Post-Mortem Complete] Detected: [Date Time UTC] Resolved: [Date Time UTC or ONGOING] Duration: [Total time from detection to resolution] Incident Commander: [Name/Role]


Impact Summary

[2-3 sentences describing what players experienced. Write from the player perspective, not the technical perspective.]

  • Players affected: [estimated count or percentage]
  • Platforms affected: [PC / Console / Mobile / All]
  • Regions affected: [All / specific regions]
  • Revenue impact: [estimated if applicable]

Timeline

Time (UTC) Event Action Taken
[HH:MM] Incident detected via [monitoring/player report/etc.] Incident commander assigned
[HH:MM] Root cause identified [Brief description of cause]
[HH:MM] Mitigation deployed [What was done]
[HH:MM] Service restored / Fix confirmed Monitoring for recurrence
[HH:MM] All-clear declared Post-mortem scheduled

Root Cause

What Happened

[Technical description of the root cause. Be specific about the chain of events that led to the incident.]

Why It Happened

[Systemic cause — why did existing processes, tests, or safeguards fail to prevent this? This is more important than the technical cause.]

Contributing Factors

  • [Factor 1 — e.g., "Insufficient load testing for the new matchmaking system"]
  • [Factor 2 — e.g., "Monitoring alert threshold was set too high"]
  • [Factor 3]

Mitigation and Resolution

Immediate Actions (during incident)

  1. [Action taken to stop the bleeding]
  2. [Action taken to restore service]
  3. [Action taken to verify resolution]

Follow-Up Actions (after resolution)

  1. [Permanent fix if immediate action was a workaround]
  2. [Additional testing or monitoring added]
  3. [Process changes to prevent recurrence]

Player Communication

Initial Acknowledgment

Sent: [Time] via [channel]

[Exact text of the first public message acknowledging the issue]

Status Updates

Sent: [Time] via [channel]

[Text of each subsequent update]

Resolution Notice

Sent: [Time] via [channel]

[Text announcing the fix and any compensation]

Compensation (if applicable)

  • What: [description of compensation — e.g., "500 premium currency + 24-hour XP boost"]
  • Who: [all players / affected players only / players who logged in during incident]
  • When: [delivery date and method]
  • Rationale: [why this compensation is appropriate for the impact]

Prevention

What We Are Changing

Action Item Owner Deadline Status
[Specific preventive measure] [Role] [Date] [TODO/Done]
[Add monitoring for X] [Role] [Date] [TODO/Done]
[Add test coverage for Y] [Role] [Date] [TODO/Done]
[Update runbook for Z] [Role] [Date] [TODO/Done]

Process Improvements

  • [Process change to prevent similar incidents]
  • [Monitoring/alerting improvement]
  • [Testing improvement]

Lessons Learned

What Went Well

  • [Positive aspect of incident response — e.g., "Detection was fast due to monitoring alerts"]
  • [Positive aspect]

What Went Poorly

  • [Problem with response — e.g., "Took 20 minutes to identify the correct on-call person"]
  • [Problem]

Where We Got Lucky

  • [Factor that reduced impact by chance rather than design — these are hidden risks to address]

Sign-Offs

  • Technical Director — Root cause accurate, prevention plan sufficient
  • QA Lead — Test coverage gaps addressed
  • Producer — Timeline and communication reviewed
  • Community Manager — Player communication reviewed

This document is filed in production/hotfixes/ and linked from the release notes for the fix version.