3.9 KiB
Incident Response: [Incident Title]
Severity: [S1-Critical / S2-Major / S3-Moderate / S4-Minor] Status: [Active / Mitigated / Resolved / Post-Mortem Complete] Detected: [Date Time UTC] Resolved: [Date Time UTC or ONGOING] Duration: [Total time from detection to resolution] Incident Commander: [Name/Role]
Impact Summary
[2-3 sentences describing what players experienced. Write from the player perspective, not the technical perspective.]
- Players affected: [estimated count or percentage]
- Platforms affected: [PC / Console / Mobile / All]
- Regions affected: [All / specific regions]
- Revenue impact: [estimated if applicable]
Timeline
| Time (UTC) | Event | Action Taken |
|---|---|---|
| [HH:MM] | Incident detected via [monitoring/player report/etc.] | Incident commander assigned |
| [HH:MM] | Root cause identified | [Brief description of cause] |
| [HH:MM] | Mitigation deployed | [What was done] |
| [HH:MM] | Service restored / Fix confirmed | Monitoring for recurrence |
| [HH:MM] | All-clear declared | Post-mortem scheduled |
Root Cause
What Happened
[Technical description of the root cause. Be specific about the chain of events that led to the incident.]
Why It Happened
[Systemic cause — why did existing processes, tests, or safeguards fail to prevent this? This is more important than the technical cause.]
Contributing Factors
- [Factor 1 — e.g., "Insufficient load testing for the new matchmaking system"]
- [Factor 2 — e.g., "Monitoring alert threshold was set too high"]
- [Factor 3]
Mitigation and Resolution
Immediate Actions (during incident)
- [Action taken to stop the bleeding]
- [Action taken to restore service]
- [Action taken to verify resolution]
Follow-Up Actions (after resolution)
- [Permanent fix if immediate action was a workaround]
- [Additional testing or monitoring added]
- [Process changes to prevent recurrence]
Player Communication
Initial Acknowledgment
Sent: [Time] via [channel]
[Exact text of the first public message acknowledging the issue]
Status Updates
Sent: [Time] via [channel]
[Text of each subsequent update]
Resolution Notice
Sent: [Time] via [channel]
[Text announcing the fix and any compensation]
Compensation (if applicable)
- What: [description of compensation — e.g., "500 premium currency + 24-hour XP boost"]
- Who: [all players / affected players only / players who logged in during incident]
- When: [delivery date and method]
- Rationale: [why this compensation is appropriate for the impact]
Prevention
What We Are Changing
| Action Item | Owner | Deadline | Status |
|---|---|---|---|
| [Specific preventive measure] | [Role] | [Date] | [TODO/Done] |
| [Add monitoring for X] | [Role] | [Date] | [TODO/Done] |
| [Add test coverage for Y] | [Role] | [Date] | [TODO/Done] |
| [Update runbook for Z] | [Role] | [Date] | [TODO/Done] |
Process Improvements
- [Process change to prevent similar incidents]
- [Monitoring/alerting improvement]
- [Testing improvement]
Lessons Learned
What Went Well
- [Positive aspect of incident response — e.g., "Detection was fast due to monitoring alerts"]
- [Positive aspect]
What Went Poorly
- [Problem with response — e.g., "Took 20 minutes to identify the correct on-call person"]
- [Problem]
Where We Got Lucky
- [Factor that reduced impact by chance rather than design — these are hidden risks to address]
Sign-Offs
- Technical Director — Root cause accurate, prevention plan sufficient
- QA Lead — Test coverage gaps addressed
- Producer — Timeline and communication reviewed
- Community Manager — Player communication reviewed
This document is filed in production/hotfixes/ and linked from the
release notes for the fix version.