136 lines
3.9 KiB
Markdown
136 lines
3.9 KiB
Markdown
# Incident Response: [Incident Title]
|
|
|
|
**Severity**: [S1-Critical / S2-Major / S3-Moderate / S4-Minor]
|
|
**Status**: [Active / Mitigated / Resolved / Post-Mortem Complete]
|
|
**Detected**: [Date Time UTC]
|
|
**Resolved**: [Date Time UTC or ONGOING]
|
|
**Duration**: [Total time from detection to resolution]
|
|
**Incident Commander**: [Name/Role]
|
|
|
|
---
|
|
|
|
## Impact Summary
|
|
|
|
[2-3 sentences describing what players experienced. Write from the player
|
|
perspective, not the technical perspective.]
|
|
|
|
- **Players affected**: [estimated count or percentage]
|
|
- **Platforms affected**: [PC / Console / Mobile / All]
|
|
- **Regions affected**: [All / specific regions]
|
|
- **Revenue impact**: [estimated if applicable]
|
|
|
|
---
|
|
|
|
## Timeline
|
|
|
|
| Time (UTC) | Event | Action Taken |
|
|
| ---- | ---- | ---- |
|
|
| [HH:MM] | Incident detected via [monitoring/player report/etc.] | Incident commander assigned |
|
|
| [HH:MM] | Root cause identified | [Brief description of cause] |
|
|
| [HH:MM] | Mitigation deployed | [What was done] |
|
|
| [HH:MM] | Service restored / Fix confirmed | Monitoring for recurrence |
|
|
| [HH:MM] | All-clear declared | Post-mortem scheduled |
|
|
|
|
---
|
|
|
|
## Root Cause
|
|
|
|
### What Happened
|
|
[Technical description of the root cause. Be specific about the chain of events
|
|
that led to the incident.]
|
|
|
|
### Why It Happened
|
|
[Systemic cause — why did existing processes, tests, or safeguards fail to
|
|
prevent this? This is more important than the technical cause.]
|
|
|
|
### Contributing Factors
|
|
- [Factor 1 — e.g., "Insufficient load testing for the new matchmaking system"]
|
|
- [Factor 2 — e.g., "Monitoring alert threshold was set too high"]
|
|
- [Factor 3]
|
|
|
|
---
|
|
|
|
## Mitigation and Resolution
|
|
|
|
### Immediate Actions (during incident)
|
|
1. [Action taken to stop the bleeding]
|
|
2. [Action taken to restore service]
|
|
3. [Action taken to verify resolution]
|
|
|
|
### Follow-Up Actions (after resolution)
|
|
1. [Permanent fix if immediate action was a workaround]
|
|
2. [Additional testing or monitoring added]
|
|
3. [Process changes to prevent recurrence]
|
|
|
|
---
|
|
|
|
## Player Communication
|
|
|
|
### Initial Acknowledgment
|
|
*Sent: [Time] via [channel]*
|
|
> [Exact text of the first public message acknowledging the issue]
|
|
|
|
### Status Updates
|
|
*Sent: [Time] via [channel]*
|
|
> [Text of each subsequent update]
|
|
|
|
### Resolution Notice
|
|
*Sent: [Time] via [channel]*
|
|
> [Text announcing the fix and any compensation]
|
|
|
|
### Compensation (if applicable)
|
|
- **What**: [description of compensation — e.g., "500 premium currency + 24-hour XP boost"]
|
|
- **Who**: [all players / affected players only / players who logged in during incident]
|
|
- **When**: [delivery date and method]
|
|
- **Rationale**: [why this compensation is appropriate for the impact]
|
|
|
|
---
|
|
|
|
## Prevention
|
|
|
|
### What We Are Changing
|
|
|
|
| Action Item | Owner | Deadline | Status |
|
|
| ---- | ---- | ---- | ---- |
|
|
| [Specific preventive measure] | [Role] | [Date] | [TODO/Done] |
|
|
| [Add monitoring for X] | [Role] | [Date] | [TODO/Done] |
|
|
| [Add test coverage for Y] | [Role] | [Date] | [TODO/Done] |
|
|
| [Update runbook for Z] | [Role] | [Date] | [TODO/Done] |
|
|
|
|
### Process Improvements
|
|
- [Process change to prevent similar incidents]
|
|
- [Monitoring/alerting improvement]
|
|
- [Testing improvement]
|
|
|
|
---
|
|
|
|
## Lessons Learned
|
|
|
|
### What Went Well
|
|
- [Positive aspect of incident response — e.g., "Detection was fast due to
|
|
monitoring alerts"]
|
|
- [Positive aspect]
|
|
|
|
### What Went Poorly
|
|
- [Problem with response — e.g., "Took 20 minutes to identify the correct
|
|
on-call person"]
|
|
- [Problem]
|
|
|
|
### Where We Got Lucky
|
|
- [Factor that reduced impact by chance rather than design — these are hidden
|
|
risks to address]
|
|
|
|
---
|
|
|
|
## Sign-Offs
|
|
|
|
- [ ] Technical Director — Root cause accurate, prevention plan sufficient
|
|
- [ ] QA Lead — Test coverage gaps addressed
|
|
- [ ] Producer — Timeline and communication reviewed
|
|
- [ ] Community Manager — Player communication reviewed
|
|
|
|
---
|
|
|
|
*This document is filed in `production/hotfixes/` and linked from the
|
|
release notes for the fix version.*
|