# Incident Response: [Incident Title] **Severity**: [S1-Critical / S2-Major / S3-Moderate / S4-Minor] **Status**: [Active / Mitigated / Resolved / Post-Mortem Complete] **Detected**: [Date Time UTC] **Resolved**: [Date Time UTC or ONGOING] **Duration**: [Total time from detection to resolution] **Incident Commander**: [Name/Role] --- ## Impact Summary [2-3 sentences describing what players experienced. Write from the player perspective, not the technical perspective.] - **Players affected**: [estimated count or percentage] - **Platforms affected**: [PC / Console / Mobile / All] - **Regions affected**: [All / specific regions] - **Revenue impact**: [estimated if applicable] --- ## Timeline | Time (UTC) | Event | Action Taken | | ---- | ---- | ---- | | [HH:MM] | Incident detected via [monitoring/player report/etc.] | Incident commander assigned | | [HH:MM] | Root cause identified | [Brief description of cause] | | [HH:MM] | Mitigation deployed | [What was done] | | [HH:MM] | Service restored / Fix confirmed | Monitoring for recurrence | | [HH:MM] | All-clear declared | Post-mortem scheduled | --- ## Root Cause ### What Happened [Technical description of the root cause. Be specific about the chain of events that led to the incident.] ### Why It Happened [Systemic cause — why did existing processes, tests, or safeguards fail to prevent this? This is more important than the technical cause.] ### Contributing Factors - [Factor 1 — e.g., "Insufficient load testing for the new matchmaking system"] - [Factor 2 — e.g., "Monitoring alert threshold was set too high"] - [Factor 3] --- ## Mitigation and Resolution ### Immediate Actions (during incident) 1. [Action taken to stop the bleeding] 2. [Action taken to restore service] 3. [Action taken to verify resolution] ### Follow-Up Actions (after resolution) 1. [Permanent fix if immediate action was a workaround] 2. [Additional testing or monitoring added] 3. [Process changes to prevent recurrence] --- ## Player Communication ### Initial Acknowledgment *Sent: [Time] via [channel]* > [Exact text of the first public message acknowledging the issue] ### Status Updates *Sent: [Time] via [channel]* > [Text of each subsequent update] ### Resolution Notice *Sent: [Time] via [channel]* > [Text announcing the fix and any compensation] ### Compensation (if applicable) - **What**: [description of compensation — e.g., "500 premium currency + 24-hour XP boost"] - **Who**: [all players / affected players only / players who logged in during incident] - **When**: [delivery date and method] - **Rationale**: [why this compensation is appropriate for the impact] --- ## Prevention ### What We Are Changing | Action Item | Owner | Deadline | Status | | ---- | ---- | ---- | ---- | | [Specific preventive measure] | [Role] | [Date] | [TODO/Done] | | [Add monitoring for X] | [Role] | [Date] | [TODO/Done] | | [Add test coverage for Y] | [Role] | [Date] | [TODO/Done] | | [Update runbook for Z] | [Role] | [Date] | [TODO/Done] | ### Process Improvements - [Process change to prevent similar incidents] - [Monitoring/alerting improvement] - [Testing improvement] --- ## Lessons Learned ### What Went Well - [Positive aspect of incident response — e.g., "Detection was fast due to monitoring alerts"] - [Positive aspect] ### What Went Poorly - [Problem with response — e.g., "Took 20 minutes to identify the correct on-call person"] - [Problem] ### Where We Got Lucky - [Factor that reduced impact by chance rather than design — these are hidden risks to address] --- ## Sign-Offs - [ ] Technical Director — Root cause accurate, prevention plan sufficient - [ ] QA Lead — Test coverage gaps addressed - [ ] Producer — Timeline and communication reviewed - [ ] Community Manager — Player communication reviewed --- *This document is filed in `production/hotfixes/` and linked from the release notes for the fix version.*