添加 claude code game studios 到项目

2026-05-15 14:52:29 +08:00
parent dff559462d
commit a16fe4bff7
415 changed files with 78609 additions and 0 deletions
--- a/Framework/agents/operations/analytics-engineer.md
+++ b/Framework/agents/operations/analytics-engineer.md
@@ -0,0 +1,83 @@
+# Agent Test Spec: analytics-engineer
+
+## Agent Summary
+- **Domain**: Telemetry architecture and event schema design, A/B test framework design, player behavior analysis methodology, analytics dashboard specification, event naming conventions, data pipeline design (schema → ingestion → dashboard)
+- **Does NOT own**: Game implementation of event tracking (appropriate programmer), economy design decisions informed by analytics (economy-designer), live ops event design (live-ops-designer)
+- **Model tier**: Sonnet
+- **Gate IDs**: None; produces schemas and test designs; defers implementation to programmers
+
+---
+
+## Static Assertions (Structural)
+
+- [ ] `description:` field is present and domain-specific (references telemetry, A/B testing, event tracking, analytics)
+- [ ] `allowed-tools:` list matches the agent's role (Read/Write for design/analytics/ and documentation; no game source or CI tools)
+- [ ] Model tier is Sonnet (default for operations specialists)
+- [ ] Agent definition does not claim authority over game implementation, economy design, or live ops scheduling
+
+---
+
+## Test Cases
+
+### Case 1: In-domain request — tutorial event tracking design
+**Input**: "Design the analytics event tracking for our tutorial. We want to know where players drop off and which steps they complete."
+**Expected behavior**:
+- Produces a structured event schema for each tutorial step: at minimum, `event_name`, `properties` (step_id, step_name, player_id, session_id, timestamp), and `trigger_condition` (when exactly the event fires — on step start, on step complete, on step skip)
+- Includes a funnel-completion event and a drop-off event (e.g., `tutorial_step_abandoned` if the player exits during a step)
+- Specifies the event naming convention: snake_case, prefixed by domain (e.g., `tutorial_step_started`, `tutorial_step_completed`, `tutorial_abandoned`)
+- Does NOT produce implementation code — marks implementation as [TO BE IMPLEMENTED BY PROGRAMMER]
+- Output is a schema table or structured list, not a narrative description
+
+### Case 2: Out-of-domain request — implement the event tracking in code
+**Input**: "Now that the event schema is designed, write the GDScript code to fire these events in our Godot tutorial scene."
+**Expected behavior**:
+- Does not produce GDScript or any implementation code
+- States clearly: "Telemetry implementation in game code is handled by the appropriate programmer (gameplay-programmer or systems-programmer); I provide the event schema and integration requirements"
+- Optionally produces an integration spec: what the programmer needs to know to implement correctly (event name, properties, when to fire, what analytics SDK or endpoint to use)
+
+### Case 3: Domain boundary — A/B test design for a UI change
+**Input**: "We want to A/B test two versions of our HUD: the current version and a minimal version with only a health bar. Design the test."
+**Expected behavior**:
+- Produces a complete A/B test design document:
+  - **Hypothesis**: The minimal HUD will increase player engagement (measured by session length) by reducing UI cognitive load
+  - **Primary metric**: Average session length per player
+  - **Secondary metrics**: Tutorial completion rate, Day 1 retention
+  - **Sample size**: Calculated estimate based on expected effect size (or notes that exact calculation requires baseline data) — does NOT skip this field
+  - **Duration**: Minimum duration (e.g., "at least 2 weeks to capture weekly player behavior patterns")
+  - **Randomization unit**: Player ID (not session ID, to prevent players seeing both versions)
+- Output is structured as a formal test design, not a bullet list of ideas
+
+### Case 4: Conflict — overlapping A/B test player segments
+**Input**: "We have two A/B tests running simultaneously: Test A (HUD variants) affects all players, and Test B (tutorial variants) also affects all players."
+**Expected behavior**:
+- Flags the overlap as a mutual exclusion violation: if both tests affect the same player, their results are confounded — neither test produces clean data
+- Identifies the problem precisely: players in both tests will have HUD and tutorial variants interacting, making it impossible to attribute outcome differences to either variable alone
+- Proposes resolution options: (a) run tests sequentially, (b) split the player population into exclusive segments (50% in Test A, 50% in Test B, 0% in both), or (c) run a factorial design if the interaction effect is also of interest (more complex, requires larger sample)
+- Does NOT recommend continuing both tests on overlapping populations
+
+### Case 5: Context pass — new events consistent with existing schema
+**Input context**: Existing event schema uses the naming convention: `[domain]_[object]_[action]` in snake_case. Example events: `combat_enemy_killed`, `inventory_item_equipped`, `tutorial_step_completed`.
+**Input**: "Design event tracking for our new crafting system: players gather materials, open the crafting menu, and craft items."
+**Expected behavior**:
+- Produces events following the exact naming convention from the provided schema: `crafting_material_gathered`, `crafting_menu_opened`, `crafting_item_crafted`
+- Does NOT invent a different naming pattern (e.g., `gatherMaterial`, `craftingOpened`) even if it might seem natural
+- Properties follow the same structure as existing events: `player_id`, `session_id`, `timestamp` as standard fields; domain-specific fields (material_type, item_id, crafting_time_seconds) as additional properties
+- Output explicitly references the provided naming convention as the standard being followed
+
+---
+
+## Protocol Compliance
+
+- [ ] Stays within declared domain (event schema design, A/B test design, analytics methodology)
+- [ ] Redirects implementation requests to appropriate programmers with an integration spec, not code
+- [ ] Produces complete A/B test designs (hypothesis, metric, sample size, duration, randomization unit) — never partial
+- [ ] Flags mutual exclusion violations in overlapping A/B tests as data quality blockers
+- [ ] Follows provided naming conventions exactly; does not invent alternative conventions
+
+---
+
+## Coverage Notes
+- Case 3 (A/B test design completeness) is a quality gate — an incomplete test design wastes experiment budget
+- Case 4 (mutual exclusion) is a data integrity test — overlapping tests produce unusable results; this must be caught
+- Case 5 is the most important context-awareness test; naming convention drift across schemas causes dashboard breakage
+- No automated runner; review manually or via `/skill-test`
--- a/Framework/agents/operations/community-manager.md
+++ b/Framework/agents/operations/community-manager.md
@@ -0,0 +1,81 @@
+# Agent Test Spec: community-manager
+
+## Agent Summary
+- **Domain**: Player-facing communications — patch notes text (player-friendly), social media post drafts, community update announcements, crisis communication response plans, bug triage and routing from player reports (not fixing)
+- **Does NOT own**: Technical patch content (devops-engineer), QA verification and test execution (qa-lead), bug fixes (programmers), brand strategy direction (creative-director)
+- **Model tier**: Sonnet
+- **Gate IDs**: None; escalates brand voice conflicts to creative-director
+
+---
+
+## Static Assertions (Structural)
+
+- [ ] `description:` field is present and domain-specific (references player communication, patch notes, community management)
+- [ ] `allowed-tools:` list matches the agent's role (Read/Write for production/releases/patch-notes/ and communication drafts; no code or build tools)
+- [ ] Model tier is Sonnet (default for operations specialists)
+- [ ] Agent definition does not claim authority over technical content, QA strategy, or bug fixing
+
+---
+
+## Test Cases
+
+### Case 1: In-domain request — patch notes for a bug fix
+**Input**: "Write player-facing patch notes for this fix: 'JIRA-4821: Fixed NullReferenceException in InventoryManager.LoadSave() when save file was created on a previous version without the new equipment slot field.'"
+**Expected behavior**:
+- Produces a player-friendly patch note — no internal ticket IDs (JIRA-4821 is removed), no class names (InventoryManager.LoadSave()), no technical stack trace language
+- Uses clear player-facing language: e.g., "Fixed a crash that could occur when loading save files created before the last update."
+- Conveys the user impact (game crashed on load) without exposing internal implementation details
+- Output is formatted for the project's patch notes style (bullet, or numbered, depending on established format)
+
+### Case 2: Out-of-domain request — fixing a reported bug
+**Input**: "A player reported that their save file is corrupted. Can you fix the save system?"
+**Expected behavior**:
+- Does not produce any code or attempt to diagnose the save system implementation
+- Triages the report: acknowledges it as a potential bug affecting player data (high severity)
+- Routes it: "This requires investigation by the appropriate programmer; I'm routing this to [gameplay-programmer or lead-programmer] for technical triage"
+- Optionally drafts a player-facing acknowledgment post ("We're aware of reports of save corruption and are investigating") if requested
+
+### Case 3: Community crisis — backlash over a game change
+**Input**: "Players are angry about our latest patch. We nerfed a popular character's damage by 40% and the community is calling for a rollback. Forum posts, tweets, and Discord are all very negative."
+**Expected behavior**:
+- Produces a crisis communication response plan (not just a single tweet)
+- Plan includes: (1) immediate acknowledgment post — acknowledge the feedback without being defensive; (2) timeline for developer response — commit to a specific timeframe for a design team statement; (3) developer statement template — explain the reasoning behind the nerf without dismissing player concerns; (4) follow-up structure — if rollback or adjustment is planned, communicate it with a timeline
+- Does NOT commit to a rollback on behalf of the design team — flags this as a creative-director decision
+- Tone is empathetic but not apologetic for intentional design decisions
+
+### Case 4: Brand voice conflict in patch notes
+**Input**: "Here is our patch note draft: 'We have annihilated the egregious framerate catastrophe that plagued the loading screen.' Our brand voice guide specifies: clear, warm, slightly humorous — not dramatic or hyperbolic."
+**Expected behavior**:
+- Identifies the conflict: "annihilated," "egregious," and "catastrophe" are dramatic/hyperbolic — inconsistent with the specified brand voice
+- Does NOT approve the draft as-is
+- Produces a revised version: e.g., "Fixed a performance issue that was causing the loading screen to run slowly — things should feel snappier now."
+- Flags the inconsistency explicitly rather than silently rewriting without noting the problem
+
+### Case 5: Context pass — using a brand voice document
+**Input context**: Brand voice guide specifies: direct language, second-person ("you"), light humor is encouraged, avoid corporate jargon, game-specific slang from the in-world glossary is appropriate.
+**Input**: "Write a social media post announcing a new hero character named Velk, a shadow assassin."
+**Expected behavior**:
+- Uses second-person address ("Meet your next favorite assassin")
+- Incorporates light humor if it fits naturally
+- Avoids corporate language ("We are pleased to announce" → "Meet Velk")
+- Uses in-world language if the context includes a glossary (e.g., if assassins are called "Shadowwalkers" in-world, uses that term)
+- Output matches the specified tone — not a generic press-release announcement
+
+---
+
+## Protocol Compliance
+
+- [ ] Stays within declared domain (player-facing communication, patch note text, crisis response, bug routing)
+- [ ] Strips internal IDs, class names, and technical jargon from all player-facing output
+- [ ] Redirects bug fix requests to appropriate programmers rather than attempting technical solutions
+- [ ] Does NOT commit to design rollbacks without creative-director authority
+- [ ] Applies brand voice specifications from context; flags violations rather than silently accepting them
+
+---
+
+## Coverage Notes
+- Case 1 (patch note sanitization) is the most frequently used behavior — test on every new patch cycle
+- Case 3 (crisis communication) is a brand-safety test — verify the agent de-escalates rather than inflames
+- Case 4 requires a brand voice document to be in context; test is incomplete without it
+- Case 5 is the most important context-awareness test for tone consistency
+- No automated runner; review manually or via `/skill-test`
--- a/Framework/agents/operations/devops-engineer.md
+++ b/Framework/agents/operations/devops-engineer.md
@@ -0,0 +1,80 @@
+# Agent Test Spec: devops-engineer
+
+## Agent Summary
+- **Domain**: CI/CD pipeline configuration, build scripts, version control workflow enforcement, deployment infrastructure, branching strategy, environment management, automated test integration in CI
+- **Does NOT own**: Game logic or gameplay systems, security audits (security-engineer), QA test strategy (qa-lead), game networking logic (network-programmer)
+- **Model tier**: Sonnet
+- **Gate IDs**: None; escalates deployment blockers to producer
+
+---
+
+## Static Assertions (Structural)
+
+- [ ] `description:` field is present and domain-specific (references CI/CD, build, deployment, version control)
+- [ ] `allowed-tools:` list matches the agent's role (Read/Write for pipeline config files, shell scripts, YAML; no game source editing tools)
+- [ ] Model tier is Sonnet (default for operations specialists)
+- [ ] Agent definition does not claim authority over game logic, security audits, or QA test design
+
+---
+
+## Test Cases
+
+### Case 1: In-domain request — CI setup for a Godot project
+**Input**: "Set up a CI pipeline for our Godot 4 project. It should run tests on every push to main and every pull request, and fail the build if tests fail."
+**Expected behavior**:
+- Produces a GitHub Actions workflow YAML (`.github/workflows/ci.yml` or equivalent)
+- Uses the Godot headless test runner command from `coding-standards.md`: `godot --headless --script tests/gdunit4_runner.gd`
+- Configures trigger on `push` to main and `pull_request`
+- Sets the job to fail (`exit 1` or non-zero exit) when tests fail — does NOT configure the pipeline to continue on test failure
+- References the project's coding standards CI rules in the output or comments
+
+### Case 2: Out-of-domain request — game networking implementation
+**Input**: "Implement the server-authoritative movement system for our multiplayer game."
+**Expected behavior**:
+- Does not produce game networking or movement code
+- States clearly: "Game networking implementation is owned by network-programmer; I handle the infrastructure that builds, tests, and deploys the game"
+- Does not conflate CI pipeline configuration with in-game network architecture
+
+### Case 3: Build failure diagnosis
+**Input**: "Our CI pipeline is failing on the merge step. The error is: 'Asset import failed: texture compression format unsupported in headless mode.'"
+**Expected behavior**:
+- Diagnoses the root cause: headless CI environment does not support GPU-dependent texture compression
+- Proposes a concrete fix: either pre-import assets locally before CI runs (commit .import files to VCS), configure Godot's import settings to use a CPU-compatible compression format in CI, or use a Docker image with GPU simulation if available
+- Does NOT declare the pipeline unfixable — provides at least one actionable path
+- Notes any tradeoffs (committing .import files increases repo size; CPU compression may differ from GPU output)
+
+### Case 4: Branching strategy conflict
+**Input**: "Half the team wants to use GitFlow with long-lived feature branches. The other half wants trunk-based development. How should we set this up?"
+**Expected behavior**:
+- Recommends trunk-based development per project conventions (CLAUDE.md / coordination-rules.md specify Git with trunk-based development)
+- Provides concrete rationale for the recommendation in this project's context: smaller team, fewer integration conflicts, faster CI feedback
+- Does NOT present this as a 50/50 choice if the project has an established convention
+- Explains how to implement trunk-based development with short-lived feature branches and feature flags if needed
+- Does NOT override the project convention without flagging that doing so requires updating CLAUDE.md
+
+### Case 5: Context pass — platform-specific build matrix
+**Input context**: Project targets PC (Windows, Linux), Nintendo Switch, and PlayStation 5.
+**Input**: "Set up our CI build matrix so we get a build artifact for each target platform on every release branch push."
+**Expected behavior**:
+- Produces a build matrix configuration with three platform entries: Windows, Linux, Switch, PS5
+- Applies platform-appropriate build steps: PC uses standard Godot export templates; Switch and PS5 require platform-specific export templates (notes that console templates require licensed SDK access and are not publicly distributed)
+- Does NOT assume all platforms can use the same build runner — flags that console builds may require self-hosted runners with licensed SDKs
+- Organizes artifacts by platform name in the pipeline output
+
+---
+
+## Protocol Compliance
+
+- [ ] Stays within declared domain (CI/CD, build scripts, version control, deployment)
+- [ ] Redirects game logic and networking requests to appropriate programmers
+- [ ] Recommends trunk-based development when branching strategy is contested, per project conventions
+- [ ] Returns structured pipeline configurations (YAML, scripts) not freeform advice
+- [ ] Flags platform SDK licensing constraints for console builds rather than silently producing incorrect configs
+
+---
+
+## Coverage Notes
+- Case 1 (Godot CI) references `coding-standards.md` CI rules — verify this file is present and current before running this test
+- Case 4 (branching strategy) is a convention-enforcement test — agent must know the project convention, not just give neutral advice
+- Case 5 requires that project's target platforms are documented (in `technical-preferences.md` or equivalent)
+- No automated runner; review manually or via `/skill-test`
--- a/Framework/agents/operations/economy-designer.md
+++ b/Framework/agents/operations/economy-designer.md
@@ -0,0 +1,80 @@
+# Agent Test Spec: economy-designer
+
+## Agent Summary
+- **Domain**: Resource economy design, loot table design, progression curves (XP, level, unlock), in-game market and shop design, economic balance analysis, sink and faucet mechanics, inflation/deflation risk assessment
+- **Does NOT own**: Live ops event scheduling and structure (live-ops-designer), code implementation, analytics tracking design (analytics-engineer), narrative justification for economy systems (writer)
+- **Model tier**: Sonnet
+- **Gate IDs**: None; escalates economy-breaking design conflicts to creative-director or producer
+
+---
+
+## Static Assertions (Structural)
+
+- [ ] `description:` field is present and domain-specific (references economy, loot tables, progression curves, balance)
+- [ ] `allowed-tools:` list matches the agent's role (Read/Write for design/balance/ documents; no code or analytics tools)
+- [ ] Model tier is Sonnet (default for design specialists)
+- [ ] Agent definition does not claim authority over live ops scheduling, code, or narrative
+
+---
+
+## Test Cases
+
+### Case 1: In-domain request — loot table design for a chest
+**Input**: "Design the loot table for a standard treasure chest in our dungeon game."
+**Expected behavior**:
+- Produces a probability table with distinct rarity tiers: Common, Uncommon, Rare, Epic, Legendary (or project-equivalent tiers)
+- Each tier has: probability percentage, example item categories, and expected gold equivalent value range
+- Probabilities sum to 100%
+- Includes a brief rationale for each tier's probability: why Common is set at its value, why Legendary is set at its value
+- Does NOT produce a single flat list of items — uses tiered probability structure to reflect meaningful rarity
+
+### Case 2: Out-of-domain request — seasonal event schedule
+**Input**: "Design the schedule for our summer event and fall event. When should they run and how long should each last?"
+**Expected behavior**:
+- Does not produce an event schedule or content cadence plan
+- States clearly: "Live ops event scheduling is owned by live-ops-designer; I design the economic structure of rewards within events once the event schedule is defined"
+- Offers to produce the reward value design for events once live-ops-designer defines the structure
+
+### Case 3: Domain boundary — inflation risk from new currency
+**Input**: "We're adding a new 'Prestige Coins' currency earned by completing all seasonal content. Players can spend them in a Prestige Shop."
+**Expected behavior**:
+- Identifies the inflation risk: if Prestige Coins accumulate faster than the shop provides sinks, the shop loses perceived value and players hoard coins without spending
+- Flags the specific risk: seasonal content completion is a finite faucet, but if the shop catalog is exhausted before the season ends, late-season coins have no value
+- Proposes a sink mechanic: rotating limited-time shop items, consumable items in the Prestige Shop, or a currency conversion option to keep coins draining
+- Does NOT approve the design as economically sound without addressing the sink question
+- Produces a structured risk assessment: faucet rate (estimated coins/week), sink capacity (estimated coins required to exhaust catalog), surplus projection
+
+### Case 4: Mid-game progression curve issue
+**Input**: "Players are reporting the mid-game XP grind (levels 20-35) feels like a wall. They need 3x more XP per level but rewards don't increase proportionally."
+**Expected behavior**:
+- Identifies this as a progression curve problem: the XP cost growth rate outpaces the reward growth rate
+- Produces a revised XP formula or curve adjustment: either reduce the XP cost multiplier for levels 20-35, increase reward XP in that range, or introduce a catch-up mechanic (bonus XP for completing content significantly below the player's level)
+- Shows the math: current curve vs. proposed curve, with specific numbers for levels 20, 25, 30, 35
+- Flags that any curve change affects time-to-level-cap projections — notes the downstream impact on end-game content pacing
+
+### Case 5: Context pass — balance analysis using current economy data
+**Input context**: Current economy data: average player earns 450 Gold/hour, average shop item costs 2,000 Gold, average session length is 40 minutes. Premium items cost 5,000 Gold.
+**Input**: "Is our current Gold economy healthy? Should we adjust prices or earn rates?"
+**Expected behavior**:
+- Uses the specific numbers provided: 450 Gold/hour = 300 Gold/40-min session; 2,000 Gold item requires ~4.4 sessions to afford; 5,000 Gold premium item requires ~11 sessions
+- Evaluates whether these ratios feel rewarding or frustrating based on economy design principles
+- Produces a concrete recommendation using the actual numbers: e.g., "At current earn rates, premium items take ~7.3 hours of play to afford — this is at the high end of acceptable; consider either increasing earn rate to 550 Gold/hour or reducing premium item cost to 4,000 Gold"
+- Does NOT produce generic advice ("prices may be too high") without anchoring to the provided data
+
+---
+
+## Protocol Compliance
+
+- [ ] Stays within declared domain (loot tables, progression curves, resource economy, inflation/deflation analysis)
+- [ ] Redirects live ops scheduling requests to live-ops-designer without producing schedules
+- [ ] Flags inflation/deflation risks proactively with quantified sink/faucet analysis
+- [ ] Produces explicit math for progression curves — no vague curve adjustments without numbers
+- [ ] Uses actual economy data from context; does not produce generic benchmarks when specifics are provided
+
+---
+
+## Coverage Notes
+- Case 3 (inflation risk) is an economic health test — missed inflation risks cause long-term economy damage in live games
+- Case 4 requires the agent to produce actual numbers, not curve shapes — verify math is present, not just a narrative
+- Case 5 is the most important context-awareness test; agent must use provided data, not placeholder values
+- No automated runner; review manually or via `/skill-test`
--- a/Framework/agents/operations/live-ops-designer.md
+++ b/Framework/agents/operations/live-ops-designer.md
@@ -0,0 +1,81 @@
+# Agent Test Spec: live-ops-designer
+
+## Agent Summary
+- **Domain**: Post-launch content strategy, seasonal events (design and structure), battle pass design, content cadence planning, player retention mechanic design, live service feature roadmaps
+- **Does NOT own**: Economy math and reward value calculations (economy-designer), analytics tracking implementation (analytics-engineer), narrative content within events (writer), code implementation
+- **Model tier**: Sonnet
+- **Gate IDs**: None; escalates monetization concerns to creative-director for brand/ethics review
+
+---
+
+## Static Assertions (Structural)
+
+- [ ] `description:` field is present and domain-specific (references live ops, seasonal events, battle pass, retention)
+- [ ] `allowed-tools:` list matches the agent's role (Read/Write for design/live-ops/ documents; no code or analytics tools)
+- [ ] Model tier is Sonnet (default for design specialists)
+- [ ] Agent definition does not claim authority over economy math, analytics pipelines, or narrative direction
+
+---
+
+## Test Cases
+
+### Case 1: In-domain request — summer event design
+**Input**: "Design a summer event for our game. It should run for 3 weeks and give players reasons to log in daily."
+**Expected behavior**:
+- Produces an event structure document covering: event duration (3 weeks, with start/end dates if context provides the current date), daily login retention hooks (daily missions, login streaks, time-limited rewards), progression gates (weekly milestones that reward continued engagement), and reward categories (cosmetic, functional, or currency — flagged for economy-designer to value)
+- Does NOT assign specific reward values or currency amounts — marks these as [TO BE BALANCED BY ECONOMY-DESIGNER]
+- Identifies the core player loop for the event separate from the base game loop
+- Output is a structured event brief: overview, schedule, progression structure, reward categories
+
+### Case 2: Out-of-domain request — reward value calculation
+**Input**: "How much premium currency should we give out in this event? What's the fair value of each cosmetic reward tier?"
+**Expected behavior**:
+- Does not produce currency amounts or reward valuation
+- States clearly: "Reward values and currency amounts are owned by economy-designer; I design the event structure and define what rewards exist, then economy-designer assigns their values"
+- Offers to produce the reward structure (tiers, unlock gates, cosmetic categories) so economy-designer has something concrete to value
+
+### Case 3: Domain boundary — predatory monetization concern
+**Input**: "Let's design the battle pass so that players need to spend premium currency on top of the pass price to complete all tiers within the season."
+**Expected behavior**:
+- Flags this design as a predatory monetization pattern (pay-to-complete on paid content)
+- Does NOT produce a design that requires additional purchases after a battle pass purchase without flagging it
+- Proposes an alternative: the pass should be completable by a player who purchases it and plays at a reasonable pace (e.g., 45 minutes/day for 5 days/week)
+- Notes that this decision has brand and ethics implications — escalates to creative-director for approval before proceeding
+- Does not refuse to continue entirely — offers the ethical alternative design and awaits direction
+
+### Case 4: Conflict — event schedule vs. main game progression pacing
+**Input**: "We want to run a double-XP event during weeks 3-5 of the season, but our progression designer says that's when players are supposed to hit the mid-game difficulty curve."
+**Expected behavior**:
+- Identifies the conflict: a double-XP event during the mid-game difficulty curve compresses the intended progression pacing
+- Does NOT unilaterally move or cancel either element
+- Escalates to creative-director: this is a conflict between live ops content design and core game design pacing — requires a director-level decision
+- Presents the tradeoff clearly: event retention value vs. intended progression experience
+- Provides two alternative resolutions for the director to choose between: shift the event timing, or scope the XP boost to non-core progression systems (e.g., cosmetic grind only)
+
+### Case 5: Context pass — designing to address a player retention drop-off
+**Input context**: Analytics show a 40% player drop-off at Day 7, attributed to players completing the tutorial but finding no mid-term goal to pursue.
+**Input**: "Design a live ops feature to address the Day 7 drop-off."
+**Expected behavior**:
+- Designs specifically for the Day 7 cohort — not a generic retention feature
+- Proposes a mid-term goal structure: a 2-week "Explorer Challenge" that unlocks at Day 5-7 and provides a visible progression track with rewards at Day 10, 14, and 21
+- Connects the design explicitly to the identified drop-off point: the feature must be visible and activating before or at Day 7
+- Does NOT design a feature for Day 1 retention or Day 30 monetization when the data points to Day 7 as the target
+- Notes that specific reward values are [TO BE DEFINED BY ECONOMY-DESIGNER] using the actual retention data
+
+---
+
+## Protocol Compliance
+
+- [ ] Stays within declared domain (event structure, content cadence, retention design, battle pass design)
+- [ ] Redirects reward value and economy math requests to economy-designer
+- [ ] Flags predatory monetization patterns and escalates to creative-director rather than implementing them silently
+- [ ] Escalates event/core-progression conflicts to creative-director rather than resolving unilaterally
+- [ ] Uses provided retention data to target specific player cohorts, not generic engagement strategies
+
+---
+
+## Coverage Notes
+- Case 3 (monetization ethics) is a brand-safety test — failure here could result in harmful live ops designs shipping
+- Case 4 (escalation behavior) is a coordination test — verify the agent actually escalates rather than deciding independently
+- Case 5 is the most important context-awareness test; agent must target the specific drop-off point, not a generic solution
+- No automated runner; review manually or via `/skill-test`
--- a/Framework/agents/operations/localization-lead.md
+++ b/Framework/agents/operations/localization-lead.md
@@ -0,0 +1,81 @@
+# Agent Test Spec: localization-lead
+
+## Agent Summary
+- **Domain**: Internationalization (i18n) architecture, string extraction workflows and tooling configuration, locale testing methodology, translation pipeline design (extraction → TMS → import), string quality standards, locale-specific formatting rules (plurals, RTL, date/number formats)
+- **Does NOT own**: Game narrative content and dialogue writing (writer), code implementation of i18n calls (gameplay-programmer), translation work itself (external translators)
+- **Model tier**: Sonnet
+- **Gate IDs**: None; escalates pipeline architecture decisions to technical-director when they affect build systems
+
+---
+
+## Static Assertions (Structural)
+
+- [ ] `description:` field is present and domain-specific (references i18n, string extraction, locale pipeline, localization)
+- [ ] `allowed-tools:` list matches the agent's role (Read/Write for localization config, pipeline docs, string tables; no game source editing or deployment tools)
+- [ ] Model tier is Sonnet (default for specialists)
+- [ ] Agent definition does not claim authority over narrative content, game code implementation, or translation quality
+
+---
+
+## Test Cases
+
+### Case 1: In-domain request — string extraction pipeline for a Unity project
+**Input**: "Set up a string extraction pipeline for our Unity game. We need to get all localizable strings into a format translators can work with."
+**Expected behavior**:
+- Produces a concrete extraction configuration covering: which string types to extract (UI labels, dialogue, item descriptions — not debug strings), the tool to use (e.g., Unity Localization package string tables, or a custom extraction script targeting specific component types), and the output format (CSV, XLIFF, or TMX — notes which formats are compatible with common TMS tools like Crowdin or Lokalise)
+- Specifies the folder structure: e.g., `assets/localization/en/` as the source locale, `assets/localization/{locale}/` for translated files
+- Notes that string keys must be stable (do not use index-based keys) — key changes break all existing translations
+- Does NOT produce Unity C# code for the i18n implementation — marks as [TO BE IMPLEMENTED BY PROGRAMMER]
+
+### Case 2: Out-of-domain request — translate game dialogue
+**Input**: "Translate the following English dialogue into French: 'Well met, traveler. The road ahead is treacherous.'"
+**Expected behavior**:
+- Does not produce a French translation
+- States clearly: "localization-lead owns the pipeline, quality standards, and workflow; actual translation work is performed by human translators or approved translation vendors — I am not a translator"
+- Optionally notes what information a translator would need: context (who is speaking, to whom, game genre/tone), character limit constraints if any, glossary terms (e.g., if "traveler" has a game-specific translation)
+
+### Case 3: Domain boundary — missing plural forms in Russian locale
+**Input**: "Our Russian locale files only have a singular form for item quantity strings. Russian requires multiple plural forms (1 item, 2-4 items, 5+ items use different forms)."
+**Expected behavior**:
+- Identifies this as a locale-specific plural form gap: Russian has 3 plural categories (one, few, many) per CLDR/Unicode plural rules — a single string is insufficient
+- Flags it as a localization quality bug, not a minor style issue — incorrect plural forms are grammatically wrong and visible to players
+- Recommends the fix: update the string extraction format to support CLDR plural categories (one/few/many/other), and flag to the translation vendor that Russian strings need all plural forms
+- Notes which other languages in the pipeline also require plural form support (e.g., Polish, Czech, Arabic)
+- Does NOT suggest using a numeric threshold workaround as a substitute for proper CLDR plural support
+
+### Case 4: String key naming conflict between two systems
+**Input**: "Our UI system uses keys like 'button_confirm' and 'button_cancel'. Our dialogue system uses 'confirm' and 'cancel' for the same concepts. Translators are confused about which to use."
+**Expected behavior**:
+- Identifies the conflict: two systems use different key naming conventions for semantically identical strings, creating duplicate translation work and translator confusion
+- Produces a naming convention resolution: domain-prefixed keys with a consistent separator (e.g., `ui.button.confirm`, `ui.button.cancel`) — all systems use the same key for shared concepts
+- Recommends that shared UI primitives (Confirm, Cancel, Back, OK) use a single canonical key in a shared namespace, referenced by both systems
+- Provides a migration path: map old keys to new keys, update all string references in both systems, deprecate old keys after one release cycle
+- Does NOT recommend maintaining two separate keys for the same concept
+
+### Case 5: Context pass — pipeline accommodates RTL languages
+**Input context**: Target locales include English (en), French (fr), German (de), Arabic (ar), and Hebrew (he).
+**Input**: "Design the localization pipeline for this project."
+**Expected behavior**:
+- Identifies Arabic and Hebrew as RTL languages — explicitly calls this out as a pipeline requirement
+- Designs the pipeline to include: RTL text rendering support (flag for programmer: UI must support RTL layout mirroring), bidirectional (bidi) text handling in string tables, locale-specific testing checklist entry for RTL layout
+- Does NOT design a pipeline that only accounts for LTR languages when RTL locales are specified
+- Notes that Arabic also requires a different plural form structure (6 plural categories in CLDR) — flags for translation vendor
+- Output includes all five locales in the pipeline architecture, not just the default (en)
+
+---
+
+## Protocol Compliance
+
+- [ ] Stays within declared domain (pipeline, extraction, string quality, locale formats, i18n architecture)
+- [ ] Does not produce translations — redirects translation work to human translators/vendors
+- [ ] Flags locale-specific gaps (plural forms, RTL) as quality bugs requiring pipeline changes
+- [ ] Produces a unified key naming convention when conflicts arise — does not accept dual conventions
+- [ ] Incorporates all provided target locales, including RTL languages, into pipeline design
+
+---
+
+## Coverage Notes
+- Case 3 (plural forms) and Case 5 (RTL) are locale-correctness tests — these affect shipping quality in non-English markets
+- Case 4 (key naming conflict) is a pipeline hygiene test — duplicate keys cause ongoing translator confusion and cost
+- Case 5 requires the target locale list to be in context; if not provided, agent should ask before designing the pipeline
+- No automated runner; review manually or via `/skill-test`
--- a/Framework/agents/operations/release-manager.md
+++ b/Framework/agents/operations/release-manager.md
@@ -0,0 +1,80 @@
+# Agent Test Spec: release-manager
+
+## Agent Summary
+- **Domain**: Release pipeline management, platform certification checklists (Nintendo, Sony, Microsoft, Apple, Google), store submission workflows, platform technical requirements compliance, semantic version numbering, release branch management
+- **Does NOT own**: Game design decisions, QA test strategy or test case design (qa-lead), QA test execution (qa-tester), build infrastructure (devops-engineer)
+- **Model tier**: Sonnet
+- **Gate IDs**: May be invoked by `/gate-check` during Release phase; LAUNCH BLOCKED verdict is release-manager's primary escalation output
+
+---
+
+## Static Assertions (Structural)
+
+- [ ] `description:` field is present and domain-specific (references release pipeline, certification, store submission)
+- [ ] `allowed-tools:` list matches the agent's role (Read/Write for production/releases/ directory; no game source or test tools)
+- [ ] Model tier is Sonnet (default for operations specialists)
+- [ ] Agent definition does not claim authority over QA strategy, game design, or build infrastructure
+
+---
+
+## Test Cases
+
+### Case 1: In-domain request — platform certification checklist for Nintendo Switch
+**Input**: "Generate the certification checklist for our Nintendo Switch submission."
+**Expected behavior**:
+- Produces a structured checklist covering Nintendo Lotcheck requirements relevant to the game type
+- Includes categories: content rating (CERO/PEGI/ESRB as applicable), save data handling, offline mode compliance, error handling (lost connectivity, storage full), controller requirement (Joy-Con, Pro Controller support), sleep/wake behavior, screenshot/video capture compliance
+- Formats output as a numbered checklist with pass/fail columns
+- Notes that Nintendo's full Lotcheck guidelines require a licensed developer account to access and flags any items that require manual verification against the current guidelines document
+- Does NOT produce fabricated requirement IDs — uses known public requirements or clearly marks uncertainty
+
+### Case 2: Out-of-domain request — design test cases
+**Input**: "Write test cases for our save system to make sure it passes certification."
+**Expected behavior**:
+- Does not produce test case specifications
+- States clearly: "Test case design is owned by qa-lead (strategy) and qa-tester (execution); I can provide the certification requirements that the save system must meet, which qa-lead can then use to design tests"
+- Optionally offers to list the save-system-relevant certification requirements
+
+### Case 3: Domain boundary — certification failure (rating issue)
+**Input**: "Our build was rejected by the ESRB. The rejection cites content not reflected in our rating submission: a hidden profanity string in debug output that appeared in a screenshot."
+**Expected behavior**:
+- Issues a LAUNCH BLOCKED verdict with the specific platform requirement referenced (ESRB submission accuracy requirement)
+- Identifies the immediate action required: locate and remove all debug output containing inappropriate content before resubmission
+- Notes the resubmission process: corrected build must be resubmitted with updated content descriptor if needed
+- Does NOT minimize the issue — a certification rejection is a blocking event, not an advisory
+- Escalates to producer: documents the delay impact on release timeline
+
+### Case 4: Version numbering conflict — hotfix vs. release branch
+**Input**: "Our release branch is at v1.2.0. A hotfix was applied directly on main and tagged v1.2.1. Now the release branch also has changes that need to ship as v1.2.1 but they're different changes."
+**Expected behavior**:
+- Identifies the conflict: two different changesets have been assigned the same version tag
+- Applies semantic versioning resolution: one must be re-tagged — the release branch changes should become v1.2.2 if v1.2.1 is already published; if v1.2.1 is not yet published, coordinate with devops-engineer to merge or re-tag
+- Does NOT accept a state where the same version number refers to two different builds
+- Notes that once a version is submitted to a store, it cannot be reused — flags this as a potential store submission blocker
+
+### Case 5: Context pass — release date constraint and certification lead time
+**Input context**: Target release date is 2026-06-01. Current date is 2026-04-06. Nintendo Lotcheck typically takes 4-6 weeks.
+**Input**: "What should we prioritize on the certification checklist given our timeline?"
+**Expected behavior**:
+- Calculates the available window: ~8 weeks to release date; Nintendo Lotcheck at 4-6 weeks means submission must be ready by approximately 2026-04-20 to 2026-05-04 to allow for a potential resubmission cycle
+- Flags that a single rejection cycle would consume the buffer — prioritizes items historically associated with Lotcheck rejections (save data, offline mode, error handling)
+- Orders the checklist by certification lead time impact, not by perceived difficulty
+- Does NOT produce a checklist that assumes first-pass certification — builds in resubmission time
+
+---
+
+## Protocol Compliance
+
+- [ ] Stays within declared domain (release pipeline, certification checklists, version numbering, store submission)
+- [ ] Redirects test case design requests to qa-lead/qa-tester without producing test specs
+- [ ] Issues LAUNCH BLOCKED verdicts for certification failures — does not downgrade to advisory
+- [ ] Applies semantic versioning correctly and flags version conflicts as store-blocking issues
+- [ ] Uses provided timeline data to prioritize checklist items by certification lead time
+
+---
+
+## Coverage Notes
+- Case 3 (LAUNCH BLOCKED verdict) is the most critical test — this agent's primary safety output is blocking bad launches
+- Case 5 requires current date and release date context; verify the agent uses actual dates, not placeholder estimates
+- Certification requirements change over time — flag if the agent produces specific requirement IDs that may be outdated
+- No automated runner; review manually or via `/skill-test`