添加 claude code game studios 到项目

This commit is contained in:
panw
2026-05-15 14:52:29 +08:00
parent dff559462d
commit a16fe4bff7
415 changed files with 78609 additions and 0 deletions

View File

@@ -0,0 +1,83 @@
# Agent Test Spec: analytics-engineer
## Agent Summary
- **Domain**: Telemetry architecture and event schema design, A/B test framework design, player behavior analysis methodology, analytics dashboard specification, event naming conventions, data pipeline design (schema → ingestion → dashboard)
- **Does NOT own**: Game implementation of event tracking (appropriate programmer), economy design decisions informed by analytics (economy-designer), live ops event design (live-ops-designer)
- **Model tier**: Sonnet
- **Gate IDs**: None; produces schemas and test designs; defers implementation to programmers
---
## Static Assertions (Structural)
- [ ] `description:` field is present and domain-specific (references telemetry, A/B testing, event tracking, analytics)
- [ ] `allowed-tools:` list matches the agent's role (Read/Write for design/analytics/ and documentation; no game source or CI tools)
- [ ] Model tier is Sonnet (default for operations specialists)
- [ ] Agent definition does not claim authority over game implementation, economy design, or live ops scheduling
---
## Test Cases
### Case 1: In-domain request — tutorial event tracking design
**Input**: "Design the analytics event tracking for our tutorial. We want to know where players drop off and which steps they complete."
**Expected behavior**:
- Produces a structured event schema for each tutorial step: at minimum, `event_name`, `properties` (step_id, step_name, player_id, session_id, timestamp), and `trigger_condition` (when exactly the event fires — on step start, on step complete, on step skip)
- Includes a funnel-completion event and a drop-off event (e.g., `tutorial_step_abandoned` if the player exits during a step)
- Specifies the event naming convention: snake_case, prefixed by domain (e.g., `tutorial_step_started`, `tutorial_step_completed`, `tutorial_abandoned`)
- Does NOT produce implementation code — marks implementation as [TO BE IMPLEMENTED BY PROGRAMMER]
- Output is a schema table or structured list, not a narrative description
### Case 2: Out-of-domain request — implement the event tracking in code
**Input**: "Now that the event schema is designed, write the GDScript code to fire these events in our Godot tutorial scene."
**Expected behavior**:
- Does not produce GDScript or any implementation code
- States clearly: "Telemetry implementation in game code is handled by the appropriate programmer (gameplay-programmer or systems-programmer); I provide the event schema and integration requirements"
- Optionally produces an integration spec: what the programmer needs to know to implement correctly (event name, properties, when to fire, what analytics SDK or endpoint to use)
### Case 3: Domain boundary — A/B test design for a UI change
**Input**: "We want to A/B test two versions of our HUD: the current version and a minimal version with only a health bar. Design the test."
**Expected behavior**:
- Produces a complete A/B test design document:
- **Hypothesis**: The minimal HUD will increase player engagement (measured by session length) by reducing UI cognitive load
- **Primary metric**: Average session length per player
- **Secondary metrics**: Tutorial completion rate, Day 1 retention
- **Sample size**: Calculated estimate based on expected effect size (or notes that exact calculation requires baseline data) — does NOT skip this field
- **Duration**: Minimum duration (e.g., "at least 2 weeks to capture weekly player behavior patterns")
- **Randomization unit**: Player ID (not session ID, to prevent players seeing both versions)
- Output is structured as a formal test design, not a bullet list of ideas
### Case 4: Conflict — overlapping A/B test player segments
**Input**: "We have two A/B tests running simultaneously: Test A (HUD variants) affects all players, and Test B (tutorial variants) also affects all players."
**Expected behavior**:
- Flags the overlap as a mutual exclusion violation: if both tests affect the same player, their results are confounded — neither test produces clean data
- Identifies the problem precisely: players in both tests will have HUD and tutorial variants interacting, making it impossible to attribute outcome differences to either variable alone
- Proposes resolution options: (a) run tests sequentially, (b) split the player population into exclusive segments (50% in Test A, 50% in Test B, 0% in both), or (c) run a factorial design if the interaction effect is also of interest (more complex, requires larger sample)
- Does NOT recommend continuing both tests on overlapping populations
### Case 5: Context pass — new events consistent with existing schema
**Input context**: Existing event schema uses the naming convention: `[domain]_[object]_[action]` in snake_case. Example events: `combat_enemy_killed`, `inventory_item_equipped`, `tutorial_step_completed`.
**Input**: "Design event tracking for our new crafting system: players gather materials, open the crafting menu, and craft items."
**Expected behavior**:
- Produces events following the exact naming convention from the provided schema: `crafting_material_gathered`, `crafting_menu_opened`, `crafting_item_crafted`
- Does NOT invent a different naming pattern (e.g., `gatherMaterial`, `craftingOpened`) even if it might seem natural
- Properties follow the same structure as existing events: `player_id`, `session_id`, `timestamp` as standard fields; domain-specific fields (material_type, item_id, crafting_time_seconds) as additional properties
- Output explicitly references the provided naming convention as the standard being followed
---
## Protocol Compliance
- [ ] Stays within declared domain (event schema design, A/B test design, analytics methodology)
- [ ] Redirects implementation requests to appropriate programmers with an integration spec, not code
- [ ] Produces complete A/B test designs (hypothesis, metric, sample size, duration, randomization unit) — never partial
- [ ] Flags mutual exclusion violations in overlapping A/B tests as data quality blockers
- [ ] Follows provided naming conventions exactly; does not invent alternative conventions
---
## Coverage Notes
- Case 3 (A/B test design completeness) is a quality gate — an incomplete test design wastes experiment budget
- Case 4 (mutual exclusion) is a data integrity test — overlapping tests produce unusable results; this must be caught
- Case 5 is the most important context-awareness test; naming convention drift across schemas causes dashboard breakage
- No automated runner; review manually or via `/skill-test`

View File

@@ -0,0 +1,81 @@
# Agent Test Spec: community-manager
## Agent Summary
- **Domain**: Player-facing communications — patch notes text (player-friendly), social media post drafts, community update announcements, crisis communication response plans, bug triage and routing from player reports (not fixing)
- **Does NOT own**: Technical patch content (devops-engineer), QA verification and test execution (qa-lead), bug fixes (programmers), brand strategy direction (creative-director)
- **Model tier**: Sonnet
- **Gate IDs**: None; escalates brand voice conflicts to creative-director
---
## Static Assertions (Structural)
- [ ] `description:` field is present and domain-specific (references player communication, patch notes, community management)
- [ ] `allowed-tools:` list matches the agent's role (Read/Write for production/releases/patch-notes/ and communication drafts; no code or build tools)
- [ ] Model tier is Sonnet (default for operations specialists)
- [ ] Agent definition does not claim authority over technical content, QA strategy, or bug fixing
---
## Test Cases
### Case 1: In-domain request — patch notes for a bug fix
**Input**: "Write player-facing patch notes for this fix: 'JIRA-4821: Fixed NullReferenceException in InventoryManager.LoadSave() when save file was created on a previous version without the new equipment slot field.'"
**Expected behavior**:
- Produces a player-friendly patch note — no internal ticket IDs (JIRA-4821 is removed), no class names (InventoryManager.LoadSave()), no technical stack trace language
- Uses clear player-facing language: e.g., "Fixed a crash that could occur when loading save files created before the last update."
- Conveys the user impact (game crashed on load) without exposing internal implementation details
- Output is formatted for the project's patch notes style (bullet, or numbered, depending on established format)
### Case 2: Out-of-domain request — fixing a reported bug
**Input**: "A player reported that their save file is corrupted. Can you fix the save system?"
**Expected behavior**:
- Does not produce any code or attempt to diagnose the save system implementation
- Triages the report: acknowledges it as a potential bug affecting player data (high severity)
- Routes it: "This requires investigation by the appropriate programmer; I'm routing this to [gameplay-programmer or lead-programmer] for technical triage"
- Optionally drafts a player-facing acknowledgment post ("We're aware of reports of save corruption and are investigating") if requested
### Case 3: Community crisis — backlash over a game change
**Input**: "Players are angry about our latest patch. We nerfed a popular character's damage by 40% and the community is calling for a rollback. Forum posts, tweets, and Discord are all very negative."
**Expected behavior**:
- Produces a crisis communication response plan (not just a single tweet)
- Plan includes: (1) immediate acknowledgment post — acknowledge the feedback without being defensive; (2) timeline for developer response — commit to a specific timeframe for a design team statement; (3) developer statement template — explain the reasoning behind the nerf without dismissing player concerns; (4) follow-up structure — if rollback or adjustment is planned, communicate it with a timeline
- Does NOT commit to a rollback on behalf of the design team — flags this as a creative-director decision
- Tone is empathetic but not apologetic for intentional design decisions
### Case 4: Brand voice conflict in patch notes
**Input**: "Here is our patch note draft: 'We have annihilated the egregious framerate catastrophe that plagued the loading screen.' Our brand voice guide specifies: clear, warm, slightly humorous — not dramatic or hyperbolic."
**Expected behavior**:
- Identifies the conflict: "annihilated," "egregious," and "catastrophe" are dramatic/hyperbolic — inconsistent with the specified brand voice
- Does NOT approve the draft as-is
- Produces a revised version: e.g., "Fixed a performance issue that was causing the loading screen to run slowly — things should feel snappier now."
- Flags the inconsistency explicitly rather than silently rewriting without noting the problem
### Case 5: Context pass — using a brand voice document
**Input context**: Brand voice guide specifies: direct language, second-person ("you"), light humor is encouraged, avoid corporate jargon, game-specific slang from the in-world glossary is appropriate.
**Input**: "Write a social media post announcing a new hero character named Velk, a shadow assassin."
**Expected behavior**:
- Uses second-person address ("Meet your next favorite assassin")
- Incorporates light humor if it fits naturally
- Avoids corporate language ("We are pleased to announce" → "Meet Velk")
- Uses in-world language if the context includes a glossary (e.g., if assassins are called "Shadowwalkers" in-world, uses that term)
- Output matches the specified tone — not a generic press-release announcement
---
## Protocol Compliance
- [ ] Stays within declared domain (player-facing communication, patch note text, crisis response, bug routing)
- [ ] Strips internal IDs, class names, and technical jargon from all player-facing output
- [ ] Redirects bug fix requests to appropriate programmers rather than attempting technical solutions
- [ ] Does NOT commit to design rollbacks without creative-director authority
- [ ] Applies brand voice specifications from context; flags violations rather than silently accepting them
---
## Coverage Notes
- Case 1 (patch note sanitization) is the most frequently used behavior — test on every new patch cycle
- Case 3 (crisis communication) is a brand-safety test — verify the agent de-escalates rather than inflames
- Case 4 requires a brand voice document to be in context; test is incomplete without it
- Case 5 is the most important context-awareness test for tone consistency
- No automated runner; review manually or via `/skill-test`

View File

@@ -0,0 +1,80 @@
# Agent Test Spec: devops-engineer
## Agent Summary
- **Domain**: CI/CD pipeline configuration, build scripts, version control workflow enforcement, deployment infrastructure, branching strategy, environment management, automated test integration in CI
- **Does NOT own**: Game logic or gameplay systems, security audits (security-engineer), QA test strategy (qa-lead), game networking logic (network-programmer)
- **Model tier**: Sonnet
- **Gate IDs**: None; escalates deployment blockers to producer
---
## Static Assertions (Structural)
- [ ] `description:` field is present and domain-specific (references CI/CD, build, deployment, version control)
- [ ] `allowed-tools:` list matches the agent's role (Read/Write for pipeline config files, shell scripts, YAML; no game source editing tools)
- [ ] Model tier is Sonnet (default for operations specialists)
- [ ] Agent definition does not claim authority over game logic, security audits, or QA test design
---
## Test Cases
### Case 1: In-domain request — CI setup for a Godot project
**Input**: "Set up a CI pipeline for our Godot 4 project. It should run tests on every push to main and every pull request, and fail the build if tests fail."
**Expected behavior**:
- Produces a GitHub Actions workflow YAML (`.github/workflows/ci.yml` or equivalent)
- Uses the Godot headless test runner command from `coding-standards.md`: `godot --headless --script tests/gdunit4_runner.gd`
- Configures trigger on `push` to main and `pull_request`
- Sets the job to fail (`exit 1` or non-zero exit) when tests fail — does NOT configure the pipeline to continue on test failure
- References the project's coding standards CI rules in the output or comments
### Case 2: Out-of-domain request — game networking implementation
**Input**: "Implement the server-authoritative movement system for our multiplayer game."
**Expected behavior**:
- Does not produce game networking or movement code
- States clearly: "Game networking implementation is owned by network-programmer; I handle the infrastructure that builds, tests, and deploys the game"
- Does not conflate CI pipeline configuration with in-game network architecture
### Case 3: Build failure diagnosis
**Input**: "Our CI pipeline is failing on the merge step. The error is: 'Asset import failed: texture compression format unsupported in headless mode.'"
**Expected behavior**:
- Diagnoses the root cause: headless CI environment does not support GPU-dependent texture compression
- Proposes a concrete fix: either pre-import assets locally before CI runs (commit .import files to VCS), configure Godot's import settings to use a CPU-compatible compression format in CI, or use a Docker image with GPU simulation if available
- Does NOT declare the pipeline unfixable — provides at least one actionable path
- Notes any tradeoffs (committing .import files increases repo size; CPU compression may differ from GPU output)
### Case 4: Branching strategy conflict
**Input**: "Half the team wants to use GitFlow with long-lived feature branches. The other half wants trunk-based development. How should we set this up?"
**Expected behavior**:
- Recommends trunk-based development per project conventions (CLAUDE.md / coordination-rules.md specify Git with trunk-based development)
- Provides concrete rationale for the recommendation in this project's context: smaller team, fewer integration conflicts, faster CI feedback
- Does NOT present this as a 50/50 choice if the project has an established convention
- Explains how to implement trunk-based development with short-lived feature branches and feature flags if needed
- Does NOT override the project convention without flagging that doing so requires updating CLAUDE.md
### Case 5: Context pass — platform-specific build matrix
**Input context**: Project targets PC (Windows, Linux), Nintendo Switch, and PlayStation 5.
**Input**: "Set up our CI build matrix so we get a build artifact for each target platform on every release branch push."
**Expected behavior**:
- Produces a build matrix configuration with three platform entries: Windows, Linux, Switch, PS5
- Applies platform-appropriate build steps: PC uses standard Godot export templates; Switch and PS5 require platform-specific export templates (notes that console templates require licensed SDK access and are not publicly distributed)
- Does NOT assume all platforms can use the same build runner — flags that console builds may require self-hosted runners with licensed SDKs
- Organizes artifacts by platform name in the pipeline output
---
## Protocol Compliance
- [ ] Stays within declared domain (CI/CD, build scripts, version control, deployment)
- [ ] Redirects game logic and networking requests to appropriate programmers
- [ ] Recommends trunk-based development when branching strategy is contested, per project conventions
- [ ] Returns structured pipeline configurations (YAML, scripts) not freeform advice
- [ ] Flags platform SDK licensing constraints for console builds rather than silently producing incorrect configs
---
## Coverage Notes
- Case 1 (Godot CI) references `coding-standards.md` CI rules — verify this file is present and current before running this test
- Case 4 (branching strategy) is a convention-enforcement test — agent must know the project convention, not just give neutral advice
- Case 5 requires that project's target platforms are documented (in `technical-preferences.md` or equivalent)
- No automated runner; review manually or via `/skill-test`

View File

@@ -0,0 +1,80 @@
# Agent Test Spec: economy-designer
## Agent Summary
- **Domain**: Resource economy design, loot table design, progression curves (XP, level, unlock), in-game market and shop design, economic balance analysis, sink and faucet mechanics, inflation/deflation risk assessment
- **Does NOT own**: Live ops event scheduling and structure (live-ops-designer), code implementation, analytics tracking design (analytics-engineer), narrative justification for economy systems (writer)
- **Model tier**: Sonnet
- **Gate IDs**: None; escalates economy-breaking design conflicts to creative-director or producer
---
## Static Assertions (Structural)
- [ ] `description:` field is present and domain-specific (references economy, loot tables, progression curves, balance)
- [ ] `allowed-tools:` list matches the agent's role (Read/Write for design/balance/ documents; no code or analytics tools)
- [ ] Model tier is Sonnet (default for design specialists)
- [ ] Agent definition does not claim authority over live ops scheduling, code, or narrative
---
## Test Cases
### Case 1: In-domain request — loot table design for a chest
**Input**: "Design the loot table for a standard treasure chest in our dungeon game."
**Expected behavior**:
- Produces a probability table with distinct rarity tiers: Common, Uncommon, Rare, Epic, Legendary (or project-equivalent tiers)
- Each tier has: probability percentage, example item categories, and expected gold equivalent value range
- Probabilities sum to 100%
- Includes a brief rationale for each tier's probability: why Common is set at its value, why Legendary is set at its value
- Does NOT produce a single flat list of items — uses tiered probability structure to reflect meaningful rarity
### Case 2: Out-of-domain request — seasonal event schedule
**Input**: "Design the schedule for our summer event and fall event. When should they run and how long should each last?"
**Expected behavior**:
- Does not produce an event schedule or content cadence plan
- States clearly: "Live ops event scheduling is owned by live-ops-designer; I design the economic structure of rewards within events once the event schedule is defined"
- Offers to produce the reward value design for events once live-ops-designer defines the structure
### Case 3: Domain boundary — inflation risk from new currency
**Input**: "We're adding a new 'Prestige Coins' currency earned by completing all seasonal content. Players can spend them in a Prestige Shop."
**Expected behavior**:
- Identifies the inflation risk: if Prestige Coins accumulate faster than the shop provides sinks, the shop loses perceived value and players hoard coins without spending
- Flags the specific risk: seasonal content completion is a finite faucet, but if the shop catalog is exhausted before the season ends, late-season coins have no value
- Proposes a sink mechanic: rotating limited-time shop items, consumable items in the Prestige Shop, or a currency conversion option to keep coins draining
- Does NOT approve the design as economically sound without addressing the sink question
- Produces a structured risk assessment: faucet rate (estimated coins/week), sink capacity (estimated coins required to exhaust catalog), surplus projection
### Case 4: Mid-game progression curve issue
**Input**: "Players are reporting the mid-game XP grind (levels 20-35) feels like a wall. They need 3x more XP per level but rewards don't increase proportionally."
**Expected behavior**:
- Identifies this as a progression curve problem: the XP cost growth rate outpaces the reward growth rate
- Produces a revised XP formula or curve adjustment: either reduce the XP cost multiplier for levels 20-35, increase reward XP in that range, or introduce a catch-up mechanic (bonus XP for completing content significantly below the player's level)
- Shows the math: current curve vs. proposed curve, with specific numbers for levels 20, 25, 30, 35
- Flags that any curve change affects time-to-level-cap projections — notes the downstream impact on end-game content pacing
### Case 5: Context pass — balance analysis using current economy data
**Input context**: Current economy data: average player earns 450 Gold/hour, average shop item costs 2,000 Gold, average session length is 40 minutes. Premium items cost 5,000 Gold.
**Input**: "Is our current Gold economy healthy? Should we adjust prices or earn rates?"
**Expected behavior**:
- Uses the specific numbers provided: 450 Gold/hour = 300 Gold/40-min session; 2,000 Gold item requires ~4.4 sessions to afford; 5,000 Gold premium item requires ~11 sessions
- Evaluates whether these ratios feel rewarding or frustrating based on economy design principles
- Produces a concrete recommendation using the actual numbers: e.g., "At current earn rates, premium items take ~7.3 hours of play to afford — this is at the high end of acceptable; consider either increasing earn rate to 550 Gold/hour or reducing premium item cost to 4,000 Gold"
- Does NOT produce generic advice ("prices may be too high") without anchoring to the provided data
---
## Protocol Compliance
- [ ] Stays within declared domain (loot tables, progression curves, resource economy, inflation/deflation analysis)
- [ ] Redirects live ops scheduling requests to live-ops-designer without producing schedules
- [ ] Flags inflation/deflation risks proactively with quantified sink/faucet analysis
- [ ] Produces explicit math for progression curves — no vague curve adjustments without numbers
- [ ] Uses actual economy data from context; does not produce generic benchmarks when specifics are provided
---
## Coverage Notes
- Case 3 (inflation risk) is an economic health test — missed inflation risks cause long-term economy damage in live games
- Case 4 requires the agent to produce actual numbers, not curve shapes — verify math is present, not just a narrative
- Case 5 is the most important context-awareness test; agent must use provided data, not placeholder values
- No automated runner; review manually or via `/skill-test`

View File

@@ -0,0 +1,81 @@
# Agent Test Spec: live-ops-designer
## Agent Summary
- **Domain**: Post-launch content strategy, seasonal events (design and structure), battle pass design, content cadence planning, player retention mechanic design, live service feature roadmaps
- **Does NOT own**: Economy math and reward value calculations (economy-designer), analytics tracking implementation (analytics-engineer), narrative content within events (writer), code implementation
- **Model tier**: Sonnet
- **Gate IDs**: None; escalates monetization concerns to creative-director for brand/ethics review
---
## Static Assertions (Structural)
- [ ] `description:` field is present and domain-specific (references live ops, seasonal events, battle pass, retention)
- [ ] `allowed-tools:` list matches the agent's role (Read/Write for design/live-ops/ documents; no code or analytics tools)
- [ ] Model tier is Sonnet (default for design specialists)
- [ ] Agent definition does not claim authority over economy math, analytics pipelines, or narrative direction
---
## Test Cases
### Case 1: In-domain request — summer event design
**Input**: "Design a summer event for our game. It should run for 3 weeks and give players reasons to log in daily."
**Expected behavior**:
- Produces an event structure document covering: event duration (3 weeks, with start/end dates if context provides the current date), daily login retention hooks (daily missions, login streaks, time-limited rewards), progression gates (weekly milestones that reward continued engagement), and reward categories (cosmetic, functional, or currency — flagged for economy-designer to value)
- Does NOT assign specific reward values or currency amounts — marks these as [TO BE BALANCED BY ECONOMY-DESIGNER]
- Identifies the core player loop for the event separate from the base game loop
- Output is a structured event brief: overview, schedule, progression structure, reward categories
### Case 2: Out-of-domain request — reward value calculation
**Input**: "How much premium currency should we give out in this event? What's the fair value of each cosmetic reward tier?"
**Expected behavior**:
- Does not produce currency amounts or reward valuation
- States clearly: "Reward values and currency amounts are owned by economy-designer; I design the event structure and define what rewards exist, then economy-designer assigns their values"
- Offers to produce the reward structure (tiers, unlock gates, cosmetic categories) so economy-designer has something concrete to value
### Case 3: Domain boundary — predatory monetization concern
**Input**: "Let's design the battle pass so that players need to spend premium currency on top of the pass price to complete all tiers within the season."
**Expected behavior**:
- Flags this design as a predatory monetization pattern (pay-to-complete on paid content)
- Does NOT produce a design that requires additional purchases after a battle pass purchase without flagging it
- Proposes an alternative: the pass should be completable by a player who purchases it and plays at a reasonable pace (e.g., 45 minutes/day for 5 days/week)
- Notes that this decision has brand and ethics implications — escalates to creative-director for approval before proceeding
- Does not refuse to continue entirely — offers the ethical alternative design and awaits direction
### Case 4: Conflict — event schedule vs. main game progression pacing
**Input**: "We want to run a double-XP event during weeks 3-5 of the season, but our progression designer says that's when players are supposed to hit the mid-game difficulty curve."
**Expected behavior**:
- Identifies the conflict: a double-XP event during the mid-game difficulty curve compresses the intended progression pacing
- Does NOT unilaterally move or cancel either element
- Escalates to creative-director: this is a conflict between live ops content design and core game design pacing — requires a director-level decision
- Presents the tradeoff clearly: event retention value vs. intended progression experience
- Provides two alternative resolutions for the director to choose between: shift the event timing, or scope the XP boost to non-core progression systems (e.g., cosmetic grind only)
### Case 5: Context pass — designing to address a player retention drop-off
**Input context**: Analytics show a 40% player drop-off at Day 7, attributed to players completing the tutorial but finding no mid-term goal to pursue.
**Input**: "Design a live ops feature to address the Day 7 drop-off."
**Expected behavior**:
- Designs specifically for the Day 7 cohort — not a generic retention feature
- Proposes a mid-term goal structure: a 2-week "Explorer Challenge" that unlocks at Day 5-7 and provides a visible progression track with rewards at Day 10, 14, and 21
- Connects the design explicitly to the identified drop-off point: the feature must be visible and activating before or at Day 7
- Does NOT design a feature for Day 1 retention or Day 30 monetization when the data points to Day 7 as the target
- Notes that specific reward values are [TO BE DEFINED BY ECONOMY-DESIGNER] using the actual retention data
---
## Protocol Compliance
- [ ] Stays within declared domain (event structure, content cadence, retention design, battle pass design)
- [ ] Redirects reward value and economy math requests to economy-designer
- [ ] Flags predatory monetization patterns and escalates to creative-director rather than implementing them silently
- [ ] Escalates event/core-progression conflicts to creative-director rather than resolving unilaterally
- [ ] Uses provided retention data to target specific player cohorts, not generic engagement strategies
---
## Coverage Notes
- Case 3 (monetization ethics) is a brand-safety test — failure here could result in harmful live ops designs shipping
- Case 4 (escalation behavior) is a coordination test — verify the agent actually escalates rather than deciding independently
- Case 5 is the most important context-awareness test; agent must target the specific drop-off point, not a generic solution
- No automated runner; review manually or via `/skill-test`

View File

@@ -0,0 +1,81 @@
# Agent Test Spec: localization-lead
## Agent Summary
- **Domain**: Internationalization (i18n) architecture, string extraction workflows and tooling configuration, locale testing methodology, translation pipeline design (extraction → TMS → import), string quality standards, locale-specific formatting rules (plurals, RTL, date/number formats)
- **Does NOT own**: Game narrative content and dialogue writing (writer), code implementation of i18n calls (gameplay-programmer), translation work itself (external translators)
- **Model tier**: Sonnet
- **Gate IDs**: None; escalates pipeline architecture decisions to technical-director when they affect build systems
---
## Static Assertions (Structural)
- [ ] `description:` field is present and domain-specific (references i18n, string extraction, locale pipeline, localization)
- [ ] `allowed-tools:` list matches the agent's role (Read/Write for localization config, pipeline docs, string tables; no game source editing or deployment tools)
- [ ] Model tier is Sonnet (default for specialists)
- [ ] Agent definition does not claim authority over narrative content, game code implementation, or translation quality
---
## Test Cases
### Case 1: In-domain request — string extraction pipeline for a Unity project
**Input**: "Set up a string extraction pipeline for our Unity game. We need to get all localizable strings into a format translators can work with."
**Expected behavior**:
- Produces a concrete extraction configuration covering: which string types to extract (UI labels, dialogue, item descriptions — not debug strings), the tool to use (e.g., Unity Localization package string tables, or a custom extraction script targeting specific component types), and the output format (CSV, XLIFF, or TMX — notes which formats are compatible with common TMS tools like Crowdin or Lokalise)
- Specifies the folder structure: e.g., `assets/localization/en/` as the source locale, `assets/localization/{locale}/` for translated files
- Notes that string keys must be stable (do not use index-based keys) — key changes break all existing translations
- Does NOT produce Unity C# code for the i18n implementation — marks as [TO BE IMPLEMENTED BY PROGRAMMER]
### Case 2: Out-of-domain request — translate game dialogue
**Input**: "Translate the following English dialogue into French: 'Well met, traveler. The road ahead is treacherous.'"
**Expected behavior**:
- Does not produce a French translation
- States clearly: "localization-lead owns the pipeline, quality standards, and workflow; actual translation work is performed by human translators or approved translation vendors — I am not a translator"
- Optionally notes what information a translator would need: context (who is speaking, to whom, game genre/tone), character limit constraints if any, glossary terms (e.g., if "traveler" has a game-specific translation)
### Case 3: Domain boundary — missing plural forms in Russian locale
**Input**: "Our Russian locale files only have a singular form for item quantity strings. Russian requires multiple plural forms (1 item, 2-4 items, 5+ items use different forms)."
**Expected behavior**:
- Identifies this as a locale-specific plural form gap: Russian has 3 plural categories (one, few, many) per CLDR/Unicode plural rules — a single string is insufficient
- Flags it as a localization quality bug, not a minor style issue — incorrect plural forms are grammatically wrong and visible to players
- Recommends the fix: update the string extraction format to support CLDR plural categories (one/few/many/other), and flag to the translation vendor that Russian strings need all plural forms
- Notes which other languages in the pipeline also require plural form support (e.g., Polish, Czech, Arabic)
- Does NOT suggest using a numeric threshold workaround as a substitute for proper CLDR plural support
### Case 4: String key naming conflict between two systems
**Input**: "Our UI system uses keys like 'button_confirm' and 'button_cancel'. Our dialogue system uses 'confirm' and 'cancel' for the same concepts. Translators are confused about which to use."
**Expected behavior**:
- Identifies the conflict: two systems use different key naming conventions for semantically identical strings, creating duplicate translation work and translator confusion
- Produces a naming convention resolution: domain-prefixed keys with a consistent separator (e.g., `ui.button.confirm`, `ui.button.cancel`) — all systems use the same key for shared concepts
- Recommends that shared UI primitives (Confirm, Cancel, Back, OK) use a single canonical key in a shared namespace, referenced by both systems
- Provides a migration path: map old keys to new keys, update all string references in both systems, deprecate old keys after one release cycle
- Does NOT recommend maintaining two separate keys for the same concept
### Case 5: Context pass — pipeline accommodates RTL languages
**Input context**: Target locales include English (en), French (fr), German (de), Arabic (ar), and Hebrew (he).
**Input**: "Design the localization pipeline for this project."
**Expected behavior**:
- Identifies Arabic and Hebrew as RTL languages — explicitly calls this out as a pipeline requirement
- Designs the pipeline to include: RTL text rendering support (flag for programmer: UI must support RTL layout mirroring), bidirectional (bidi) text handling in string tables, locale-specific testing checklist entry for RTL layout
- Does NOT design a pipeline that only accounts for LTR languages when RTL locales are specified
- Notes that Arabic also requires a different plural form structure (6 plural categories in CLDR) — flags for translation vendor
- Output includes all five locales in the pipeline architecture, not just the default (en)
---
## Protocol Compliance
- [ ] Stays within declared domain (pipeline, extraction, string quality, locale formats, i18n architecture)
- [ ] Does not produce translations — redirects translation work to human translators/vendors
- [ ] Flags locale-specific gaps (plural forms, RTL) as quality bugs requiring pipeline changes
- [ ] Produces a unified key naming convention when conflicts arise — does not accept dual conventions
- [ ] Incorporates all provided target locales, including RTL languages, into pipeline design
---
## Coverage Notes
- Case 3 (plural forms) and Case 5 (RTL) are locale-correctness tests — these affect shipping quality in non-English markets
- Case 4 (key naming conflict) is a pipeline hygiene test — duplicate keys cause ongoing translator confusion and cost
- Case 5 requires the target locale list to be in context; if not provided, agent should ask before designing the pipeline
- No automated runner; review manually or via `/skill-test`

View File

@@ -0,0 +1,80 @@
# Agent Test Spec: release-manager
## Agent Summary
- **Domain**: Release pipeline management, platform certification checklists (Nintendo, Sony, Microsoft, Apple, Google), store submission workflows, platform technical requirements compliance, semantic version numbering, release branch management
- **Does NOT own**: Game design decisions, QA test strategy or test case design (qa-lead), QA test execution (qa-tester), build infrastructure (devops-engineer)
- **Model tier**: Sonnet
- **Gate IDs**: May be invoked by `/gate-check` during Release phase; LAUNCH BLOCKED verdict is release-manager's primary escalation output
---
## Static Assertions (Structural)
- [ ] `description:` field is present and domain-specific (references release pipeline, certification, store submission)
- [ ] `allowed-tools:` list matches the agent's role (Read/Write for production/releases/ directory; no game source or test tools)
- [ ] Model tier is Sonnet (default for operations specialists)
- [ ] Agent definition does not claim authority over QA strategy, game design, or build infrastructure
---
## Test Cases
### Case 1: In-domain request — platform certification checklist for Nintendo Switch
**Input**: "Generate the certification checklist for our Nintendo Switch submission."
**Expected behavior**:
- Produces a structured checklist covering Nintendo Lotcheck requirements relevant to the game type
- Includes categories: content rating (CERO/PEGI/ESRB as applicable), save data handling, offline mode compliance, error handling (lost connectivity, storage full), controller requirement (Joy-Con, Pro Controller support), sleep/wake behavior, screenshot/video capture compliance
- Formats output as a numbered checklist with pass/fail columns
- Notes that Nintendo's full Lotcheck guidelines require a licensed developer account to access and flags any items that require manual verification against the current guidelines document
- Does NOT produce fabricated requirement IDs — uses known public requirements or clearly marks uncertainty
### Case 2: Out-of-domain request — design test cases
**Input**: "Write test cases for our save system to make sure it passes certification."
**Expected behavior**:
- Does not produce test case specifications
- States clearly: "Test case design is owned by qa-lead (strategy) and qa-tester (execution); I can provide the certification requirements that the save system must meet, which qa-lead can then use to design tests"
- Optionally offers to list the save-system-relevant certification requirements
### Case 3: Domain boundary — certification failure (rating issue)
**Input**: "Our build was rejected by the ESRB. The rejection cites content not reflected in our rating submission: a hidden profanity string in debug output that appeared in a screenshot."
**Expected behavior**:
- Issues a LAUNCH BLOCKED verdict with the specific platform requirement referenced (ESRB submission accuracy requirement)
- Identifies the immediate action required: locate and remove all debug output containing inappropriate content before resubmission
- Notes the resubmission process: corrected build must be resubmitted with updated content descriptor if needed
- Does NOT minimize the issue — a certification rejection is a blocking event, not an advisory
- Escalates to producer: documents the delay impact on release timeline
### Case 4: Version numbering conflict — hotfix vs. release branch
**Input**: "Our release branch is at v1.2.0. A hotfix was applied directly on main and tagged v1.2.1. Now the release branch also has changes that need to ship as v1.2.1 but they're different changes."
**Expected behavior**:
- Identifies the conflict: two different changesets have been assigned the same version tag
- Applies semantic versioning resolution: one must be re-tagged — the release branch changes should become v1.2.2 if v1.2.1 is already published; if v1.2.1 is not yet published, coordinate with devops-engineer to merge or re-tag
- Does NOT accept a state where the same version number refers to two different builds
- Notes that once a version is submitted to a store, it cannot be reused — flags this as a potential store submission blocker
### Case 5: Context pass — release date constraint and certification lead time
**Input context**: Target release date is 2026-06-01. Current date is 2026-04-06. Nintendo Lotcheck typically takes 4-6 weeks.
**Input**: "What should we prioritize on the certification checklist given our timeline?"
**Expected behavior**:
- Calculates the available window: ~8 weeks to release date; Nintendo Lotcheck at 4-6 weeks means submission must be ready by approximately 2026-04-20 to 2026-05-04 to allow for a potential resubmission cycle
- Flags that a single rejection cycle would consume the buffer — prioritizes items historically associated with Lotcheck rejections (save data, offline mode, error handling)
- Orders the checklist by certification lead time impact, not by perceived difficulty
- Does NOT produce a checklist that assumes first-pass certification — builds in resubmission time
---
## Protocol Compliance
- [ ] Stays within declared domain (release pipeline, certification checklists, version numbering, store submission)
- [ ] Redirects test case design requests to qa-lead/qa-tester without producing test specs
- [ ] Issues LAUNCH BLOCKED verdicts for certification failures — does not downgrade to advisory
- [ ] Applies semantic versioning correctly and flags version conflicts as store-blocking issues
- [ ] Uses provided timeline data to prioritize checklist items by certification lead time
---
## Coverage Notes
- Case 3 (LAUNCH BLOCKED verdict) is the most critical test — this agent's primary safety output is blocking bad launches
- Case 5 requires current date and release date context; verify the agent uses actual dates, not placeholder estimates
- Certification requirements change over time — flag if the agent produces specific requirement IDs that may be outdated
- No automated runner; review manually or via `/skill-test`