Files
pixelheros/CCGS Skill Testing Framework/skills/utility/skill-improve.md
2026-05-15 14:52:29 +08:00

186 lines
6.6 KiB
Markdown

# Skill Test Spec: /skill-improve
## Skill Summary
`/skill-improve` runs an automated test-fix-retest improvement loop on a skill
file. It invokes `/skill-test static` (and optionally `/skill-test category`) to
establish a baseline score, diagnoses the failing checks, proposes targeted fixes
to the SKILL.md file, asks "May I write the improvements to [skill path]?", applies
the fixes, and re-runs the tests to confirm improvement.
If the proposed fix makes the skill worse (regression), the fix is reverted (with
user confirmation) rather than applied. If the skill is already perfect (0 failures),
the skill exits immediately without making changes. No director gates apply. Verdicts:
IMPROVED (score went up), NO CHANGE (no improvements possible or user declined), or
REVERTED (fix was applied but caused regression and was reverted).
---
## Static Assertions (Structural)
Verified automatically by `/skill-test static` — no fixture needed.
- [ ] Has required frontmatter fields: `name`, `description`, `argument-hint`, `user-invocable`, `allowed-tools`
- [ ] Has ≥2 phase headings
- [ ] Contains verdict keywords: IMPROVED, NO CHANGE, REVERTED
- [ ] Contains "May I write" collaborative protocol language before applying fixes
- [ ] Has a next-step handoff (e.g., run `/skill-test spec` to validate behavioral compliance)
---
## Director Gate Checks
None. `/skill-improve` is a meta-utility skill. No director gates apply.
---
## Test Cases
### Case 1: Happy Path — Skill With 2 Static Failures, Both Fixed, IMPROVED
**Fixture:**
- `.claude/skills/some-skill/SKILL.md` has 2 static failures:
- Check 4: no "May I write" language despite having Write in allowed-tools
- Check 5: no next-step handoff at the end
**Input:** `/skill-improve some-skill`
**Expected behavior:**
1. Skill runs `/skill-test static some-skill` — baseline: 5/7 checks pass
2. Skill diagnoses the 2 failing checks (4 and 5)
3. Skill proposes fixes:
- Add "May I write" language to the appropriate phase
- Add a next-step handoff section at the end
4. Skill asks "May I write improvements to `.claude/skills/some-skill/SKILL.md`?"
5. Fixes applied; `/skill-test static some-skill` re-run — now 7/7 checks pass
6. Verdict is IMPROVED (5→7)
**Assertions:**
- [ ] Baseline score is established before any changes (5/7)
- [ ] Both failing checks are diagnosed and addressed in the proposed fix
- [ ] "May I write" is asked before applying the fix
- [ ] Re-test confirms improvement (7/7)
- [ ] Verdict is IMPROVED with before/after score shown
---
### Case 2: Fix Causes Regression — Score Comparison Shows Regression, REVERTED
**Fixture:**
- `.claude/skills/some-skill/SKILL.md` has 1 static failure (missing handoff)
- Proposed fix inadvertently removes the verdict keywords section
(introducing a new failure)
**Input:** `/skill-improve some-skill`
**Expected behavior:**
1. Baseline: 6/7 checks pass (1 failure: missing handoff)
2. Skill proposes fix and asks "May I write improvements?"
3. Fix is applied; re-test runs
4. Re-test result: 5/7 (fixed the handoff but broke verdict keywords)
5. Skill detects regression: score went DOWN
6. Skill asks user: "Fix caused a regression (6→5). May I revert the changes?"
7. User confirms; changes are reverted; verdict is REVERTED
**Assertions:**
- [ ] Re-test score is compared to baseline before finalizing
- [ ] Regression is detected when score decreases
- [ ] User is asked to confirm revert (not automatic)
- [ ] File is reverted on user confirmation
- [ ] Verdict is REVERTED
---
### Case 3: Skill With Category Assignment — Baseline Captures Both Scores
**Fixture:**
- `.claude/skills/gate-check/SKILL.md` is a gate skill with 1 static failure
and 2 category (G-criteria) failures
- `tests/skills/quality-rubric.md` has Gate Skills section
**Input:** `/skill-improve gate-check`
**Expected behavior:**
1. Skill runs both static and category tests for the baseline:
- Static: 6/7 checks pass
- Category: 3/5 G-criteria pass
2. Combined baseline: 9/12
3. Skill diagnoses all 3 failures and proposes fixes
4. "May I write improvements to `.claude/skills/gate-check/SKILL.md`?"
5. Fixes applied; both test types re-run
6. Re-test: static 7/7, category 5/5 = 12/12
7. Verdict is IMPROVED (9→12)
**Assertions:**
- [ ] Both static and category scores are captured in the baseline
- [ ] Combined score is used for comparison (not just one type)
- [ ] All 3 failures are addressed in the proposed fix
- [ ] Re-test confirms improvement in both score types
- [ ] Verdict is IMPROVED with combined before/after
---
### Case 4: Skill Already Perfect — No Improvements Needed
**Fixture:**
- `.claude/skills/brainstorm/SKILL.md` has no static failures
- Category score is also 5/5 (if applicable)
**Input:** `/skill-improve brainstorm`
**Expected behavior:**
1. Skill runs `/skill-test static brainstorm` — 7/7 checks pass
2. If category applies: 5/5 criteria pass
3. Skill outputs: "No improvements needed — brainstorm is fully compliant"
4. Skill exits without proposing any changes
5. No "May I write" is asked; no files are modified
6. Verdict is NO CHANGE
**Assertions:**
- [ ] Skill exits immediately after confirming 0 failures
- [ ] "No improvements needed" message is shown
- [ ] No changes are proposed
- [ ] No "May I write" is asked
- [ ] Verdict is NO CHANGE
---
### Case 5: Director Gate Check — No gate; skill-improve is a meta utility
**Fixture:**
- Skill with at least 1 static failure
**Input:** `/skill-improve some-skill`
**Expected behavior:**
1. Skill runs the test-fix-retest loop
2. No director agents are spawned
3. No gate IDs appear in output
**Assertions:**
- [ ] No director gate is invoked
- [ ] No gate skip messages appear
- [ ] Verdict is IMPROVED, NO CHANGE, or REVERTED — no gate verdict
---
## Protocol Compliance
- [ ] Always establishes a baseline score before proposing any changes
- [ ] Shows before/after score comparison in the output
- [ ] Asks "May I write" before applying any fix
- [ ] Detects regressions by comparing re-test score to baseline
- [ ] Asks for user confirmation before reverting (not automatic)
- [ ] Ends with IMPROVED, NO CHANGE, or REVERTED verdict
---
## Coverage Notes
- The improvement loop is designed to run only one fix-retest cycle per
invocation; running multiple iterations requires re-invoking `/skill-improve`.
- Behavioral compliance (spec-mode test results) is not included in the
improvement loop — only structural (static) and category scores are automated.
- The case where the skill file cannot be read (permissions error or missing file)
is not tested; this would result in an error before the baseline is established.