6.6 KiB
Skill Test Spec: /skill-improve
Skill Summary
/skill-improve runs an automated test-fix-retest improvement loop on a skill
file. It invokes /skill-test static (and optionally /skill-test category) to
establish a baseline score, diagnoses the failing checks, proposes targeted fixes
to the SKILL.md file, asks "May I write the improvements to [skill path]?", applies
the fixes, and re-runs the tests to confirm improvement.
If the proposed fix makes the skill worse (regression), the fix is reverted (with user confirmation) rather than applied. If the skill is already perfect (0 failures), the skill exits immediately without making changes. No director gates apply. Verdicts: IMPROVED (score went up), NO CHANGE (no improvements possible or user declined), or REVERTED (fix was applied but caused regression and was reverted).
Static Assertions (Structural)
Verified automatically by /skill-test static — no fixture needed.
- Has required frontmatter fields:
name,description,argument-hint,user-invocable,allowed-tools - Has ≥2 phase headings
- Contains verdict keywords: IMPROVED, NO CHANGE, REVERTED
- Contains "May I write" collaborative protocol language before applying fixes
- Has a next-step handoff (e.g., run
/skill-test specto validate behavioral compliance)
Director Gate Checks
None. /skill-improve is a meta-utility skill. No director gates apply.
Test Cases
Case 1: Happy Path — Skill With 2 Static Failures, Both Fixed, IMPROVED
Fixture:
.claude/skills/some-skill/SKILL.mdhas 2 static failures:- Check 4: no "May I write" language despite having Write in allowed-tools
- Check 5: no next-step handoff at the end
Input: /skill-improve some-skill
Expected behavior:
- Skill runs
/skill-test static some-skill— baseline: 5/7 checks pass - Skill diagnoses the 2 failing checks (4 and 5)
- Skill proposes fixes:
- Add "May I write" language to the appropriate phase
- Add a next-step handoff section at the end
- Skill asks "May I write improvements to
.claude/skills/some-skill/SKILL.md?" - Fixes applied;
/skill-test static some-skillre-run — now 7/7 checks pass - Verdict is IMPROVED (5→7)
Assertions:
- Baseline score is established before any changes (5/7)
- Both failing checks are diagnosed and addressed in the proposed fix
- "May I write" is asked before applying the fix
- Re-test confirms improvement (7/7)
- Verdict is IMPROVED with before/after score shown
Case 2: Fix Causes Regression — Score Comparison Shows Regression, REVERTED
Fixture:
.claude/skills/some-skill/SKILL.mdhas 1 static failure (missing handoff)- Proposed fix inadvertently removes the verdict keywords section (introducing a new failure)
Input: /skill-improve some-skill
Expected behavior:
- Baseline: 6/7 checks pass (1 failure: missing handoff)
- Skill proposes fix and asks "May I write improvements?"
- Fix is applied; re-test runs
- Re-test result: 5/7 (fixed the handoff but broke verdict keywords)
- Skill detects regression: score went DOWN
- Skill asks user: "Fix caused a regression (6→5). May I revert the changes?"
- User confirms; changes are reverted; verdict is REVERTED
Assertions:
- Re-test score is compared to baseline before finalizing
- Regression is detected when score decreases
- User is asked to confirm revert (not automatic)
- File is reverted on user confirmation
- Verdict is REVERTED
Case 3: Skill With Category Assignment — Baseline Captures Both Scores
Fixture:
.claude/skills/gate-check/SKILL.mdis a gate skill with 1 static failure and 2 category (G-criteria) failurestests/skills/quality-rubric.mdhas Gate Skills section
Input: /skill-improve gate-check
Expected behavior:
- Skill runs both static and category tests for the baseline:
- Static: 6/7 checks pass
- Category: 3/5 G-criteria pass
- Combined baseline: 9/12
- Skill diagnoses all 3 failures and proposes fixes
- "May I write improvements to
.claude/skills/gate-check/SKILL.md?" - Fixes applied; both test types re-run
- Re-test: static 7/7, category 5/5 = 12/12
- Verdict is IMPROVED (9→12)
Assertions:
- Both static and category scores are captured in the baseline
- Combined score is used for comparison (not just one type)
- All 3 failures are addressed in the proposed fix
- Re-test confirms improvement in both score types
- Verdict is IMPROVED with combined before/after
Case 4: Skill Already Perfect — No Improvements Needed
Fixture:
.claude/skills/brainstorm/SKILL.mdhas no static failures- Category score is also 5/5 (if applicable)
Input: /skill-improve brainstorm
Expected behavior:
- Skill runs
/skill-test static brainstorm— 7/7 checks pass - If category applies: 5/5 criteria pass
- Skill outputs: "No improvements needed — brainstorm is fully compliant"
- Skill exits without proposing any changes
- No "May I write" is asked; no files are modified
- Verdict is NO CHANGE
Assertions:
- Skill exits immediately after confirming 0 failures
- "No improvements needed" message is shown
- No changes are proposed
- No "May I write" is asked
- Verdict is NO CHANGE
Case 5: Director Gate Check — No gate; skill-improve is a meta utility
Fixture:
- Skill with at least 1 static failure
Input: /skill-improve some-skill
Expected behavior:
- Skill runs the test-fix-retest loop
- No director agents are spawned
- No gate IDs appear in output
Assertions:
- No director gate is invoked
- No gate skip messages appear
- Verdict is IMPROVED, NO CHANGE, or REVERTED — no gate verdict
Protocol Compliance
- Always establishes a baseline score before proposing any changes
- Shows before/after score comparison in the output
- Asks "May I write" before applying any fix
- Detects regressions by comparing re-test score to baseline
- Asks for user confirmation before reverting (not automatic)
- Ends with IMPROVED, NO CHANGE, or REVERTED verdict
Coverage Notes
- The improvement loop is designed to run only one fix-retest cycle per
invocation; running multiple iterations requires re-invoking
/skill-improve. - Behavioral compliance (spec-mode test results) is not included in the improvement loop — only structural (static) and category scores are automated.
- The case where the skill file cannot be read (permissions error or missing file) is not tested; this would result in an error before the baseline is established.