6.5 KiB
Skill Test Spec: /test-flakiness
Skill Summary
/test-flakiness detects non-deterministic tests by analyzing test history logs
(if available) or scanning test source code for common flakiness patterns (random
numbers without seeds, real-time waits, external I/O). No director gates are
invoked. The skill does not write without user approval. Verdicts: NO FLAKINESS,
SUSPECT TESTS FOUND, or CONFIRMED FLAKY.
Static Assertions (Structural)
Verified automatically by /skill-test static — no fixture needed.
- Has required frontmatter fields:
name,description,argument-hint,user-invocable,allowed-tools - Has ≥2 phase headings
- Contains verdict keywords: NO FLAKINESS, SUSPECT TESTS FOUND, CONFIRMED FLAKY
- Does NOT require "May I write" language (read-only; optional report requires approval)
- Has a next-step handoff (what to do after flakiness findings)
Director Gate Checks
None. Flakiness detection is an advisory quality skill for the QA lead; no gates are invoked.
Test Cases
Case 1: Happy Path — Clean test history, no flakiness
Fixture:
production/qa/test-history/contains logs for 10 test runs- All tests pass consistently across all 10 runs (100% pass rate per test)
- No test has a failure pattern
Input: /test-flakiness
Expected behavior:
- Skill reads test history logs from
production/qa/test-history/ - Skill computes per-test pass rate across 10 runs
- All tests pass all 10 runs — no inconsistency detected
- Verdict is NO FLAKINESS
Assertions:
- Skill reads test history logs when available
- Per-test pass rate is computed across all available runs
- Verdict is NO FLAKINESS when all tests pass consistently
- No files are written
Case 2: Suspect Tests Found — Test fails intermittently in history
Fixture:
production/qa/test-history/contains logs for 10 test runstest_combat_damage_applies_crit_multiplierpasses 7 times, fails 3 times- Failure messages differ (sometimes timeout, sometimes wrong value)
Input: /test-flakiness
Expected behavior:
- Skill reads test history logs — computes pass rates
test_combat_damage_applies_crit_multiplierhas 70% pass rate (threshold: 95%)- Skill flags it as SUSPECT with pass rate (7/10) and failure pattern noted
- Verdict is SUSPECT TESTS FOUND
- Skill recommends investigating the test for timing or state dependencies
Assertions:
- Tests below the pass-rate threshold are flagged by name
- Pass rate (fraction and percentage) is shown for each suspect test
- Failure pattern (e.g., inconsistent error messages) is noted if detectable
- Verdict is SUSPECT TESTS FOUND
- Skill recommends investigation steps
Case 3: Source Pattern — Random number used without seed
Fixture:
- No test history logs exist
tests/unit/loot/loot_drop_test.gdcontains:var roll = randf() # unseeded random — non-deterministic assert_gt(roll, 0.5, "Loot should drop above 50%")
Input: /test-flakiness
Expected behavior:
- Skill finds no test history logs
- Skill falls back to source code analysis
- Skill detects
randf()call without a precedingseed()call - Skill flags the test as FLAKINESS RISK (source pattern, not confirmed)
- Verdict is SUSPECT TESTS FOUND (pattern detected, not confirmed by history)
- Skill recommends seeding random before the call or mocking the random function
Assertions:
- Source code analysis is used as fallback when no history logs exist
- Unseeded random number usage is detected as a flakiness risk
- Verdict is SUSPECT TESTS FOUND (not CONFIRMED FLAKY — no history to confirm)
- Remediation recommends seeding or mocking
Case 4: No Test History — Source-only analysis with common patterns
Fixture:
production/qa/test-history/does not existtests/contains 15 test files- Scan finds 2 tests using
OS.get_ticks_msec()for timing assertions - No other flakiness patterns found
Input: /test-flakiness
Expected behavior:
- Skill checks for test history — not found
- Skill notes: "No test history available — analyzing source code for flakiness patterns only"
- Skill scans all test files for known patterns: unseeded random, real-time waits, system clock usage
- Finds 2 tests using
OS.get_ticks_msec()— flags as FLAKINESS RISK - Verdict is SUSPECT TESTS FOUND
Assertions:
- Skill notes clearly that source-only analysis is being performed (no history)
- Common flakiness patterns are scanned: random, time-based assertions, external I/O
OS.get_ticks_msec()usage for assertions is flagged as a flakiness risk- Verdict is SUSPECT TESTS FOUND when source patterns are found
Case 5: Gate Compliance — No gate; flakiness report is advisory
Fixture:
- Test history shows 1 CONFIRMED FLAKY test (fails 6 out of 10 runs)
review-mode.txtcontainsfull
Input: /test-flakiness
Expected behavior:
- Skill analyzes test history; identifies 1 confirmed flaky test
- No director gate is invoked regardless of review mode
- Verdict is CONFIRMED FLAKY
- Skill presents findings and offers optional written report
- If user opts in: "May I write to
production/qa/flakiness-report-[date].md?"
Assertions:
- No director gate is invoked in any review mode
- CONFIRMED FLAKY verdict requires history-based evidence (not just source patterns)
- Optional report requires "May I write" before writing
- Flakiness report is advisory for qa-lead; skill does not auto-disable tests
Protocol Compliance
- Reads test history logs when available; falls back to source analysis when not
- Notes clearly which analysis mode is being used (history vs. source-only)
- Flakiness threshold (e.g., 95% pass rate) is used for SUSPECT classification
- CONFIRMED FLAKY requires history evidence; SUSPECT covers source patterns only
- Does not disable or modify any test files
- No director gates are invoked
- Verdict is one of: NO FLAKINESS, SUSPECT TESTS FOUND, CONFIRMED FLAKY
Coverage Notes
- The pass-rate threshold for SUSPECT classification (95% suggested above) is an implementation detail; the tests verify that intermittent failures are flagged, not the exact threshold value.
- Tests that fail due to environment issues (missing assets, wrong platform) are not flakiness — the skill distinguishes environment failures from non-determinism in the test itself; this distinction is not explicitly tested here.