Behavioral thresholds
RFC-1 defines voice.verbosity as a 0 to 100 scalar but deliberately maps no
word counts. So that behavioral grades are objective and reproducible, muster
applies a documented, deterministic mapping (a locked design decision).
The mapping
Section titled “The mapping”| Quantity | Rule |
|---|---|
| Verbosity word cap | maxWords(verbosity) = 10 + verbosity (e.g. verbosity 25 → 35 words) |
| Refusal word cap | constant 25 |
| Word counting | s.trim().split(/\s+/).filter(Boolean).length |
The mapping is owned by the RFC-1 adapter, so a future adapter for a different spec could map thresholds its own way without touching the core grader.
k-of-n grading
Section titled “k-of-n grading”Each case runs runs times (default 3) and passes iff at least pass_threshold
(default 2) runs pass; an errored run counts as failed. Every grade records
measured and limit, so a failure always says exactly what it measured and
what it expected.
Overrides
Section titled “Overrides”A case may carry per-case overrides in the behavioral manifest:
overrides: max_words: 30 refusal_cap: 20Overrides express deliberate test design: they win over the default
mapping for that case. The canonical example is the intentionally-impossible
xfail_discrimination_overly_verbose case, which sets an unreachable cap to
prove the grader actually fails non-conforming output rather than rubber-stamping
it.