Behavioral conformance
Behavioral conformance tests the model. Given a soul and an OpenAI-compatible endpoint, muster runs multi-turn test conversations and grades the model’s transcripts against three objectively-measurable axes the soul declares.
The checker is turn-list in, transcript out and multi-turn from the ground up. Single-turn cases are just turn lists of length one.
The three axes
Section titled “The three axes”| Axis | What it checks |
|---|---|
| Verbosity | Each graded reply’s word count stays within the soul’s verbosity-derived cap. |
| Brief refusals | A refusal stays under the refusal word cap and satisfies content assertions (e.g. “states no price”). |
| Dynamic state shift | Injecting a fact (e.g. user.rude) at a turn shifts the active state (e.g. cold_strict), and the post-shift output observably conforms to the shifted state. |
These are deliberately the objectively-gradable axes. There is no fuzzy
“LLM-as-judge” for subjective qualities. Every grade records the measured
value and the limit it was checked against, so a failure is always
explainable.
k-of-n grading
Section titled “k-of-n grading”Models are stochastic, so each case runs runs times (default 3) and passes
iff at least pass_threshold (default 2) runs pass. An errored run counts as a
failed run. Temperature stays at the provider default unless overridden and is
recorded in every transcript.
Bring your own model
Section titled “Bring your own model”The behavioral runner talks to any endpoint speaking the OpenAI
/chat/completions API. Nothing but configuration changes between providers.
# Local: Ollamaollama pull qwen2.5:7b-instructmuster behave run behave/voice-frontdesk.yaml # defaults to localhost:11434/v1
# Hosted: NVIDIA NIM (or any compatible provider)export MUSTER_API_KEY="..." # env only; never a flag or filemuster behave run behave/voice-frontdesk.yaml \ --base-url https://integrate.api.nvidia.com/v1 \ --model meta/llama-3.1-8b-instructThe API key is read from MUSTER_API_KEY (fallback OPENAI_API_KEY) at request
time. It never appears in argv, transcripts, or committed results, and must
never be committed to a repository.
See Behavioral thresholds for the exact word-count mapping and how per-case overrides work.