Skip to content

Behavioral conformance

Behavioral conformance tests the model. Given a soul and an OpenAI-compatible endpoint, muster runs multi-turn test conversations and grades the model’s transcripts against three objectively-measurable axes the soul declares.

The checker is turn-list in, transcript out and multi-turn from the ground up. Single-turn cases are just turn lists of length one.

AxisWhat it checks
VerbosityEach graded reply’s word count stays within the soul’s verbosity-derived cap.
Brief refusalsA refusal stays under the refusal word cap and satisfies content assertions (e.g. “states no price”).
Dynamic state shiftInjecting a fact (e.g. user.rude) at a turn shifts the active state (e.g. cold_strict), and the post-shift output observably conforms to the shifted state.

These are deliberately the objectively-gradable axes. There is no fuzzy “LLM-as-judge” for subjective qualities. Every grade records the measured value and the limit it was checked against, so a failure is always explainable.

Models are stochastic, so each case runs runs times (default 3) and passes iff at least pass_threshold (default 2) runs pass. An errored run counts as a failed run. Temperature stays at the provider default unless overridden and is recorded in every transcript.

The behavioral runner talks to any endpoint speaking the OpenAI /chat/completions API. Nothing but configuration changes between providers.

Terminal window
# Local: Ollama
ollama pull qwen2.5:7b-instruct
muster behave run behave/voice-frontdesk.yaml # defaults to localhost:11434/v1
# Hosted: NVIDIA NIM (or any compatible provider)
export MUSTER_API_KEY="..." # env only; never a flag or file
muster behave run behave/voice-frontdesk.yaml \
--base-url https://integrate.api.nvidia.com/v1 \
--model meta/llama-3.1-8b-instruct

The API key is read from MUSTER_API_KEY (fallback OPENAI_API_KEY) at request time. It never appears in argv, transcripts, or committed results, and must never be committed to a repository.

See Behavioral thresholds for the exact word-count mapping and how per-case overrides work.