<?xml version="1.0" encoding="UTF-8"?><rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>muster | Blog</title><description>Conformance harness for the agent-file stack: static validation and behavioral grading across Soul.md personas, Agent Skills, SOPs, tools, memory, heartbeat, A2A cards, and cross-layer composition.</description><link>https://garrison-hq.github.io/</link><language>en</language><item><title>Your AI agent is a stack of files. muster 1.0.0 tests all of them.</title><link>https://garrison-hq.github.io/muster/blog/muster-1-0-0/</link><guid isPermaLink="true">https://garrison-hq.github.io/muster/blog/muster-1-0-0/</guid><description>muster is a conformance harness for the agent-file stack. It checks each file an agent is built from against its spec, and checks whether the model actually behaves the way those files say. Here is what shipped in 1.0.0.

</description><pubDate>Tue, 16 Jun 2026 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Look at what defines an AI agent now. It is not one file anymore.&lt;/p&gt;
&lt;p&gt;There is a persona file that sets the voice and the safety posture. A skills
directory that says what the agent can do and when to reach for it. An
&lt;code dir=&quot;auto&quot;&gt;AGENTS.md&lt;/code&gt; that spells out the standard operating procedure. A tools manifest
listing the functions it may call. A memory file holding what it should remember
about you. A heartbeat checklist for its scheduled work. An agent card that
advertises it to other agents. Each of these has its own emerging spec, and each
one is a place the agent can quietly go wrong.&lt;/p&gt;
&lt;p&gt;Here is the part that kept bothering me: a file that parses is not the same as a
file the model follows. You can have a perfectly valid persona spec and a model
that ignores half of it under pressure. You can write a rule in your SOP and
watch a crafted message talk the model out of it. Validation tells you the file
is well-formed. It says nothing about behavior.&lt;/p&gt;
&lt;p&gt;muster is my attempt to test both. Version 1.0.0 is out on npm today.&lt;/p&gt;
&lt;div&gt;&lt;h2 id=&quot;what-it-does&quot;&gt;What it does&lt;/h2&gt;&lt;/div&gt;
&lt;p&gt;muster checks seven layers of that file stack, plus how the layers compose. For
each layer it does two things.&lt;/p&gt;
&lt;p&gt;The static check parses the file and validates it against its spec. This runs
offline and is byte-for-byte reproducible, so you can drop it into CI as a hard
gate and trust the result. No network, no flakiness, same bytes every time
(RFC 8785 canonical JSON under the hood).&lt;/p&gt;
&lt;p&gt;The behavioral check grades a live model against what the file declares. It runs
real multi-turn conversations against any OpenAI-compatible endpoint and scores
the transcripts. For a persona that means verbosity, refusals, and state shifts.
For an SOP it means compliance probes and adversarial ones. For memory it means
recall and privacy leaks. Behavioral grading is probabilistic, so muster runs
each case several times and takes a k-of-n majority rather than trusting a
single roll.&lt;/p&gt;
&lt;p&gt;The layers, with the command for each:&lt;/p&gt;


















































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Layer&lt;/th&gt;&lt;th&gt;File&lt;/th&gt;&lt;th&gt;Command&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Persona&lt;/td&gt;&lt;td&gt;&lt;code dir=&quot;auto&quot;&gt;Soul.md&lt;/code&gt;&lt;/td&gt;&lt;td&gt;&lt;code dir=&quot;auto&quot;&gt;check&lt;/code&gt;, &lt;code dir=&quot;auto&quot;&gt;resolve&lt;/code&gt;, &lt;code dir=&quot;auto&quot;&gt;cts run&lt;/code&gt;, &lt;code dir=&quot;auto&quot;&gt;behave run&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Skills&lt;/td&gt;&lt;td&gt;&lt;code dir=&quot;auto&quot;&gt;SKILL.md&lt;/code&gt;&lt;/td&gt;&lt;td&gt;&lt;code dir=&quot;auto&quot;&gt;skills run&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;SOP&lt;/td&gt;&lt;td&gt;&lt;code dir=&quot;auto&quot;&gt;AGENTS.md&lt;/code&gt;&lt;/td&gt;&lt;td&gt;&lt;code dir=&quot;auto&quot;&gt;sop run&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Tools&lt;/td&gt;&lt;td&gt;&lt;code dir=&quot;auto&quot;&gt;TOOLS.md&lt;/code&gt;&lt;/td&gt;&lt;td&gt;&lt;code dir=&quot;auto&quot;&gt;tools run&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Memory&lt;/td&gt;&lt;td&gt;&lt;code dir=&quot;auto&quot;&gt;MEMORY.md&lt;/code&gt; / &lt;code dir=&quot;auto&quot;&gt;USER.md&lt;/code&gt;&lt;/td&gt;&lt;td&gt;&lt;code dir=&quot;auto&quot;&gt;memory run&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Heartbeat&lt;/td&gt;&lt;td&gt;&lt;code dir=&quot;auto&quot;&gt;HEARTBEAT.md&lt;/code&gt;&lt;/td&gt;&lt;td&gt;&lt;code dir=&quot;auto&quot;&gt;heartbeat run&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;A2A&lt;/td&gt;&lt;td&gt;Agent Card&lt;/td&gt;&lt;td&gt;&lt;code dir=&quot;auto&quot;&gt;a2a run&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Cross-layer&lt;/td&gt;&lt;td&gt;all of the above&lt;/td&gt;&lt;td&gt;&lt;code dir=&quot;auto&quot;&gt;crosslayer run&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;You bring your own model. Local Ollama, NVIDIA NIM, OpenAI, anything that speaks
the OpenAI chat API. There is no provider baked in, and the API key is read from
an environment variable at request time. It never goes in a flag, a manifest, or
a file on disk. A test in the repo fails the build if a secret-shaped string is
ever committed, which is the kind of guard rail I wish more projects had.&lt;/p&gt;
&lt;div&gt;&lt;h2 id=&quot;try-it&quot;&gt;Try it&lt;/h2&gt;&lt;/div&gt;
&lt;div&gt;&lt;figure&gt;&lt;figcaption&gt;&lt;span&gt;&lt;/span&gt;&lt;/figcaption&gt;&lt;pre&gt;&lt;code&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;npm&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;install&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;-g&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;@garrison-hq/muster&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;
&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;# every command ships with a runnable example&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;muster&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;check&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;examples/soul/Soul.md&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;--json&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;muster&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;skills&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;run&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;examples/skills/manifest.yaml&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;muster&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;a2a&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;run&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;examples/a2a/manifest.json&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;&lt;/div&gt;
&lt;p&gt;The static commands need nothing but Node 22. To grade a model, point a layer at
an endpoint and set &lt;code dir=&quot;auto&quot;&gt;MUSTER_API_KEY&lt;/code&gt;.&lt;/p&gt;
&lt;div&gt;&lt;h2 id=&quot;the-part-i-did-not-expect-to-write-about&quot;&gt;The part I did not expect to write about&lt;/h2&gt;&lt;/div&gt;
&lt;p&gt;muster started as one thing: the reference conformance harness for Soul.md
RFC-1, a persona format. The interesting accident was that the engine underneath
did not care about personas at all. Parse, validate, resolve, grade, report. The
spec was a plugin. Once that was clear, six more layers followed on the same
core, and a 1.0.0 that was supposed to be a single-format tool turned into a
test suite for the whole stack.&lt;/p&gt;
&lt;p&gt;The other thing worth admitting: most of this was built by AI agents working
through a spec-driven process, and the entire trail is in the repository. Every
layer has a specification, a plan, work-package tasks, and a post-merge review,
all under &lt;code dir=&quot;auto&quot;&gt;kitty-specs/&lt;/code&gt;. I left it in on purpose. If you want to see how the
thing was actually made, it is right there next to the code.&lt;/p&gt;
&lt;div&gt;&lt;h2 id=&quot;what-100-is-not&quot;&gt;What 1.0.0 is not&lt;/h2&gt;&lt;/div&gt;
&lt;p&gt;It is a CLI. There is no stable library API yet, so if you want to write a new
adapter you do it inside the repo for now. Behavioral grading is only as good as
your endpoint and your thresholds, and it will never be deterministic the way
the static checks are. And the seven layers track specs that are themselves
young, so expect them to move.&lt;/p&gt;
&lt;p&gt;That is the honest shape of it. If you are building agents from files and you
have no way to test those files, muster is for you. The code is Apache-2.0 on
&lt;a href=&quot;https://github.com/garrison-hq/muster&quot;&gt;GitHub&lt;/a&gt;, the docs are at
&lt;a href=&quot;https://garrison-hq.github.io/muster&quot;&gt;garrison-hq.github.io/muster&lt;/a&gt;, and I would
genuinely like to know which layer you reach for first.&lt;/p&gt;</content:encoded><category>announcement</category></item></channel></rss>