Why SkillSpec Exists

Abstract

The problem is not that skills are weak.

The problem is that the most important skills become load-bearing prose.

They start as clean instructions, then grow into state machines, dependency notes, tool policies, code snippets, references, and release checks.

One axis makes each skill longer.
The other makes the skill library wider.
Both compete for model attention.

SkillSpec is the answer we arrived at after measuring that pressure: keep prose for judgment, but move the parts that must be followed into a small, portable behavior contract.

The contract says when the skill applies, which route to choose, what steps to run, what tools are allowed, what checks must pass, and what proof should exist at the end.

1. Skills

Skills are becoming the packaging layer for tacit domain knowledge.

A prompt is usually a one-off instruction. A skill is a reusable package: frontmatter for discovery, a SKILL.md body for instructions, and optional resources such as examples, scripts, reference files, and checklists. That packaging matters because it lets an agent carry specialized ways of working across tasks and harnesses.

In practice, skills are becoming the place where teams write down tacit operating knowledge: how to use a product, how to review a legal clause, how to call a CLI safely, how to sequence a release, or how to keep an agent aligned with a desired outcome. The format emerged from prompts, but the intent is larger than prompting. It is a way to ship capability.

This became more important as agent harnesses evolved. Early models already knew common internet and programming patterns from training. Modern harnesses then gave agents access to files, shells, browsers, APIs, MCP servers, and local tools. Some of those tools are natural to a model. Others are private, new, domain-specific, or company-specific. Skills became the teaching layer that tells the agent how to use those tools in the right order for a real outcome.

Prompt A transient instruction inside one conversation.

Skill A portable package of instructions, resources, scripts, and examples.

SkillSpec A checkable contract beside the skill for route, steps, limits, checks, and proof.

The original skill architecture was a real advance because it used progressive disclosure: load a small description first, load the full body when relevant, and load references only when needed. That solves the catalog problem better than dumping every prompt into context. But after activation, the skill body is still prose. It can be skipped, reordered, partially followed, or interpreted differently by another model or harness. Reliability gap paper

2. Attention

Skill growth creates two kinds of pressure.

Longitudinal pressure is what happens inside one skill over time. The skill starts short. Then it gains a decision tree, safety rules, setup steps, examples, dependency notes, code snippets, references, exceptions, and final reporting requirements. The file is more complete, but the agent now has more instructions to hold and sequence.

Lateral pressure is what happens across the skill library. As agents reach deeper into a user's environment, every useful CLI, API, MCP server, browser flow, and domain workflow wants a skill. More skills means more discovery metadata, more routing ambiguity, and more pressure on the harness budget reserved for skills.

Longitudinal One skill gets deeper

Instructions become plans, state, rules, snippets, references, and proof demands.

Lateral The library gets wider

More tools and domains create more skill descriptions competing for selection.

Result

The agent must select the right skill, then follow the right slice of it, without losing steps.

Doctor was built to make this visible. It classifies the shape of a skill or workspace, measures activation-loaded text, dependency and proof surfaces, frontmatter discovery risk, and agent drift risk. The score is not a grade of domain quality. It is a static estimate of follow-through pressure: how likely the current shape is to make an agent skip, distort, over-trigger, under-trigger, or finish without proof. Doctor risk model

3. Model limits

Better models help, but they do not remove the follow-through problem.

A stronger model can read more, reason better, and recover from more ambiguity. That still does not turn prose into a guarantee. If a skill says "never use tool X before approval," that sentence is only as strong as the model's attention to it at the moment of action. If a later phase says "run tests before release," the run still needs evidence that tests actually ran.

The design docs make a narrow claim: long context and compaction are not free. Token windows are large, but usable attention is not uniform, and summarized context can drop governance constraints. Doctor treats these as directional risk signals, not exact probabilities. The report cites its basis so the user can see what is measured, what is policy, and what is still a hypothesis. Token economy design

4. The move

SkillSpec keeps prose, but stops making prose carry everything.

The goal is not to replace skills. A good skill still needs human-readable explanation, examples, references, and judgment. The move is to place a small contract next to that prose for the parts that matter most.

Assess Doctor measures the current skill before changing it.

Port Source maps and import scaffolds turn prose into a reviewed contract.

Guide Run-loop serves the current route and phase instead of the whole spec.

Prove Trace alignment compares what was planned with what has evidence.

This is why SkillSpec is a CLI plus a structured skill plan. The CLI does the mechanical work: map, validate, test, compile, plan, record progress, and align traces. The skill plan carries the intended behavior: routes, phases, tool boundaries, dependencies, checks, tests, and proof obligations. The installed loader stays small so the agent can ask the CLI for the next slice instead of loading the whole operating manual. Guided trampoline design

That boundary matters. If follow-through optimizations live only inside one harness, the skill becomes less portable the moment it moves. SkillSpec keeps the optimization beside the skill itself: the prose, the contract, the tests, and the proof model travel together.

5. Operating model

What SkillSpec gives an agent operator.

Measure the existing skill

Run Doctor before porting. It tells you whether the current shape is a simple skill, multi-skill workspace, plugin-shaped workspace, or non-skill repository. It also explains where follow-through risk comes from.

Derive a contract from prose

Import starts with source mapping, not guesswork. The agent reviews handles, headings, code blocks, dependencies, and review-required spans before promoting prose into structured routes, rules, checks, and proof.

Run with progressive guidance

The run loop selects the route, prints the current gate, records progress, and resumes from a persisted trace after interruption or compaction.

Route large skill libraries

Router mode moves broad discovery into a local index. The prompt sees the selected candidates, not every installed skill description.

Create skills from observed work

Durable execution, powered by Modiqo Rote when installed, captures real interaction evidence so tacit workflow knowledge can become an explicit SkillSpec-backed skill.

The implementation details are intentionally split by concern: import and release flow, porting router mode for broad catalogs, routing and durable execution for observed tool-backed work. durable execution

6. Proof

The final answer should not be the only artifact you trust.

SkillSpec separates decision evidence from execution evidence. A decision trace can prove which route was selected. A progress ledger can prove which requirements were satisfied. Alignment compares those records against the current contract and reports pass, partial, or fail.

Decision trace What route and rules were selected for the task.

Progress ledger What the agent recorded as completed, with evidence references.

Alignment report What is proven, missing, forbidden, or contradicted.

Token metrics What was measured, estimated, or explicitly not recorded.

This is the scientific posture of the project: do not infer success from a polished answer. Ask what route was selected, what checks passed, what proof exists, and what remains unproven. Traces and alignment

Sources

Design material behind this article.

These links open the implementation and design notes used to write this page. They are intentionally public so the claims can be checked against the project source.

Reliability gap in agent skills Doctor agent drift risk model Import to release explainer Guided trampoline design Traces and alignment Completion alignment and token reporting Router mode Durable executor Workspace authoring graph Performance and token economy

Why SkillSpec exists