Teaching AI to speak
in tokens.
Even with a three-tier variable library, DTCG JSON exports, Code Connect, style documentation, and Figma Make pointing directly at our design system kit, AI-assisted prototyping still required 5 correction loops to produce token-accurate output. The design system wasn't broken. The interface between it and AI was. I built a new one.
Every prototype session started
with a correction loop.
Across the design team, prototyping with AI followed the same frustrating pattern: output a component, spot a token error, correct it, output again, spot another, correct again. By the time a component matched our actual design system, we'd spent more time correcting AI than it saved. And it wasn't isolated to one tool. Figma Make, Claude, Cursor — same pattern every time.
The average prototype session required 4–5 rounds of correction. Designers lost the productivity benefit. PMs got slower turnaround. And each session started from scratch — no learning, no memory, no cumulative improvement.
The design system was complete.
And it still wasn't enough.
Before building anything new, I needed to understand why such comprehensive infrastructure was producing such inconsistent output. We had everything the design community recommends:
Despite all of this, every session still started with corrections. The problem wasn't documentation depth or tooling maturity. The problem was the format. Design systems are written to be understood by humans. AI doesn't read documentation — it processes instructions.
A format designed for humans
reads poorly to machines.
I ran a structured audit comparing what AI needed to produce token-accurate output against what our existing documentation provided. The gap wasn't quantity — it was structure.
- Visual hierarchy conveyed through layout, not rules
- Token names without semantic disambiguation
- Confusable pairs: title vs subheadline, text.primary vs text.action
- No explicit hierarchy rule: when to use Alias vs Mapped
- No "stop and ask" instruction when a token was missing
- Responsive values buried in long documentation prose
- Ordered, numbered rules — a decision procedure, not a reference
- Semantic intent tables: role → token name, not just token name
- Explicit disambiguation for every confusable pair
- Tier resolution rule: Mapped first, Alias second, Global never
- A hard stop: flag missing tokens, never invent values
- Breakpoint-specific values co-located with token definitions
AI models are excellent at following explicit, ranked instructions. They're poor at inferring implicit conventions from visual documentation. The fix wasn't more documentation — it was a different type of documentation, written for how AI actually processes context.
Modular by design.
Precision through layering.
A skill that tries to cover everything in one file becomes too large for efficient AI context use and too rigid to update. I designed the architecture as a two-level system: a compact main orchestrator that defines the decision process, pointing to focused reference modules for each token category. AI loads only what it needs.
The 3-tier token architecture maps directly into the skill's resolution rule: AI always looks for the most specific token available — Mapped for known components, Alias for semantic intent, Global only as reference — never hardcoded.
Each reference file is scoped to a single token category and sized to stay within efficient AI context limits. The main SKILL.md acts as the entry point — it defines the workflow, architecture rule, and common mistakes, then delegates deep token lookups to the relevant sub-file.
Rules first. Examples second.
Ambiguity never.
The core design principle was: compress everything an experienced designer holds implicitly into an explicit, ordered decision process. The main SKILL.md has three functional sections, each solving a different failure mode from the original audit.
The two disambiguation tables were the highest-leverage additions. Before these existed, AI regularly confused token pairs that look similar but serve different roles. After — zero errors in these categories.
| What AI used to do wrong | What the skill enforces |
|---|---|
text.secondary for button labels |
text.secondary-button |
border.secondary for button outlines |
border.secondary-button |
background.action for secondary button |
background.action-secondary |
| subheadline (Book) for card headings | title (Medium) for card headings |
| display for page headings | headline 1 = pages; display = marketing heroes only |
| Desktop spacing applied on mobile | Always check sm breakpoint values explicitly |
text.primary for links |
text.action |
Hardcoded 16px for radius |
base radius token |
Not just functional.
Measurably excellent.
Once the skill was built, I needed a rigorous way to evaluate it — not just "does it work" but "how well does it work, and where can it improve." I developed an 8-dimension evaluation framework spanning structural quality and real-world performance, applying it before and after each iteration cycle.
The framework weights dimensions by their impact on output quality: real-world performance carries 25% of the total score, instruction specificity and workflow clarity each carry 15%, reflecting how directly these dimensions determine whether AI produces correct output on first pass.
One source of truth.
Two surfaces.
The skill solved AI's format problem. But designers, PMs, and engineers still needed a human-readable reference. Rather than maintain two separate sources, I built a parallel surface: an HTML design guide using Claude Design that maps every token directly to its intended use, with visual previews, copy-pasteable CSS variable names, and developer-ready Tailwind equivalents.
The result: when a developer deploys an AI-generated component, they can cross-reference the same token names in the HTML guide to verify intent. AI codes it. Humans verify it. The same token vocabulary runs through both surfaces — no translation required.
This dual-surface approach closes a gap that most teams leave open: design systems that are either too abstract for AI or too visual for programmatic use. The token vocabulary stays identical across both — what AI calls text.action, the HTML guide labels and previews with the same name. Designers, AI, and developers share one language.
What changed
The shift from 5 correction loops to 1–2 wasn't the most significant outcome. The more durable change was qualitative: AI prototyping went from a source of frustration — something that needed constant supervision — to a reliable capability. People started using it more, because they trusted it more.
The real lesson: AI tools don't fail because they're dumb — they fail because they're given documentation written for humans. Once the interface matched how AI actually processes instructions, the same model that produced 5 errors on first pass produced 0. The capability was always there. The missing piece was a format it could actually use.
- Prototype directly in the right tokens — no correction overhead
- Figma Make, Claude, Cursor all use the same skill
- First-pass output matches production design system
- AI-generated code references real token names
- HTML guide cross-references the same vocabulary
- Design-to-dev handoff requires less translation
- Methodology independently adopted by engineering into the shared token library
- Skill architecture replicated across 8 design system sections in kv-tokens
- Scalable: update the skill once, all AI tools benefit
The proof is in
the prototype.
Same AI model. Same design system. Same prompt. The only variable: whether the skill was loaded. Left is what came back before the skill existed — five correction rounds before the tokens matched. Right is what came back after — accurate on the first pass, tokens correct, no corrections needed.
Let's close
a gap together.
I'm looking for teams where research drives the roadmap and design ships — not just specs.