A year ago, if you had told me that today I wouldn’t be writing any code, I’d have asked whether I was out of work, had broken both of my hands, or gone blind in both eyes. Six months ago, I would’ve said about the same. But five months ago, I revisited LLMs and noticed a fundamentally different level of output, especially when trying agentic LLM solutions such as Claude Code. My reaction was roughly in this order: “Oh wow” -> “Oh no” -> “What happens if I try this?!?” to “Let the revolution begin”.

Fast-forward to today, I don’t write code anymore. 100% of my code is AI-generated. Does this mean that I blindly accept its output? Of course not. I know good code when “I sees it”. And generally, I know bad code when I have to use it. This is a skill I am grateful to have learned as a software engineer in test. But what does this mean for how I use an LLM effectively? Over the past several months, I have moved from a 100% human-generated coding workflow to a 100% AI-generated one. During this time, I have learned about various AI “primitives”: tools, commands, skills, and agents. While I am still learning (and actively experimenting) about agents and, to a lesser extent, skills, I wanted to share my own experiences during this process.

In this article, I want to share when to use which primitive and how to build skills that produce more than what a single prompt with Claude (or your LLM of choice) would give you.

Examples and conventions throughout (SKILL.md, MCP, subagents) come from the Claude ecosystem. However, the concepts should apply to any LLM-based agent framework with comparable primitives.

Prerequisites

This article references several Claude-ecosystem formats and protocols. If you’re unfamiliar with any of them, skim the relevant docs first:

  • SKILL.md — Claude’s convention for a markdown skill file: frontmatter metadata (name, description) plus a body of instructions and examples. The description is what Claude reads to decide whether to invoke a skill; the body is what it follows once invoked.
  • MCP (Model Context Protocol) — the protocol Claude uses to talk to external tools, either via local subprocess (stdio) or remote service (HTTP/SSE).
  • Subagents — Claude Code’s mechanism for invoking specialized agents with isolated context within a session.

You don’t need to be an expert in any of these topics to follow the article, but high-level knowledge will help.

Decision Guide by Context

flowchart TD
    BEGIN(("Start"))
    BEGIN --> START(["🤔 I want my LLM<br/>to do something<br/>repeatably or better"])
    START --> Q1{"Is this a<br/>one-off task?"}

    Q1 -->|Yes| ONOFF["💬 Unassisted prompt.<br/>Talk to the LLM directly.<br/>No primitive needed."]
    ONOFF --> HAPPY{"Are you happy<br/>with the answer?"}
    HAPPY -->|Yes| DONE(("Done"))
    HAPPY -->|No| IMPROVE{"Can more unassisted<br/>prompts improve<br/>the answer?"}
    IMPROVE -->|Yes| CLARIFY["✏️ Provide a clarifying<br/>unassisted prompt"]
    CLARIFY --> HAPPY
    IMPROVE -->|No| CONTEXT(["📎 Prompt with<br/>supplemental context<br/>example code, design docs,<br/>screenshots, error logs,<br/>relevant file paths"])
    CONTEXT -.->|"Same context every time?<br/>Make it durable"| DURABLE(["📌 Durable context<br/>CLAUDE.md, AGENTS.md<br/>auto-loaded<br/>every conversation"])
    DURABLE -.->|"Need a recipe or<br/>workflow on top?"| SKILL

    Q1 -->|No| Q2{"Who triggers it —<br/>you or the LLM?"}

    Q2 -->|"I type it explicitly<br/>(/command, hotkey, script)"| CMD["✅ COMMAND<br/>Single-step, fixed behavior.<br/>Slash command, shell script,<br/>Makefile target, npm task.<br/>You decide when it runs."]

    Q2 -->|"The LLM picks it up<br/>contextually when relevant"| Q3{"What's missing —<br/>a capability<br/>or a workflow?"}

    Q3 -->|"Capability gap —<br/>the LLM can't read, write,<br/>compute, or search<br/>without help"| SCRIPT["✅ SCRIPT<br/>Local executable<br/>(bash, python, node).<br/>No MCP overhead.<br/>Skills invoke it directly."]
    SCRIPT -->|"Need network access<br/>or to cross process /<br/>team boundaries? <br/> Script isn't cutting it?"| TOOL["✅ TOOL (MCP)<br/>Adds verbs the LLM can't<br/>perform with conversation <br/>alone: read state, <br/> write state, run code, <br/>search, query APIs."]

    Q3 -->|"Workflow gap —<br/>the LLM can already speak<br/>the verbs, but doesn't know<br/>the recipe"| SKILL["✅ SKILL<br/>Orchestrates primitives.<br/>Multi-step + domain judgment.<br/>Triggers contextually.<br/>Composes commands <br/>+ tools."]

    %% Composition — skills sit above and orchestrate
    SKILL -.->|"Skills compose scripts<br/>they own as workflow steps<br/>(the default — no MCP <br/>needed)"| SCRIPT
    SKILL -.->|"Skills call tools when<br/>capability is owned externally<br/>or needs network access"| TOOL
    CMD -.->|"Want the LLM to invoke<br/>this itself? Wrap it<br/>in a skill"| SKILL

    %% Multi-skill agent layer — gated on maturity, sustained by feedback
    SKILL -.->|"Skills mature enough<br/>(or paired with feedback loops)<br/>to coordinate with judgment?"| AGENT["🤖 AGENT / Subagent<br/> system<br/>Coordinates multiple skills.<br/>Isolated context per<br/> subagent. Extended context,<br/> multi-domain work."]
    AGENT -.->|"Feedback loops keep<br/>skills + coordination improving<br/>(see Skill Refinement Loop)"| SKILL

    %% Visual emphasis on the terminal primitive nodes (the "answers" the
    %% decision tree leads to). classDef adds the `terminal` class + thicker
    %% stroke-width; the theme-aware stroke color is set in theme.css so it
    %% tracks `--theme-accent` (purple in modern, neon green in terminal).
    classDef terminal stroke-width:3px
    class CMD,SCRIPT,TOOL,SKILL,AGENT terminal

How the primitives compose

The decision tree above shows when AI tooling primitives apply. In practice they compose. Tools (or simple scripts) sit at the bottom as the foundation layer — the verbs the LLM can speak. Commands are single-step explicit triggers you invoke. Skills act as orchestrators, layering domain knowledge and contextual triggers over those lower primitives (scripts, tools, commands, even other skills).

Let’s visualize using a custom Slack workflow. Imagine an “incident triage” task: when someone reports an issue, the LLM should search #incidents for the last 2 hours of messages, find the original alert thread, summarize the timeline, post a status update to #status, DM the on-call, and file a follow-up ticket. None of those are new capabilities — the Slack MCP already exposes search, post, and DM as atomic “verbs”, and the task tracker exposes a “create” tool. All of this is publicly available. The real complexity is the business domain and how to solve the problem effectively: the order of operations, knowing which channel, which thread, what to summarize, and who to ping.

flowchart TD
    %% Flow A — user-prompted incident triage
    BEGIN_A(("Start: user"))
    BEGIN_A --> USER(["👤 'looks like an incident<br/>in #alerts — can you triage?'"])
    USER -.->|"contextual trigger"| SKILL
    SKILL["🧭 SKILL: Incident Triage<br/>Recognizes incident-flavored prompts.<br/>Knows which channels, the timeline,<br/>who to ping, what to file."]

    %% Skill invokes specific tools/commands with specific intent.
    %% Tool nodes stay pure capability names — the skill's *use* of the
    %% tool lives on the edge label, separating "what the tool is" from
    %% "what the skill does with it".
    SKILL -->|"summarize incident<br/>and update status page"| C1
    SKILL -->|"search #incidents,<br/>last 2 hours"| T1
    SKILL -->|"find the original<br/>alert thread"| T2
    SKILL -->|"post status update<br/>to #status"| T3
    SKILL -->|"DM the on-call engineer"| T4
    SKILL -->|"create a follow-up<br/>ticket about this incident"| T5
    SKILL -->|"acknowledge the<br/>PagerDuty alert<br/>(programmatic REST call)"| S1

    %% Flow B — scheduled standup poster (different entry, same tool layer,
    %% different intent per tool call).
    BEGIN_B(("Start: schedule"))
    BEGIN_B --> SCHED(["⏰ 9am every weekday<br/>cron-like trigger"])
    SCHED --> AGENT2["🤖 Daily standup poster<br/>(a separate agent)"]
    AGENT2 -.->|"post the standup<br/>thread to #engineering"| T3
    AGENT2 -.->|"create a recurring task<br/>for standup notes"| T5

    %% Flow C — automation invokes the same skill via command syntax
    %% (deterministic; no contextual matching).
    BEGIN_C(("Start: automation"))
    BEGIN_C --> AUTOMATION(["🪝 Automated trigger<br/>e.g., PagerDuty webhook,<br/>CI failure alert"])
    AUTOMATION -.->|"explicit /command invocation<br/>(skips contextual matching)"| SKILL

    subgraph CMDS ["⚡ Commands"]
        C1["/run-status-page-update"]
    end

    subgraph MCPS ["🔌 MCP Tools"]
        T1["slack.search_channel"]
        T2["slack.find_thread"]
        T3["slack.post_message"]
        T4["slack.dm_user"]
        T5["tasktracker.create"]
    end

    %% Scripts sit alongside MCP tools — same role (capability the skill calls),
    %% lower overhead. Often a script hitting a public REST API exposes surface
    %% that the published MCP server doesn't.
    subgraph SCRIPTS ["📜 Scripts"]
        S1["python pagerduty_ack.py"]
    end

    %% Commands themselves layer on MCP tools — not atomic primitives.
    C1 -.->|"internally calls"| T3

    %% A script can be promoted to an MCP tool later if ownership crosses
    %% team boundaries — until then it's the lighter-weight option.
    S1 -.->|"promote to MCP only if<br/>HTTP API / SDK<br/>isn't enough, or<br/>an agent uses it<br/>as a workflow step"| TOOL_NOTE(["🔁 same role,<br/>different overhead"])

    %% Visual emphasis matches Diagram 1: primitives get the theme-accent
    %% stroke so the categories pop out.
    classDef terminal stroke-width:3px
    class C1,S1,T1,T2,T3,T4,T5 terminal

Litmus test for “tool, command, or skill on top of the tool?”

  • “The work is deterministic — fixed steps, no LLM judgment needed during execution” → a script (one call or many — a Python script hitting the public REST API gets you there efficiently). Promote to a tool (MCP) only when you need to cross network boundaries or compose into fully agentic workflows. Even then, an in-house script often beats the published MCP — as a customer of an API, you tend to know how to use it for your own use cases better than publically available MCP tools might advertise.
  • “You want to trigger the invocation of a specific workflow automation — or have external automation trigger it” → a command (slash command, cron job, webhook).
  • “You want to orchestrate multiple script invocations while taking into account contextual details → a skill (the LLM matches the prompt to the skill description and runs the workflow, composing scripts, tools, and commands it needs).

The dotted edges at the bottom of the first diagram are this idea in shorthand: skills compose, they don’t replace. If you find yourself adding domain-specific judgment around a tool’s calls, you’re not making the tool less of a tool — you’re defining a skill that composes tools, scripts, commands, and the automation around them.


Skill Refinement Loop

So once a skill is created, is it finished? Almost never. Skills are living instructions that evolve with usage. The framework below adapts to any set of agentic instructions — and arguably to software development in general: define a spec → test it → gather feedback → iterate. The diagram’s “Done” terminal means stable in the wild for now, not ship and walk away forever. For evolving workflows you’ll come back through MONITOR repeatedly; “Done” is more commonly the terminal state for atomic tools than for orchestrating skills.

---
config:
  layout: dagre
---
flowchart TD
    BEGIN(("Start"))
    BEGIN --> IDEA(["💡 Idea / Need<br/>Identify repeatable task<br/>or workflow to capture"])

    IDEA --> INTENT["📋 Capture Intent<br/>What should it do?<br/>When should it trigger?<br/>What's the expected output?"]

    INTENT --> DRAFT["✍️ Draft SKILL.md<br/>Write name, description,<br/>instructions, examples,<br/>bundled assets if needed"]

    DRAFT --> CATALOG["🗂️ Catalog Use Cases<br/>List the categories the skill<br/>must cover:<br/>setup (first-run, prerequisites),<br/>standard (the 80% case),<br/>edge (failure modes, odd inputs)"]

    CATALOG --> TEST["🧪 Write Test Prompts<br/>Craft one or more realistic<br/>prompts per cataloged case<br/>(complex enough to need the skill)"]

    TEST --> RUN["▶️ Run the LLM + Skill<br/>on Test Prompts<br/>Execute each test case;<br/>note where output<br/>hits or misses the mark"]

    %% Parallel evaluation: both human and quantitative checks read the
    %% same RUN output. Neither blocks the other; they merge at DECIDE.
    RUN --> HUMAN["👁️ Human Review<br/>Look at outputs qualitatively.<br/>Does it feel right?<br/>Is anything missing or off?"]
    RUN --> QUANT["📊 Quantitative Check<br/>(Optional but valuable)<br/>Measure match accuracy,<br/>format adherence,<br/>assertion pass rate"]

    HUMAN --> DECIDE{"Outputs at<br/>desired quality<br/>across all<br/>cataloged cases?"}
    QUANT --> DECIDE

    DECIDE -->|"No"| DIAGNOSE{"What's the<br/>failure mode?"}

    %% DIAGNOSE is a decision: route to whichever earlier node owns the fix.
    DIAGNOSE -->|"Triggering —<br/>the LLM didn't fire the skill"| DESCOPT
    DIAGNOSE -->|"Content —<br/>wrong output,<br/>missing steps or examples"| FIX["🔧 Revise SKILL.md<br/>Add examples, clarify steps,<br/>fix edge cases"]
    DIAGNOSE -->|"Scope —<br/>missing or wrong cases"| CATALOG

    FIX --> TEST

    %% DESCOPT is reached only via DIAGNOSE, which handles both pre-deploy
    %% failures (from DECIDE No) and post-deploy drift (from OBSERVED Yes).
    %% It updates the description, then loops back through TEST so the new
    %% description is verified against the same cataloged cases. No separate
    %% match-accuracy gate — DECIDE already catches a description that
    %% broke triggering.
    DESCOPT["🎯 Optimize Description<br/>Tune the description metadata<br/>(the LLM's matching signal)<br/>to fire reliably across<br/>varied phrasings"]

    DESCOPT --> TEST

    %% Quality passes → ship.
    DECIDE -->|"Yes — all cataloged<br/>cases meet the bar"| PACKAGE["📦 Package & Share<br/>Version the skill,<br/>commit to repo,<br/>share with team or install<br/>as .skill file"]

    PACKAGE --> MONITOR["🔁 Monitor in the Wild<br/>Gather real-usage feedback.<br/>Skills are living documents —<br/>collect edge cases<br/>for future iterations."]

    %% Binary gate first — if no failures, we're done. If failures observed,
    %% reuse the same DIAGNOSE node as the pre-deploy path: same failure-mode
    %% taxonomy, same routing to DESCOPT/FIX/CATALOG.
    MONITOR --> OBSERVED{"Are failures<br/>observed in the wild?"}
    OBSERVED -->|"No — stable"| DONE(("Done"))
    OBSERVED -->|Yes| DIAGNOSE

    %% ANNOTATION CALLOUTS
    DRAFT -.->|"Progressive disclosure:<br/>Metadata → SKILL.md body<br/>→ Bundled resources"| NOTE1(["📎 Keep SKILL.md<br/>under ~500 lines"])

    TEST -.->|"Avoid trivial prompts —<br/>the LLM handles simple tasks<br/>without consulting a skill"| NOTE2(["⚠️ Use substantive,<br/>multi-step test cases"])

Closing Remarks

The four primitives — command, tool, skill, agent — aren’t a hierarchy you climb. Two analogies that might help: (1) they’re a set of tools you reach for based on which part of the house you’re building, and (2) each primitive is an instrument used by you, the designer, who knows when to use which one and when to compose them together. Pick based on (a) your need, (b) your comfort, and (c) the ownership boundary you’re working across. Over time, experiment, iterate, and evolve. The graduation path is: start with a script, promote to a tool when ownership crosses network boundaries and the script just isn’t cutting it, wrap in a command when you want an explicit trigger, lift into a skill when you want contextual invocation, and eventually compose multiple skills into an agent (or subagents).

You may be thinking — do I really need a tool or a command in many cases? I have the same thoughts on this. Please reach out if you have suggestions on when tools or commands would be needed outside of remote MCPs when public SDKs don’t exist.

The cheatsheet below is the quick lookup when you’re choosing the right instrument for the job.

Quick Reference Cheatsheet

Skill
Command / Script
Skill + Direct Script
MCP Tool
stdioHTTP/SSE
Agent / Subagent System
What it is
Markdown instruction set for an LLM
Shell / CLI invocation
Skill orchestrating scripts it owns (no MCP)
Live capability (local subprocess)
Live capability (remote service)
Orchestrator of multiple skills + judgment
Self-contained?
✅ Yes
⚠️ Depends on env
⚠️ Depends on script env
✅ Local subprocess
❌ Remote service
❌ No — sits on top of others
Best for solo dev?
✅ First choice
✅ For shell tasks
✅ First choice when scripts are owned
✅ Local capabilities
⚠️ Only if needed
⚠️ Overkill unless multi-domain
Best for teams?
✅ Version + share
✅ Wrap in CI/scripts
⚠️ Works while scripts stay owned
⚠️ Each user spawns own
✅ Shared SaaS access
✅ Yes, for cross-domain workflows
Iteration needed?
✅ Yes — always
➡️ Test and fix script
✅ Yes — refine skill + scripts
➡️ Test integration
➡️ Test integration
✅ Yes — coordination + skills
Portability
✅ .skill file, any env
⚠️ OS/env dependent
⚠️ Skill portable, scripts env-dependent
⚠️ Need binary on host
⚠️ Host-portable; service-bound
⚠️ Inherits from components
Composable?
✅ Skill can call both
➡️ Skill can describe when
✅ Skill orchestrates the recipe
➡️ Skill can orchestrate
➡️ Skill can orchestrate
✅ Composes multiple skills