feat: add CLI agent-readiness reviewer and principles guide (#391)

This commit is contained in:
Trevin Chow
2026-03-26 12:24:54 -07:00
committed by GitHub
parent 506ad01b4f
commit 13aa3fa846
3 changed files with 888 additions and 0 deletions

View File

@@ -0,0 +1,452 @@
# Building Agent-Friendly CLIs: Practical Principles
CLIs are a natural fit for agents — text in, text out, composable by design. They're also more practical than MCP for most developer-facing agent work: LLMs already know common CLI tools from training data, so there's no schema overhead. An MCP server can burn tens of thousands of tokens just loading its tool definitions before a single question is asked, while a CLI call costs only the command and its output. MCP earns its complexity when agents need per-user auth and structured governance, but for the tools developers build and use day-to-day, a well-designed CLI is faster, cheaper, and more reliable.
The details still trip agents up, though: interactive prompts they can't answer, help pages with no examples, error messages that say "invalid input" and nothing else, output that buries useful data in formatting. As agents become real consumers of developer tooling, CLI design needs to account for them explicitly.
This guide synthesizes ideas from Anthropic's tool-design guidance, the Command Line Interface Guidelines project, CLI-Anything, and practitioner experience into **7 practical principles** for evaluating whether a CLI is merely usable by agents or genuinely well-optimized for them.
This is not a generic CLI style guide. It is a rubric for CLIs that are intended to work well with AI agents.
---
## How to Use This Rubric
This guide is intentionally opinionated, but it is **not pass/fail**.
Use each finding to classify the CLI along three levels:
| Level | Meaning | Typical impact on agents |
|---|---|---|
| Blocker | Prevents reliable agent use | Hangs, requires human intervention, or makes output hard to recover from |
| Friction | Agents can use it, but inefficiently or unreliably | More retries, wasted tokens, brittle parsing, extra tool calls |
| Optimization | Improves speed, cost, and robustness | Better agent throughput, lower token cost, fewer corrective loops |
In practice, you should evaluate commands by **command type**, not only at the CLI level:
| Command type | Most important principles |
|---|---|
| Read/query commands | Structured output, bounded output, composability |
| Mutating commands | Non-interactive execution, actionable errors, safety, idempotence where feasible |
| Streaming/logging commands | Filtering, truncation controls, clean stderr/stdout behavior |
| Interactive/bootstrap commands | Automation escape hatch, `--no-input`, scriptable alternatives |
| Bulk/export commands | Pagination, range selection, machine-readable output |
This keeps the rubric practical. For example, idempotence is critical for many mutating commands, but not every `tail -f`-style command needs to satisfy it.
---
## The 7 Principles
| # | Principle | Why it matters |
|---|-----------|---------------|
| 1 | Non-interactive by default for automation paths | Agents cannot reliably answer prompts or navigate TUI flows |
| 2 | Structured, parseable output | Agents need stable data contracts, not presentation formatting |
| 3 | Progressive help discovery | Agents explore tools incrementally and benefit from concrete examples |
| 4 | Fail fast with actionable errors | Agents recover well when errors tell them exactly how to correct course |
| 5 | Safe retries and explicit mutation boundaries | Agents retry, resume, and recover; commands must not make that dangerous |
| 6 | Composable and predictable command structure | Agents chain commands and depend on consistent affordances |
| 7 | Bounded, high-signal responses | Extra output consumes context, time, and tool budget |
---
## 1. Non-Interactive by Default for Automation Paths
**The principle:** Any command an agent might reasonably automate should be invocable without prompts. Interactive mode can still exist, but it should be a convenience layer, not the only path.
This principle is strongly supported by the CLI Guidelines project: if stdin is not a TTY, the command should not prompt, and `--no-input` should disable prompting entirely. The broader inference from agent-tooling guidance is straightforward: tools that pause for human intervention are poor fits for autonomous execution.
**What good looks like:**
```bash
# Human at a terminal (TTY detected) — prompts fill in missing inputs
$ blog-cli publish
? Status? (use arrow keys)
draft
> published
scheduled
? Status? published
? Path to content: my-post.md
Published "My Post" to personal
# Agent or script (no TTY, or --no-input) — flags only, no prompts
$ blog-cli publish --content my-post.md --yes
Published "My Post" to personal (post_id: post_8k3m)
```
- `Blocker`: a common automation command cannot run without a prompt
- `Friction`: some prompts can be bypassed, but behavior is inconsistent across subcommands
- `Optimization`: every automation path supports explicit flags and a global non-interactive mode
Recommended traits:
- Support `--no-input` or `--non-interactive`
- Detect TTY vs non-TTY and never prompt when stdin is not interactive
- Support `--yes` / `--force` for confirmation bypass where appropriate
- Accept structured input via flags, files, or stdin
**Evaluation goal:** verify that commands never hang waiting for input in non-interactive execution.
**One practical check (POSIX shell + Python 3 example):**
```bash
python3 - <<'PY'
import subprocess, sys
cmd = ["blog-cli", "publish", "--content", "my-post.md"]
try:
result = subprocess.run(
cmd,
stdin=subprocess.DEVNULL,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
text=True,
timeout=10,
)
print("exit:", result.returncode)
print("PASS: command exited without hanging")
except subprocess.TimeoutExpired:
print("FAIL: command hung waiting for input")
sys.exit(1)
PY
```
Adapt the mechanism to your environment. The important part is the test purpose: **detach stdin and enforce a timeout**.
---
## 2. Structured, Parseable Output
**The principle:** Commands that return data should expose a stable machine-readable representation and predictable process semantics.
Anthropic explicitly recommends returning meaningful context from tools and optimizing tool responses for token efficiency. CLIG explicitly recommends `--json`, clean stdout/stderr separation, and suppressing presentation formatting in non-TTY contexts. This document extends that guidance into a CLI-evaluation rule for agent use.
**What good looks like:**
```bash
# Human-readable
$ blog-cli publish --content my-post.md
Published "My Post" to personal
URL: https://personal.blog.dev/my-post
Post ID: post_8k3m
# Machine-readable
$ blog-cli publish --content my-post.md --json
{"title":"My Post","url":"https://personal.blog.dev/my-post","post_id":"post_8k3m","status":"published"}
```
- `Blocker`: output is only prose, tables, or ANSI-heavy formatting with no stable parse path
- `Friction`: some commands support structured output, but coverage is inconsistent or stderr/stdout are mixed
- `Optimization`: all data-bearing commands expose a stable machine-readable mode with useful identifiers
Recommended traits:
- Support `--json` or another clearly documented machine-readable format on data-bearing commands
- Use exit code `0` for success and non-zero for failure
- Write result data to stdout and diagnostics/logs/errors to stderr
- Return meaningful fields such as names, URLs, status, and IDs
- Suppress color, spinners, and decorative output when not attached to a TTY
**Evaluation goal:** verify that structured output is valid, stable enough to parse, and cleanly separated from diagnostics.
**One practical check (POSIX shell + Python 3 example):**
```bash
blog-cli publish --content my-post.md --json 2>stderr.txt | python3 -c '
import json, sys
data = json.load(sys.stdin)
required = ["title", "url", "post_id", "status"]
missing = [field for field in required if field not in data]
sys.exit(1 if missing else 0)
'
echo "json-valid: $?"
test ! -s stderr.txt
echo "stderr-empty-on-success: $?"
rm -f stderr.txt
```
---
## 3. Progressive Help Discovery
**The principle:** Agents rarely learn a CLI from one giant document. They probe top-level help, then subcommand help, then examples. Help should support that workflow.
CLIG directly recommends concise help, examples, subcommand help, and linking to deeper docs. Anthropic separately shows that precise tool descriptions and examples materially improve tool-use behavior. The inference here is that CLI help should be designed as layered runtime documentation.
**What good looks like:**
```bash
$ blog-cli --help
Usage: blog-cli <command>
Commands:
publish Publish content
posts List and manage posts
$ blog-cli publish --help
Publish a markdown file to your blog.
Options:
--content Path to markdown file
--status Post status (draft, published, scheduled; default: published)
--yes Skip confirmation prompt
--json Output as JSON
--dry-run Preview without publishing
Examples:
blog-cli publish --content my-post.md
blog-cli publish --content my-post.md --status draft
blog-cli publish --content my-post.md --dry-run
```
- `Blocker`: subcommands are hard to discover or `--help` is missing/incomplete
- `Friction`: help exists but omits concrete invocation patterns or required argument guidance
- `Optimization`: help is layered, concise, example-driven, and points to deeper docs when needed
Recommended traits:
- Top-level help lists commands clearly
- Subcommand help includes synopsis, required inputs, key flags, and at least one concrete example for non-trivial commands
- Common flags appear near the top
- Deeper docs are linked from help where helpful
**Evaluation goal:** verify that an agent can discover how to invoke a command without leaving the CLI or reading the source code.
**A better check than `grep example`:**
For each important subcommand, inspect whether help includes all four of:
1. A one-line purpose
2. A concrete invocation pattern
3. Required arguments or required flags
4. The most important modifiers or safety flags
If one of those is missing, treat it as `Friction`. If several are missing, treat it as a `Blocker` for discoverability.
---
## 4. Fail Fast with Actionable Errors
**The principle:** When a command fails, the error should help the agent fix the next attempt.
This is directly supported by Anthropic's guidance: error responses should communicate specific, actionable improvements rather than opaque codes or tracebacks. CLIG also recommends clear error handling and concise output.
**What good looks like:**
```bash
# Bad
$ blog-cli publish
Error: missing required arguments
# Better
$ blog-cli publish
Error: --content is required.
Usage: blog-cli publish --content <file> [--status <status>]
Available statuses: draft, published, scheduled
Example: blog-cli publish --content my-post.md
```
- `Blocker`: failures are vague, silent, or buried in stack traces
- `Friction`: errors mention what failed but not how to correct it
- `Optimization`: errors include the correction path, valid values, and nearby examples
Recommended traits:
- Include the correct syntax or usage pattern
- Suggest valid values when validation fails
- Validate early, before side effects
- Prefer actionable text over raw tracebacks by default
**Evaluation goal:** verify that a failed invocation tells the next caller how to succeed.
**One practical check:**
```bash
error_output=$(blog-cli publish 2>&1 >/dev/null)
exit_code=$?
printf '%s\n' "$error_output"
echo "exit=$exit_code"
```
Assess the error against these questions:
- Does it say what was wrong?
- Does it show the correct invocation shape?
- Does it suggest valid values or next steps?
If the answer is only yes to the first question, that is usually `Friction`, not `Optimization`.
---
## 5. Safe Retries and Explicit Mutation Boundaries
**The principle:** Agents retry, resume, and sometimes replay commands. Mutating commands should make that safe when possible, and dangerous mutations should be explicit.
This section intentionally goes beyond the sources a bit. Anthropic emphasizes clear boundaries, careful tool selection, and annotations for destructive tools; CLIG emphasizes confirmations, `--force`, and `--dry-run`. From an agent-readiness perspective, the practical synthesis is: retries must be safe enough that automation is not reckless.
**What good looks like:**
```bash
# Repeating the same command does not create duplicate work
$ blog-cli publish --content my-post.md
Published "My Post" to personal (post_id: post_8k3m)
$ blog-cli publish --content my-post.md
Already published "My Post" to personal, no changes (post_id: post_8k3m)
# Dangerous mutation is explicit
$ blog-cli posts delete --slug my-post --confirm
```
- `Blocker`: retrying a mutating command can easily duplicate or corrupt state with no warning
- `Friction`: destructive commands are scriptable but offer little preview or state feedback
- `Optimization`: retries are safe where feasible, and destructive intent is explicit and inspectable
Recommended traits:
- Provide `--dry-run` for consequential mutations where feasible
- Use explicit destructive flags for dangerous operations
- Return enough state in success output to verify what happened
- Make duplicate application a no-op or clearly detectable when the domain allows it
Important scoping note:
- For **create/update/deploy/apply** commands, idempotence or duplicate detection is usually high-value
- For **append/send/trigger/run-now** commands, exact idempotence may be impossible; in those cases, the CLI should at least make mutation boundaries explicit and return audit-friendly identifiers
**Evaluation goal:** verify that retrying or re-running a command is not surprisingly dangerous.
**Practical checks:**
- Run the same low-risk mutating command twice and compare outcomes
- Check whether destructive commands expose preview, confirmation-bypass, or explicit-danger affordances
- Check whether success output includes identifiers that let an agent determine whether it repeated work
---
## 6. Composable and Predictable Command Structure
**The principle:** Agents solve tasks by chaining commands. They benefit from CLIs that accept stdin, produce clean stdout, and use predictable naming and subcommand structure.
CLIG strongly supports composition: support stdin/stdout, `-` for pipes, clean stderr separation, and order-independent argument handling where possible. Anthropic separately recommends choosing thoughtful, composable tools instead of forcing agents through many low-level steps. The practical synthesis for CLI evaluation is consistency plus pipeability.
**What good looks like:**
```bash
cat posts.json | blog-cli posts import --stdin
blog-cli posts list --json | blog-cli posts validate --stdin
blog-cli posts list --status draft --limit 5 --json | jq -r '.[].title'
```
- `Blocker`: commands cannot participate in pipelines or have inconsistent invocation structure
- `Friction`: some commands are pipeable, but naming and structure vary unpredictably
- `Optimization`: the CLI is easy to chain because inputs, outputs, and subcommand patterns are regular
Recommended traits:
- Accept input via flags, files, or stdin where that materially helps automation
- Support `-` as a stdin/stdout alias when file paths are involved
- Keep command structures consistent across related resources
- Prefer flags for ambiguous multi-field operations; reserve positional arguments for familiar, conventional cases
- Avoid requiring users to remember arbitrary ordering rules for flags and subcommands
**Evaluation goal:** verify that commands can be chained without brittle adapters or special-case knowledge.
**Practical checks:**
- Can a command consume stdin or `-` when input logically comes from another command?
- Can output from a data command be piped into another tool without stripping logs or ANSI codes?
- Do related commands use similar verb/resource patterns?
This is a better evaluation axis than requiring a specific grammar such as `resource verb` for every CLI.
---
## 7. Bounded, High-Signal Responses
**The principle:** Agents pay a real cost for every extra line of output. Large outputs are sometimes justified, but the CLI should make narrow, relevant responses the default path.
This is directly aligned with Anthropic's token-efficiency guidance: use pagination, filtering, truncation, and sensible defaults for large responses, and steer agents toward narrowing strategies. This document adds a practical optimization stance for CLIs: a command may be usable while still being wasteful.
**What good looks like:**
```bash
# Broad but bounded
$ blog-cli posts list --limit 25
Showing 25 of 312 posts
To narrow results: blog-cli posts list --status published --since 7d --limit 10
# More precise
$ blog-cli posts list --tag javascript --status published --since 30d --limit 10 --json
```
- `Blocker`: a routine query command dumps huge output by default with no narrowing controls
- `Friction`: narrowing exists, but defaults are too broad or truncation provides no guidance
- `Optimization`: defaults are bounded, filters are obvious, and truncation teaches the next better query
Recommended traits:
- Support filtering, pagination, range selection, and limits on potentially large result sets
- Provide concise vs detailed response modes where helpful
- When truncating, explain how to narrow or page the query
- Return semantic identifiers and summaries before raw detail
On thresholds:
- A default response comfortably under a few hundred lines is often a strong optimization for agents
- A larger default is not automatically wrong if the command is inherently export-oriented or the data volume is intrinsic
- For evaluation, prefer asking whether the default is **proportionate to the common task** rather than treating any fixed line count as a hard fail
**Evaluation goal:** verify that agents can get relevant answers without first paying for an unnecessary data dump.
**Practical checks:**
- Compare default output to filtered output and check whether narrowing materially reduces volume
- Check whether the command exposes `--limit`, filters, time bounds, selectors, or pagination
- If default output is large, check whether the command is explicitly an export/bulk command rather than a routine query surface
As a heuristic, treat a default output above roughly 500 lines as a likely `Friction` signal unless the command is explicitly bulk-oriented and documented as such.
---
## Quick Assessment Checklist
Use this to evaluate a CLI quickly without pretending every issue is binary:
| # | Check | What you are testing | Typical severity if missing |
|---|-------|----------------------|-----------------------------|
| 1 | Non-interactive path | Can the command run with stdin detached and no prompt? | `Blocker` |
| 2 | Structured output | Can agents get machine-readable output without scraping prose? | `Blocker` or `Friction` |
| 3 | Discoverable help | Can an agent find the invocation shape from `--help` alone? | `Friction` |
| 4 | Actionable errors | Does failure teach the next correct invocation? | `Friction` |
| 5 | Safe mutation boundaries | Are retries, destructive actions, and previews handled explicitly? | `Blocker` or `Friction` |
| 6 | Composition | Can the command participate in pipelines cleanly? | `Friction` |
| 7 | Bounded output | Are defaults reasonably scoped for common agent tasks? | `Friction` or `Optimization` |
---
## Recommended Evaluation Flow
When assessing a real CLI, review it in this order:
1. Pick representative commands by type: one read command, one mutating command, one bulk/logging command, and any intentionally interactive workflow.
2. Check for automation blockers first: prompts, unusable help, prose-only output, mixed stdout/stderr.
3. Check recovery quality next: error messages, validation, stable identifiers, repeatability.
4. Check optimization last: narrowing defaults, concise modes, consistent structure, pipeability.
This avoids over-penalizing a CLI for missing optimizations before confirming whether agents can use it at all.
---
## Sources
### Primary sources
- [Writing effective tools for agents — Anthropic Engineering](https://www.anthropic.com/engineering/writing-tools-for-agents) — Primary source for tool design guidance around meaningful context, token efficiency, actionable errors, and evaluation-driven optimization.
- [Command Line Interface Guidelines](https://clig.dev/) — Primary source for CLI behavior around help, stdout/stderr separation, interactivity, arguments/flags, and composability.
- [CLI-Anything](https://clianything.org/) — Useful agent-CLI reference point emphasizing self-description, composability, JSON output, and deterministic behavior. Best treated as a practitioner framework, not a standards source.
### Additional references
- [Why CLI is the New MCP — OneUptime](https://oneuptime.com/blog/post/2026-02-03-cli-is-the-new-mcp/view) — Opinionated ecosystem commentary on why CLI remains a strong agent integration surface.
- [How to Write a Good Spec for AI Agents — Addy Osmani](https://addyosmani.com/blog/good-spec/) — Relevant to layered documentation and context budgeting, but not a primary source for CLI-specific guidance.

View File

@@ -104,6 +104,7 @@ Agents are specialized subagents invoked by skills — you typically don't call
|-------|-------------|
| `agent-native-reviewer` | Verify features are agent-native (action + context parity) |
| `api-contract-reviewer` | Detect breaking API contract changes |
| `cli-agent-readiness-reviewer` | Evaluate CLI agent-friendliness against 7 core principles |
| `architecture-strategist` | Analyze architectural decisions and compliance |
| `code-simplicity-reviewer` | Final pass for simplicity and minimalism |
| `correctness-reviewer` | Logic errors, edge cases, state bugs |

View File

@@ -0,0 +1,435 @@
---
name: cli-agent-readiness-reviewer
description: "Reviews CLI source code, plans, or specs for AI agent readiness using a severity-based rubric focused on whether a CLI is merely usable by agents or genuinely optimized for them."
model: inherit
color: yellow
---
<examples>
<example>
Context: The user is building a CLI and wants to check if the code is agent-friendly.
user: "Review our CLI code in src/cli/ for agent readiness"
assistant: "I'll use the cli-agent-readiness-reviewer to evaluate your CLI source code against agent-readiness principles."
<commentary>The user is building a CLI. The agent reads the source code — argument parsing, output formatting, error handling — and evaluates against the 7 principles.</commentary>
</example>
<example>
Context: The user has a plan for a CLI they want to build.
user: "We're designing a CLI for our deployment platform. Here's the spec — how agent-ready is this design?"
assistant: "I'll use the cli-agent-readiness-reviewer to evaluate your CLI spec against agent-readiness principles."
<commentary>The CLI doesn't exist yet. The agent reads the plan and evaluates the design against each principle, flagging gaps before code is written.</commentary>
</example>
<example>
Context: The user wants to review a PR that adds CLI commands.
user: "This PR adds new subcommands to our CLI. Can you check them for agent friendliness?"
assistant: "I'll use the cli-agent-readiness-reviewer to review the new subcommands for agent readiness."
<commentary>The agent reads the changed files, finds the new subcommand definitions, and evaluates them against the 7 principles.</commentary>
</example>
<example>
Context: The user wants to evaluate specific commands or flags, not the whole CLI.
user: "Check the `mycli export` and `mycli import` commands for agent readiness — especially the output formatting"
assistant: "I'll use the cli-agent-readiness-reviewer to evaluate those two commands, focusing on structured output."
<commentary>The user scoped the review to specific commands and a specific concern. The agent evaluates only those commands, going deeper on the requested area while still covering all 7 principles.</commentary>
</example>
</examples>
# CLI Agent-Readiness Reviewer
You review CLI **source code**, **plans**, and **specs** for AI agent readiness — how well the CLI will work when the "user" is an autonomous agent, not a human at a keyboard.
You are a code reviewer, not a black-box tester. Read the implementation (or design) to understand what the CLI does, then evaluate it against the 7 principles below.
This is not a generic CLI review. It is an **agent-optimization review**:
- The question is not only "can an agent use this CLI?"
- The question is also "where will an agent waste time, tokens, retries, or operator intervention?"
Do **not** reduce the review to pass/fail. Classify findings using:
- **Blocker** — prevents reliable autonomous use
- **Friction** — usable, but costly, brittle, or inefficient for agents
- **Optimization** — not broken, but materially improvable for better agent throughput and reliability
Evaluate commands by **command type** — different types have different priority principles:
| Command type | Most important principles |
|---|---|
| Read/query | Structured output, bounded output, composability |
| Mutating | Non-interactive, actionable errors, safety, idempotence |
| Streaming/logging | Filtering, truncation controls, clean stderr/stdout |
| Interactive/bootstrap | Automation escape hatch, `--no-input`, scriptable alternatives |
| Bulk/export | Pagination, range selection, machine-readable output |
## Step 1: Locate the CLI and Identify the Framework
Determine what you're reviewing:
- **Source code** — read argument parsing setup, command definitions, output formatting, error handling, help text
- **Plan or spec** — evaluate the design; flag principles the document doesn't address as **gaps** (opportunities to strengthen before implementation)
If the user doesn't point to specific files, search the codebase:
- Argument parsing libraries: Click, argparse, Commander, clap, Cobra, yargs, oclif, Thor
- Entry points: `cli.py`, `cli.ts`, `main.rs`, `bin/`, `cmd/`, `src/cli/`
- Package.json `bin` field, setup.py `console_scripts`, Cargo.toml `[[bin]]`
**Identify the framework early.** Your recommendations, what you credit as "already handled," and what you flag as missing all depend on knowing what the framework gives you for free vs. what the developer must implement. See the Framework Idioms Reference at the end of this document.
**Scoping:** If the user names specific commands, flags, or areas of concern, evaluate those — don't override their focus with your own selection. When no scope is given, identify 3-5 primary subcommands using these signals:
- **README/docs references** — commands featured in documentation are primary workflows
- **Test coverage** — commands with the most test cases are the most exercised paths
- **Code volume** — a 200-line command handler matters more than a 20-line one
- Don't use help text ordering as a priority signal — most frameworks list subcommands alphabetically
Before scoring anything, identify the command type for each command you review. Do not over-apply a principle where it does not fit. Example: strict idempotence matters far more for `deploy` than for `logs tail`.
## Step 2: Evaluate Against the 7 Principles
Evaluate in priority order: check for **Blockers** first across all principles, then **Friction**, then **Optimization** opportunities. This ensures the most critical issues are surfaced before refinements. For source code, cite specific files, functions, and line numbers. For plans, quote the relevant sections. For principles a plan doesn't mention, flag the gap and recommend what to add.
For each principle, answer:
1. Is there a **Blocker**, **Friction**, or **Optimization** issue here?
2. What is the evidence?
3. How does the command type affect the assessment?
4. What is the most framework-idiomatic fix?
---
### Principle 1: Non-Interactive by Default for Automation Paths
Any command an agent might reasonably automate should be invocable without prompts. Interactive mode can exist, but it should be a convenience layer, not the only path.
**In code, look for:**
- Interactive prompt library imports (inquirer, prompt_toolkit, dialoguer, readline)
- `input()` / `readline()` calls without TTY guards
- Confirmation prompts without `--yes`/`--force` bypass
- Wizard or multi-step flows without flag-based alternatives
- TTY detection gating interactivity (`process.stdout.isTTY`, `sys.stdin.isatty()`, `atty::is()`)
- `--no-input` or `--non-interactive` flag definitions
**In plans, look for:** interactive flows without flag bypass, setup wizards without `--no-input`, no mention of CI/automation usage.
**Severity guidance:**
- **Blocker**: a primary automation path depends on a prompt or TUI flow
- **Friction**: most prompts are bypassable, but behavior is inconsistent or poorly documented
- **Optimization**: explicit non-interactive affordances exist, but could be made more uniform or discoverable
When relevant, suggest a practical test purpose such as: "detach stdin and confirm the command exits or errors within a timeout rather than hanging."
---
### Principle 2: Structured, Parseable Output
Commands that return data should expose a stable machine-readable representation and predictable process semantics.
**In code, look for:**
- `--json`, `--format`, or `--output` flag definitions on data-returning commands
- Serialization calls (JSON.stringify, json.dumps, serde_json, to_json)
- Explicit exit code setting with distinct codes for distinct failure types
- stdout vs stderr separation — data to stdout, messages/logs to stderr
- What success output contains — structured data with IDs and URLs, or just "Done!"
- TTY checks before emitting color codes, spinners, progress bars, or emoji
**In plans, look for:** output format definitions, exit code semantics, whether structured output is mentioned at all.
**Severity guidance:**
- **Blocker**: data-bearing commands are prose-only, ANSI-heavy, or mix data with diagnostics in ways that break parsing
- **Friction**: some commands expose machine-readable output but coverage is inconsistent or polluted by stderr/stdout mixing
- **Optimization**: structured output exists, but fields, identifiers, or format consistency could be improved
Do not require `--json` literally if the CLI has another well-documented stable machine format. The issue is machine readability, not one flag spelling.
---
### Principle 3: Progressive Help Discovery
Agents discover capabilities incrementally: top-level help, then subcommand help, then examples. Review help for discoverability, not just the presence of the word "example."
**In code, look for:**
- Per-subcommand description strings and example strings
- Whether the argument parser generates layered help (most frameworks do by default — note when this is free)
- Help text verbosity — under ~80 lines per subcommand is good; 200+ lines floods agent context
- Whether common flags are listed before obscure ones
**In plans, look for:** help text strategy, whether examples are planned per subcommand.
Assess whether each important subcommand help includes:
- A one-line purpose
- A concrete invocation pattern
- Required arguments or required flags
- Important modifiers or safety flags
**Severity guidance:**
- **Blocker**: subcommand help is missing or too incomplete to discover invocation shape
- **Friction**: help exists but omits examples, required inputs, or important modifiers
- **Optimization**: help works but could be tightened, reordered, or made more example-driven
---
### Principle 4: Fail Fast with Actionable Errors
When input is missing or invalid, error immediately with a message that helps the next attempt succeed.
**In code, look for:**
- What happens when required args are missing — usage hint, or prompt, or hang?
- Custom error messages that include correct syntax or valid values
- Input validation before side effects (not after partial execution)
- Error output that includes example invocations
- Try/catch that swallows errors silently or returns generic messages
**In plans, look for:** error handling strategy, error message format, validation approach.
**Severity guidance:**
- **Blocker**: failures are silent, vague, hanging, or buried in stack traces
- **Friction**: the error identifies the failure but not the correction path
- **Optimization**: the error is actionable but could better suggest valid values, examples, or next commands
---
### Principle 5: Safe Retries and Explicit Mutation Boundaries
Agents retry, resume, and sometimes replay commands. Mutating commands should make that safe when possible, and dangerous mutations should be explicit.
**In code, look for:**
- `--dry-run` flag on state-changing commands and whether it's actually wired up
- `--force`/`--yes` flags (presence indicates the default path has safety prompts — good)
- "Already exists" handling, upsert logic, create-or-update patterns
- Whether destructive operations (delete, overwrite) have confirmation gates
**In plans, look for:** idempotency requirements, dry-run support, destructive action handling.
Scope this principle by command type:
- For `create`, `update`, `apply`, `deploy`, and similar commands, idempotence or duplicate detection is high-value
- For `send`, `trigger`, `append`, or `run-now` commands, exact idempotence may be impossible; in those cases, explicit mutation boundaries and audit-friendly output matter more
**Severity guidance:**
- **Blocker**: retries can easily duplicate or corrupt state with no warning or visibility
- **Friction**: some safety affordances exist, but they are inconsistent or too opaque for automation
- **Optimization**: command safety is acceptable, but previews, identifiers, or duplicate detection could be stronger
---
### Principle 6: Composable and Predictable Command Structure
Agents chain commands and pipe output between tools. The CLI should be easy to compose without brittle adapters or memorized exceptions.
**In code, look for:**
- Flag-based vs positional argument patterns
- Stdin reading support (`--stdin`, reading from pipe, `-` as filename alias)
- Consistent command structure across related subcommands
- Output clean when piped — no color, no spinners, no interactive noise when not a TTY
**In plans, look for:** command naming conventions, stdin/pipe support, composability examples.
Do not treat all positional arguments as a flaw. Conventional positional forms may be fine. Focus on ambiguity, inconsistency, and pipeline-hostile behavior.
**Severity guidance:**
- **Blocker**: commands cannot be chained cleanly or behave unpredictably in pipelines
- **Friction**: some commands are pipeable, but naming, ordering, or stdin behavior is inconsistent
- **Optimization**: command structure is serviceable, but could be more regular or easier for agents to infer
---
### Principle 7: Bounded, High-Signal Responses
Every token of CLI output consumes limited agent context. Large outputs are sometimes justified, but defaults should be proportionate to the common task and provide ways to narrow.
**In code, look for:**
- Default limits on list/query commands (e.g., `default=50`, `max_results=100`)
- `--limit`, `--filter`, `--since`, `--max` flag definitions
- `--quiet`/`--verbose` output modes
- Pagination implementation (cursor, offset, page)
- Whether unbounded queries are possible by default — an unfiltered `list` returning thousands of rows is a context killer
- Truncation messages that guide the agent toward narrowing results
**In plans, look for:** default result limits, filtering/pagination design, verbosity controls.
Treat fixed thresholds as heuristics, not laws. A default above roughly 500 lines is often a `Friction` signal for routine queries, but may be justified for explicit bulk/export commands.
**Severity guidance:**
- **Blocker**: a routine query command dumps huge output by default with no narrowing controls
- **Friction**: narrowing exists, but defaults are too broad or truncation provides no guidance
- **Optimization**: defaults are acceptable, but could be better bounded or more teachable to agents
---
## Step 3: Produce the Report
```markdown
## CLI Agent-Readiness Review: <CLI name or project>
**Input type**: Source code / Plan / Spec
**Framework**: <detected framework and version if known>
**Command types reviewed**: <read/mutating/streaming/etc.>
**Files reviewed**: <key files examined>
**Overall judgment**: <brief summary of how usable vs optimized this CLI is for agents>
### Scorecard
| # | Principle | Severity | Key Finding |
|---|-----------|----------|-------------|
| 1 | Non-interactive automation paths | Blocker/Friction/Optimization/None | <one-line summary> |
| 2 | Structured output | Blocker/Friction/Optimization/None | <one-line summary> |
| 3 | Progressive help discovery | Blocker/Friction/Optimization/None | <one-line summary> |
| 4 | Actionable errors | Blocker/Friction/Optimization/None | <one-line summary> |
| 5 | Safe retries and mutation boundaries | Blocker/Friction/Optimization/None | <one-line summary> |
| 6 | Composable command structure | Blocker/Friction/Optimization/None | <one-line summary> |
| 7 | Bounded responses | Blocker/Friction/Optimization/None | <one-line summary> |
### Detailed Findings
#### Principle 1: Non-Interactive Automation Paths — <Severity or None>
**Evidence:**
<file:line references, flag definitions, or spec excerpts>
**Command-type context:**
<why this matters for the specific commands reviewed>
**Framework context:**
<what the framework handles vs. what's missing>
**Assessment:**
<what works, what is missing, and why this is a blocker/friction/optimization issue>
**Recommendation:**
<framework-idiomatic fix — e.g., "Change `prompt=True` to `required=True` on the `--env` option in cli.py:45">
**Practical check or test to add:**
<portable test purpose or concrete assertion — e.g., "Detach stdin and assert `deploy` exits non-zero instead of prompting">
[repeat for each principle]
### Highest-Impact Improvements
1. <highest-impact improvement with framework-specific implementation guidance>
2. ...
### What's Working Well
- <positive patterns worth preserving, including framework defaults being used correctly>
```
## Review Guidelines
- **Cite evidence.** File paths, line numbers, function names for code. Quoted sections for plans. Never score on impressions.
- **Credit the framework.** When the argument parser handles something automatically, note it. The principle is satisfied even if the developer didn't explicitly implement it. Don't flag what's already free.
- **Recommendations must be framework-idiomatic.** "Add `@click.option('--json', 'output_json', is_flag=True)` to the deploy command" is useful. "Add a --json flag" is generic. Use the patterns from the Framework Idioms Reference.
- **Include a practical check or test assertion per finding.** Prefer test purpose plus an environment-adaptable assertion over brittle shell snippets that assume a specific OS utility layout.
- **Gaps are opportunities.** For plans and specs, a principle not addressed is a gap to fill before implementation, not a failure.
- **Give credit for what works.** When a CLI is partially compliant, acknowledge the good patterns.
- **Do not flatten everything into a score.** The review should tell the user where agent use will break, where it will be costly, and where it is already strong.
- **Use the principle names consistently.** Keep wording aligned with the 7 principle names defined in this document.
---
## Framework Idioms Reference
Once you identify the CLI framework, use this knowledge to calibrate your review. Credit what the framework handles automatically. Flag what it doesn't. Write recommendations using idiomatic patterns for that framework.
### Python — Click
**Gives you for free:**
- Layered help with `--help` on every command/group
- Error + usage hint on missing required options
- Type validation on parameters
**Doesn't give you — must implement:**
- `--json` output — add `@click.option('--json', 'output_json', is_flag=True)` and branch on it in the handler
- TTY detection — use `sys.stdout.isatty()` or `click.get_text_stream('stdout').isatty()`
- `--no-input` — Click prompts for missing values when `prompt=True` is set on an option; make sure required inputs are options with `required=True` (errors on missing) not `prompt=True` (blocks agents)
- Stdin reading — use `click.get_text_stream('stdin')` or `type=click.File('-')`
- Exit codes — Click uses `sys.exit(1)` on errors by default but doesn't differentiate error types; use `ctx.exit(code)` for distinct codes
**Anti-patterns to flag:**
- `prompt=True` on options without a `--no-input` guard
- `click.confirm()` without checking `--yes`/`--force` first
- Using `click.echo()` for both data and messages (no stdout/stderr separation) — use `click.echo(..., err=True)` for messages
### Python — argparse
**Gives you for free:**
- Usage/error message on missing required args
- Layered help via subparsers
**Doesn't give you — must implement:**
- Examples in help text — use `epilog` with `RawDescriptionHelpFormatter`
- `--json` output — entirely manual
- Stdin support — use `type=argparse.FileType('r')` with `default='-'` or `nargs='?'`
- TTY detection, exit codes, output separation — all manual
**Anti-patterns to flag:**
- Using `input()` for missing values instead of making arguments required
- Default `HelpFormatter` truncating epilog examples — need `RawDescriptionHelpFormatter`
### Go — Cobra
**Gives you for free:**
- Layered help with usage and examples fields — but only if `Example:` field is populated
- Error on unknown flags
- Consistent subcommand structure via `AddCommand`
- `--help` on every command
**Doesn't give you — must implement:**
- `--json`/`--output` — common pattern is a persistent `--output` flag on root with `json`/`table`/`yaml` values
- `--dry-run` — entirely manual
- Stdin — use `os.Stdin` or `cobra.ExactArgs` for validation, `cmd.InOrStdin()` for reading
- TTY detection — use `golang.org/x/term` or `mattn/go-isatty`
**Anti-patterns to flag:**
- Empty `Example:` fields on commands
- Using `fmt.Println` for both data and errors — use `cmd.OutOrStdout()` and `cmd.ErrOrStderr()`
- `RunE` functions that return `nil` on failure instead of an error
### Rust — clap
**Gives you for free:**
- Layered help from derive macros
- Compile-time validation of required args
- Typed parsing with strong error messages
- Consistent subcommand structure via enums
**Doesn't give you — must implement:**
- `--json` output — use `serde_json::to_string_pretty` with a `--format` flag
- `--dry-run` — manual flag and logic
- Stdin — use `std::io::stdin()` with `is_terminal::IsTerminal` to detect piped input
- TTY detection — `is-terminal` crate (`is_terminal::IsTerminal` trait)
- Exit codes — use `std::process::exit()` with distinct codes or `ExitCode`
**Anti-patterns to flag:**
- Using `println!` for both data and diagnostics — use `eprintln!` for messages
- No examples in help text — add via `#[command(after_help = "Examples:\n mycli deploy --env staging")]`
### Node.js — Commander / yargs / oclif
**Gives you for free:**
- Commander: layered help, error on missing required, `--help` on all commands
- yargs: `.demandOption()` for required flags, `.example()` for help examples, `.fail()` for custom errors
- oclif: layered help, examples; `--json` available but requires per-command opt-in via `static enableJsonFlag = true`
**Doesn't give you — must implement:**
- Commander: no built-in `--json`; stdin reading; TTY detection (`process.stdout.isTTY`)
- yargs: `--json` is manual; stdin via `process.stdin`
- oclif: `--json` requires per-command opt-in via `static enableJsonFlag = true`
**Anti-patterns to flag:**
- Using `inquirer` or `prompts` without checking `process.stdin.isTTY` first
- `console.log` for both data and messages — use `process.stdout.write` and `process.stderr.write`
- Commander `.action()` that calls `process.exit(0)` on errors
### Ruby — Thor
**Gives you for free:**
- Layered help, subcommand structure
- `method_option` for named flags
- Error on unknown flags
**Doesn't give you — must implement:**
- `--json` output — manual
- Stdin — use `$stdin.read` or `ARGF`
- TTY detection — `$stdout.tty?`
- Exit codes — `exit 1` or `abort`
**Anti-patterns to flag:**
- Using `ask()` or `yes?()` without a `--yes` flag bypass
- `say` for both data and messages — use `$stderr.puts` for messages
### Framework not listed
If the framework isn't above, apply the same pattern: identify what the framework gives for free by reading its documentation or source, what must be implemented manually, and what idiomatic patterns exist for each principle. Note your findings in the report so the user understands the basis for your recommendations.