* [2.17.0] Expand agent-native skill with mobile app learnings Major expansion of agent-native-architecture skill based on real-world learnings from building the Every Reader iOS app. New reference documents: - dynamic-context-injection.md: Runtime app state in system prompts - action-parity-discipline.md: Ensuring agents can do what users can - shared-workspace-architecture.md: Agents and users in same data space - agent-native-testing.md: Testing patterns for agent-native apps - mobile-patterns.md: Background execution, permissions, cost awareness Updated references: - architecture-patterns.md: Added Unified Agent Architecture, Agent-to-UI Communication, and Model Tier Selection patterns Enhanced agent-native-reviewer with comprehensive review process covering all new patterns, including mobile-specific verification. Key insight: "The agent should be able to do anything the user can do, through tools that mirror UI capabilities, with full context about the app state." 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * [2.18.0] Add Dynamic Capability Discovery and iCloud sync patterns New patterns in agent-native-architecture skill: - **Dynamic Capability Discovery** - For agent-native apps integrating with external APIs (HealthKit, HomeKit, GraphQL), use a discovery tool (list_*) plus a generic access tool instead of individual tools per endpoint. (Note: Static mapping is fine for constrained agents with limited scope.) - **CRUD Completeness** - Every entity needs create, read, update, AND delete. - **iCloud File Storage** - Use iCloud Documents for shared workspace to get free, automatic multi-device sync without building a sync layer. - **Architecture Review Checklist** - Pushes reviewer findings earlier into design phase. Covers tool design, action parity, UI integration, context. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> --------- Co-authored-by: Claude <noreply@anthropic.com>
16 KiB
<testing_philosophy>
Testing Philosophy
Test Outcomes, Not Procedures
Traditional (procedure-focused):
// Testing that a specific function was called with specific args
expect(mockProcessFeedback).toHaveBeenCalledWith({
message: "Great app!",
category: "praise",
priority: 2
});
Agent-native (outcome-focused):
// Testing that the outcome was achieved
const result = await agent.process("Great app!");
const storedFeedback = await db.feedback.getLatest();
expect(storedFeedback.content).toContain("Great app");
expect(storedFeedback.importance).toBeGreaterThanOrEqual(1);
expect(storedFeedback.importance).toBeLessThanOrEqual(5);
// We don't care exactly how it categorized—just that it's reasonable
Accept Variability
Agents may solve problems differently each time. Your tests should:
- Verify the end state, not the path
- Accept reasonable ranges, not exact values
- Check for presence of required elements, not exact format </testing_philosophy>
<can_agent_do_it_test>
The "Can Agent Do It?" Test
For each UI feature, write a test prompt and verify the agent can accomplish it.
Template
describe('Agent Capability Tests', () => {
test('Agent can add a book to library', async () => {
const result = await agent.chat("Add 'Moby Dick' by Herman Melville to my library");
// Verify outcome
const library = await libraryService.getBooks();
const mobyDick = library.find(b => b.title.includes("Moby Dick"));
expect(mobyDick).toBeDefined();
expect(mobyDick.author).toContain("Melville");
});
test('Agent can publish to feed', async () => {
// Setup: ensure a book exists
await libraryService.addBook({ id: "book_123", title: "1984" });
const result = await agent.chat("Write something about surveillance themes in my feed");
// Verify outcome
const feed = await feedService.getItems();
const newItem = feed.find(item => item.bookId === "book_123");
expect(newItem).toBeDefined();
expect(newItem.content.toLowerCase()).toMatch(/surveillance|watching|control/);
});
test('Agent can search and save research', async () => {
await libraryService.addBook({ id: "book_456", title: "Moby Dick" });
const result = await agent.chat("Research whale symbolism in Moby Dick");
// Verify files were created
const files = await fileService.listFiles("Research/book_456/");
expect(files.length).toBeGreaterThan(0);
// Verify content is relevant
const content = await fileService.readFile(files[0]);
expect(content.toLowerCase()).toMatch(/whale|symbolism|melville/);
});
});
The "Write to Location" Test
A key litmus test: can the agent create content in specific app locations?
describe('Location Awareness Tests', () => {
const locations = [
{ userPhrase: "my reading feed", expectedTool: "publish_to_feed" },
{ userPhrase: "my library", expectedTool: "add_book" },
{ userPhrase: "my research folder", expectedTool: "write_file" },
{ userPhrase: "my profile", expectedTool: "write_file" },
];
for (const { userPhrase, expectedTool } of locations) {
test(`Agent knows how to write to "${userPhrase}"`, async () => {
const prompt = `Write a test note to ${userPhrase}`;
const result = await agent.chat(prompt);
// Check that agent used the right tool (or achieved the outcome)
expect(result.toolCalls).toContainEqual(
expect.objectContaining({ name: expectedTool })
);
// Or verify outcome directly
// expect(await locationHasNewContent(userPhrase)).toBe(true);
});
}
});
</can_agent_do_it_test>
<surprise_test>
The "Surprise Test"
A well-designed agent-native app lets the agent figure out creative approaches. Test this by giving open-ended requests.
The Test
describe('Agent Creativity Tests', () => {
test('Agent can handle open-ended requests', async () => {
// Setup: user has some books
await libraryService.addBook({ id: "1", title: "1984", author: "Orwell" });
await libraryService.addBook({ id: "2", title: "Brave New World", author: "Huxley" });
await libraryService.addBook({ id: "3", title: "Fahrenheit 451", author: "Bradbury" });
// Open-ended request
const result = await agent.chat("Help me organize my reading for next month");
// The agent should do SOMETHING useful
// We don't specify exactly what—that's the point
expect(result.toolCalls.length).toBeGreaterThan(0);
// It should have engaged with the library
const libraryTools = ["read_library", "write_file", "publish_to_feed"];
const usedLibraryTool = result.toolCalls.some(
call => libraryTools.includes(call.name)
);
expect(usedLibraryTool).toBe(true);
});
test('Agent finds creative solutions', async () => {
// Don't specify HOW to accomplish the task
const result = await agent.chat(
"I want to understand the dystopian themes across my sci-fi books"
);
// Agent might:
// - Read all books and create a comparison document
// - Research dystopian literature and relate it to user's books
// - Create a mind map in a markdown file
// - Publish a series of insights to the feed
// We just verify it did something substantive
expect(result.response.length).toBeGreaterThan(100);
expect(result.toolCalls.length).toBeGreaterThan(0);
});
});
What Failure Looks Like
// FAILURE: Agent can only say it can't do that
const result = await agent.chat("Help me prepare for a book club discussion");
// Bad outcome:
expect(result.response).not.toContain("I can't");
expect(result.response).not.toContain("I don't have a tool");
expect(result.response).not.toContain("Could you clarify");
// If the agent asks for clarification on something it should understand,
// you have a context injection or capability gap
</surprise_test>
<parity_testing>
Automated Parity Testing
Ensure every UI action has an agent equivalent.
Capability Map Testing
// capability-map.ts
export const capabilityMap = {
// UI Action: Agent Tool
"View library": "read_library",
"Add book": "add_book",
"Delete book": "delete_book",
"Publish insight": "publish_to_feed",
"Start research": "start_research",
"View highlights": "read_library", // same tool, different query
"Edit profile": "write_file",
"Search web": "web_search",
"Export data": "N/A", // UI-only action
};
// parity.test.ts
import { capabilityMap } from './capability-map';
import { getAgentTools } from './agent-config';
import { getSystemPrompt } from './system-prompt';
describe('Action Parity', () => {
const agentTools = getAgentTools();
const systemPrompt = getSystemPrompt();
for (const [uiAction, toolName] of Object.entries(capabilityMap)) {
if (toolName === 'N/A') continue;
test(`"${uiAction}" has agent tool: ${toolName}`, () => {
const toolNames = agentTools.map(t => t.name);
expect(toolNames).toContain(toolName);
});
test(`${toolName} is documented in system prompt`, () => {
expect(systemPrompt).toContain(toolName);
});
}
});
Context Parity Testing
describe('Context Parity', () => {
test('Agent sees all data that UI shows', async () => {
// Setup: create some data
await libraryService.addBook({ id: "1", title: "Test Book" });
await feedService.addItem({ id: "f1", content: "Test insight" });
// Get system prompt (which includes context)
const systemPrompt = await buildSystemPrompt();
// Verify data is included
expect(systemPrompt).toContain("Test Book");
expect(systemPrompt).toContain("Test insight");
});
test('Recent activity is visible to agent', async () => {
// Perform some actions
await activityService.log({ action: "highlighted", bookId: "1" });
await activityService.log({ action: "researched", bookId: "2" });
const systemPrompt = await buildSystemPrompt();
// Verify activity is included
expect(systemPrompt).toMatch(/highlighted|researched/);
});
});
</parity_testing>
<integration_testing>
Integration Testing
Test the full flow from user request to outcome.
End-to-End Flow Tests
describe('End-to-End Flows', () => {
test('Research flow: request → web search → file creation', async () => {
// Setup
const bookId = "book_123";
await libraryService.addBook({ id: bookId, title: "Moby Dick" });
// User request
await agent.chat("Research the historical context of whaling in Moby Dick");
// Verify: web search was performed
const searchCalls = mockWebSearch.mock.calls;
expect(searchCalls.length).toBeGreaterThan(0);
expect(searchCalls.some(call =>
call[0].query.toLowerCase().includes("whaling")
)).toBe(true);
// Verify: files were created
const researchFiles = await fileService.listFiles(`Research/${bookId}/`);
expect(researchFiles.length).toBeGreaterThan(0);
// Verify: content is relevant
const content = await fileService.readFile(researchFiles[0]);
expect(content.toLowerCase()).toMatch(/whale|whaling|nantucket|melville/);
});
test('Publish flow: request → tool call → feed update → UI reflects', async () => {
// Setup
await libraryService.addBook({ id: "book_1", title: "1984" });
// Initial state
const feedBefore = await feedService.getItems();
// User request
await agent.chat("Write something about Big Brother for my reading feed");
// Verify feed updated
const feedAfter = await feedService.getItems();
expect(feedAfter.length).toBe(feedBefore.length + 1);
// Verify content
const newItem = feedAfter.find(item =>
!feedBefore.some(old => old.id === item.id)
);
expect(newItem).toBeDefined();
expect(newItem.content.toLowerCase()).toMatch(/big brother|surveillance|watching/);
});
});
Failure Recovery Tests
describe('Failure Recovery', () => {
test('Agent handles missing book gracefully', async () => {
const result = await agent.chat("Tell me about 'Nonexistent Book'");
// Agent should not crash
expect(result.error).toBeUndefined();
// Agent should acknowledge the issue
expect(result.response.toLowerCase()).toMatch(
/not found|don't see|can't find|library/
);
});
test('Agent recovers from API failure', async () => {
// Mock API failure
mockWebSearch.mockRejectedValueOnce(new Error("Network error"));
const result = await agent.chat("Research this topic");
// Agent should handle gracefully
expect(result.error).toBeUndefined();
expect(result.response).not.toContain("unhandled exception");
// Agent should communicate the issue
expect(result.response.toLowerCase()).toMatch(
/couldn't search|unable to|try again/
);
});
});
</integration_testing>
<snapshot_testing>
Snapshot Testing for System Prompts
Track changes to system prompts and context injection over time.
describe('System Prompt Stability', () => {
test('System prompt structure matches snapshot', async () => {
const systemPrompt = await buildSystemPrompt();
// Extract structure (removing dynamic data)
const structure = systemPrompt
.replace(/id: \w+/g, 'id: [ID]')
.replace(/"[^"]+"/g, '"[TITLE]"')
.replace(/\d{4}-\d{2}-\d{2}/g, '[DATE]');
expect(structure).toMatchSnapshot();
});
test('All capability sections are present', async () => {
const systemPrompt = await buildSystemPrompt();
const requiredSections = [
"Your Capabilities",
"Available Books",
"Recent Activity",
];
for (const section of requiredSections) {
expect(systemPrompt).toContain(section);
}
});
});
</snapshot_testing>
<manual_testing>
Manual Testing Checklist
Some things are best tested manually during development:
Natural Language Variation Test
Try multiple phrasings for the same request:
"Add this to my feed"
"Write something in my reading feed"
"Publish an insight about this"
"Put this in the feed"
"I want this in my feed"
All should work if context injection is correct.
Edge Case Prompts
"What can you do?"
→ Agent should describe capabilities
"Help me with my books"
→ Agent should engage with library, not ask what "books" means
"Write something"
→ Agent should ask WHERE (feed, file, etc.) if not clear
"Delete everything"
→ Agent should confirm before destructive actions
Confusion Test
Ask about things that should exist but might not be properly connected:
"What's in my research folder?"
→ Should list files, not ask "what research folder?"
"Show me my recent reading"
→ Should show activity, not ask "what do you mean?"
"Continue where I left off"
→ Should reference recent activity if available
</manual_testing>
<ci_integration>
CI/CD Integration
Add agent-native tests to your CI pipeline:
# .github/workflows/test.yml
name: Agent-Native Tests
on: [push, pull_request]
jobs:
agent-tests:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Setup
run: npm install
- name: Run Parity Tests
run: npm run test:parity
- name: Run Capability Tests
run: npm run test:capabilities
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
- name: Check System Prompt Completeness
run: npm run test:system-prompt
- name: Verify Capability Map
run: |
# Ensure capability map is up to date
npm run generate:capability-map
git diff --exit-code capability-map.ts
Cost-Aware Testing
Agent tests cost API tokens. Strategies to manage:
// Use smaller models for basic tests
const testConfig = {
model: process.env.CI ? "claude-3-haiku" : "claude-3-opus",
maxTokens: 500, // Limit output length
};
// Cache responses for deterministic tests
const cachedAgent = new CachedAgent({
cacheDir: ".test-cache",
ttl: 24 * 60 * 60 * 1000, // 24 hours
});
// Run expensive tests only on main branch
if (process.env.GITHUB_REF === 'refs/heads/main') {
describe('Full Integration Tests', () => { ... });
}
</ci_integration>
<test_utilities>
Test Utilities
Agent Test Harness
class AgentTestHarness {
private agent: Agent;
private mockServices: MockServices;
async setup() {
this.mockServices = createMockServices();
this.agent = await createAgent({
services: this.mockServices,
model: "claude-3-haiku", // Cheaper for tests
});
}
async chat(message: string): Promise<AgentResponse> {
return this.agent.chat(message);
}
async expectToolCall(toolName: string) {
const lastResponse = this.agent.getLastResponse();
expect(lastResponse.toolCalls.map(t => t.name)).toContain(toolName);
}
async expectOutcome(check: () => Promise<boolean>) {
const result = await check();
expect(result).toBe(true);
}
getState() {
return {
library: this.mockServices.library.getBooks(),
feed: this.mockServices.feed.getItems(),
files: this.mockServices.files.listAll(),
};
}
}
// Usage
test('full flow', async () => {
const harness = new AgentTestHarness();
await harness.setup();
await harness.chat("Add 'Moby Dick' to my library");
await harness.expectToolCall("add_book");
await harness.expectOutcome(async () => {
const state = harness.getState();
return state.library.some(b => b.title.includes("Moby"));
});
});
</test_utilities>
## Testing ChecklistAutomated Tests:
- "Can Agent Do It?" tests for each UI action
- Location awareness tests ("write to my feed")
- Parity tests (tool exists, documented in prompt)
- Context parity tests (agent sees what UI shows)
- End-to-end flow tests
- Failure recovery tests
Manual Tests:
- Natural language variation (multiple phrasings work)
- Edge case prompts (open-ended requests)
- Confusion test (agent knows app vocabulary)
- Surprise test (agent can be creative)
CI Integration:
- Parity tests run on every PR
- Capability tests run with API key
- System prompt completeness check
- Capability map drift detection