Files
claude-projects/agent-builder/.claude/playbook_testing_methodology.md
Backtalk6858 db6cbbdec1 init: add claude-config and agent-builder context files
Initial commit tracking session context, playbooks, and automation specs
for claude-config and agent-builder Claude Code conversations.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-17 23:08:23 -05:00

7.3 KiB

name, description, type, version
name description type version
Agent & Automation Testing Methodology Mandatory testing methodology for all built automations before production promotion — covers N8N automations, claude agents, and scripts; self-evolving document updated after every test session project 1.0

Playbook: Agent & Automation Testing Methodology

Self-evolution rule: After every test session, update this playbook — add new known failure modes, refine assertion patterns, increment the version number. The methodology improves every time something breaks in a new way.

Applies to: All automations in the automation_ideas table with types: n8n_automation, claude_agent, script


Before You Test — Required Reading Gate

Task type Read first
Testing any automation Sandbox isolation rule · Four test levels · Type-specific section
Promoting to production Promotion gate checklist
Adding a new failure mode Known failure modes section + update rule

Sandbox isolation rule (HARD): All testing happens in sandbox (server-01, 192.168.1.90). Sandbox Vault, Postgres, and N8N are test-only. No production credentials, no production data. See feedback_sandbox_isolation.md.


The Four Test Levels

Every automation must pass all four levels before promotion. Run in order — stop and log failure at the first level that fails.

Level 1 — Structure Test

Does the built artifact have valid structure?

Type Check
n8n_automation Workflow JSON is valid JSON; contains nodes, connections, settings keys; all node types exist in sandbox N8N
claude_agent The claude -p call string is syntactically valid; prompt references correct tools/paths; output schema is defined
script Python syntax check passes (python3 -m py_compile script.py); all imports are available in the target container image

Pass criteria: No structural errors. Fail action: Log to agent_test_results, NTFY user, do NOT proceed to Level 2.


Level 2 — Deployment Test

Does it deploy to sandbox without errors?

Type Check
n8n_automation POST /api/v1/workflows succeeds; workflow appears in sandbox N8N UI; all credentials are assigned (no empty credential IDs)
claude_agent Container builds and starts; docker ps shows healthy; claude -p "echo ok" returns without error from within the agent's execution context
script Docker image builds; container starts; first log line appears within 30 seconds; exit code is 0 for one-shot scripts or container stays running for daemon scripts

Pass criteria: No deployment errors, artifact is reachable. Fail action: Log to agent_test_results, NTFY user, tear down partial deployment in sandbox.


Level 3 — Smoke Test

Does it execute without crashing on minimal input?

Type Check
n8n_automation Trigger one manual execution via N8N API (POST /api/v1/workflows/{id}/run); execution completes with status success (not error or crashed)
claude_agent Run agent with a trivial, safe test input defined in the task_description; agent completes without exception; output is non-empty
script Run with --dry-run flag if supported, or with a clearly safe test input; exits 0; no unhandled exceptions in logs

Pass criteria: Execution completes, no crashes, no unhandled exceptions. Fail action: Capture full execution log, log to agent_test_results, NTFY user with error excerpt.


Level 4 — Assertion Test

Does it produce the correct side effects?

This is the type-specific level. For each automation, the task_description must include at least one verifiable assertion. The builder agents are responsible for generating these assertions at build time.

Type Assertion patterns
n8n_automation DB row was written/updated · NTFY notification received · HTTP response status was 200 · File was created at expected path
claude_agent Output JSON contains required fields · Built artifact exists and passes Level 1 structure check of the artifact it built · Side-effect DB row exists
script Expected output file exists · DB was updated · Expected log line present

LLM output validation rule (claude_agent): Never assert exact string match on LLM output — outputs are non-deterministic. Assert on: JSON schema validity, presence of required keys, value types, side effects produced.

Pass criteria: All assertions defined in task_description pass. Fail action: Log which assertions failed, NTFY user with details.


Promotion Gate

When all four levels pass, the following checklist must be completed before the automation goes to production.

  • All 4 test levels logged as pass in agent_test_results
  • NTFY notification sent to user with test summary
  • User reviews and approves (NTFY → user replies or confirms in next session)
  • For n8n_automation: all sandbox credentials re-pointed to production equivalents (see project_sandbox_workflow_credential_rule.md)
  • For claude_agent: production paths/URLs substituted for sandbox paths
  • For script: production env vars set in Coolify; no hardcoded sandbox values
  • Production deployment verified (Level 2 re-run against production)
  • automation_ideas status updated to deployed
  • agent_test_results promotion record written

Promotion is not automatic. User approval is required after Level 4 pass. This is the v1.0 rule — can be relaxed to auto-promote for specific low-risk automation types after track record is established.


Test Result Storage

All test results write to agent_test_results table (to be created in api_business DB).

Required schema:

CREATE TABLE agent_test_results (
    id              SERIAL PRIMARY KEY,
    automation_id   INTEGER NOT NULL REFERENCES automation_ideas(id),
    test_level      INTEGER NOT NULL CHECK (test_level BETWEEN 1 AND 4),
    status          TEXT NOT NULL CHECK (status IN ('pass', 'fail', 'skip')),
    error_message   TEXT,
    execution_log   TEXT,
    tested_at       TIMESTAMPTZ NOT NULL DEFAULT now(),
    promoted_at     TIMESTAMPTZ,
    notes           TEXT
);

NTFY Notification Patterns

Event Topic Message format
Level fail homelab-alerts [AGENT TEST FAIL] {name} — Level {N}: {error excerpt}
All levels pass homelab-alerts [AGENT TEST PASS] {name} — ready for your review and promotion
Promotion complete homelab-alerts [AGENT DEPLOYED] {name} — now live in production

Known Failure Modes

(Updated as new failures are discovered during testing)

ID Type Failure Root cause Fix
None yet — first test session will populate this

Self-Evolution Instructions

After every test session:

  1. Add any new failure mode to the Known Failure Modes table with ID, type, root cause, and fix
  2. If a Level assertion was too loose (passed but shouldn't have) or too strict (failed but should have passed), update the assertion pattern for that level and type
  3. Increment the version number in the frontmatter
  4. Note the date and what changed at the bottom of this file

Change log:

  • v1.0 (2026-06-16): Initial methodology — four levels, user-approval promotion gate, NTFY notifications, self-evolution rule