claude-projects/agent-builder/.claude/playbook_testing_methodology.md

---
name: Agent & Automation Testing Methodology
description: Mandatory testing methodology for all built automations before production promotion — covers N8N automations, claude agents, and scripts; self-evolving document updated after every test session
type: project
version: 1.0
---

# Playbook: Agent & Automation Testing Methodology

**Self-evolution rule:** After every test session, update this playbook — add new known failure modes, refine assertion patterns, increment the version number. The methodology improves every time something breaks in a new way.

**Applies to:** All automations in the `automation_ideas` table with types: `n8n_automation`, `claude_agent`, `script`

---

## Before You Test — Required Reading Gate

| Task type | Read first |
|---|---|
| Testing any automation | Sandbox isolation rule · Four test levels · Type-specific section |
| Promoting to production | Promotion gate checklist |
| Adding a new failure mode | Known failure modes section + update rule |

**Sandbox isolation rule (HARD):** All testing happens in sandbox (server-01, 192.168.1.90). Sandbox Vault, Postgres, and N8N are test-only. No production credentials, no production data. See `feedback_sandbox_isolation.md`.

---

## The Four Test Levels

Every automation must pass all four levels before promotion. Run in order — stop and log failure at the first level that fails.

### Level 1 — Structure Test
Does the built artifact have valid structure?

| Type | Check |
|---|---|
| `n8n_automation` | Workflow JSON is valid JSON; contains `nodes`, `connections`, `settings` keys; all node types exist in sandbox N8N |
| `claude_agent` | The `claude -p` call string is syntactically valid; prompt references correct tools/paths; output schema is defined |
| `script` | Python syntax check passes (`python3 -m py_compile script.py`); all imports are available in the target container image |

**Pass criteria:** No structural errors. **Fail action:** Log to `agent_test_results`, NTFY user, do NOT proceed to Level 2.

---

### Level 2 — Deployment Test
Does it deploy to sandbox without errors?

| Type | Check |
|---|---|
| `n8n_automation` | `POST /api/v1/workflows` succeeds; workflow appears in sandbox N8N UI; all credentials are assigned (no empty credential IDs) |
| `claude_agent` | Container builds and starts; `docker ps` shows healthy; `claude -p "echo ok"` returns without error from within the agent's execution context |
| `script` | Docker image builds; container starts; first log line appears within 30 seconds; exit code is 0 for one-shot scripts or container stays running for daemon scripts |

**Pass criteria:** No deployment errors, artifact is reachable. **Fail action:** Log to `agent_test_results`, NTFY user, tear down partial deployment in sandbox.

---

### Level 3 — Smoke Test
Does it execute without crashing on minimal input?

| Type | Check |
|---|---|
| `n8n_automation` | Trigger one manual execution via N8N API (`POST /api/v1/workflows/{id}/run`); execution completes with status `success` (not `error` or `crashed`) |
| `claude_agent` | Run agent with a trivial, safe test input defined in the `task_description`; agent completes without exception; output is non-empty |
| `script` | Run with `--dry-run` flag if supported, or with a clearly safe test input; exits 0; no unhandled exceptions in logs |

**Pass criteria:** Execution completes, no crashes, no unhandled exceptions. **Fail action:** Capture full execution log, log to `agent_test_results`, NTFY user with error excerpt.

---

### Level 4 — Assertion Test
Does it produce the correct side effects?

This is the type-specific level. For each automation, the `task_description` must include at least one verifiable assertion. The builder agents are responsible for generating these assertions at build time.

| Type | Assertion patterns |
|---|---|
| `n8n_automation` | DB row was written/updated · NTFY notification received · HTTP response status was 200 · File was created at expected path |
| `claude_agent` | Output JSON contains required fields · Built artifact exists and passes Level 1 structure check of the artifact it built · Side-effect DB row exists |
| `script` | Expected output file exists · DB was updated · Expected log line present |

**LLM output validation rule (claude_agent):** Never assert exact string match on LLM output — outputs are non-deterministic. Assert on: JSON schema validity, presence of required keys, value types, side effects produced.

**Pass criteria:** All assertions defined in `task_description` pass. **Fail action:** Log which assertions failed, NTFY user with details.

---

## Promotion Gate

When all four levels pass, the following checklist must be completed before the automation goes to production.

- [ ] All 4 test levels logged as `pass` in `agent_test_results`
- [ ] NTFY notification sent to user with test summary
- [ ] **User reviews and approves** (NTFY → user replies or confirms in next session)
- [ ] For `n8n_automation`: all sandbox credentials re-pointed to production equivalents (see `project_sandbox_workflow_credential_rule.md`)
- [ ] For `claude_agent`: production paths/URLs substituted for sandbox paths
- [ ] For `script`: production env vars set in Coolify; no hardcoded sandbox values
- [ ] Production deployment verified (Level 2 re-run against production)
- [ ] `automation_ideas` status updated to `deployed`
- [ ] `agent_test_results` promotion record written

**Promotion is not automatic.** User approval is required after Level 4 pass. This is the v1.0 rule — can be relaxed to auto-promote for specific low-risk automation types after track record is established.

---

## Test Result Storage

All test results write to `agent_test_results` table (to be created in `api_business` DB).

**Required schema:**
```sql
CREATE TABLE agent_test_results (
    id              SERIAL PRIMARY KEY,
    automation_id   INTEGER NOT NULL REFERENCES automation_ideas(id),
    test_level      INTEGER NOT NULL CHECK (test_level BETWEEN 1 AND 4),
    status          TEXT NOT NULL CHECK (status IN ('pass', 'fail', 'skip')),
    error_message   TEXT,
    execution_log   TEXT,
    tested_at       TIMESTAMPTZ NOT NULL DEFAULT now(),
    promoted_at     TIMESTAMPTZ,
    notes           TEXT
);
```

---

## NTFY Notification Patterns

| Event | Topic | Message format |
|---|---|---|
| Level fail | `homelab-alerts` | `[AGENT TEST FAIL] {name} — Level {N}: {error excerpt}` |
| All levels pass | `homelab-alerts` | `[AGENT TEST PASS] {name} — ready for your review and promotion` |
| Promotion complete | `homelab-alerts` | `[AGENT DEPLOYED] {name} — now live in production` |

---

## Known Failure Modes

*(Updated as new failures are discovered during testing)*

| ID | Type | Failure | Root cause | Fix |
|---|---|---|---|---|
| — | — | None yet — first test session will populate this | — | — |

---

## Self-Evolution Instructions

After every test session:
1. Add any new failure mode to the Known Failure Modes table with ID, type, root cause, and fix
2. If a Level assertion was too loose (passed but shouldn't have) or too strict (failed but should have passed), update the assertion pattern for that level and type
3. Increment the version number in the frontmatter
4. Note the date and what changed at the bottom of this file

**Change log:**
- v1.0 (2026-06-16): Initial methodology — four levels, user-approval promotion gate, NTFY notifications, self-evolution rule