db6cbbdec1
Initial commit tracking session context, playbooks, and automation specs for claude-config and agent-builder Claude Code conversations. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
157 lines
7.3 KiB
Markdown
157 lines
7.3 KiB
Markdown
---
|
|
name: Agent & Automation Testing Methodology
|
|
description: Mandatory testing methodology for all built automations before production promotion — covers N8N automations, claude agents, and scripts; self-evolving document updated after every test session
|
|
type: project
|
|
version: 1.0
|
|
---
|
|
|
|
# Playbook: Agent & Automation Testing Methodology
|
|
|
|
**Self-evolution rule:** After every test session, update this playbook — add new known failure modes, refine assertion patterns, increment the version number. The methodology improves every time something breaks in a new way.
|
|
|
|
**Applies to:** All automations in the `automation_ideas` table with types: `n8n_automation`, `claude_agent`, `script`
|
|
|
|
---
|
|
|
|
## Before You Test — Required Reading Gate
|
|
|
|
| Task type | Read first |
|
|
|---|---|
|
|
| Testing any automation | Sandbox isolation rule · Four test levels · Type-specific section |
|
|
| Promoting to production | Promotion gate checklist |
|
|
| Adding a new failure mode | Known failure modes section + update rule |
|
|
|
|
**Sandbox isolation rule (HARD):** All testing happens in sandbox (server-01, 192.168.1.90). Sandbox Vault, Postgres, and N8N are test-only. No production credentials, no production data. See `feedback_sandbox_isolation.md`.
|
|
|
|
---
|
|
|
|
## The Four Test Levels
|
|
|
|
Every automation must pass all four levels before promotion. Run in order — stop and log failure at the first level that fails.
|
|
|
|
### Level 1 — Structure Test
|
|
Does the built artifact have valid structure?
|
|
|
|
| Type | Check |
|
|
|---|---|
|
|
| `n8n_automation` | Workflow JSON is valid JSON; contains `nodes`, `connections`, `settings` keys; all node types exist in sandbox N8N |
|
|
| `claude_agent` | The `claude -p` call string is syntactically valid; prompt references correct tools/paths; output schema is defined |
|
|
| `script` | Python syntax check passes (`python3 -m py_compile script.py`); all imports are available in the target container image |
|
|
|
|
**Pass criteria:** No structural errors. **Fail action:** Log to `agent_test_results`, NTFY user, do NOT proceed to Level 2.
|
|
|
|
---
|
|
|
|
### Level 2 — Deployment Test
|
|
Does it deploy to sandbox without errors?
|
|
|
|
| Type | Check |
|
|
|---|---|
|
|
| `n8n_automation` | `POST /api/v1/workflows` succeeds; workflow appears in sandbox N8N UI; all credentials are assigned (no empty credential IDs) |
|
|
| `claude_agent` | Container builds and starts; `docker ps` shows healthy; `claude -p "echo ok"` returns without error from within the agent's execution context |
|
|
| `script` | Docker image builds; container starts; first log line appears within 30 seconds; exit code is 0 for one-shot scripts or container stays running for daemon scripts |
|
|
|
|
**Pass criteria:** No deployment errors, artifact is reachable. **Fail action:** Log to `agent_test_results`, NTFY user, tear down partial deployment in sandbox.
|
|
|
|
---
|
|
|
|
### Level 3 — Smoke Test
|
|
Does it execute without crashing on minimal input?
|
|
|
|
| Type | Check |
|
|
|---|---|
|
|
| `n8n_automation` | Trigger one manual execution via N8N API (`POST /api/v1/workflows/{id}/run`); execution completes with status `success` (not `error` or `crashed`) |
|
|
| `claude_agent` | Run agent with a trivial, safe test input defined in the `task_description`; agent completes without exception; output is non-empty |
|
|
| `script` | Run with `--dry-run` flag if supported, or with a clearly safe test input; exits 0; no unhandled exceptions in logs |
|
|
|
|
**Pass criteria:** Execution completes, no crashes, no unhandled exceptions. **Fail action:** Capture full execution log, log to `agent_test_results`, NTFY user with error excerpt.
|
|
|
|
---
|
|
|
|
### Level 4 — Assertion Test
|
|
Does it produce the correct side effects?
|
|
|
|
This is the type-specific level. For each automation, the `task_description` must include at least one verifiable assertion. The builder agents are responsible for generating these assertions at build time.
|
|
|
|
| Type | Assertion patterns |
|
|
|---|---|
|
|
| `n8n_automation` | DB row was written/updated · NTFY notification received · HTTP response status was 200 · File was created at expected path |
|
|
| `claude_agent` | Output JSON contains required fields · Built artifact exists and passes Level 1 structure check of the artifact it built · Side-effect DB row exists |
|
|
| `script` | Expected output file exists · DB was updated · Expected log line present |
|
|
|
|
**LLM output validation rule (claude_agent):** Never assert exact string match on LLM output — outputs are non-deterministic. Assert on: JSON schema validity, presence of required keys, value types, side effects produced.
|
|
|
|
**Pass criteria:** All assertions defined in `task_description` pass. **Fail action:** Log which assertions failed, NTFY user with details.
|
|
|
|
---
|
|
|
|
## Promotion Gate
|
|
|
|
When all four levels pass, the following checklist must be completed before the automation goes to production.
|
|
|
|
- [ ] All 4 test levels logged as `pass` in `agent_test_results`
|
|
- [ ] NTFY notification sent to user with test summary
|
|
- [ ] **User reviews and approves** (NTFY → user replies or confirms in next session)
|
|
- [ ] For `n8n_automation`: all sandbox credentials re-pointed to production equivalents (see `project_sandbox_workflow_credential_rule.md`)
|
|
- [ ] For `claude_agent`: production paths/URLs substituted for sandbox paths
|
|
- [ ] For `script`: production env vars set in Coolify; no hardcoded sandbox values
|
|
- [ ] Production deployment verified (Level 2 re-run against production)
|
|
- [ ] `automation_ideas` status updated to `deployed`
|
|
- [ ] `agent_test_results` promotion record written
|
|
|
|
**Promotion is not automatic.** User approval is required after Level 4 pass. This is the v1.0 rule — can be relaxed to auto-promote for specific low-risk automation types after track record is established.
|
|
|
|
---
|
|
|
|
## Test Result Storage
|
|
|
|
All test results write to `agent_test_results` table (to be created in `api_business` DB).
|
|
|
|
**Required schema:**
|
|
```sql
|
|
CREATE TABLE agent_test_results (
|
|
id SERIAL PRIMARY KEY,
|
|
automation_id INTEGER NOT NULL REFERENCES automation_ideas(id),
|
|
test_level INTEGER NOT NULL CHECK (test_level BETWEEN 1 AND 4),
|
|
status TEXT NOT NULL CHECK (status IN ('pass', 'fail', 'skip')),
|
|
error_message TEXT,
|
|
execution_log TEXT,
|
|
tested_at TIMESTAMPTZ NOT NULL DEFAULT now(),
|
|
promoted_at TIMESTAMPTZ,
|
|
notes TEXT
|
|
);
|
|
```
|
|
|
|
---
|
|
|
|
## NTFY Notification Patterns
|
|
|
|
| Event | Topic | Message format |
|
|
|---|---|---|
|
|
| Level fail | `homelab-alerts` | `[AGENT TEST FAIL] {name} — Level {N}: {error excerpt}` |
|
|
| All levels pass | `homelab-alerts` | `[AGENT TEST PASS] {name} — ready for your review and promotion` |
|
|
| Promotion complete | `homelab-alerts` | `[AGENT DEPLOYED] {name} — now live in production` |
|
|
|
|
---
|
|
|
|
## Known Failure Modes
|
|
|
|
*(Updated as new failures are discovered during testing)*
|
|
|
|
| ID | Type | Failure | Root cause | Fix |
|
|
|---|---|---|---|---|
|
|
| — | — | None yet — first test session will populate this | — | — |
|
|
|
|
---
|
|
|
|
## Self-Evolution Instructions
|
|
|
|
After every test session:
|
|
1. Add any new failure mode to the Known Failure Modes table with ID, type, root cause, and fix
|
|
2. If a Level assertion was too loose (passed but shouldn't have) or too strict (failed but should have passed), update the assertion pattern for that level and type
|
|
3. Increment the version number in the frontmatter
|
|
4. Note the date and what changed at the bottom of this file
|
|
|
|
**Change log:**
|
|
- v1.0 (2026-06-16): Initial methodology — four levels, user-approval promotion gate, NTFY notifications, self-evolution rule
|