Initial commit tracking session context, playbooks, and automation specs for claude-config and agent-builder Claude Code conversations. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
7.3 KiB
name, description, type, version
| name | description | type | version |
|---|---|---|---|
| Agent & Automation Testing Methodology | Mandatory testing methodology for all built automations before production promotion — covers N8N automations, claude agents, and scripts; self-evolving document updated after every test session | project | 1.0 |
Playbook: Agent & Automation Testing Methodology
Self-evolution rule: After every test session, update this playbook — add new known failure modes, refine assertion patterns, increment the version number. The methodology improves every time something breaks in a new way.
Applies to: All automations in the automation_ideas table with types: n8n_automation, claude_agent, script
Before You Test — Required Reading Gate
| Task type | Read first |
|---|---|
| Testing any automation | Sandbox isolation rule · Four test levels · Type-specific section |
| Promoting to production | Promotion gate checklist |
| Adding a new failure mode | Known failure modes section + update rule |
Sandbox isolation rule (HARD): All testing happens in sandbox (server-01, 192.168.1.90). Sandbox Vault, Postgres, and N8N are test-only. No production credentials, no production data. See feedback_sandbox_isolation.md.
The Four Test Levels
Every automation must pass all four levels before promotion. Run in order — stop and log failure at the first level that fails.
Level 1 — Structure Test
Does the built artifact have valid structure?
| Type | Check |
|---|---|
n8n_automation |
Workflow JSON is valid JSON; contains nodes, connections, settings keys; all node types exist in sandbox N8N |
claude_agent |
The claude -p call string is syntactically valid; prompt references correct tools/paths; output schema is defined |
script |
Python syntax check passes (python3 -m py_compile script.py); all imports are available in the target container image |
Pass criteria: No structural errors. Fail action: Log to agent_test_results, NTFY user, do NOT proceed to Level 2.
Level 2 — Deployment Test
Does it deploy to sandbox without errors?
| Type | Check |
|---|---|
n8n_automation |
POST /api/v1/workflows succeeds; workflow appears in sandbox N8N UI; all credentials are assigned (no empty credential IDs) |
claude_agent |
Container builds and starts; docker ps shows healthy; claude -p "echo ok" returns without error from within the agent's execution context |
script |
Docker image builds; container starts; first log line appears within 30 seconds; exit code is 0 for one-shot scripts or container stays running for daemon scripts |
Pass criteria: No deployment errors, artifact is reachable. Fail action: Log to agent_test_results, NTFY user, tear down partial deployment in sandbox.
Level 3 — Smoke Test
Does it execute without crashing on minimal input?
| Type | Check |
|---|---|
n8n_automation |
Trigger one manual execution via N8N API (POST /api/v1/workflows/{id}/run); execution completes with status success (not error or crashed) |
claude_agent |
Run agent with a trivial, safe test input defined in the task_description; agent completes without exception; output is non-empty |
script |
Run with --dry-run flag if supported, or with a clearly safe test input; exits 0; no unhandled exceptions in logs |
Pass criteria: Execution completes, no crashes, no unhandled exceptions. Fail action: Capture full execution log, log to agent_test_results, NTFY user with error excerpt.
Level 4 — Assertion Test
Does it produce the correct side effects?
This is the type-specific level. For each automation, the task_description must include at least one verifiable assertion. The builder agents are responsible for generating these assertions at build time.
| Type | Assertion patterns |
|---|---|
n8n_automation |
DB row was written/updated · NTFY notification received · HTTP response status was 200 · File was created at expected path |
claude_agent |
Output JSON contains required fields · Built artifact exists and passes Level 1 structure check of the artifact it built · Side-effect DB row exists |
script |
Expected output file exists · DB was updated · Expected log line present |
LLM output validation rule (claude_agent): Never assert exact string match on LLM output — outputs are non-deterministic. Assert on: JSON schema validity, presence of required keys, value types, side effects produced.
Pass criteria: All assertions defined in task_description pass. Fail action: Log which assertions failed, NTFY user with details.
Promotion Gate
When all four levels pass, the following checklist must be completed before the automation goes to production.
- All 4 test levels logged as
passinagent_test_results - NTFY notification sent to user with test summary
- User reviews and approves (NTFY → user replies or confirms in next session)
- For
n8n_automation: all sandbox credentials re-pointed to production equivalents (seeproject_sandbox_workflow_credential_rule.md) - For
claude_agent: production paths/URLs substituted for sandbox paths - For
script: production env vars set in Coolify; no hardcoded sandbox values - Production deployment verified (Level 2 re-run against production)
automation_ideasstatus updated todeployedagent_test_resultspromotion record written
Promotion is not automatic. User approval is required after Level 4 pass. This is the v1.0 rule — can be relaxed to auto-promote for specific low-risk automation types after track record is established.
Test Result Storage
All test results write to agent_test_results table (to be created in api_business DB).
Required schema:
CREATE TABLE agent_test_results (
id SERIAL PRIMARY KEY,
automation_id INTEGER NOT NULL REFERENCES automation_ideas(id),
test_level INTEGER NOT NULL CHECK (test_level BETWEEN 1 AND 4),
status TEXT NOT NULL CHECK (status IN ('pass', 'fail', 'skip')),
error_message TEXT,
execution_log TEXT,
tested_at TIMESTAMPTZ NOT NULL DEFAULT now(),
promoted_at TIMESTAMPTZ,
notes TEXT
);
NTFY Notification Patterns
| Event | Topic | Message format |
|---|---|---|
| Level fail | homelab-alerts |
[AGENT TEST FAIL] {name} — Level {N}: {error excerpt} |
| All levels pass | homelab-alerts |
[AGENT TEST PASS] {name} — ready for your review and promotion |
| Promotion complete | homelab-alerts |
[AGENT DEPLOYED] {name} — now live in production |
Known Failure Modes
(Updated as new failures are discovered during testing)
| ID | Type | Failure | Root cause | Fix |
|---|---|---|---|---|
| — | — | None yet — first test session will populate this | — | — |
Self-Evolution Instructions
After every test session:
- Add any new failure mode to the Known Failure Modes table with ID, type, root cause, and fix
- If a Level assertion was too loose (passed but shouldn't have) or too strict (failed but should have passed), update the assertion pattern for that level and type
- Increment the version number in the frontmatter
- Note the date and what changed at the bottom of this file
Change log:
- v1.0 (2026-06-16): Initial methodology — four levels, user-approval promotion gate, NTFY notifications, self-evolution rule