Files

T

Backtalk6858 db6cbbdec1 init: add claude-config and agent-builder context files

Initial commit tracking session context, playbooks, and automation specs
for claude-config and agent-builder Claude Code conversations.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-06-17 23:08:23 -05:00

7.3 KiB

Raw Permalink Blame History

name, description, type, version

name	description	type	version
Agent & Automation Testing Methodology	Mandatory testing methodology for all built automations before production promotion — covers N8N automations, claude agents, and scripts; self-evolving document updated after every test session	project	1.0

Playbook: Agent & Automation Testing Methodology

Self-evolution rule: After every test session, update this playbook — add new known failure modes, refine assertion patterns, increment the version number. The methodology improves every time something breaks in a new way.

Applies to: All automations in the automation_ideas table with types: n8n_automation, claude_agent, script

Before You Test — Required Reading Gate

Task type	Read first
Testing any automation	Sandbox isolation rule · Four test levels · Type-specific section
Promoting to production	Promotion gate checklist
Adding a new failure mode	Known failure modes section + update rule

Sandbox isolation rule (HARD): All testing happens in sandbox (server-01, 192.168.1.90). Sandbox Vault, Postgres, and N8N are test-only. No production credentials, no production data. See feedback_sandbox_isolation.md.

The Four Test Levels

Every automation must pass all four levels before promotion. Run in order — stop and log failure at the first level that fails.

Level 1 — Structure Test

Does the built artifact have valid structure?

Type	Check
`n8n_automation`	Workflow JSON is valid JSON; contains `nodes`, `connections`, `settings` keys; all node types exist in sandbox N8N
`claude_agent`	The `claude -p` call string is syntactically valid; prompt references correct tools/paths; output schema is defined
`script`	Python syntax check passes (`python3 -m py_compile script.py`); all imports are available in the target container image

Pass criteria: No structural errors. Fail action: Log to agent_test_results, NTFY user, do NOT proceed to Level 2.

Level 2 — Deployment Test

Does it deploy to sandbox without errors?

Type	Check
`n8n_automation`	`POST /api/v1/workflows` succeeds; workflow appears in sandbox N8N UI; all credentials are assigned (no empty credential IDs)
`claude_agent`	Container builds and starts; `docker ps` shows healthy; `claude -p "echo ok"` returns without error from within the agent's execution context
`script`	Docker image builds; container starts; first log line appears within 30 seconds; exit code is 0 for one-shot scripts or container stays running for daemon scripts

Pass criteria: No deployment errors, artifact is reachable. Fail action: Log to agent_test_results, NTFY user, tear down partial deployment in sandbox.

Level 3 — Smoke Test

Does it execute without crashing on minimal input?

Type	Check
`n8n_automation`	Trigger one manual execution via N8N API (`POST /api/v1/workflows/{id}/run`); execution completes with status `success` (not `error` or `crashed`)
`claude_agent`	Run agent with a trivial, safe test input defined in the `task_description`; agent completes without exception; output is non-empty
`script`	Run with `--dry-run` flag if supported, or with a clearly safe test input; exits 0; no unhandled exceptions in logs

Pass criteria: Execution completes, no crashes, no unhandled exceptions. Fail action: Capture full execution log, log to agent_test_results, NTFY user with error excerpt.

Level 4 — Assertion Test

Does it produce the correct side effects?

This is the type-specific level. For each automation, the task_description must include at least one verifiable assertion. The builder agents are responsible for generating these assertions at build time.

Type	Assertion patterns
`n8n_automation`	DB row was written/updated · NTFY notification received · HTTP response status was 200 · File was created at expected path
`claude_agent`	Output JSON contains required fields · Built artifact exists and passes Level 1 structure check of the artifact it built · Side-effect DB row exists
`script`	Expected output file exists · DB was updated · Expected log line present

LLM output validation rule (claude_agent): Never assert exact string match on LLM output — outputs are non-deterministic. Assert on: JSON schema validity, presence of required keys, value types, side effects produced.

Pass criteria: All assertions defined in task_description pass. Fail action: Log which assertions failed, NTFY user with details.

Promotion Gate

When all four levels pass, the following checklist must be completed before the automation goes to production.

All 4 test levels logged as pass in agent_test_results
NTFY notification sent to user with test summary
User reviews and approves (NTFY → user replies or confirms in next session)
For n8n_automation: all sandbox credentials re-pointed to production equivalents (see project_sandbox_workflow_credential_rule.md)
For claude_agent: production paths/URLs substituted for sandbox paths
For script: production env vars set in Coolify; no hardcoded sandbox values
Production deployment verified (Level 2 re-run against production)
automation_ideas status updated to deployed
agent_test_results promotion record written

Promotion is not automatic. User approval is required after Level 4 pass. This is the v1.0 rule — can be relaxed to auto-promote for specific low-risk automation types after track record is established.

Test Result Storage

All test results write to agent_test_results table (to be created in api_business DB).

Required schema:

CREATE TABLE agent_test_results (
    id              SERIAL PRIMARY KEY,
    automation_id   INTEGER NOT NULL REFERENCES automation_ideas(id),
    test_level      INTEGER NOT NULL CHECK (test_level BETWEEN 1 AND 4),
    status          TEXT NOT NULL CHECK (status IN ('pass', 'fail', 'skip')),
    error_message   TEXT,
    execution_log   TEXT,
    tested_at       TIMESTAMPTZ NOT NULL DEFAULT now(),
    promoted_at     TIMESTAMPTZ,
    notes           TEXT
);

NTFY Notification Patterns

Event	Topic	Message format
Level fail	`homelab-alerts`	`[AGENT TEST FAIL] {name} — Level {N}: {error excerpt}`
All levels pass	`homelab-alerts`	`[AGENT TEST PASS] {name} — ready for your review and promotion`
Promotion complete	`homelab-alerts`	`[AGENT DEPLOYED] {name} — now live in production`

Known Failure Modes

(Updated as new failures are discovered during testing)

ID	Type	Failure	Root cause	Fix
—	—	None yet — first test session will populate this	—	—

Self-Evolution Instructions

After every test session:

Add any new failure mode to the Known Failure Modes table with ID, type, root cause, and fix
If a Level assertion was too loose (passed but shouldn't have) or too strict (failed but should have passed), update the assertion pattern for that level and type
Increment the version number in the frontmatter
Note the date and what changed at the bottom of this file

Change log:

v1.0 (2026-06-16): Initial methodology — four levels, user-approval promotion gate, NTFY notifications, self-evolution rule

7.3 KiB Raw Permalink Blame History