init: add claude-config and agent-builder context files

Initial commit tracking session context, playbooks, and automation specs
for claude-config and agent-builder Claude Code conversations.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
Backtalk6858
2026-06-17 23:08:23 -05:00
commit db6cbbdec1
7 changed files with 732 additions and 0 deletions
+73
View File
@@ -0,0 +1,73 @@
project_name: agent-builder
# Agent Builder — Session Context
## What this project does
Design, build, and test autonomous N8N agents on server-01 sandbox before any production promotion.
First two agents: Agent Builder Agent + N8N Builder Agent.
## Scheduled work (2026-06-16, running behind — started ~6:38 PM)
1. Vision Alignment Grill-Me — Agent Builder + N8N Builder vision + testing methodology
2. Agent Builder Agent — Deploy + Test (server-01 sandbox)
3. N8N Builder Agent — Deploy + Test (server-01 sandbox)
## Architecture
- Agents run as N8N workflows on server-01 (n8n-sandbox, port 5679)
- Sandbox-first: all agents tested in sandbox before any production promotion
- server-01 sandbox stack: n8n-sandbox, postgres-sandbox, vault-sandbox, bitwarden-bridge-sandbox, vaultwarden-sandbox
- Sandbox N8N API key: prod Vault at secret/sandbox/n8n
- Sandbox reachable at 192.168.1.90
## Key decisions (set during vision grill-me — 2026-06-16)
- Agent Builder Agent: builds `claude_agent` and `script` types — Ollama (llama3.1:8b) does the building, `claude -p` is overseer/validator
- N8N Builder Agent: builds `n8n_automation` types — Ollama generates workflow JSON, imports via N8N API, assigns credentials
- automation_ideas schema changes needed: rename `description``task_description` (full structured spec), add `type` (n8n_automation/claude_agent/script), add `builder_status`
- New `agent_test_results` table needed in api_business DB
- Sandbox must mirror production: AppRole, Vaultwarden, bridge all configured before any agent deploys
- Promotion = user approval required after all 4 test levels pass (not auto-promote in v1)
- Dedicated backfill session needed for all 48 existing automation_ideas rows (type + task_description)
- claude -p uses SDK credits (Pro = $20/month hard limit) — use sparingly, Ollama does the heavy lifting
- Local model: llama3.1:8b already pulled on server-01 (4.9GB, fits in RTX 2060 Super 8GB VRAM)
## Testing methodology
- Four levels: Structure → Deployment → Smoke → Assertion
- LLM outputs validated on structure/side-effects only, never exact string match
- All results logged to agent_test_results table
- NTFY notification on pass and fail
- Full methodology: .claude/playbook_testing_methodology.md
## Agents
### Agent Builder Agent
- Status: pending — prereqs not complete
- Purpose: Receives automation spec from automation_ideas DB, uses Ollama to build claude_agent or script type automations, deploys to sandbox, runs automated tests, notifies user for promotion approval
- Builds: claude agents (via claude -p) and Python scripts (Docker containers)
### N8N Builder Agent
- Status: pending — prereqs not complete
- Purpose: Receives automation spec from automation_ideas DB, uses Ollama to generate N8N workflow JSON using n8n_automations playbook as context, imports to sandbox N8N via API, assigns credentials, runs automated tests
- Will be used to build: id=12 (Media Pipeline Learning), id=7 (Friday Research Session Prep)
## Related personal_projects DB rows
- id=4: N8N Workflow Builder Script (pending, weekend_block1)
## Prereq checklist (must complete before any agent deployment)
- [x] Schema: rename automation_ideas.description → task_description, add type, add builder_status, add priority
- [x] Create agent_test_results table in api_business
- [x] Sandbox Vault: set up AppRole auth method (credentials at /opt/appdata/docker/docker-compose/vault/approle/ on server-01)
- [x] Sandbox Vault: store sandbox N8N API key at secret/sandbox/n8n (key name: claude-sandbox, verified working)
- [x] Verify sandbox Bitwarden bridge ↔ Vaultwarden sandbox end-to-end (bridge on port 8080, returns [] for empty vault — correct)
- [x] Write Agent Builder Agent playbook → .claude/playbook_agent_builder_agent.md
- [x] Write N8N Builder Agent playbook → .claude/playbook_n8n_builder_agent.md
- [~] Backfill session: Resume from priority 28 (id=34, CalDAV Auto-Refresh Trigger) next session. ~30 automations remain. This session reviewed ids 28, 40, 53, 4, 7, 5, 14, 11, 44, 33, 29, 20, 41, 12, 1, 19, 39, 6. Key changes: id=7+50 merged into id=18 (full 10-stage business pipeline); id=18 expanded with Business Research Agent + Development Agent + parked idea email-to-Tyler flow; id=14+20 blocked (already built by schedule workflows); id=12+1+41 blocked (redundant); id=19 blocked (pending Jenkins); id=5 blocked (Obsidian vault not set up yet); id=40 pending with Jenkins conditional note. New rows added: id=54 (NTFY Topic Provisioner, p54), id=55 (Business Research Agent, p51), id=56 (Business Development Agent, p52), id=57 (Sandbox Environment Deployment Completion, p9 — NOTE: conflicts with id=28 priority 9, fix next session). Calendar events pushed to Nextcloud: Thu 6/18 12:45-3PM backfill + 3-4:30PM readiness, Fri 6/19 12:45-4:30PM agent builds.
## Final readiness check items (scheduled June 18)
- All 8 prereq checklist items verified complete
- Sandbox mirrors production: AppRole, bridge, Vaultwarden all confirmed functional
- **Sensitive output interception system** — design and implement before any agent goes live:
- Agents must scan their own stdout/logs before writing/sending output and redact anything matching secret patterns (tokens, keys, passwords, API keys)
- Pattern list at minimum: `hvs\.`, `eyJ`, bearer tokens, anything from known env var names (BRIDGE_API_KEY, VAULT_TOKEN, N8N_ENCRYPTION_KEY, etc.)
- Root cause: `docker inspect --format '{{range .Config.Env}}...'` dumps all env vars including secrets; agents will reach for broad diagnostic commands without filtering — local models even more so
- Production exposure is a serious risk; sandbox exposure is acceptable but still undesirable
- This system needs to exist at the agent level (not just Claude Code rules) because once agents run autonomously the user will not be watching
## Update instructions
Update at the end of every agent-builder session. Keep agent status, key decisions, and prereq checklist current.
@@ -0,0 +1,154 @@
# Playbook: Agent Builder Agent
## Purpose
Builds `claude_agent` and `script` type automations from the `automation_ideas` table. Uses Ollama (llama3.1:8b) as the primary code generator and `claude -p` as the overseer/validator. Deploys to sandbox, runs all 4 test levels, notifies user for promotion approval.
## Trigger
Scheduled or manual. Queries `automation_ideas` for the next row where:
- `type IN ('claude_agent', 'script')`
- `status = 'ready_to_build'`
- `builder_status = 'not_started'`
- Ordered by `priority ASC NULLS LAST, id ASC`
Only processes one automation per run.
## Infrastructure
- Runs on: server-01 (n8n-sandbox, port 5679)
- Ollama endpoint: http://localhost:11434 (server-01 local)
- Model: llama3.1:8b
- Claude overseer: `claude -p` (non-interactive, SDK credits — use sparingly)
- Vault: sandbox AppRole at /opt/appdata/docker/docker-compose/vault/approle/
- Database: production api_business (read automation_ideas, write agent_test_results)
- NTFY: production NTFY instance for notifications
## Step-by-Step
### Step 1 — Claim the automation
```sql
UPDATE automation_ideas
SET builder_status = 'queued'
WHERE id = <selected_id> AND builder_status = 'not_started';
```
If 0 rows updated: another builder claimed it — stop, notify, exit.
### Step 2 — Build the prompt for Ollama
Construct a generation prompt using all available fields from the automation row:
- `name`: what the automation is called
- `task_description`: full structured spec — this is the primary instruction
- `type`: claude_agent or script
- `infrastructure_requirement`: what infra it needs access to
Prompt structure:
```
You are an expert automation engineer. Build a {type} automation with the following specification.
Name: {name}
Infrastructure: {infrastructure_requirement}
Specification:
{task_description}
Requirements:
- If type is claude_agent: output a complete shell-executable claude -p command with full system prompt and all logic. The agent must be self-contained.
- If type is script: output a complete Python script. Include a Dockerfile if the script has dependencies beyond stdlib.
- Output ONLY the code. No explanation, no markdown fences, no commentary.
- The code must handle its own error cases and log to stdout.
- Secrets must be fetched from Vault via AppRole — never hardcoded. AppRole credentials at /opt/appdata/docker/docker-compose/vault/approle/role-id and secret-id.
```
### Step 3 — Generate with Ollama
```
POST http://localhost:11434/api/generate
{
"model": "llama3.1:8b",
"prompt": "<constructed prompt>",
"stream": false
}
```
Set builder_status = 'building' before calling.
If Ollama call fails or times out (>120s): set builder_status = 'failed', log error, notify via NTFY, stop.
### Step 4 — Overseer validation with claude -p
Pass the generated code to `claude -p` for structural review. Keep the prompt minimal to conserve SDK credits:
```bash
claude -p "Review this {type} automation code for the following only:
1. Does it correctly fetch secrets from Vault via AppRole (never hardcoded)?
2. Are there any obvious syntax errors or missing imports?
3. Does the logic match this spec summary: {name} — {task_description[:200]}
Respond with: PASS or FAIL, then one sentence explaining why.
Do not rewrite the code."
```
If FAIL: log claude's reason, set builder_status = 'failed', notify via NTFY with the failure reason, stop.
If PASS: proceed.
### Step 5 — Deploy to sandbox
**For `script` type:**
1. Write the generated code to a temp directory on server-01
2. If a Dockerfile was generated, build the image: `docker build -t agent-{id}-{slug} .`
3. Run a test container: `docker run --rm agent-{id}-{slug}` (dry run, no side effects)
**For `claude_agent` type:**
1. Write the generated claude -p command to a shell script
2. Make it executable
3. Run it once with `--dry-run` flag if supported, or with a test input that produces no side effects
If deployment fails: set builder_status = 'failed', log error, notify via NTFY, stop.
### Step 6 — Run 4-level automated tests
Run each level in order. Stop and fail if any level fails. Log every result to `agent_test_results`.
**Level 1 — Structure**
Validate the generated artifact:
- For scripts: `python3 -m py_compile script.py` — must exit 0
- For claude agents: verify the shell script is syntactically valid bash
- For Dockerfiles: `docker build --check` if available, else verify FROM and key directives exist
- Insert result: `INSERT INTO agent_test_results (automation_id, test_level, status, execution_log) VALUES ({id}, 1, 'pass'/'fail', '{log}')`
**Level 2 — Deployment**
- Verify the artifact can be deployed cleanly (no missing dependencies, image builds successfully, script runs without import errors)
- Must complete without crashing
- Insert result to agent_test_results
**Level 3 — Smoke**
- Execute the automation with minimal/test inputs
- Must run to completion without an unhandled exception or non-zero exit
- Insert result to agent_test_results
**Level 4 — Assertion**
- Verify the correct side effect occurred (not string matching — check the actual system state)
- Examples: a file was created, a DB row was written, an API call returned 200, a container is running
- Insert result to agent_test_results
### Step 7 — Notify user for promotion approval
If all 4 levels pass:
1. Set builder_status = 'awaiting_approval'
2. Send NTFY notification:
```
Title: Agent Ready for Promotion — {name}
Body: All 4 test levels passed in sandbox. Automation id={id} ({type}) is ready for production promotion. Reply to approve or reject.
```
User must explicitly approve before any production deployment. No auto-promotion in v1.
### Step 8 — On approval
Set builder_status = 'approved', then 'deployed' after production deployment completes.
Update automation_ideas status = 'deployed'.
## Error handling
- Any unhandled exception: set builder_status = 'failed', log to agent_test_results with test_level=0 and status='fail', send NTFY alert
- Always release the claim (reset builder_status to 'not_started') if failing before Step 3 so another run can retry
- After Step 3: leave as 'failed' — requires manual review before retry
## NTFY notification patterns
- Build started: `[Agent Builder] Building {name} (id={id}, type={type})`
- Overseer FAIL: `[Agent Builder] FAIL — Overseer rejected {name}: {reason}`
- Test level fail: `[Agent Builder] FAIL — {name} failed Level {n}: {error}`
- Ready for approval: `[Agent Builder] READY — {name} passed all tests, awaiting your approval`
- Unhandled error: `[Agent Builder] ERROR — {name}: {exception}`
## SDK credit budget
`claude -p` is called once per automation (Step 4 only). Keep the overseer prompt under 500 tokens. Do not call claude -p for retries or debugging — only for the initial validation pass.
@@ -0,0 +1,209 @@
# Playbook: N8N Builder Agent
## Purpose
Builds `n8n_automation` type automations from the `automation_ideas` table. Uses Ollama (llama3.1:8b) to generate N8N workflow JSON, imports to sandbox N8N via API, assigns credentials, runs all 4 test levels, notifies user for promotion approval.
## Trigger
Scheduled or manual. Queries `automation_ideas` for the next row where:
- `type = 'n8n_automation'`
- `status = 'ready_to_build'`
- `builder_status = 'not_started'`
- Ordered by `priority ASC NULLS LAST, id ASC`
Only processes one automation per run.
## Infrastructure
- Runs on: server-01 (n8n-sandbox, port 5679)
- Ollama endpoint: http://localhost:11434 (server-01 local)
- Model: llama3.1:8b
- Claude overseer: `claude -p` (non-interactive, SDK credits — use sparingly)
- Sandbox N8N API: http://192.168.1.90:5679 — API key from Vault at secret/sandbox/n8n
- Vault: sandbox AppRole at /opt/appdata/docker/docker-compose/vault/approle/
- Database: production api_business (read automation_ideas, write agent_test_results)
- NTFY: production NTFY instance for notifications
## N8N Workflow JSON Structure (required knowledge)
Every valid N8N workflow JSON must include:
```json
{
"name": "Workflow Name",
"nodes": [...],
"connections": {...},
"active": false,
"settings": {"executionOrder": "v1"},
"tags": []
}
```
Nodes have: `id` (UUID), `name`, `type` (e.g. n8n-nodes-base.httpRequest), `typeVersion`, `position` ([x, y]), `parameters`.
Connections map node outputs to node inputs by node name.
All workflows imported as `active: false` — never activate automatically in sandbox.
## Step-by-Step
### Step 1 — Claim the automation
```sql
UPDATE automation_ideas
SET builder_status = 'queued'
WHERE id = <selected_id> AND builder_status = 'not_started';
```
If 0 rows updated: another builder claimed it — stop, notify, exit.
### Step 2 — Fetch sandbox N8N API key from Vault
Use sandbox AppRole to read secret/sandbox/n8n. Extract `api_key` and `base_url`.
Never log the key value. Pass it in memory only.
### Step 3 — Discover available N8N credentials
Before generating, query the sandbox N8N for existing credentials so the generated workflow references them by name:
```
GET {base_url}/api/v1/credentials
X-N8N-API-KEY: {api_key}
```
Extract credential names and types. Pass this list to the Ollama prompt so the generated workflow uses real credential names.
### Step 4 — Build the prompt for Ollama
```
You are an expert N8N workflow engineer. Generate a valid N8N workflow JSON for the following automation.
Name: {name}
Infrastructure available: {infrastructure_requirement}
Available N8N credentials: {credential_names_and_types}
Specification:
{task_description}
Requirements:
- Output ONLY valid N8N workflow JSON. No explanation, no markdown fences, no commentary.
- The workflow must be importable via the N8N API without modification.
- Set active: false.
- Reference credentials by the exact names listed above — do not invent credential names.
- Use realistic node positions (spread nodes 200px apart on x-axis starting at x=250).
- Every node must have a unique UUID for its id field.
- The workflow must fully implement the specification — do not stub or placeholder any steps.
```
### Step 5 — Generate with Ollama
```
POST http://localhost:11434/api/generate
{
"model": "llama3.1:8b",
"prompt": "<constructed prompt>",
"stream": false
}
```
Set builder_status = 'building' before calling.
Extract JSON from response — strip any surrounding text if Ollama adds it.
Validate it parses as JSON before proceeding. If invalid JSON: set builder_status = 'failed', log error, notify, stop.
If Ollama call fails or times out (>120s): set builder_status = 'failed', log error, notify, stop.
### Step 6 — Overseer validation with claude -p
Pass the generated JSON to `claude -p` for structural review. Keep prompt minimal to conserve SDK credits:
```bash
claude -p "Review this N8N workflow JSON for the following only:
1. Is it valid N8N workflow JSON with required fields (name, nodes, connections, active, settings)?
2. Do all nodes have id, name, type, typeVersion, position, parameters?
3. Do connections reference node names that exist in the nodes array?
4. Does the workflow logic match this spec: {name} — {task_description[:200]}
Respond with: PASS or FAIL, then one sentence explaining why.
Do not rewrite the workflow."
```
If FAIL: log claude's reason, set builder_status = 'failed', notify via NTFY with reason, stop.
If PASS: proceed.
### Step 7 — Import to sandbox N8N
```
POST {base_url}/api/v1/workflows
X-N8N-API-KEY: {api_key}
Content-Type: application/json
Body: {generated workflow JSON}
```
On success: capture the returned workflow `id` from N8N. Store in notes or a temp variable.
On failure (non-2xx): set builder_status = 'failed', log the N8N error response, notify, stop.
### Step 8 — Assign credentials
For each node in the workflow that references a credential:
```
GET {base_url}/api/v1/workflows/{workflow_id}
```
Verify credential references resolved correctly. If any credential reference is broken (credential name not found), attempt to match by type — if unresolvable, set builder_status = 'failed', notify user with list of missing credentials, stop.
### Step 9 — Run 4-level automated tests
Run each level in order. Stop and fail if any level fails. Log every result to `agent_test_results`.
**Level 1 — Structure**
Validate the imported workflow via the N8N API:
- `GET {base_url}/api/v1/workflows/{workflow_id}` returns 200
- Response contains correct node count
- All required fields present
- Insert result to agent_test_results (test_level=1)
**Level 2 — Deployment**
- Verify workflow exists in sandbox N8N and is not active
- Verify all credential references are valid (no broken credential links)
- Insert result to agent_test_results (test_level=2)
**Level 3 — Smoke**
- Trigger a manual execution via N8N API:
```
POST {base_url}/api/v1/workflows/{workflow_id}/run
```
- Poll execution status until complete or timeout (60s)
- Must reach status 'success' or 'waiting' (not 'error' or 'crashed')
- Insert result to agent_test_results (test_level=3)
**Level 4 — Assertion**
- Verify the correct side effect occurred based on what the workflow is supposed to do
- Check system state, not output strings: DB row written, API called, file created, webhook fired, etc.
- The specific assertion depends on the automation — derive it from task_description
- Insert result to agent_test_results (test_level=4)
### Step 10 — Notify user for promotion approval
If all 4 levels pass:
1. Set builder_status = 'awaiting_approval'
2. Send NTFY notification:
```
Title: N8N Workflow Ready for Promotion — {name}
Body: All 4 test levels passed in sandbox. Automation id={id} (n8n_automation) is ready for production promotion. Sandbox workflow id={n8n_workflow_id}. Reply to approve or reject.
```
User must explicitly approve before production import. No auto-promotion in v1.
### Step 11 — On approval
1. Import the same workflow JSON to production N8N (port 5678)
2. Assign production credentials (different credential names from sandbox)
3. Set builder_status = 'deployed'
4. Update automation_ideas status = 'deployed'
## Error handling
- Any unhandled exception: set builder_status = 'failed', log to agent_test_results (test_level=0, status='fail'), send NTFY alert
- Always release the claim (reset to 'not_started') if failing before Step 5 so another run can retry
- After Step 5: leave as 'failed' — requires manual review before retry
- If workflow was imported before failure: delete it from sandbox N8N to keep sandbox clean
```
DELETE {base_url}/api/v1/workflows/{workflow_id}
```
## NTFY notification patterns
- Build started: `[N8N Builder] Building {name} (id={id})`
- Overseer FAIL: `[N8N Builder] FAIL — Overseer rejected {name}: {reason}`
- Import FAIL: `[N8N Builder] FAIL — {name} failed N8N import: {error}`
- Missing credentials: `[N8N Builder] BLOCKED — {name} needs credentials: {list}`
- Test level fail: `[N8N Builder] FAIL — {name} failed Level {n}: {error}`
- Ready for approval: `[N8N Builder] READY — {name} passed all tests, awaiting your approval`
- Unhandled error: `[N8N Builder] ERROR — {name}: {exception}`
## SDK credit budget
`claude -p` is called once per automation (Step 6 only). Keep the overseer prompt under 500 tokens. Do not call claude -p for retries or debugging — only for the initial validation pass.
## N8N credential naming convention
Sandbox credentials must be named with a `-sandbox` suffix to distinguish from production:
- `postgres-sandbox` (not `postgres`)
- `vault-sandbox` (not `vault`)
- `n8n-internal-sandbox` (not `n8n-internal`)
This prevents the N8N Builder Agent from accidentally referencing production credentials when building sandbox workflows.
@@ -0,0 +1,156 @@
---
name: Agent & Automation Testing Methodology
description: Mandatory testing methodology for all built automations before production promotion — covers N8N automations, claude agents, and scripts; self-evolving document updated after every test session
type: project
version: 1.0
---
# Playbook: Agent & Automation Testing Methodology
**Self-evolution rule:** After every test session, update this playbook — add new known failure modes, refine assertion patterns, increment the version number. The methodology improves every time something breaks in a new way.
**Applies to:** All automations in the `automation_ideas` table with types: `n8n_automation`, `claude_agent`, `script`
---
## Before You Test — Required Reading Gate
| Task type | Read first |
|---|---|
| Testing any automation | Sandbox isolation rule · Four test levels · Type-specific section |
| Promoting to production | Promotion gate checklist |
| Adding a new failure mode | Known failure modes section + update rule |
**Sandbox isolation rule (HARD):** All testing happens in sandbox (server-01, 192.168.1.90). Sandbox Vault, Postgres, and N8N are test-only. No production credentials, no production data. See `feedback_sandbox_isolation.md`.
---
## The Four Test Levels
Every automation must pass all four levels before promotion. Run in order — stop and log failure at the first level that fails.
### Level 1 — Structure Test
Does the built artifact have valid structure?
| Type | Check |
|---|---|
| `n8n_automation` | Workflow JSON is valid JSON; contains `nodes`, `connections`, `settings` keys; all node types exist in sandbox N8N |
| `claude_agent` | The `claude -p` call string is syntactically valid; prompt references correct tools/paths; output schema is defined |
| `script` | Python syntax check passes (`python3 -m py_compile script.py`); all imports are available in the target container image |
**Pass criteria:** No structural errors. **Fail action:** Log to `agent_test_results`, NTFY user, do NOT proceed to Level 2.
---
### Level 2 — Deployment Test
Does it deploy to sandbox without errors?
| Type | Check |
|---|---|
| `n8n_automation` | `POST /api/v1/workflows` succeeds; workflow appears in sandbox N8N UI; all credentials are assigned (no empty credential IDs) |
| `claude_agent` | Container builds and starts; `docker ps` shows healthy; `claude -p "echo ok"` returns without error from within the agent's execution context |
| `script` | Docker image builds; container starts; first log line appears within 30 seconds; exit code is 0 for one-shot scripts or container stays running for daemon scripts |
**Pass criteria:** No deployment errors, artifact is reachable. **Fail action:** Log to `agent_test_results`, NTFY user, tear down partial deployment in sandbox.
---
### Level 3 — Smoke Test
Does it execute without crashing on minimal input?
| Type | Check |
|---|---|
| `n8n_automation` | Trigger one manual execution via N8N API (`POST /api/v1/workflows/{id}/run`); execution completes with status `success` (not `error` or `crashed`) |
| `claude_agent` | Run agent with a trivial, safe test input defined in the `task_description`; agent completes without exception; output is non-empty |
| `script` | Run with `--dry-run` flag if supported, or with a clearly safe test input; exits 0; no unhandled exceptions in logs |
**Pass criteria:** Execution completes, no crashes, no unhandled exceptions. **Fail action:** Capture full execution log, log to `agent_test_results`, NTFY user with error excerpt.
---
### Level 4 — Assertion Test
Does it produce the correct side effects?
This is the type-specific level. For each automation, the `task_description` must include at least one verifiable assertion. The builder agents are responsible for generating these assertions at build time.
| Type | Assertion patterns |
|---|---|
| `n8n_automation` | DB row was written/updated · NTFY notification received · HTTP response status was 200 · File was created at expected path |
| `claude_agent` | Output JSON contains required fields · Built artifact exists and passes Level 1 structure check of the artifact it built · Side-effect DB row exists |
| `script` | Expected output file exists · DB was updated · Expected log line present |
**LLM output validation rule (claude_agent):** Never assert exact string match on LLM output — outputs are non-deterministic. Assert on: JSON schema validity, presence of required keys, value types, side effects produced.
**Pass criteria:** All assertions defined in `task_description` pass. **Fail action:** Log which assertions failed, NTFY user with details.
---
## Promotion Gate
When all four levels pass, the following checklist must be completed before the automation goes to production.
- [ ] All 4 test levels logged as `pass` in `agent_test_results`
- [ ] NTFY notification sent to user with test summary
- [ ] **User reviews and approves** (NTFY → user replies or confirms in next session)
- [ ] For `n8n_automation`: all sandbox credentials re-pointed to production equivalents (see `project_sandbox_workflow_credential_rule.md`)
- [ ] For `claude_agent`: production paths/URLs substituted for sandbox paths
- [ ] For `script`: production env vars set in Coolify; no hardcoded sandbox values
- [ ] Production deployment verified (Level 2 re-run against production)
- [ ] `automation_ideas` status updated to `deployed`
- [ ] `agent_test_results` promotion record written
**Promotion is not automatic.** User approval is required after Level 4 pass. This is the v1.0 rule — can be relaxed to auto-promote for specific low-risk automation types after track record is established.
---
## Test Result Storage
All test results write to `agent_test_results` table (to be created in `api_business` DB).
**Required schema:**
```sql
CREATE TABLE agent_test_results (
id SERIAL PRIMARY KEY,
automation_id INTEGER NOT NULL REFERENCES automation_ideas(id),
test_level INTEGER NOT NULL CHECK (test_level BETWEEN 1 AND 4),
status TEXT NOT NULL CHECK (status IN ('pass', 'fail', 'skip')),
error_message TEXT,
execution_log TEXT,
tested_at TIMESTAMPTZ NOT NULL DEFAULT now(),
promoted_at TIMESTAMPTZ,
notes TEXT
);
```
---
## NTFY Notification Patterns
| Event | Topic | Message format |
|---|---|---|
| Level fail | `homelab-alerts` | `[AGENT TEST FAIL] {name} — Level {N}: {error excerpt}` |
| All levels pass | `homelab-alerts` | `[AGENT TEST PASS] {name} — ready for your review and promotion` |
| Promotion complete | `homelab-alerts` | `[AGENT DEPLOYED] {name} — now live in production` |
---
## Known Failure Modes
*(Updated as new failures are discovered during testing)*
| ID | Type | Failure | Root cause | Fix |
|---|---|---|---|---|
| — | — | None yet — first test session will populate this | — | — |
---
## Self-Evolution Instructions
After every test session:
1. Add any new failure mode to the Known Failure Modes table with ID, type, root cause, and fix
2. If a Level assertion was too loose (passed but shouldn't have) or too strict (failed but should have passed), update the assertion pattern for that level and type
3. Increment the version number in the frontmatter
4. Note the date and what changed at the bottom of this file
**Change log:**
- v1.0 (2026-06-16): Initial methodology — four levels, user-approval promotion gate, NTFY notifications, self-evolution rule