commit db6cbbdec1ecbc0e8c49621e08436f18d1bdb68e Author: Backtalk6858 Date: Wed Jun 17 23:08:23 2026 -0500 init: add claude-config and agent-builder context files Initial commit tracking session context, playbooks, and automation specs for claude-config and agent-builder Claude Code conversations. Co-Authored-By: Claude Sonnet 4.6 diff --git a/.claude/context.md b/.claude/context.md new file mode 100644 index 0000000..64c65ac --- /dev/null +++ b/.claude/context.md @@ -0,0 +1,78 @@ +project_name: claude-config +# Claude Config Audit — Context + +## What this project does +Saturday/Sunday Claude configuration and improvement sessions — behavior_changes review, hook development, memory system upgrades, skills development, and config audit workflow. + +## Key files +- `/opt/appdata/docker/.claude/hooks/`: all hooks live here (globally referenced) +- `/opt/appdata/docker/.claude/scripts/session_log_insert.sh`: validated insert helper for session_log table +- `/home/administrator/.claude/settings.json`: global settings, hook registration, plugin registration +- `/home/administrator/.claude/projects/-opt-appdata-docker/memory/`: primary memory directory +- `/home/administrator/.claude/projects/-home-administrator-Desktop-claude/memory/`: this session's memory +- `project_claude_config_workflow.md`: Saturday workflow, behavior_changes schema +- `AUDIT_CONTEXT.md` (Desktop/claude/): handoff file written at start of each audit session + +## Patterns to follow +- Validate Python hooks with `python3 -c "import ast; ast.parse(open('file').read())"` before saving +- All new memory files go in both project memory directories if globally accessible +- behavior_changes INSERT: copy template from project_claude_config_workflow.md exactly +- New rules go to behavior_changes DB with status='applied' after being written to memory +- Evolution instructions required on every new rule, playbook, or system +- Coolify API key: Vault `secret/coolify → api_key` via AppRole — NOT a file +- Always backup hooks before modifying: `cp hook.sh hook.sh.bak.YYYYMMDD` + +## Known issues / gotchas +- Semantic memory Phase 1 hooks NOT YET BUILT — Stop hook and SessionStart hook extensions pending +- `/recall` skill IS built at `/opt/appdata/docker/.claude/skills/recall/SKILL.md` — registered under homelab-skills plugin; invoke via Skill tool with name "recall" (not "homelab-skills@homelab-skills-local:recall") +- Feedback rule retrofit complete — all 35 files now have evolution instructions +- session_log_insert.sh validates project_id before inserting — use --dry-run to test +- pgvector ivfflat index warns about low recall with little data — normal until table is populated +- N8N work that results from claude-config design decisions is IN SCOPE for Sunday dev session + +## What NOT to break +- Hook registration in settings.json — all five hooks must remain wired +- Memory file indexing in both MEMORY.md files — every new memory file must be indexed +- Ollama on localhost:11434 (internal-only) — Coolify UUID: mbhuoyt968m23qt4x7lx01pc +- pgvector extension in claude_config DB — required for claude_memories table + +## Current state +2026-06-04 Thursday: git criteria universal playbook built (personal_projects id=38 → completed). playbook_git_criteria_universal.md created in both memory dirs — single authorized trigger (checklist only), partial commit on hard block, universal pre-stage registry, dynamic co-author line, two-layer evolution. playbook_git_commits.md updated to extend it. media-api.py + related files still uncommitted — containerized script, testing not confirmed. Next session: security vulnerability patching (~6:30 PM Thursday). + +## Sunday Dev Plan (13 items) +**Hooks:** +1. Stop hook — MEMORY_EMBED tag detection + Ollama embed (from prior session) +2. SessionStart hook — semantic query + global handoff file + recent 48h summaries injection +3. UserPromptSubmit hook — add [CURRENT TIME] injection + 60% context checkpoint threshold +4. PostToolUse hook (new) — conflict-detector.py fires on Write/Edit to */memory/*.md + +**Skills:** +5. /recall skill — confidence-gated semantic memory query + +**New files:** +6. playbook_checklist_decision.md — merged checklist (grill-me finalized 2026-05-30): + DESIGN: old end-of-project + end-of-session checklists MERGED into one. Finishing a project = ending that conversation's session. + Trigger table: + - Built + verified working (project in DB) → Full merged checklist, all steps + - Built + verified working (no DB entry) → Full merged checklist, skip DB steps + - Planning/research only, nothing built → Summaries + context.md only + - Switching conversations mid-project → Summaries + context.md only + - 80% context hit, project mid-flight → Lightweight checkpoint: WIP commit, context.md "stopped at X", log follow-ups, routing + "Project finished" = code/config/infra built AND basic functional check passes. "User says done" alone is not sufficient. + Also update feedback_end_of_project_checklist.md + feedback_end_of_session_checklist.md to point to new merged playbook. +7. feedback_secrets_lookup_order.md — Vault before Bitwarden for any credential lookup +8. session_handoff.md template + write step added to conversation routing checklist + +**DB / Schema:** +9. schedule_actuals table DDL in claude_config DB +10. proposed_schedule JSON template update (time_estimate_minutes + estimate_basis) + WF1b N8N node + +**Playbook update:** +11. playbook_vault_token_rotation.md — add N8N manual rotation section + +**Behavior changes entries:** +12. Pre-task context estimation rule (flag if >15% remaining context) +13. Per-project time estimation at session start (query completed-today projects) + +## Update instructions +Update at the end of every config audit session. Keep "Current state" section and pending items current. diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..26d064f --- /dev/null +++ b/.gitignore @@ -0,0 +1,7 @@ +*.env +*.key +*.secret +*credentials* +*token* +__pycache__/ +*.pyc diff --git a/AUDIT_CONTEXT.md b/AUDIT_CONTEXT.md new file mode 100644 index 0000000..a38cbf4 --- /dev/null +++ b/AUDIT_CONTEXT.md @@ -0,0 +1,55 @@ +# Saturday 2026-05-23 — Config Audit Context Handoff + +## What happened today (API Idea conversation) +Extended Saturday session covering Friday's missed work (user was sick). + +**Completed:** +- N8N AppRole migration — all 3 outreach workflows migrated from expired static Vault token to dynamic AppRole auth (30-min TTL, revoked after each execution). Execution 86 confirmed working. +- Dry run gate restructure — all 3 workflows restructured so dry run tests the full AppRole + Vault path +- Orchestrator audit — `amHPv6iQ5HmIexVt` uses `$vars` (paid feature, never worked); 3 rotation sub-workflows still on expired static token → logged as personal_projects id=42 +- Vault Policy Registry DB (`vault_registry`) — new Postgres DB with pgcrypto encryption on sensitive columns; all 8 known policies pre-populated; passphrase at Vault `secret/data/vault-registry` +- Postgres DB creation playbook written: `playbook_postgres_databases.md` (data classification + encryption rules for all future DBs) +- business_projects id=7 (Follow-up Outreach workflow) and id=12 (Brevo Unsubscribe Handler) logged for next week — not built today, out of time +- unsubscribed_at column added to contacted_leads + +**Not done (ran out of time):** +- Claude config audit ← YOU ARE HERE +- Scheduling playbook finalization → Nextcloud conversation after this +- Schedule generation agent Phase 1 → Nextcloud conversation after this + +--- + +## Config Audit Agenda + +### 1. behavior_changes DB review (normal Saturday start) +Query: `docker exec $(docker ps --format '{{.Names}}' | grep '^postgres-') psql -U postgres -d claude_config -c "SELECT * FROM behavior_changes ORDER BY session_date DESC LIMIT 20"` + +Review recent behavior changes — identify what's working, what to retire, what needs development. + +### 2. Memory organization problem (DISCUSS — don't pre-solve) +Today's session spent significant tokens trying to find Friday's session summaries. Root cause: session summaries are named by project (e.g. `session_summary_api_business.md`), not by date. Finding "what happened on Friday May 22" required brute-force JSONL searching. + +Discuss options during the audit, decide what to build tomorrow. + +### 3. Memory upgrade discussion +User wants to discuss the broader memory upgrade (Obsidian, mem0, MemGPT candidates — see `project_claude_memory_upgrade.md`). May or may not be blocked on local AI hardware. + +### 4. CLAUDE.md GitHub repo (user will bring transcript) +Someone posted their CLAUDE.md files to GitHub — repo got ~68k stars in 24 hours. User wants to review and see if any of the patterns are usable here. User will paste the transcript into the conversation. + +### 5. Tomorrow's development agenda (Sunday) +Whatever the audit identifies as highest value to develop. Memory org fix is a strong candidate. + +--- + +## Key project DB IDs for reference +- personal_projects id=40: Vault Policy Registry DB — COMPLETED today +- personal_projects id=42: N8N Rotation Workflows AppRole Migration — pending +- personal_projects id=43: AI-Assisted DB Encryption Scanner — pending (blocked on local AI? user says Claude can do it now — discuss) +- business_projects id=7: Follow-up Outreach Emails workflow — pending +- business_projects id=12: Brevo Unsubscribe Handler — pending + +## After the audit +Come back to the **Nextcloud conversation** (`~/Desktop` or wherever it lives) for: +1. Scheduling playbook finalization (recurring_events DB table design + machine-readable rule format) +2. Schedule generation agent Phase 1 build (N8N → NTFY approval → Nextcloud CalDAV push) diff --git a/agent-builder/.claude/context.md b/agent-builder/.claude/context.md new file mode 100644 index 0000000..c1661c6 --- /dev/null +++ b/agent-builder/.claude/context.md @@ -0,0 +1,73 @@ +project_name: agent-builder +# Agent Builder — Session Context + +## What this project does +Design, build, and test autonomous N8N agents on server-01 sandbox before any production promotion. +First two agents: Agent Builder Agent + N8N Builder Agent. + +## Scheduled work (2026-06-16, running behind — started ~6:38 PM) +1. Vision Alignment Grill-Me — Agent Builder + N8N Builder vision + testing methodology +2. Agent Builder Agent — Deploy + Test (server-01 sandbox) +3. N8N Builder Agent — Deploy + Test (server-01 sandbox) + +## Architecture +- Agents run as N8N workflows on server-01 (n8n-sandbox, port 5679) +- Sandbox-first: all agents tested in sandbox before any production promotion +- server-01 sandbox stack: n8n-sandbox, postgres-sandbox, vault-sandbox, bitwarden-bridge-sandbox, vaultwarden-sandbox +- Sandbox N8N API key: prod Vault at secret/sandbox/n8n +- Sandbox reachable at 192.168.1.90 + +## Key decisions (set during vision grill-me — 2026-06-16) +- Agent Builder Agent: builds `claude_agent` and `script` types — Ollama (llama3.1:8b) does the building, `claude -p` is overseer/validator +- N8N Builder Agent: builds `n8n_automation` types — Ollama generates workflow JSON, imports via N8N API, assigns credentials +- automation_ideas schema changes needed: rename `description` → `task_description` (full structured spec), add `type` (n8n_automation/claude_agent/script), add `builder_status` +- New `agent_test_results` table needed in api_business DB +- Sandbox must mirror production: AppRole, Vaultwarden, bridge all configured before any agent deploys +- Promotion = user approval required after all 4 test levels pass (not auto-promote in v1) +- Dedicated backfill session needed for all 48 existing automation_ideas rows (type + task_description) +- claude -p uses SDK credits (Pro = $20/month hard limit) — use sparingly, Ollama does the heavy lifting +- Local model: llama3.1:8b already pulled on server-01 (4.9GB, fits in RTX 2060 Super 8GB VRAM) + +## Testing methodology +- Four levels: Structure → Deployment → Smoke → Assertion +- LLM outputs validated on structure/side-effects only, never exact string match +- All results logged to agent_test_results table +- NTFY notification on pass and fail +- Full methodology: .claude/playbook_testing_methodology.md + +## Agents +### Agent Builder Agent +- Status: pending — prereqs not complete +- Purpose: Receives automation spec from automation_ideas DB, uses Ollama to build claude_agent or script type automations, deploys to sandbox, runs automated tests, notifies user for promotion approval +- Builds: claude agents (via claude -p) and Python scripts (Docker containers) + +### N8N Builder Agent +- Status: pending — prereqs not complete +- Purpose: Receives automation spec from automation_ideas DB, uses Ollama to generate N8N workflow JSON using n8n_automations playbook as context, imports to sandbox N8N via API, assigns credentials, runs automated tests +- Will be used to build: id=12 (Media Pipeline Learning), id=7 (Friday Research Session Prep) + +## Related personal_projects DB rows +- id=4: N8N Workflow Builder Script (pending, weekend_block1) + +## Prereq checklist (must complete before any agent deployment) +- [x] Schema: rename automation_ideas.description → task_description, add type, add builder_status, add priority +- [x] Create agent_test_results table in api_business +- [x] Sandbox Vault: set up AppRole auth method (credentials at /opt/appdata/docker/docker-compose/vault/approle/ on server-01) +- [x] Sandbox Vault: store sandbox N8N API key at secret/sandbox/n8n (key name: claude-sandbox, verified working) +- [x] Verify sandbox Bitwarden bridge ↔ Vaultwarden sandbox end-to-end (bridge on port 8080, returns [] for empty vault — correct) +- [x] Write Agent Builder Agent playbook → .claude/playbook_agent_builder_agent.md +- [x] Write N8N Builder Agent playbook → .claude/playbook_n8n_builder_agent.md +- [~] Backfill session: Resume from priority 28 (id=34, CalDAV Auto-Refresh Trigger) next session. ~30 automations remain. This session reviewed ids 28, 40, 53, 4, 7, 5, 14, 11, 44, 33, 29, 20, 41, 12, 1, 19, 39, 6. Key changes: id=7+50 merged into id=18 (full 10-stage business pipeline); id=18 expanded with Business Research Agent + Development Agent + parked idea email-to-Tyler flow; id=14+20 blocked (already built by schedule workflows); id=12+1+41 blocked (redundant); id=19 blocked (pending Jenkins); id=5 blocked (Obsidian vault not set up yet); id=40 pending with Jenkins conditional note. New rows added: id=54 (NTFY Topic Provisioner, p54), id=55 (Business Research Agent, p51), id=56 (Business Development Agent, p52), id=57 (Sandbox Environment Deployment Completion, p9 — NOTE: conflicts with id=28 priority 9, fix next session). Calendar events pushed to Nextcloud: Thu 6/18 12:45-3PM backfill + 3-4:30PM readiness, Fri 6/19 12:45-4:30PM agent builds. + +## Final readiness check items (scheduled June 18) +- All 8 prereq checklist items verified complete +- Sandbox mirrors production: AppRole, bridge, Vaultwarden all confirmed functional +- **Sensitive output interception system** — design and implement before any agent goes live: + - Agents must scan their own stdout/logs before writing/sending output and redact anything matching secret patterns (tokens, keys, passwords, API keys) + - Pattern list at minimum: `hvs\.`, `eyJ`, bearer tokens, anything from known env var names (BRIDGE_API_KEY, VAULT_TOKEN, N8N_ENCRYPTION_KEY, etc.) + - Root cause: `docker inspect --format '{{range .Config.Env}}...'` dumps all env vars including secrets; agents will reach for broad diagnostic commands without filtering — local models even more so + - Production exposure is a serious risk; sandbox exposure is acceptable but still undesirable + - This system needs to exist at the agent level (not just Claude Code rules) because once agents run autonomously the user will not be watching + +## Update instructions +Update at the end of every agent-builder session. Keep agent status, key decisions, and prereq checklist current. diff --git a/agent-builder/.claude/playbook_agent_builder_agent.md b/agent-builder/.claude/playbook_agent_builder_agent.md new file mode 100644 index 0000000..97bfbc8 --- /dev/null +++ b/agent-builder/.claude/playbook_agent_builder_agent.md @@ -0,0 +1,154 @@ +# Playbook: Agent Builder Agent + +## Purpose +Builds `claude_agent` and `script` type automations from the `automation_ideas` table. Uses Ollama (llama3.1:8b) as the primary code generator and `claude -p` as the overseer/validator. Deploys to sandbox, runs all 4 test levels, notifies user for promotion approval. + +## Trigger +Scheduled or manual. Queries `automation_ideas` for the next row where: +- `type IN ('claude_agent', 'script')` +- `status = 'ready_to_build'` +- `builder_status = 'not_started'` +- Ordered by `priority ASC NULLS LAST, id ASC` + +Only processes one automation per run. + +## Infrastructure +- Runs on: server-01 (n8n-sandbox, port 5679) +- Ollama endpoint: http://localhost:11434 (server-01 local) +- Model: llama3.1:8b +- Claude overseer: `claude -p` (non-interactive, SDK credits — use sparingly) +- Vault: sandbox AppRole at /opt/appdata/docker/docker-compose/vault/approle/ +- Database: production api_business (read automation_ideas, write agent_test_results) +- NTFY: production NTFY instance for notifications + +## Step-by-Step + +### Step 1 — Claim the automation +```sql +UPDATE automation_ideas +SET builder_status = 'queued' +WHERE id = AND builder_status = 'not_started'; +``` +If 0 rows updated: another builder claimed it — stop, notify, exit. + +### Step 2 — Build the prompt for Ollama +Construct a generation prompt using all available fields from the automation row: +- `name`: what the automation is called +- `task_description`: full structured spec — this is the primary instruction +- `type`: claude_agent or script +- `infrastructure_requirement`: what infra it needs access to + +Prompt structure: +``` +You are an expert automation engineer. Build a {type} automation with the following specification. + +Name: {name} +Infrastructure: {infrastructure_requirement} + +Specification: +{task_description} + +Requirements: +- If type is claude_agent: output a complete shell-executable claude -p command with full system prompt and all logic. The agent must be self-contained. +- If type is script: output a complete Python script. Include a Dockerfile if the script has dependencies beyond stdlib. +- Output ONLY the code. No explanation, no markdown fences, no commentary. +- The code must handle its own error cases and log to stdout. +- Secrets must be fetched from Vault via AppRole — never hardcoded. AppRole credentials at /opt/appdata/docker/docker-compose/vault/approle/role-id and secret-id. +``` + +### Step 3 — Generate with Ollama +``` +POST http://localhost:11434/api/generate +{ + "model": "llama3.1:8b", + "prompt": "", + "stream": false +} +``` +Set builder_status = 'building' before calling. + +If Ollama call fails or times out (>120s): set builder_status = 'failed', log error, notify via NTFY, stop. + +### Step 4 — Overseer validation with claude -p +Pass the generated code to `claude -p` for structural review. Keep the prompt minimal to conserve SDK credits: + +```bash +claude -p "Review this {type} automation code for the following only: +1. Does it correctly fetch secrets from Vault via AppRole (never hardcoded)? +2. Are there any obvious syntax errors or missing imports? +3. Does the logic match this spec summary: {name} — {task_description[:200]} + +Respond with: PASS or FAIL, then one sentence explaining why. +Do not rewrite the code." +``` + +If FAIL: log claude's reason, set builder_status = 'failed', notify via NTFY with the failure reason, stop. +If PASS: proceed. + +### Step 5 — Deploy to sandbox +**For `script` type:** +1. Write the generated code to a temp directory on server-01 +2. If a Dockerfile was generated, build the image: `docker build -t agent-{id}-{slug} .` +3. Run a test container: `docker run --rm agent-{id}-{slug}` (dry run, no side effects) + +**For `claude_agent` type:** +1. Write the generated claude -p command to a shell script +2. Make it executable +3. Run it once with `--dry-run` flag if supported, or with a test input that produces no side effects + +If deployment fails: set builder_status = 'failed', log error, notify via NTFY, stop. + +### Step 6 — Run 4-level automated tests +Run each level in order. Stop and fail if any level fails. Log every result to `agent_test_results`. + +**Level 1 — Structure** +Validate the generated artifact: +- For scripts: `python3 -m py_compile script.py` — must exit 0 +- For claude agents: verify the shell script is syntactically valid bash +- For Dockerfiles: `docker build --check` if available, else verify FROM and key directives exist +- Insert result: `INSERT INTO agent_test_results (automation_id, test_level, status, execution_log) VALUES ({id}, 1, 'pass'/'fail', '{log}')` + +**Level 2 — Deployment** +- Verify the artifact can be deployed cleanly (no missing dependencies, image builds successfully, script runs without import errors) +- Must complete without crashing +- Insert result to agent_test_results + +**Level 3 — Smoke** +- Execute the automation with minimal/test inputs +- Must run to completion without an unhandled exception or non-zero exit +- Insert result to agent_test_results + +**Level 4 — Assertion** +- Verify the correct side effect occurred (not string matching — check the actual system state) +- Examples: a file was created, a DB row was written, an API call returned 200, a container is running +- Insert result to agent_test_results + +### Step 7 — Notify user for promotion approval +If all 4 levels pass: +1. Set builder_status = 'awaiting_approval' +2. Send NTFY notification: + ``` + Title: Agent Ready for Promotion — {name} + Body: All 4 test levels passed in sandbox. Automation id={id} ({type}) is ready for production promotion. Reply to approve or reject. + ``` + +User must explicitly approve before any production deployment. No auto-promotion in v1. + +### Step 8 — On approval +Set builder_status = 'approved', then 'deployed' after production deployment completes. +Update automation_ideas status = 'deployed'. + +## Error handling +- Any unhandled exception: set builder_status = 'failed', log to agent_test_results with test_level=0 and status='fail', send NTFY alert +- Always release the claim (reset builder_status to 'not_started') if failing before Step 3 so another run can retry +- After Step 3: leave as 'failed' — requires manual review before retry + +## NTFY notification patterns +- Build started: `[Agent Builder] Building {name} (id={id}, type={type})` +- Overseer FAIL: `[Agent Builder] FAIL — Overseer rejected {name}: {reason}` +- Test level fail: `[Agent Builder] FAIL — {name} failed Level {n}: {error}` +- Ready for approval: `[Agent Builder] READY — {name} passed all tests, awaiting your approval` +- Unhandled error: `[Agent Builder] ERROR — {name}: {exception}` + +## SDK credit budget +`claude -p` is called once per automation (Step 4 only). Keep the overseer prompt under 500 tokens. Do not call claude -p for retries or debugging — only for the initial validation pass. diff --git a/agent-builder/.claude/playbook_n8n_builder_agent.md b/agent-builder/.claude/playbook_n8n_builder_agent.md new file mode 100644 index 0000000..89d8243 --- /dev/null +++ b/agent-builder/.claude/playbook_n8n_builder_agent.md @@ -0,0 +1,209 @@ +# Playbook: N8N Builder Agent + +## Purpose +Builds `n8n_automation` type automations from the `automation_ideas` table. Uses Ollama (llama3.1:8b) to generate N8N workflow JSON, imports to sandbox N8N via API, assigns credentials, runs all 4 test levels, notifies user for promotion approval. + +## Trigger +Scheduled or manual. Queries `automation_ideas` for the next row where: +- `type = 'n8n_automation'` +- `status = 'ready_to_build'` +- `builder_status = 'not_started'` +- Ordered by `priority ASC NULLS LAST, id ASC` + +Only processes one automation per run. + +## Infrastructure +- Runs on: server-01 (n8n-sandbox, port 5679) +- Ollama endpoint: http://localhost:11434 (server-01 local) +- Model: llama3.1:8b +- Claude overseer: `claude -p` (non-interactive, SDK credits — use sparingly) +- Sandbox N8N API: http://192.168.1.90:5679 — API key from Vault at secret/sandbox/n8n +- Vault: sandbox AppRole at /opt/appdata/docker/docker-compose/vault/approle/ +- Database: production api_business (read automation_ideas, write agent_test_results) +- NTFY: production NTFY instance for notifications + +## N8N Workflow JSON Structure (required knowledge) +Every valid N8N workflow JSON must include: +```json +{ + "name": "Workflow Name", + "nodes": [...], + "connections": {...}, + "active": false, + "settings": {"executionOrder": "v1"}, + "tags": [] +} +``` +Nodes have: `id` (UUID), `name`, `type` (e.g. n8n-nodes-base.httpRequest), `typeVersion`, `position` ([x, y]), `parameters`. +Connections map node outputs to node inputs by node name. +All workflows imported as `active: false` — never activate automatically in sandbox. + +## Step-by-Step + +### Step 1 — Claim the automation +```sql +UPDATE automation_ideas +SET builder_status = 'queued' +WHERE id = AND builder_status = 'not_started'; +``` +If 0 rows updated: another builder claimed it — stop, notify, exit. + +### Step 2 — Fetch sandbox N8N API key from Vault +Use sandbox AppRole to read secret/sandbox/n8n. Extract `api_key` and `base_url`. +Never log the key value. Pass it in memory only. + +### Step 3 — Discover available N8N credentials +Before generating, query the sandbox N8N for existing credentials so the generated workflow references them by name: +``` +GET {base_url}/api/v1/credentials +X-N8N-API-KEY: {api_key} +``` +Extract credential names and types. Pass this list to the Ollama prompt so the generated workflow uses real credential names. + +### Step 4 — Build the prompt for Ollama +``` +You are an expert N8N workflow engineer. Generate a valid N8N workflow JSON for the following automation. + +Name: {name} +Infrastructure available: {infrastructure_requirement} +Available N8N credentials: {credential_names_and_types} + +Specification: +{task_description} + +Requirements: +- Output ONLY valid N8N workflow JSON. No explanation, no markdown fences, no commentary. +- The workflow must be importable via the N8N API without modification. +- Set active: false. +- Reference credentials by the exact names listed above — do not invent credential names. +- Use realistic node positions (spread nodes 200px apart on x-axis starting at x=250). +- Every node must have a unique UUID for its id field. +- The workflow must fully implement the specification — do not stub or placeholder any steps. +``` + +### Step 5 — Generate with Ollama +``` +POST http://localhost:11434/api/generate +{ + "model": "llama3.1:8b", + "prompt": "", + "stream": false +} +``` +Set builder_status = 'building' before calling. + +Extract JSON from response — strip any surrounding text if Ollama adds it. +Validate it parses as JSON before proceeding. If invalid JSON: set builder_status = 'failed', log error, notify, stop. + +If Ollama call fails or times out (>120s): set builder_status = 'failed', log error, notify, stop. + +### Step 6 — Overseer validation with claude -p +Pass the generated JSON to `claude -p` for structural review. Keep prompt minimal to conserve SDK credits: + +```bash +claude -p "Review this N8N workflow JSON for the following only: +1. Is it valid N8N workflow JSON with required fields (name, nodes, connections, active, settings)? +2. Do all nodes have id, name, type, typeVersion, position, parameters? +3. Do connections reference node names that exist in the nodes array? +4. Does the workflow logic match this spec: {name} — {task_description[:200]} + +Respond with: PASS or FAIL, then one sentence explaining why. +Do not rewrite the workflow." +``` + +If FAIL: log claude's reason, set builder_status = 'failed', notify via NTFY with reason, stop. +If PASS: proceed. + +### Step 7 — Import to sandbox N8N +``` +POST {base_url}/api/v1/workflows +X-N8N-API-KEY: {api_key} +Content-Type: application/json +Body: {generated workflow JSON} +``` + +On success: capture the returned workflow `id` from N8N. Store in notes or a temp variable. +On failure (non-2xx): set builder_status = 'failed', log the N8N error response, notify, stop. + +### Step 8 — Assign credentials +For each node in the workflow that references a credential: +``` +GET {base_url}/api/v1/workflows/{workflow_id} +``` +Verify credential references resolved correctly. If any credential reference is broken (credential name not found), attempt to match by type — if unresolvable, set builder_status = 'failed', notify user with list of missing credentials, stop. + +### Step 9 — Run 4-level automated tests +Run each level in order. Stop and fail if any level fails. Log every result to `agent_test_results`. + +**Level 1 — Structure** +Validate the imported workflow via the N8N API: +- `GET {base_url}/api/v1/workflows/{workflow_id}` returns 200 +- Response contains correct node count +- All required fields present +- Insert result to agent_test_results (test_level=1) + +**Level 2 — Deployment** +- Verify workflow exists in sandbox N8N and is not active +- Verify all credential references are valid (no broken credential links) +- Insert result to agent_test_results (test_level=2) + +**Level 3 — Smoke** +- Trigger a manual execution via N8N API: + ``` + POST {base_url}/api/v1/workflows/{workflow_id}/run + ``` +- Poll execution status until complete or timeout (60s) +- Must reach status 'success' or 'waiting' (not 'error' or 'crashed') +- Insert result to agent_test_results (test_level=3) + +**Level 4 — Assertion** +- Verify the correct side effect occurred based on what the workflow is supposed to do +- Check system state, not output strings: DB row written, API called, file created, webhook fired, etc. +- The specific assertion depends on the automation — derive it from task_description +- Insert result to agent_test_results (test_level=4) + +### Step 10 — Notify user for promotion approval +If all 4 levels pass: +1. Set builder_status = 'awaiting_approval' +2. Send NTFY notification: + ``` + Title: N8N Workflow Ready for Promotion — {name} + Body: All 4 test levels passed in sandbox. Automation id={id} (n8n_automation) is ready for production promotion. Sandbox workflow id={n8n_workflow_id}. Reply to approve or reject. + ``` + +User must explicitly approve before production import. No auto-promotion in v1. + +### Step 11 — On approval +1. Import the same workflow JSON to production N8N (port 5678) +2. Assign production credentials (different credential names from sandbox) +3. Set builder_status = 'deployed' +4. Update automation_ideas status = 'deployed' + +## Error handling +- Any unhandled exception: set builder_status = 'failed', log to agent_test_results (test_level=0, status='fail'), send NTFY alert +- Always release the claim (reset to 'not_started') if failing before Step 5 so another run can retry +- After Step 5: leave as 'failed' — requires manual review before retry +- If workflow was imported before failure: delete it from sandbox N8N to keep sandbox clean + ``` + DELETE {base_url}/api/v1/workflows/{workflow_id} + ``` + +## NTFY notification patterns +- Build started: `[N8N Builder] Building {name} (id={id})` +- Overseer FAIL: `[N8N Builder] FAIL — Overseer rejected {name}: {reason}` +- Import FAIL: `[N8N Builder] FAIL — {name} failed N8N import: {error}` +- Missing credentials: `[N8N Builder] BLOCKED — {name} needs credentials: {list}` +- Test level fail: `[N8N Builder] FAIL — {name} failed Level {n}: {error}` +- Ready for approval: `[N8N Builder] READY — {name} passed all tests, awaiting your approval` +- Unhandled error: `[N8N Builder] ERROR — {name}: {exception}` + +## SDK credit budget +`claude -p` is called once per automation (Step 6 only). Keep the overseer prompt under 500 tokens. Do not call claude -p for retries or debugging — only for the initial validation pass. + +## N8N credential naming convention +Sandbox credentials must be named with a `-sandbox` suffix to distinguish from production: +- `postgres-sandbox` (not `postgres`) +- `vault-sandbox` (not `vault`) +- `n8n-internal-sandbox` (not `n8n-internal`) + +This prevents the N8N Builder Agent from accidentally referencing production credentials when building sandbox workflows. diff --git a/agent-builder/.claude/playbook_testing_methodology.md b/agent-builder/.claude/playbook_testing_methodology.md new file mode 100644 index 0000000..03cec99 --- /dev/null +++ b/agent-builder/.claude/playbook_testing_methodology.md @@ -0,0 +1,156 @@ +--- +name: Agent & Automation Testing Methodology +description: Mandatory testing methodology for all built automations before production promotion — covers N8N automations, claude agents, and scripts; self-evolving document updated after every test session +type: project +version: 1.0 +--- + +# Playbook: Agent & Automation Testing Methodology + +**Self-evolution rule:** After every test session, update this playbook — add new known failure modes, refine assertion patterns, increment the version number. The methodology improves every time something breaks in a new way. + +**Applies to:** All automations in the `automation_ideas` table with types: `n8n_automation`, `claude_agent`, `script` + +--- + +## Before You Test — Required Reading Gate + +| Task type | Read first | +|---|---| +| Testing any automation | Sandbox isolation rule · Four test levels · Type-specific section | +| Promoting to production | Promotion gate checklist | +| Adding a new failure mode | Known failure modes section + update rule | + +**Sandbox isolation rule (HARD):** All testing happens in sandbox (server-01, 192.168.1.90). Sandbox Vault, Postgres, and N8N are test-only. No production credentials, no production data. See `feedback_sandbox_isolation.md`. + +--- + +## The Four Test Levels + +Every automation must pass all four levels before promotion. Run in order — stop and log failure at the first level that fails. + +### Level 1 — Structure Test +Does the built artifact have valid structure? + +| Type | Check | +|---|---| +| `n8n_automation` | Workflow JSON is valid JSON; contains `nodes`, `connections`, `settings` keys; all node types exist in sandbox N8N | +| `claude_agent` | The `claude -p` call string is syntactically valid; prompt references correct tools/paths; output schema is defined | +| `script` | Python syntax check passes (`python3 -m py_compile script.py`); all imports are available in the target container image | + +**Pass criteria:** No structural errors. **Fail action:** Log to `agent_test_results`, NTFY user, do NOT proceed to Level 2. + +--- + +### Level 2 — Deployment Test +Does it deploy to sandbox without errors? + +| Type | Check | +|---|---| +| `n8n_automation` | `POST /api/v1/workflows` succeeds; workflow appears in sandbox N8N UI; all credentials are assigned (no empty credential IDs) | +| `claude_agent` | Container builds and starts; `docker ps` shows healthy; `claude -p "echo ok"` returns without error from within the agent's execution context | +| `script` | Docker image builds; container starts; first log line appears within 30 seconds; exit code is 0 for one-shot scripts or container stays running for daemon scripts | + +**Pass criteria:** No deployment errors, artifact is reachable. **Fail action:** Log to `agent_test_results`, NTFY user, tear down partial deployment in sandbox. + +--- + +### Level 3 — Smoke Test +Does it execute without crashing on minimal input? + +| Type | Check | +|---|---| +| `n8n_automation` | Trigger one manual execution via N8N API (`POST /api/v1/workflows/{id}/run`); execution completes with status `success` (not `error` or `crashed`) | +| `claude_agent` | Run agent with a trivial, safe test input defined in the `task_description`; agent completes without exception; output is non-empty | +| `script` | Run with `--dry-run` flag if supported, or with a clearly safe test input; exits 0; no unhandled exceptions in logs | + +**Pass criteria:** Execution completes, no crashes, no unhandled exceptions. **Fail action:** Capture full execution log, log to `agent_test_results`, NTFY user with error excerpt. + +--- + +### Level 4 — Assertion Test +Does it produce the correct side effects? + +This is the type-specific level. For each automation, the `task_description` must include at least one verifiable assertion. The builder agents are responsible for generating these assertions at build time. + +| Type | Assertion patterns | +|---|---| +| `n8n_automation` | DB row was written/updated · NTFY notification received · HTTP response status was 200 · File was created at expected path | +| `claude_agent` | Output JSON contains required fields · Built artifact exists and passes Level 1 structure check of the artifact it built · Side-effect DB row exists | +| `script` | Expected output file exists · DB was updated · Expected log line present | + +**LLM output validation rule (claude_agent):** Never assert exact string match on LLM output — outputs are non-deterministic. Assert on: JSON schema validity, presence of required keys, value types, side effects produced. + +**Pass criteria:** All assertions defined in `task_description` pass. **Fail action:** Log which assertions failed, NTFY user with details. + +--- + +## Promotion Gate + +When all four levels pass, the following checklist must be completed before the automation goes to production. + +- [ ] All 4 test levels logged as `pass` in `agent_test_results` +- [ ] NTFY notification sent to user with test summary +- [ ] **User reviews and approves** (NTFY → user replies or confirms in next session) +- [ ] For `n8n_automation`: all sandbox credentials re-pointed to production equivalents (see `project_sandbox_workflow_credential_rule.md`) +- [ ] For `claude_agent`: production paths/URLs substituted for sandbox paths +- [ ] For `script`: production env vars set in Coolify; no hardcoded sandbox values +- [ ] Production deployment verified (Level 2 re-run against production) +- [ ] `automation_ideas` status updated to `deployed` +- [ ] `agent_test_results` promotion record written + +**Promotion is not automatic.** User approval is required after Level 4 pass. This is the v1.0 rule — can be relaxed to auto-promote for specific low-risk automation types after track record is established. + +--- + +## Test Result Storage + +All test results write to `agent_test_results` table (to be created in `api_business` DB). + +**Required schema:** +```sql +CREATE TABLE agent_test_results ( + id SERIAL PRIMARY KEY, + automation_id INTEGER NOT NULL REFERENCES automation_ideas(id), + test_level INTEGER NOT NULL CHECK (test_level BETWEEN 1 AND 4), + status TEXT NOT NULL CHECK (status IN ('pass', 'fail', 'skip')), + error_message TEXT, + execution_log TEXT, + tested_at TIMESTAMPTZ NOT NULL DEFAULT now(), + promoted_at TIMESTAMPTZ, + notes TEXT +); +``` + +--- + +## NTFY Notification Patterns + +| Event | Topic | Message format | +|---|---|---| +| Level fail | `homelab-alerts` | `[AGENT TEST FAIL] {name} — Level {N}: {error excerpt}` | +| All levels pass | `homelab-alerts` | `[AGENT TEST PASS] {name} — ready for your review and promotion` | +| Promotion complete | `homelab-alerts` | `[AGENT DEPLOYED] {name} — now live in production` | + +--- + +## Known Failure Modes + +*(Updated as new failures are discovered during testing)* + +| ID | Type | Failure | Root cause | Fix | +|---|---|---|---|---| +| — | — | None yet — first test session will populate this | — | — | + +--- + +## Self-Evolution Instructions + +After every test session: +1. Add any new failure mode to the Known Failure Modes table with ID, type, root cause, and fix +2. If a Level assertion was too loose (passed but shouldn't have) or too strict (failed but should have passed), update the assertion pattern for that level and type +3. Increment the version number in the frontmatter +4. Note the date and what changed at the bottom of this file + +**Change log:** +- v1.0 (2026-06-16): Initial methodology — four levels, user-approval promotion gate, NTFY notifications, self-evolution rule