Studio 09 · The Final Test Bench

Stress-test it before you ship.

The last studio. You'll write a few inputs the agent must handle correctly, a few it must refuse, and a small first-week ritual to catch surprises. Then you'll earn the last sticker.

🤖

Pip thinks…

Testing is the unglamorous part. Almost nobody does it. The agents that survive past month 3 are the ones whose authors spent a calm hour writing tests before wiring anything live.

Pip holding three small gold-trimmed cards with soft yellow stars.

Golden examples

3-5 real input cases where you already know the correct output. These are the agent's exam. If it passes them, it earns the right to run.

→ Use real data shapes from your business, not invented ones.

The evil twin test

Try to break it on purpose. Missing fields, corrupted data, weird edge cases, holidays. If the agent doesn't refuse cleanly, your guard isn't strong enough.

→ The goal isn't to pass these. The goal is to fail correctly.

Pip carefully presenting a torn input card with a curious expression.

Pip releasing a paper airplane that lands in a small sandbox tray.

The dry-run

Run the agent for a week with the "post" step pointing at a private channel only you can see. Check every output. Don't go live until you've sampled at least 5 in a row that you'd have sent yourself.

→ A week of dry-run saves a month of cleanup.

The first-week ritual

For the first 5 working days after launch: read every output. Note anything that surprised you. After day 5, decide: keep, fix, or kill.

→ Most "the agent went rogue" stories happen because nobody read week 1.

Pip with a small daily checklist marking off 5 boxes.

A week of dry-run saves a month of cleanup.

Write your tests

Three small exercises, and you're done.

Auto-saved. Exportable. Pair it with your spec and instructions to hand off a complete build packet.

Exercise 01

Three golden examples

Write 3 real input cases where you know the correct output. Be specific. Use real data shapes.

Why it matters → These are your regression tests. Re-run them every time you change the instructions.

ExampleGolden 1 Input: yesterday €12,400 / 87 orders / AOV €142. LY same day €11,200 / 92 / €121. Expected output: "Yesterday: €12,400 revenue (+10.7% vs LY)..." (Add two more, same shape.)

If you can't write three real ones, you don't know the task yet. That's a finding, not a failure.

Exercise 02

Three evil-twin inputs

Write 3 inputs the agent should refuse cleanly. For each, write the exact refusal message you'd want to see.

Why it matters → This is how you make sure the guard works. Without these, the guard is theoretical.

ExampleEvil 1 Input: yesterday row count is 12% of normal (data load failed). Expected output: "Data incomplete (only 12% of usual rows). Skipping today, will retry at 09:00. Please check pipeline." (Add two more, e.g. holiday, null revenue.)

The right number of refusal cases is "a couple more than you think." Add one weird one.

Exercise 03

Your first-week ritual

For the first 5 working days after launch, what will you check? Write it as a daily checklist.

Why it matters → Most "the agent went rogue" stories happen because nobody read week 1. Make the ritual concrete, calendar it.

ExampleDay 1: read the output within 30 min of it posting. Note anything that surprised me. Day 2-5: same. Compare today's output to yesterday's. End of day 5: decide keep/fix/kill. Write a one-paragraph review.

Block 10 minutes a day in your calendar for week 1. After that, weekly is plenty.

Your test plan

Saved automatically. Combined with your spec (Studio 07) and your instructions (Studio 08), you now have a complete build packet for one real agent.

Loading…

The closing letter

You've designed an agent.

Most people who start a guide like this never finish it. You did. Here's the only thing left to say.

A working agent is a scheduled task with a clear input, a guard, and one output channel. Everything else is decoration.

🤖

Pip's final note…

You walked through the whole workshop. You have a spec, instructions, and tests for one real agent. Now go build it. Start small. Run it in draft mode. Read every output for a week. The boring ones win. I'll be at the bench when you're ready for the next one.

You finished the whole workshop

Agent Builder

The final sticker. The whole set.