Stop Babysitting Your Agents

Verification Loops, Parallel Agents, and Background Routines (Claude Code 301)

Original source: https://x.com/i/status/2066969072818872627

Summary

As models get smarter, many of us are spending more time staring at screens, waiting for agents to finish, or acting as glorified QA testers. This talk (a "Claude Code 301" level session) gives practical strategies to stop babysitting and let agents do more of the heavy lifting autonomously.

The core idea is to move from "holding the agent's hand" to giving it complete, self-contained verification loops so it can check its own work, debug failures, and iterate until success.

Key themes:

Verification: Teach the agent how a human would verify the task (build → run → test → inspect side effects → screenshots/logs/DB checks → tests).
Packaging as Skills: Turn repeatable verification processes into reusable, self-improving SKILL.md files that the whole team (and future you) can use.
Multi-Clouding: Once agents are reliable at verification, you can confidently run many agents in parallel on different tasks.
Background Loops & Routines: Use /loop and remote Routines so agents keep working even when your laptop is closed — completely removing yourself from the hot path for routine work (PR babysitting, docs updates, triage, CI hygiene, etc.).

Prerequisites mentioned: High-quality CLAUDE.md (or Cloud MD), connected tools (Slack, Linear, etc.), and remote environment via Claude Code web.

Full Transcription

Below is the full transcription of the talk for reference.

how you can stop babysitting your agents.

As models have been getting smarter, I've noticed that we're increasingly spending a larger percentage of our time staring at the screen, waiting for Cloud to finish its work, or just acting as a glorified QA tester for Cloud.

And this can be quite unsatisfying and also just an inefficient use of your time.

And my goal for this talk is to give you strategies and help you take back some of this time so that you can manage your agents better.

You could also think of this as a more advanced Cloud Code talk, so a Cloud Code 301-type university class.

And because of that, we have some prerequisites and some table stakes that everyone here should have at least heard about if not implemented for your own projects, starting with a very high-quality Cloud MD file.

This is the single highest leverage thing that you can do to improve your Cloud Code experience.

So if you haven't done this yet, I highly encourage you to try it out.

Number two is connecting your tools to Cloud Code.

A good rule of thumb is that if a tool is useful for you in your day-to-day life, it will also be useful for Cloud.

So things like Slack, Asana, Linear, Datadog, BigQuery, all of these things help Cloud stitch together a much richer context for itself.

And it's able to perform much better if you give it access to these tools.

And finally, setting up your remote environment on Cloud Code web.

This makes it so that the compute that's running your Cloud Code is separated or decoupled from your laptop.

So you can close your laptop.

Your laptop could die.

You could spill some water on your laptop.

And your Cloud Code sessions will still continue because they're running in the Cloud.

I'd love to see a show of hands here.

How many people use Cloud Code every day?

OK, that's almost everyone.

How many people have completed the first two things here?

So high-quality Cloud MD, and you've connected your tools?

OK, so about 50%, I'd say.

And then how many people have done all three?

OK, if you haven't raised your hand at all, don't worry.

You'll still get some value out of the stock.

But I would encourage you to start with these three things first.

OK, so why does your tooling need to change?

Most software tooling so far was built with humans in mind.

Whether it's linters, IDEs, prettier, type checkers, even compilers, they were mostly written with the goal of making humans and human teams faster.

But the problem now is that humans aren't writing most of our code anymore.

It's agents.

So we have to take a step back, zoom out, and reconsider our tooling.

And when you do that, there's some good news, and then there's some bad news.

The good news is that a lot of these tools that we've built for ourselves translate over pretty well for agents as well.

So things like prettiers, and linters, and symbol servers, Cloud and agents can end up using these things quite effectively, and they serve them pretty well.

But the bad news is that we also have blind spots.

As human beings, we have some assumptions that we make about our tooling and our tool chain that Cloud doesn't have.

And for that reason, it's important to ask the question, what does an agent need from your code base that a human takes for granted?

And I'd love for you guys to keep that question in mind as we continue to the rest of the talk, because it kind of frames the goal of not babysitting your agents as much in a much more clear way.

So this is our roadmap for today.

We'll be talking about three distinct things that build on top of each other.

And when you take all of these three things together, they become incredibly powerful and give you a set of tools that can help you work in a way that we just haven't worked before as human beings.

So we'll be talking about verification, which is how to teach Cloud to check its own work.

Once Cloud can check its own work and be more reliable, we can now run many Clouds at the same time and be confident that they'll be doing the right thing.

So we'll be talking about strategies for multi-Clouding or parallelizing your work.

And then finally, we'll end with background loops.

And background loops are a way for you to completely take your keyboard out of the hot path.

So your keyboard is not the bottleneck anymore, and Cloud just keeps running in the background in a loop, doing useful work for you.

So I'd like to start the verification section with a brainstorm for a minute or so.

I'd like everyone here to think about the last software project or feature that you worked on.

And while you were working on that feature, how did you check your own work?

And I don't just mean, how did you check the final output of your work, but I also mean, how did you iterate on your work in a way that gave you confidence that you will end up in a place where you're expecting to go?

So let's take 30 seconds.

If you have a pen and paper in front of you, feel free to jot this down.

If you have a laptop and you want to put this in your notes, let's take 30 seconds together and just come up with your last project and how you verified your work there.

OK. I see some typing slowing down.

So hopefully, you've had a chance to think about it a little bit.

It's OK if you haven't completely.

But I've found that most software engineering tasks can be broken down into the series of steps that you see on the screen.

Some combination or sequence or subset of these things enable you to check your own work and build software.

So you start with designing and writing code.

You then usually end up building your code, running your compilers, type checkers, et cetera.

If they fail, you go back and change your code again and do that in a loop.

Then you might run your executable, whether that's a Docker container or a CLI application or a web server.

And then you might check for side effects.

So if you're running a web server, you might spin up your browser.

And you might see if the UI elements are showing up in the correct place.

You might even look for logs to see is a specific log that you're looking for present in your logs.

Or you might check the database to see what the state is and if state has been manipulated correctly.

And then hopefully, you run unit tests to make sure that you haven't made any regressions and your feature hasn't broken some other feature.

And hopefully, you also add a new unit test for the thing that you're working on.

And then finally, you deploy to staging.

Or if you're really brave, you go straight to prod.

And that's usually how humans verify their work and build software.

And what's interesting is that this same exact playbook can be used by Cloud quite effectively to also verify its own work and build software.

So as we go through the rest of this presentation, it's helpful to think about teaching Cloud how to do things in a similar way that you would do them.

And the only thing that's required is giving Cloud the right tools and instruction set to make this possible.

OK, so we've talked about verification, how humans do verification, and how Cloud should theoretically do verification.

But loops are really what makes the whole thing go around.

And this is arguably the most important slide in this presentation.

So if you haven't been paying attention yet, this is a good time to get started.

A loop essentially is an autonomous circuit that you can complete for Cloud.

And it allows Cloud to hill climb on a given task or a given success criteria.

So you can think about it as giving Cloud access to tools to verify its own work and to write code.

And what Cloud will do is it will write some code.

It will check if there's a failure.

If there's a failure, it will debug that failure and write some more code.

And then it keeps doing that in a loop again and again and again until it gets to a success state.

And when it finally gets to a success state, you can be confident that the PR that it's sending you is higher quality and will actually work.

So in this image that you see on the screen, I faced an issue recently where on my personal website, the Sign Up button stopped working.

And what I told Cloud was to make the Sign Up button work.

And this is kind of what it did.

There's more steps here too, but for brevity's sake, it basically started writing some code.

It built my app.

It clicked my Sign Up button, opened up a browser, and saw that clicking the Sign Up button isn't really doing anything.

It doesn't take you anywhere.

So then it decided to read some logs.

And it found out what the problem was.

It fixed the code, reloaded the app, and kept doing that until it got to a successful state.

And finally, what it came up with was a PR that indeed worked.

So the most important thing to take away from this slide is that wherever possible, our goal now is to get Cloud into a loop by giving it the tools and instructions that are required for it to work effectively.

So verification comes in many flavors.

We talked about UX verification.

But you can have back-end verification.

You may want to verify your entire app end-to-end, including infra.

And the core concept here remains the same.

You want to give Cloud the tools and instructions to get it into a loop.

And once you figure that piece out, all three of these flavors merge into one.

You don't have to be very specific about the instructions you give Cloud.

As long as it has all the right tools and instructions, it'll be able to verify all of these things.

So we've talked a lot about theory.

And we've talked a lot about hypotheticals and jargon.

But I wanted this slide to be a little bit more concrete.

So what does it actually mean to give Cloud the instructions and the tools to make it go in a loop?

And it usually boils down to four things.

And I'll go through the front-end or UX section from this slide.

The first thing is to run your application.

So for a front-end application or a front-end verification group, this might correspond to running your dev server.

So running npm run start or whatever your dev server might be, it just spins up a dev server.

Once the dev server is up, you want Cloud to actually use the web server.

And the way it does that is by opening up a browser.

My personal MCP tool of choice for this is the Cloud and Chrome MCP tool.

You can access this with slash Chrome if you're using Cloud Code.

You can also use Playwright or there's a bunch of other browser-controlled MCPs that you can use to do that.

Once Cloud can drive your browser, the next step is to prove that something works.

So if it's a fix it's working on, you want to take a screenshot before the fix and after the fix and make sure that it's the right state.

And finally, there's unblocking it.

So if you've ever tried to create a verification loop in a production app, you'll very quickly find that there are some blockers you run into.

And some of the common blockers are, for example, auth and state.

So auth basically means you want to give Cloud an identity that it can log into to your web application so it can actually start to use your app.

And then state means you may want to pre-configure some state.

For example, if you have an e-commerce store, you may want to populate the inventory for that store for Cloud to be able to use your app meaningfully.

And this isn't very novel.

In fact, in traditional software engineering too, when you write end-to-end tests, writing these state setup scripts are quite common.

The only difference here is that you want to give Cloud access to these scripts and you want to make them dynamic.

You don't want to be too prescriptive about what these scripts are doing.

And that allows Cloud to do a much wider variety of things than you can do with static scripts.

So we know what a verification loop now is.

We know how to write one.

How do you package it?

How do you distribute the script to your colleagues, to your co-workers, even to your future self?

And one of the best ways of doing this is by using a skill.

You can think of a skill as just a way to store some arbitrary context about a specific topic.

And in this case, that topic happens to be a verification loop.

The interesting thing about skills also is that you can make them self-improving.

So if you put in instructions into your skill about improving the skill every time Cloud hits a blocker, you will end up creating this self-documenting, self-improving skill, which everyone on your team can contribute to, not just you.

And this makes it really powerful.

This is actually how we do verification in the Cloud Code team as well.

We have a single verification skill.

And the skill is explicitly told to keep documenting itself.

So every time someone runs into a blocker, the skill will go back in and edit itself so that next time when you or your colleague run into the same issue, it's not a problem.

OK, so we're going to jump into a demo next.

But before the demo, I want to talk about the application that I'm going to be using.

There is a type tester application called MonkeyType.

How many of you have heard of MonkeyType?

OK, I thought so.

It's a niche community.

But it's basically a type tester where it shows you a bunch of words, as you can see.

And you have to type those words as accurately and as fast as possible.

And the application just tracks your stats for you.

I like this as a demo app because it is representative of a real-world full-stack app.

It's written in TypeScript with an Express backend and MongoDB and Redis as persistence layers.

And it's open source.

So you guys can go to monkeytype.com right now.

You can even check out the source code if you want.

But what we'll be doing in this demo is we'll be creating a verification loop live.

So we'll tell Cloud to spin up a new dev server.

We'll tell it to go and use the Chrome MCP to check some of its work.

And then once we create the verification skill, we'll also create a new feature and ask Cloud to use the verification skill to verify itself.

So let's get started with the demo.

So we can switch over to my laptop screen.

OK, so this is a brand new Cloud Code session.

I've already done the homework of setting up monkey type locally.

I've also installed some dependencies and curated a Cloud MD because I didn't want to do that in front of you guys and waste your time.

So let's tell Cloud to spin up the dev server.

OK, so it says the dev server is already running.

And that's right, because I started it right before our talk.

And let's go and check out what's on the front end.

So if we go here, monkey type opens up.

I can start typing.

And there's a little timer that shows up.

I'm not very good at typing, so there's a lot of typos here.

But it's essentially what I would expect.

Let's also check out the back end link.

This just returns a JSON.

And it just basically means that the back end is up and running, which is good.

The next thing I'm going to do is I'm going to make sure that my Chrome MCP is enabled.

And the way you do that is just slash Chrome.

And as you can see here, it says status enabled, extension installed, which is exactly what we're looking for.

If you don't have it installed, it'll take you to the setup guide, and you can install it for yourself.

And now I'm going to say use the Chrome MCP to make sure that the front end is working.

Make it quick, please.

And what we should see now is that this is the tab that Cloud is using.

And it should call the Chrome MCP tool.

So if you go back here, we can see two Chrome MCP tool calls.

I can Control-O and see exactly what it did.

So it navigated to localhost 3000, and then it's looking at the contents of the tab, which is great.

But we want to do something more exciting.

Just looking at a static web page isn't very helpful.

So let's say, before I do that, I'm going to resize these so you guys can see what's happening in the background.

Can you try typing and make sure everything works?

So Cloud, apparently, is also not very good at typing.

But it typed in something, and it says that typing works.

That's great.

Let's do one more thing.

Let's say, can you also use the settings and change something?

So it navigated to the Settings page, and it's changing the difficulty to Expert.

Not a good idea, based on how it performed.

And it claims that the setting has persisted, and it's able to verify that.

So that's great.

What we did so far is we just held Claude's hand and told it exactly what to do.

So we were like, spin up the dev server, go and do these two or three things that we care about.

And that's basically verification.

What I can do next is I can tell Cloud to take all the learnings from this session and put it into a skill file.

So I can say, take everything we learned and put it into a skill file in .cloud demo verification.

I didn't have to give it the full path, but I chose to anyway.

OK, let's see.

It wants to create a new directory.

OK, so it's now proceeding to write a fairly large skill.md file.

And if you look at what's inside this file, we'll just skim through it real quick.

It says, number one, bring up the stack, which is basically what we did.

It has some commands to do that.

...[and the rest of the talk continues with the demo, multi-agent strategies, background loops with /loop and Routines, and wrapping up]...

Key Concepts

Prerequisites / Table Stakes

High-quality CLAUDE.md / Cloud MD (highest leverage item)
Connect everyday tools (Slack, Linear, Asana, Datadog, etc.) so the agent has rich context
Run Claude Code in the remote web environment (so sessions survive laptop death/closure)

Verification Loops

Teach the agent the full human verification playbook:

Write code → Build / type-check / lint
Run the app (dev server, Docker, etc.)
Inspect side effects (browser, logs, database, UI screenshots)
Run tests (including new tests for the change)
Deploy / check staging

The goal is to close the loop so the agent can "hill climb" until the success criteria are met.

Packaging Verification as Skills

Store repeatable verification processes in SKILL.md files. Make the skill self-improving by instructing it to update itself whenever it hits a new blocker. This creates living documentation that the whole team benefits from.

Multi-Clouding (Parallel Agents)

Once verification is reliable, you can safely spin up multiple agents working on different tasks at the same time.

Background Loops & Routines

Use /loop <interval> <prompt> and remote Routines (time-based or event-based) so agents keep working on PR babysitting, docs updates, triage, CI hygiene, etc., even when you're away from the keyboard.

Related in the Loops Series & Further Reading

Loops, Not Prompts — Boris Cherny on the shift from prompting to writing autonomous agent loops (Anthropic / Claude Code).
Building AI Agent Loops and Workflows — Practical step-by-step on cron + LLM judgment loops, skillifying tasks, and agent-friendly CLIs.
6-Month Path to Becoming an Agentic AI Engineer — Comprehensive roadmap covering foundations through production and multi-agent systems.
Claude Agent Skills — How to package reusable expertise (directly related to the self-improving verification skills discussed here).
Grok Build X Thread Workflow — Real-world example of human-in-the-loop multi-agent orchestration with review gates.
AI Agents Category — More experiments in verification, orchestration, and autonomous systems.