Discussion about this post

User's avatar
Pawel Jozefiak's avatar

The "bounded browser task" framing is useful - it's exactly where agents stop failing. When you scope the job tightly enough that there's no ambiguity about what "done" looks like, they actually finish things.

I've been running similar autonomous tasks overnight: research, writing drafts, deploying small updates while I sleep. The failure mode is almost always scope creep, not capability: https://thoughts.jock.pl/p/building-ai-agent-night-shifts-ep1

The Ofsted case works because the data has structure. What happened with edge cases - reports in non-standard formats, or schools with partial inspection histories?

No posts

Ready for more?