What I learned building an opinionated and minimal coding agent
verdverm
Glad to see more people doing this!
I built on ADK (Agent Development Kit), which comes with many of the features discussed in the post.
Building a full, custom agent setup is surprisingly easy and a great learning experience for this transformational technology. Getting into instruction and tool crafting was where I found the most ROI.
xcodevn
I did something similar in Python, in case people want to see a slightly different perspective (I was aiming for a minimal agent library with built-in tools, similar to the Claude Agent SDK):
One thing I do find is that subagents are helpful for performance -- offloading tasks to smaller models (gpt-oss specifically for me) gets data to the bigger model quicker.
sghiassy
I always wonder what type of moat systems / business like these have
keyle
None, basically.
bschwarz
The only moat in all of this is capital.
mellosouls
Its open source. Where does it say he wants to monetise it?
I hadn't realized that Pi is the agent harness used by OpenClaw.
jeffrallen
As a user of a minimal, opinionated agent (https://exe.dev) I've observed at least 80% of this article's findings myself.
Small and observable is excellent.
Letting your agent read traces of other sessions is an interesting method of context trimming.
Especially, "always Yolo" and "no background tasks". The LLM can manage Unix processes just fine with bash (e.g. ps, lsof, kill), and if you want you can remind it to use systemd, and it will. (It even does it without rolling it's eyes, which I normally do when forced to deal with systemd.)
Something he didn't mention is git: talk to your agent a commit at a time. Recently I had a colleague check in his minimal, broken PoC on a new branch with the commit message "work in progress". We pointed the agent at the branch and said, "finish the feature we started" and it nailed it in one shot. No context whatsoever other than "draw the rest of the f'ing owl" and it just.... did it. Fascinating.
zby
Pi has probably the best architecture and being written in Javascript it is well positioned to use the browser sandbox architecture that I think is the future for ai agents.
“standardize the intersection, expose the union” is a great phrase I hadn’t heard articulated before
genie3io
[dead]
yosefk
"Also, it [Claude Code] flickers" - it does, doesn't it? Why?.. Did it vibe code itself so badly that this is hopeless to fix?..
falloutx
Claude code programmers are very open that they vibe code it.
grim_io
Because they target 60 fps refresh, with 11 of the 16 ms budget per frame being wasted by react itself.
They are locked in this naive, horrible framework that would be embarrassing to open source even if they had the permission to do it.
valleyer
> If you look at the security measures in other coding agents, they're mostly security theater. As soon as your agent can write code and run code, it's pretty much game over.
At least for Codex, the agent runs commands inside an OS-provided sandbox (Seatbelt on macOS, and other stuff on other platforms). It does not end up "making the agent mostly useless".
beacon294
My codex just uses python to write files around the sandbox when I ask it to patch a sdk outside its path.
maleldil
Does Codex randomly decide to disable the sandbox like Claude Code does?
lvl155
You really shouldn’t be running agents outside of a container. That’s 101.
chr15m
Approval should be mandatory for any non-read tool call. You should read everything your LLM intends to do, and approve it manually.
"But that is annoying and will slow me down!" Yes, and so will recovering from disastrous tool calls.
xXSLAYERXx
I'm trying to understand this workflow. I have just started using codex. Literally 2 days in. I have it hooked up to my githbub repo and it just runs in the cloud and creates a pr. I have it touching only UI and middle layer code. No db changes, I always tell it to not touch the models.
jFriedensreich
I dont know how to feel about being the only one refusing to run yolo mode until the tooling is there, which is still about 6 months away for my setup. Am I years behind everyone else by then? You can get pretty far without completely giving in. Agents really dont need to execute that many arbitrary commands. linting, search, edit, web access should all be bespoke tools integrated into the permission and sandbox system. agents should not even be allowed to start and stop applications that support dev mode, they edit files, can test and get the logs what else would they need to do? especially as the amount of external dependencies that make sense goes to a handful you can without headache approve every new one. If your runtime supports sandboxing and permissions like deno or workerd this adds an initial layer of defense.
This makes it even more baffling why anthropic went with bun, a runtime without any sandboxing or security
architecture and will rely in apple seatbelt alone?
WhyNotHugo
You use YOLO mode inside some sandbox (VM, container). Give the container only access to the necessary resources.
asyncadventure
[dead]
v0id_user
Being minimalist is real power these days as everything around us keeps shoving features in our face every week with a million tricks and gimmicks to learn. Something minimalist like this is honestly a breath of fresh air!
The YOLO mode is also good, but having a small ‘baby setting mode’ that’s not full-blown system access would make sense for basic security. Just a sensible layer of "pls don't blow my machine" without killing the freedom :)
Majromax
Pi supports restricting the set of tools given to an agent. For example, one of the examples in pi --help is:
# Read-only mode (no file modifications possible)
pi --tools read,grep,find,ls -p "Review the code in src/"
Otherwise, "yolo mode" inside a sandbox is perfectly reasonable. A basic bubblewrap configuration can expose read-only system tools and have a read/write project directory while hiding sensitive information like API keys and other home-directory files.
theturtletalks
Can I replace Vercel’s AI SDK with Pi’s equivalent?
I've seen a couple of power users already switching to Pi [1], and I'm considering that too. The premise is very appealing:
- Minimal, configurable context - including system prompts [2]
- Minimal and extensible tools; for example, todo tasks [3]
- No built-in MCP support; extensions exist [4]. I'd rather use mcporter [5]
Full control over context is a high-leverage capability. If you're aware of the many limitations of context on performance (in-context retrieval limits [6], context rot [7], etc.), you'd truly appreciate Pi lets you fine-tune context for optimal performance.
It's clearly not for everyone, but I can see how powerful it can be.
Pi is the part of moltXYZ that should have gone viral. Armin is way ahead of the curve here.
The Claude sub is the only think keeping me on Claude Code. It's not as janky as it used to be, but the hooks and context management support are still fairly superficial.
yuzhun
I really like pi and have started using it to build my agent.
Mario's article fully reveals some design trade-offs and complexities in the construction process of coding agents and even general agents. I have benefited a lot!
athrowaway3z
The solution to the security issue is using `useradd`.
I would add subagents though. They allow for the pattern where the top agent directs / observe a subagent executing a step in a plan.
The top agent is both better at directing a subagent, and it keeps the context clean of details that don't matter - otherwise they'd be in the same step in the plan.
badlogic
There are lots of ways of doing subagents. It mostly depends on your workflow. That's why pi doesn't ship with anything built in. It's pretty simple to write an extension to do that.
> from copying and pasting code into ChatGPT, to Copilot auto-completions [...], to Cursor, and finally the new breed of coding agent harnesses like Claude Code, Codex, Amp, Droid, and opencode
Reading HN I feel a bit out of touch since I seem to be "stuck" on Cursor. Tried to make the jump further to Claude Code like everyone tells me to, but it just doesn't feel right...
It may be due to the size of my codebase -- I'm 6 months into solo developer bootstrap startup, so there isn't all that much there, and I can iterate very quickly with Cursor. And it's mostly SPA browser click-tested stuff. Comparatively it feels like Claude Code spends an eternity to do something.
(That said Cursor's UI does drive me crazy sometimes. In particular the extra layer of diff-review of AI changes (red/green) which is not integrated into git -- I would have preferred that to instead actively use something integrated in git (Staged vs Unstaged hunks). More important to have a good code review experience than to remember which changes I made vs which changes AI made..)
sibellavia
Probably an ideal compromise solution for you would be to install the official Claude Code extension for VS Code, so you have an IDE for navigating large, complex codebases while still having CC integration.
mkreis
Bootstrapped solo dev here. I enjoyed using Claude to get little things done which I happed on my TODO list below the important stuff, like updating a landing page, or in your case perhaps adding automated testing for the frontend stuff (so you don't have to click yourself). It's just nice having someone coming up with a proposal on how to implement something, even it's not the perfect way, it's good as a starter.
Also I have one Claude instance running to implement the main feature, in a tight feedback loop so that I know exactly what it's doing.
Yes, sometimes it takes a bit longer, but I use the time checking what the other Claudes are doing...
iterateoften
For me cursor provides a much tighter feedback loop than Claude code. I can review revert iterate change models to get what I need. It feels sometimes Claude code is presented more as a yolo option where you put more trust on the agent about what it will produce.
I think the ability to change models is critical. Some models are better at designing frontend than others. Some are better at different programming languages, writing copy, blogs, etc.
I feel sabotaged if I can’t switch the models easily to try the same prompt and context across all the frontier options
andai
Claude Code spends most of its time poking around the files. It doesn't have any knowledge of the project by default (no file index etc), unless they changed it recently.
When I was using it a lot, I created a startup hook that just dumped a file listing into the context, or the actual full code on very small repos.
I also got some gains from using a custom edit tool I made which can edit multiple chunks in multiple files simultaneously. It was about 3x faster. I had some edge cases where it broke though, so I ended up disabling it.
SatvikBeri
Seems like there's a speed/autonomy spectrum where Cursor is the fastest, Codex is the best for long-running jobs, and Claude is somewhere in the middle.
Personally, I found Cursor to be too inaccurate to be useful (possibly because I use Julia, which is relatively obscure) – Opus has been roughly the right level for my "pair programming" workflow.
leerob
> in particular the extra layer of diff-review of AI changes (red/green) which is not integrated into git
We're making this better very soon! In the coming weeks hopefully.
pests
> remember which changes I made vs which changes AI made..
They are improving this use case too with their enhanced blame. I think it was mentioned in their latest update blog.
You'll be able to hover over lines to see if you wrote it, or an AI. If it was an AI, it will show which model and a reference to the prompt that generated it.
I do like Cursor quite a lot.
kloud
The OpenClaw/pi-agent situation seems similar to ollama/llama-cpp, where the former gets all the hype, while the latter is actually the more impressive part.
This is great work, I am looking forward how it evolves in the future. So far Claude Code seems best despite its bugs given the generous subscription, but when the market corrects and the prices will get closer to API prices, then probably the pay-per-token premium with optimized experience will be a better deal than to suffer Claude Code glitches and paper cuts.
The realization is that at the end agent framework kit that is customizable and can be recursively improved by agents is going to be better than a rigid proprietary client app.
Aurornis
> but when the market corrects and the prices will get closer to API prices
I think it’s more likely that the API prices will decrease over time and the CC allowances will only become more generous. We’ve been hearing predictions about LLM price increases for years but I think the unit economics of inference (excluding training) are much better than a lot of people think and there is no shortage of funding for R&D.
I also wouldn’t bet on Claude Code staying the same as it is right now with little glitches. All of the tools are going to improve over time. In my experience the competing tools aren’t bug free either but they get a pass due to underdog status. All of the tools are improving and will continue to do so.
andai
This is basically identical to the ChatGPT/GPT-3 situation ;) You know OpenAI themselves keep saying "we still don't understand why ChatGPT is so popular... GPT was already available via API for years!"
badlogic
FWIW, you can use subscriptions with pi. OpenAI has blessed pi allowing users to use their GPT subscriptions. Same holds for other providers, except Flicker Company.
And I'm personally very happy that Peter's project gets all the hype. The pi repo already gets enough vibesloped PRs from openclaw users as is, and its still only 1/100th of what the openclaw repository has to suffer through.
ohyoutravel
And like ollama it will no doubt start to get enshittified.
jrm4
This is the first I'm hearing of this pi-agent thing and HOW DO PEOPLE TECH DECIDE TO NAME THINGS?
Seriously. Is creator not aware that "pi" absolutely invokes the name of another very important thing? sigh.
mike_hearn
I'm writing my own agent too as a side project at work. This is a good article but simultaneously kinda disappointing. The entire agent space has disappeared down the same hole, with exactly the same core design used everywhere and everyone making the same mistakes. The focus on TUIs I find especially odd. We're at the dawn of the AI age and people are trying to optimize the framerate of Teletext? If you care about framerates use a proper GUI framework!
The agent I'm writing shares some ideas with Pi but otherwise departs quite drastically from the core design used by Claude Code, Codex, Pi etc, and it seems to have yielded some nice benefits:
• No early stopping ("shall I continue?", "5 tests failed -> all tests passed, I'm done" etc).
• No permission prompts but also no YOLO mode or broken Seatbelt sandboxes. Everything is executed in a customized container designed specifically for the model and adapted to its needs. The agent does a lot of container management to make this work well.
• Agent can manage its own context window, and does. I never needed to add compaction because I never yet saw it run out of context.
• Seems to be fast compared to other agents, at least in any environment where there's heavy load on the inferencing servers.
• Eliminates "slop-isms" like excessive error swallowing, narrative commenting, dropping fully qualified class names into the middle of source files etc.
• No fancy TUI. I don't want to spend any time fixing flickering bugs when I could be improving its skill at the core tasks I actually need it for.
It's got downsides too, it's very overfit to the exact things I've needed and the corporate environment it runs in. It's not a full replacement for CC or Codex. But I use it all the time and it writes nearly all my code now.
The agent is owned by the company and they're starting to ask about whether it could be productized so I suppose I can't really go into the techniques used to achieve this, sorry. Suffice it to say that the agent design space is far wider and deeper than you'd initially intuit from reading articles like this. None of the ideas in my agent are hard to come up with so explore!
msp26
> Special shout out to Google who to this date seem to not support tool call streaming which is extremely Google.
Google doesn't even provide a tokenizer to count tokens locally. The results of this stupidity can be seen directly in AI studio which makes an API call to count_tokens every time you type in the prompt box.
localhost
tbf neither does anthropic
haxel
AI studio also has a bug that continuously counts the tokens, typing or not, with 100% CPU usage.
Sometimes I wonder who is drawing more power, my laptop or the TPU cluster on the other side.
bob1029
> The second approach is to just write to the terminal like any CLI program, appending content to the scrollback buffer
This is how I prototyped all of mine. Console.Write[Line].
I am currently polishing up one of the prototypes with WinForms (.NET10) & WebView2. Building something that looks like a WhatsApp conversation in basic winforms is a lot of work. This takes about 60 seconds in a web view.
I am not too concerned about cross platform because a vast majority of my users will be on windows when they'd want to use this tool.
PKop
If you use WPF you can have the Mica backdrop underneath your WebView2 content and set the WebView2 to have transparent background color, which looks nice and a little more native, fyi. Though if you're doing more than just showing the WebView maybe isn't a choice to switch.
asyncadventure
[dead]
jatora
I was confused by him basically inventing his own skills but I guess this is from Nov 2025 so makes sense as skills were pretty new at that point.
Also please note this is nowhere on the terminal bench leaderboard anymore. I'd advise everyone reading the comments here to be aware of that. This isn't a CLI to use. Just a good experiment and write up.
haxel
It's batteries-not-included, by design. Here's what it looks like with batteries (and note who owns this repo):
I don’t follow nor use pi so no horse in this race, but I think the results were never submitted to terminal bench? not sure how the process works exactly but it’s entirely missing from the benchmark. is this a sign of weakness? I honestly don’t know.
bicepjai
Main reason I haven’t switched over to the new pi coding agent (or even fully to Claude Code alternatives) is the price point. I eat tokens for breakfast, lunch, and dinner.
I’m on a $100/mo plan, but the codex bar makes it look like I’m burning closer to $500 every 30 days. I tried going local with Qwen 3 (coding) on a Blackwell Pro 6000, and it still feels a beat behind, either laggy, or just not quite good enough for me to fully relinquish Claude Code.
Curious what other folks are seeing: any success stories with other agents on local models, or are you mostly sticking with proprietary models?
I’m feeling a bit vendor-locked into Claude Code: it’s pricey, but it’s also annoyingly good
cyanydeez
And its doubtful they are anywhere near break even costs
0xbadcafebee
According to the article, Pi massively shrinks your context use (due to smaller system prompt and lack of MCPs) so your token use may drop. Also Pi seems to support Anthropic OAuth for your plan (but afaik they might ban you)
SatvikBeri
I particularly liked Mario's point about using tmux for long-running commands. I've found models to be very good at reading from / writing to tmux, so I'll do things like spin up a session with a REPL, use Claude to prototype something, then inspect it more deeply in the REPL.
yl_seeto
One aspect that resonates from stories like this is the tension between opinionated design and real-world utility.
When building something minimal, especially in areas like agent-based tooling or assistants, the challenge isn’t only about reducing surface area — it’s about focusing that reduction around what actually solves a user’s problem.
A minimal agent that only handles edge cases, or only works in highly constrained environments, can feel elegant on paper but awkward in practice. Conversely, a slightly less minimal system that still maintains clarity and intent often ends up being more useful without being bloated.
In my own experience launching tools that involve analysis and interpretation, the sweet spot always ends up being somewhere in the intersection of:
- clearly scoped core value,
- deliberately limited surface, and
- enough flexibility to handle real user variation.
Curious how others think about balancing minimalism and practical coverage when designing agents or abstractions in their own projects.
airstrike
begone, bot
0xbadcafebee
The best deep-dive into coding agents (and best architecture) I've seen so far. And I love the minimalism with this design, but there's so much complexity necessary already, it's kind of crazy. Really glad I didn't try to write my own :)
Re: security, I think I need to make an AI credential broker/system. The only way to securely use agents is to never give them access to a credential at all. So the only way to have the agent run a command which requires credentials, is to send the command to a segregated process which asks the user for permission, then runs it, then returns status to the agent. It would process read-only requests automatically but write requests would send a request to the user to authorize. I haven't yet found somebody else writing this, so I might as well give it a shot
Other than credentialed calls, I have Docker-in-Docker in a VM, so all other actions will be YOLO'd. I think this is the only reasonable system for long-running loops.
pjm331
> Re: security, I think I need to make an AI credential broker/system. The only way to securely use agents is to never give them access to a credential at all. So the only way to have the agent run a command which requires credentials, is to send the command to a segregated process which asks the user for permission, then runs it, then returns status to the agent
This is a problem that model context protocol solves
Your MCP server has the creds, your agent does not.
benjaminfh
Really awesome and thoughtful thing you've built - bravo!
I'm so aligned on your take on context engineering / context management. I found the default linear flow of conversation turns really frustrating and limiting. In fact, I still do. Sometimes you know upfront that the next thing you're to do will flood/poison the nicely crafted context you've built up... other times you realise after the fact. In both cases, you didn't have that many alternatives but to press on... Trees are the answer for sure.
I actually spent most of Dec building something with the same philosphy for my own use (aka me as the agent) when doing research and ideation with LLMs. Frustrated by most of the same limitations - want to build context to a good place then preserve/reuse it over and over, fire off side quests etc, bring back only the good stuff. Be able to traverse the tree forwards and back to understand how I got to a place...
Anyway, you've definitely built the more valuable incarnation of this - great work. I'm glad I peeled back the surface of the moltbot hysteria to learn about Pi.
visarga
> want to build context to a good place then preserve/reuse it over and over, fire off side quests etc, bring back only the good stuff
My attempt - a minimalist graph format that is a simple markdown file with inline citations. I load MIND_MAP.md at the start of work, and update it at the end. It reduces context waste to resume or spawn subagents. Memory across sessions.
I’m just curious why your writing is punctuated by lots of word breaks. I hardly see hyphenated word breaks across lines anymore and it made me pause on all those occurrences. I do remember having to do this with literal typewriters.
kirab
According to dev tools this is a simple `hyphens: auto` CSS
willswire
Minimal, intentional guidance is the cornerstone of my CLAUDE.md’s design philosophy document.
I'm curious about how the costs compare using something like this where you're hitting api's directly vs my $20 ChatGPT plan which includes Codex.
kalendos
You can use your ChatGPT subscription with Pi!
asyncadventure
[dead]
amluto
I'm hoping someone makes an agent that fixes the container situation, better:
> If you're uncomfortable with full access, run pi inside a container or use a different tool if you need (faux) guardrails.
I'm sick of doing this. I also don't want faux guardrails. What I do want is an agent front-end that is trustworthy in the sense that it will not, even when instructed by the LLM inside, do anything to my local machine. So it should have tools that run in a container. And it should have really nice features like tools that can control a container and create and start containers within appropriate constraints.
In other words, the 'edit' tool is scoped to whatever I've told the front-end that it can access. So is 'bash' and therefore anything bash does. This isn't a heuristic like everyone running in non-YOLO-mode does today -- it’s more like a traditional capability system. If I want to use gVisor instead of Docker, that should be a very small adaptation. Or Firecracker or really anything else. Or even some random UART connection to some embedded device, where I want to control it with an agent but the device is neither capable of running the front-end nor of connecting to the internet (and may not even have enough RAM to store a conversation!).
I think this would be both easier to use and more secure than what's around right now. Instead of making a container for a project and then dealing with installing the agent into the container, I want to run the agent front-end and then say "Please make a container based on such-and-such image and build me this app inside." Or "Please make three containers as follows".
As a side bonus, this would make designing a container sandbox sooooo much easier, since the agent front-end would not itself need to be compatible with the sandbox. So I could run a container with -net none and still access the inference API.
Contrast with today, where I wanted to make a silly Node app. Step 1: Ask ChatGPT (the web app) to make me a Dockerfile that sets up the right tools including codex-rs and then curse at it because GPT-5.2 is really remarkably bad at this. This sucks, and the agent tool should be able to do this for me, but that would currently require a completely unacceptable degree of YOLO.
(I want an IDE that works like this too. vscode's security model is comically poor. Hmm, an IDE is kind of like an agent front-end except the tools are stronger and there's no AI involved. These things could share code.)
detroitwebsites
Great writeup on minimal agent architecture. The philosophy of "if I don't need it, it won't be built" resonates strongly.
I've been running OpenClaw (which sits on top of similar primitives) to manage multiple simultaneous workflows - one agent handles customer support tickets, another monitors our deployment pipeline, a third does code reviews. The key insight I hit was exactly what you describe: context engineering is everything.
What makes OpenClaw particularly interesting is the workspace-first model. Each agent has AGENTS.md, TOOLS.md, and a memory/ directory that persists across sessions. You can literally watch agents learn from their mistakes by reading their daily logs. It's less magic, more observable system.
The YOLO-by-default approach is spot on. Security theater in coding agents is pointless - if it can write and execute code, game over. Better to be honest about the threat model.
One pattern I documented at howtoopenclawfordummies.com: running multiple specialized agents beats one generalist. Your sub-agent discussion nails why - full observability + explicit context boundaries. I have agents that spawn other agents via tmux, exactly as you suggest.
The benchmark results are compelling. Would love to see pi and OpenClaw compared head-to-head on Terminal-Bench.
verdverm
Glad to see more people doing this!
I built on ADK (Agent Development Kit), which comes with many of the features discussed in the post.
Building a full, custom agent setup is surprisingly easy and a great learning experience for this transformational technology. Getting into instruction and tool crafting was where I found the most ROI.
xcodevn
I did something similar in Python, in case people want to see a slightly different perspective (I was aiming for a minimal agent library with built-in tools, similar to the Claude Agent SDK):
One thing I do find is that subagents are helpful for performance -- offloading tasks to smaller models (gpt-oss specifically for me) gets data to the bigger model quicker.
sghiassy
I always wonder what type of moat systems / business like these have
keyle
None, basically.
bschwarz
The only moat in all of this is capital.
mellosouls
Its open source. Where does it say he wants to monetise it?
I hadn't realized that Pi is the agent harness used by OpenClaw.
jeffrallen
As a user of a minimal, opinionated agent (https://exe.dev) I've observed at least 80% of this article's findings myself.
Small and observable is excellent.
Letting your agent read traces of other sessions is an interesting method of context trimming.
Especially, "always Yolo" and "no background tasks". The LLM can manage Unix processes just fine with bash (e.g. ps, lsof, kill), and if you want you can remind it to use systemd, and it will. (It even does it without rolling it's eyes, which I normally do when forced to deal with systemd.)
Something he didn't mention is git: talk to your agent a commit at a time. Recently I had a colleague check in his minimal, broken PoC on a new branch with the commit message "work in progress". We pointed the agent at the branch and said, "finish the feature we started" and it nailed it in one shot. No context whatsoever other than "draw the rest of the f'ing owl" and it just.... did it. Fascinating.
zby
Pi has probably the best architecture and being written in Javascript it is well positioned to use the browser sandbox architecture that I think is the future for ai agents.
“standardize the intersection, expose the union” is a great phrase I hadn’t heard articulated before
genie3io
[dead]
yosefk
"Also, it [Claude Code] flickers" - it does, doesn't it? Why?.. Did it vibe code itself so badly that this is hopeless to fix?..
falloutx
Claude code programmers are very open that they vibe code it.
grim_io
Because they target 60 fps refresh, with 11 of the 16 ms budget per frame being wasted by react itself.
They are locked in this naive, horrible framework that would be embarrassing to open source even if they had the permission to do it.
valleyer
> If you look at the security measures in other coding agents, they're mostly security theater. As soon as your agent can write code and run code, it's pretty much game over.
At least for Codex, the agent runs commands inside an OS-provided sandbox (Seatbelt on macOS, and other stuff on other platforms). It does not end up "making the agent mostly useless".
beacon294
My codex just uses python to write files around the sandbox when I ask it to patch a sdk outside its path.
maleldil
Does Codex randomly decide to disable the sandbox like Claude Code does?
lvl155
You really shouldn’t be running agents outside of a container. That’s 101.
chr15m
Approval should be mandatory for any non-read tool call. You should read everything your LLM intends to do, and approve it manually.
"But that is annoying and will slow me down!" Yes, and so will recovering from disastrous tool calls.
xXSLAYERXx
I'm trying to understand this workflow. I have just started using codex. Literally 2 days in. I have it hooked up to my githbub repo and it just runs in the cloud and creates a pr. I have it touching only UI and middle layer code. No db changes, I always tell it to not touch the models.
jFriedensreich
I dont know how to feel about being the only one refusing to run yolo mode until the tooling is there, which is still about 6 months away for my setup. Am I years behind everyone else by then? You can get pretty far without completely giving in. Agents really dont need to execute that many arbitrary commands. linting, search, edit, web access should all be bespoke tools integrated into the permission and sandbox system. agents should not even be allowed to start and stop applications that support dev mode, they edit files, can test and get the logs what else would they need to do? especially as the amount of external dependencies that make sense goes to a handful you can without headache approve every new one. If your runtime supports sandboxing and permissions like deno or workerd this adds an initial layer of defense.
This makes it even more baffling why anthropic went with bun, a runtime without any sandboxing or security
architecture and will rely in apple seatbelt alone?
WhyNotHugo
You use YOLO mode inside some sandbox (VM, container). Give the container only access to the necessary resources.
asyncadventure
[dead]
v0id_user
Being minimalist is real power these days as everything around us keeps shoving features in our face every week with a million tricks and gimmicks to learn. Something minimalist like this is honestly a breath of fresh air!
The YOLO mode is also good, but having a small ‘baby setting mode’ that’s not full-blown system access would make sense for basic security. Just a sensible layer of "pls don't blow my machine" without killing the freedom :)
Majromax
Pi supports restricting the set of tools given to an agent. For example, one of the examples in pi --help is:
# Read-only mode (no file modifications possible)
pi --tools read,grep,find,ls -p "Review the code in src/"
Otherwise, "yolo mode" inside a sandbox is perfectly reasonable. A basic bubblewrap configuration can expose read-only system tools and have a read/write project directory while hiding sensitive information like API keys and other home-directory files.
theturtletalks
Can I replace Vercel’s AI SDK with Pi’s equivalent?
I've seen a couple of power users already switching to Pi [1], and I'm considering that too. The premise is very appealing:
- Minimal, configurable context - including system prompts [2]
- Minimal and extensible tools; for example, todo tasks [3]
- No built-in MCP support; extensions exist [4]. I'd rather use mcporter [5]
Full control over context is a high-leverage capability. If you're aware of the many limitations of context on performance (in-context retrieval limits [6], context rot [7], etc.), you'd truly appreciate Pi lets you fine-tune context for optimal performance.
It's clearly not for everyone, but I can see how powerful it can be.
Pi is the part of moltXYZ that should have gone viral. Armin is way ahead of the curve here.
The Claude sub is the only think keeping me on Claude Code. It's not as janky as it used to be, but the hooks and context management support are still fairly superficial.
yuzhun
I really like pi and have started using it to build my agent.
Mario's article fully reveals some design trade-offs and complexities in the construction process of coding agents and even general agents. I have benefited a lot!
athrowaway3z
The solution to the security issue is using `useradd`.
I would add subagents though. They allow for the pattern where the top agent directs / observe a subagent executing a step in a plan.
The top agent is both better at directing a subagent, and it keeps the context clean of details that don't matter - otherwise they'd be in the same step in the plan.
badlogic
There are lots of ways of doing subagents. It mostly depends on your workflow. That's why pi doesn't ship with anything built in. It's pretty simple to write an extension to do that.
> from copying and pasting code into ChatGPT, to Copilot auto-completions [...], to Cursor, and finally the new breed of coding agent harnesses like Claude Code, Codex, Amp, Droid, and opencode
Reading HN I feel a bit out of touch since I seem to be "stuck" on Cursor. Tried to make the jump further to Claude Code like everyone tells me to, but it just doesn't feel right...
It may be due to the size of my codebase -- I'm 6 months into solo developer bootstrap startup, so there isn't all that much there, and I can iterate very quickly with Cursor. And it's mostly SPA browser click-tested stuff. Comparatively it feels like Claude Code spends an eternity to do something.
(That said Cursor's UI does drive me crazy sometimes. In particular the extra layer of diff-review of AI changes (red/green) which is not integrated into git -- I would have preferred that to instead actively use something integrated in git (Staged vs Unstaged hunks). More important to have a good code review experience than to remember which changes I made vs which changes AI made..)
sibellavia
Probably an ideal compromise solution for you would be to install the official Claude Code extension for VS Code, so you have an IDE for navigating large, complex codebases while still having CC integration.
mkreis
Bootstrapped solo dev here. I enjoyed using Claude to get little things done which I happed on my TODO list below the important stuff, like updating a landing page, or in your case perhaps adding automated testing for the frontend stuff (so you don't have to click yourself). It's just nice having someone coming up with a proposal on how to implement something, even it's not the perfect way, it's good as a starter.
Also I have one Claude instance running to implement the main feature, in a tight feedback loop so that I know exactly what it's doing.
Yes, sometimes it takes a bit longer, but I use the time checking what the other Claudes are doing...
iterateoften
For me cursor provides a much tighter feedback loop than Claude code. I can review revert iterate change models to get what I need. It feels sometimes Claude code is presented more as a yolo option where you put more trust on the agent about what it will produce.
I think the ability to change models is critical. Some models are better at designing frontend than others. Some are better at different programming languages, writing copy, blogs, etc.
I feel sabotaged if I can’t switch the models easily to try the same prompt and context across all the frontier options
andai
Claude Code spends most of its time poking around the files. It doesn't have any knowledge of the project by default (no file index etc), unless they changed it recently.
When I was using it a lot, I created a startup hook that just dumped a file listing into the context, or the actual full code on very small repos.
I also got some gains from using a custom edit tool I made which can edit multiple chunks in multiple files simultaneously. It was about 3x faster. I had some edge cases where it broke though, so I ended up disabling it.
SatvikBeri
Seems like there's a speed/autonomy spectrum where Cursor is the fastest, Codex is the best for long-running jobs, and Claude is somewhere in the middle.
Personally, I found Cursor to be too inaccurate to be useful (possibly because I use Julia, which is relatively obscure) – Opus has been roughly the right level for my "pair programming" workflow.
leerob
> in particular the extra layer of diff-review of AI changes (red/green) which is not integrated into git
We're making this better very soon! In the coming weeks hopefully.
pests
> remember which changes I made vs which changes AI made..
They are improving this use case too with their enhanced blame. I think it was mentioned in their latest update blog.
You'll be able to hover over lines to see if you wrote it, or an AI. If it was an AI, it will show which model and a reference to the prompt that generated it.
I do like Cursor quite a lot.
kloud
The OpenClaw/pi-agent situation seems similar to ollama/llama-cpp, where the former gets all the hype, while the latter is actually the more impressive part.
This is great work, I am looking forward how it evolves in the future. So far Claude Code seems best despite its bugs given the generous subscription, but when the market corrects and the prices will get closer to API prices, then probably the pay-per-token premium with optimized experience will be a better deal than to suffer Claude Code glitches and paper cuts.
The realization is that at the end agent framework kit that is customizable and can be recursively improved by agents is going to be better than a rigid proprietary client app.
Aurornis
> but when the market corrects and the prices will get closer to API prices
I think it’s more likely that the API prices will decrease over time and the CC allowances will only become more generous. We’ve been hearing predictions about LLM price increases for years but I think the unit economics of inference (excluding training) are much better than a lot of people think and there is no shortage of funding for R&D.
I also wouldn’t bet on Claude Code staying the same as it is right now with little glitches. All of the tools are going to improve over time. In my experience the competing tools aren’t bug free either but they get a pass due to underdog status. All of the tools are improving and will continue to do so.
andai
This is basically identical to the ChatGPT/GPT-3 situation ;) You know OpenAI themselves keep saying "we still don't understand why ChatGPT is so popular... GPT was already available via API for years!"
badlogic
FWIW, you can use subscriptions with pi. OpenAI has blessed pi allowing users to use their GPT subscriptions. Same holds for other providers, except Flicker Company.
And I'm personally very happy that Peter's project gets all the hype. The pi repo already gets enough vibesloped PRs from openclaw users as is, and its still only 1/100th of what the openclaw repository has to suffer through.
ohyoutravel
And like ollama it will no doubt start to get enshittified.
jrm4
This is the first I'm hearing of this pi-agent thing and HOW DO PEOPLE TECH DECIDE TO NAME THINGS?
Seriously. Is creator not aware that "pi" absolutely invokes the name of another very important thing? sigh.
mike_hearn
I'm writing my own agent too as a side project at work. This is a good article but simultaneously kinda disappointing. The entire agent space has disappeared down the same hole, with exactly the same core design used everywhere and everyone making the same mistakes. The focus on TUIs I find especially odd. We're at the dawn of the AI age and people are trying to optimize the framerate of Teletext? If you care about framerates use a proper GUI framework!
The agent I'm writing shares some ideas with Pi but otherwise departs quite drastically from the core design used by Claude Code, Codex, Pi etc, and it seems to have yielded some nice benefits:
• No early stopping ("shall I continue?", "5 tests failed -> all tests passed, I'm done" etc).
• No permission prompts but also no YOLO mode or broken Seatbelt sandboxes. Everything is executed in a customized container designed specifically for the model and adapted to its needs. The agent does a lot of container management to make this work well.
• Agent can manage its own context window, and does. I never needed to add compaction because I never yet saw it run out of context.
• Seems to be fast compared to other agents, at least in any environment where there's heavy load on the inferencing servers.
• Eliminates "slop-isms" like excessive error swallowing, narrative commenting, dropping fully qualified class names into the middle of source files etc.
• No fancy TUI. I don't want to spend any time fixing flickering bugs when I could be improving its skill at the core tasks I actually need it for.
It's got downsides too, it's very overfit to the exact things I've needed and the corporate environment it runs in. It's not a full replacement for CC or Codex. But I use it all the time and it writes nearly all my code now.
The agent is owned by the company and they're starting to ask about whether it could be productized so I suppose I can't really go into the techniques used to achieve this, sorry. Suffice it to say that the agent design space is far wider and deeper than you'd initially intuit from reading articles like this. None of the ideas in my agent are hard to come up with so explore!
msp26
> Special shout out to Google who to this date seem to not support tool call streaming which is extremely Google.
Google doesn't even provide a tokenizer to count tokens locally. The results of this stupidity can be seen directly in AI studio which makes an API call to count_tokens every time you type in the prompt box.
localhost
tbf neither does anthropic
haxel
AI studio also has a bug that continuously counts the tokens, typing or not, with 100% CPU usage.
Sometimes I wonder who is drawing more power, my laptop or the TPU cluster on the other side.
bob1029
> The second approach is to just write to the terminal like any CLI program, appending content to the scrollback buffer
This is how I prototyped all of mine. Console.Write[Line].
I am currently polishing up one of the prototypes with WinForms (.NET10) & WebView2. Building something that looks like a WhatsApp conversation in basic winforms is a lot of work. This takes about 60 seconds in a web view.
I am not too concerned about cross platform because a vast majority of my users will be on windows when they'd want to use this tool.
PKop
If you use WPF you can have the Mica backdrop underneath your WebView2 content and set the WebView2 to have transparent background color, which looks nice and a little more native, fyi. Though if you're doing more than just showing the WebView maybe isn't a choice to switch.
asyncadventure
[dead]
jatora
I was confused by him basically inventing his own skills but I guess this is from Nov 2025 so makes sense as skills were pretty new at that point.
Also please note this is nowhere on the terminal bench leaderboard anymore. I'd advise everyone reading the comments here to be aware of that. This isn't a CLI to use. Just a good experiment and write up.
haxel
It's batteries-not-included, by design. Here's what it looks like with batteries (and note who owns this repo):
I don’t follow nor use pi so no horse in this race, but I think the results were never submitted to terminal bench? not sure how the process works exactly but it’s entirely missing from the benchmark. is this a sign of weakness? I honestly don’t know.
bicepjai
Main reason I haven’t switched over to the new pi coding agent (or even fully to Claude Code alternatives) is the price point. I eat tokens for breakfast, lunch, and dinner.
I’m on a $100/mo plan, but the codex bar makes it look like I’m burning closer to $500 every 30 days. I tried going local with Qwen 3 (coding) on a Blackwell Pro 6000, and it still feels a beat behind, either laggy, or just not quite good enough for me to fully relinquish Claude Code.
Curious what other folks are seeing: any success stories with other agents on local models, or are you mostly sticking with proprietary models?
I’m feeling a bit vendor-locked into Claude Code: it’s pricey, but it’s also annoyingly good
cyanydeez
And its doubtful they are anywhere near break even costs
0xbadcafebee
According to the article, Pi massively shrinks your context use (due to smaller system prompt and lack of MCPs) so your token use may drop. Also Pi seems to support Anthropic OAuth for your plan (but afaik they might ban you)
SatvikBeri
I particularly liked Mario's point about using tmux for long-running commands. I've found models to be very good at reading from / writing to tmux, so I'll do things like spin up a session with a REPL, use Claude to prototype something, then inspect it more deeply in the REPL.
yl_seeto
One aspect that resonates from stories like this is the tension between opinionated design and real-world utility.
When building something minimal, especially in areas like agent-based tooling or assistants, the challenge isn’t only about reducing surface area — it’s about focusing that reduction around what actually solves a user’s problem.
A minimal agent that only handles edge cases, or only works in highly constrained environments, can feel elegant on paper but awkward in practice. Conversely, a slightly less minimal system that still maintains clarity and intent often ends up being more useful without being bloated.
In my own experience launching tools that involve analysis and interpretation, the sweet spot always ends up being somewhere in the intersection of:
- clearly scoped core value,
- deliberately limited surface, and
- enough flexibility to handle real user variation.
Curious how others think about balancing minimalism and practical coverage when designing agents or abstractions in their own projects.
airstrike
begone, bot
0xbadcafebee
The best deep-dive into coding agents (and best architecture) I've seen so far. And I love the minimalism with this design, but there's so much complexity necessary already, it's kind of crazy. Really glad I didn't try to write my own :)
Re: security, I think I need to make an AI credential broker/system. The only way to securely use agents is to never give them access to a credential at all. So the only way to have the agent run a command which requires credentials, is to send the command to a segregated process which asks the user for permission, then runs it, then returns status to the agent. It would process read-only requests automatically but write requests would send a request to the user to authorize. I haven't yet found somebody else writing this, so I might as well give it a shot
Other than credentialed calls, I have Docker-in-Docker in a VM, so all other actions will be YOLO'd. I think this is the only reasonable system for long-running loops.
pjm331
> Re: security, I think I need to make an AI credential broker/system. The only way to securely use agents is to never give them access to a credential at all. So the only way to have the agent run a command which requires credentials, is to send the command to a segregated process which asks the user for permission, then runs it, then returns status to the agent
This is a problem that model context protocol solves
Your MCP server has the creds, your agent does not.
benjaminfh
Really awesome and thoughtful thing you've built - bravo!
I'm so aligned on your take on context engineering / context management. I found the default linear flow of conversation turns really frustrating and limiting. In fact, I still do. Sometimes you know upfront that the next thing you're to do will flood/poison the nicely crafted context you've built up... other times you realise after the fact. In both cases, you didn't have that many alternatives but to press on... Trees are the answer for sure.
I actually spent most of Dec building something with the same philosphy for my own use (aka me as the agent) when doing research and ideation with LLMs. Frustrated by most of the same limitations - want to build context to a good place then preserve/reuse it over and over, fire off side quests etc, bring back only the good stuff. Be able to traverse the tree forwards and back to understand how I got to a place...
Anyway, you've definitely built the more valuable incarnation of this - great work. I'm glad I peeled back the surface of the moltbot hysteria to learn about Pi.
visarga
> want to build context to a good place then preserve/reuse it over and over, fire off side quests etc, bring back only the good stuff
My attempt - a minimalist graph format that is a simple markdown file with inline citations. I load MIND_MAP.md at the start of work, and update it at the end. It reduces context waste to resume or spawn subagents. Memory across sessions.
I’m just curious why your writing is punctuated by lots of word breaks. I hardly see hyphenated word breaks across lines anymore and it made me pause on all those occurrences. I do remember having to do this with literal typewriters.
kirab
According to dev tools this is a simple `hyphens: auto` CSS
willswire
Minimal, intentional guidance is the cornerstone of my CLAUDE.md’s design philosophy document.
I'm curious about how the costs compare using something like this where you're hitting api's directly vs my $20 ChatGPT plan which includes Codex.
kalendos
You can use your ChatGPT subscription with Pi!
asyncadventure
[dead]
amluto
I'm hoping someone makes an agent that fixes the container situation, better:
> If you're uncomfortable with full access, run pi inside a container or use a different tool if you need (faux) guardrails.
I'm sick of doing this. I also don't want faux guardrails. What I do want is an agent front-end that is trustworthy in the sense that it will not, even when instructed by the LLM inside, do anything to my local machine. So it should have tools that run in a container. And it should have really nice features like tools that can control a container and create and start containers within appropriate constraints.
In other words, the 'edit' tool is scoped to whatever I've told the front-end that it can access. So is 'bash' and therefore anything bash does. This isn't a heuristic like everyone running in non-YOLO-mode does today -- it’s more like a traditional capability system. If I want to use gVisor instead of Docker, that should be a very small adaptation. Or Firecracker or really anything else. Or even some random UART connection to some embedded device, where I want to control it with an agent but the device is neither capable of running the front-end nor of connecting to the internet (and may not even have enough RAM to store a conversation!).
I think this would be both easier to use and more secure than what's around right now. Instead of making a container for a project and then dealing with installing the agent into the container, I want to run the agent front-end and then say "Please make a container based on such-and-such image and build me this app inside." Or "Please make three containers as follows".
As a side bonus, this would make designing a container sandbox sooooo much easier, since the agent front-end would not itself need to be compatible with the sandbox. So I could run a container with -net none and still access the inference API.
Contrast with today, where I wanted to make a silly Node app. Step 1: Ask ChatGPT (the web app) to make me a Dockerfile that sets up the right tools including codex-rs and then curse at it because GPT-5.2 is really remarkably bad at this. This sucks, and the agent tool should be able to do this for me, but that would currently require a completely unacceptable degree of YOLO.
(I want an IDE that works like this too. vscode's security model is comically poor. Hmm, an IDE is kind of like an agent front-end except the tools are stronger and there's no AI involved. These things could share code.)
detroitwebsites
Great writeup on minimal agent architecture. The philosophy of "if I don't need it, it won't be built" resonates strongly.
I've been running OpenClaw (which sits on top of similar primitives) to manage multiple simultaneous workflows - one agent handles customer support tickets, another monitors our deployment pipeline, a third does code reviews. The key insight I hit was exactly what you describe: context engineering is everything.
What makes OpenClaw particularly interesting is the workspace-first model. Each agent has AGENTS.md, TOOLS.md, and a memory/ directory that persists across sessions. You can literally watch agents learn from their mistakes by reading their daily logs. It's less magic, more observable system.
The YOLO-by-default approach is spot on. Security theater in coding agents is pointless - if it can write and execute code, game over. Better to be honest about the threat model.
One pattern I documented at howtoopenclawfordummies.com: running multiple specialized agents beats one generalist. Your sub-agent discussion nails why - full observability + explicit context boundaries. I have agents that spawn other agents via tmux, exactly as you suggest.
The benchmark results are compelling. Would love to see pi and OpenClaw compared head-to-head on Terminal-Bench.
Glad to see more people doing this!
I built on ADK (Agent Development Kit), which comes with many of the features discussed in the post.
Building a full, custom agent setup is surprisingly easy and a great learning experience for this transformational technology. Getting into instruction and tool crafting was where I found the most ROI.
I did something similar in Python, in case people want to see a slightly different perspective (I was aiming for a minimal agent library with built-in tools, similar to the Claude Agent SDK):
https://github.com/NTT123/nano-agent
An excellent piece of writing.
One thing I do find is that subagents are helpful for performance -- offloading tasks to smaller models (gpt-oss specifically for me) gets data to the bigger model quicker.
I always wonder what type of moat systems / business like these have
None, basically.
The only moat in all of this is capital.
Its open source. Where does it say he wants to monetise it?
Capital, both social and economic.
Also data, see https://news.ycombinator.com/item?id=46637328
>The only way you could prevent exfiltration of data would be to cut off all network access for the execution environment the agent runs in
You can sandbox off the data.
Armin Ronacher wrote a good piece about why he uses Pi here: https://lucumr.pocoo.org/2026/1/31/pi/
I hadn't realized that Pi is the agent harness used by OpenClaw.
As a user of a minimal, opinionated agent (https://exe.dev) I've observed at least 80% of this article's findings myself.
Small and observable is excellent.
Letting your agent read traces of other sessions is an interesting method of context trimming.
Especially, "always Yolo" and "no background tasks". The LLM can manage Unix processes just fine with bash (e.g. ps, lsof, kill), and if you want you can remind it to use systemd, and it will. (It even does it without rolling it's eyes, which I normally do when forced to deal with systemd.)
Something he didn't mention is git: talk to your agent a commit at a time. Recently I had a colleague check in his minimal, broken PoC on a new branch with the commit message "work in progress". We pointed the agent at the branch and said, "finish the feature we started" and it nailed it in one shot. No context whatsoever other than "draw the rest of the f'ing owl" and it just.... did it. Fascinating.
Pi has probably the best architecture and being written in Javascript it is well positioned to use the browser sandbox architecture that I think is the future for ai agents.
I only wish the author changed his stance on vendor extensions: https://github.com/badlogic/pi-mono/discussions/254
“standardize the intersection, expose the union” is a great phrase I hadn’t heard articulated before
[dead]
"Also, it [Claude Code] flickers" - it does, doesn't it? Why?.. Did it vibe code itself so badly that this is hopeless to fix?..
Claude code programmers are very open that they vibe code it.
Because they target 60 fps refresh, with 11 of the 16 ms budget per frame being wasted by react itself.
They are locked in this naive, horrible framework that would be embarrassing to open source even if they had the permission to do it.
> If you look at the security measures in other coding agents, they're mostly security theater. As soon as your agent can write code and run code, it's pretty much game over.
At least for Codex, the agent runs commands inside an OS-provided sandbox (Seatbelt on macOS, and other stuff on other platforms). It does not end up "making the agent mostly useless".
My codex just uses python to write files around the sandbox when I ask it to patch a sdk outside its path.
Does Codex randomly decide to disable the sandbox like Claude Code does?
You really shouldn’t be running agents outside of a container. That’s 101.
Approval should be mandatory for any non-read tool call. You should read everything your LLM intends to do, and approve it manually.
"But that is annoying and will slow me down!" Yes, and so will recovering from disastrous tool calls.
I'm trying to understand this workflow. I have just started using codex. Literally 2 days in. I have it hooked up to my githbub repo and it just runs in the cloud and creates a pr. I have it touching only UI and middle layer code. No db changes, I always tell it to not touch the models.
I dont know how to feel about being the only one refusing to run yolo mode until the tooling is there, which is still about 6 months away for my setup. Am I years behind everyone else by then? You can get pretty far without completely giving in. Agents really dont need to execute that many arbitrary commands. linting, search, edit, web access should all be bespoke tools integrated into the permission and sandbox system. agents should not even be allowed to start and stop applications that support dev mode, they edit files, can test and get the logs what else would they need to do? especially as the amount of external dependencies that make sense goes to a handful you can without headache approve every new one. If your runtime supports sandboxing and permissions like deno or workerd this adds an initial layer of defense.
This makes it even more baffling why anthropic went with bun, a runtime without any sandboxing or security architecture and will rely in apple seatbelt alone?
You use YOLO mode inside some sandbox (VM, container). Give the container only access to the necessary resources.
[dead]
Being minimalist is real power these days as everything around us keeps shoving features in our face every week with a million tricks and gimmicks to learn. Something minimalist like this is honestly a breath of fresh air!
The YOLO mode is also good, but having a small ‘baby setting mode’ that’s not full-blown system access would make sense for basic security. Just a sensible layer of "pls don't blow my machine" without killing the freedom :)
Pi supports restricting the set of tools given to an agent. For example, one of the examples in pi --help is:
Otherwise, "yolo mode" inside a sandbox is perfectly reasonable. A basic bubblewrap configuration can expose read-only system tools and have a read/write project directory while hiding sensitive information like API keys and other home-directory files.Can I replace Vercel’s AI SDK with Pi’s equivalent?
It's not an API drop in replacement, if that's what you mean. But the pi-ai package serves the same purpose as Vercel's AI SDK. https://github.com/badlogic/pi-mono/tree/main/packages/ai
I've seen a couple of power users already switching to Pi [1], and I'm considering that too. The premise is very appealing:
- Minimal, configurable context - including system prompts [2]
- Minimal and extensible tools; for example, todo tasks [3]
- No built-in MCP support; extensions exist [4]. I'd rather use mcporter [5]
Full control over context is a high-leverage capability. If you're aware of the many limitations of context on performance (in-context retrieval limits [6], context rot [7], etc.), you'd truly appreciate Pi lets you fine-tune context for optimal performance.
It's clearly not for everyone, but I can see how powerful it can be.
---
[1] https://lucumr.pocoo.org/2026/1/31/pi/
[2] https://github.com/badlogic/pi-mono/tree/main/packages/codin...
[3] https://github.com/mitsuhiko/agent-stuff/blob/main/pi-extens...
[4] https://github.com/nicobailon/pi-mcp-adapter
[5] https://github.com/steipete/mcporter
[6] https://github.com/gkamradt/LLMTest_NeedleInAHaystack
[7] https://research.trychroma.com/context-rot
Pi is the part of moltXYZ that should have gone viral. Armin is way ahead of the curve here.
The Claude sub is the only think keeping me on Claude Code. It's not as janky as it used to be, but the hooks and context management support are still fairly superficial.
I really like pi and have started using it to build my agent. Mario's article fully reveals some design trade-offs and complexities in the construction process of coding agents and even general agents. I have benefited a lot!
The solution to the security issue is using `useradd`.
I would add subagents though. They allow for the pattern where the top agent directs / observe a subagent executing a step in a plan.
The top agent is both better at directing a subagent, and it keeps the context clean of details that don't matter - otherwise they'd be in the same step in the plan.
There are lots of ways of doing subagents. It mostly depends on your workflow. That's why pi doesn't ship with anything built in. It's pretty simple to write an extension to do that.
Or you use any of the packages people provide, like this one: https://github.com/nicobailon/pi-subagents
> from copying and pasting code into ChatGPT, to Copilot auto-completions [...], to Cursor, and finally the new breed of coding agent harnesses like Claude Code, Codex, Amp, Droid, and opencode
Reading HN I feel a bit out of touch since I seem to be "stuck" on Cursor. Tried to make the jump further to Claude Code like everyone tells me to, but it just doesn't feel right...
It may be due to the size of my codebase -- I'm 6 months into solo developer bootstrap startup, so there isn't all that much there, and I can iterate very quickly with Cursor. And it's mostly SPA browser click-tested stuff. Comparatively it feels like Claude Code spends an eternity to do something.
(That said Cursor's UI does drive me crazy sometimes. In particular the extra layer of diff-review of AI changes (red/green) which is not integrated into git -- I would have preferred that to instead actively use something integrated in git (Staged vs Unstaged hunks). More important to have a good code review experience than to remember which changes I made vs which changes AI made..)
Probably an ideal compromise solution for you would be to install the official Claude Code extension for VS Code, so you have an IDE for navigating large, complex codebases while still having CC integration.
Bootstrapped solo dev here. I enjoyed using Claude to get little things done which I happed on my TODO list below the important stuff, like updating a landing page, or in your case perhaps adding automated testing for the frontend stuff (so you don't have to click yourself). It's just nice having someone coming up with a proposal on how to implement something, even it's not the perfect way, it's good as a starter. Also I have one Claude instance running to implement the main feature, in a tight feedback loop so that I know exactly what it's doing. Yes, sometimes it takes a bit longer, but I use the time checking what the other Claudes are doing...
For me cursor provides a much tighter feedback loop than Claude code. I can review revert iterate change models to get what I need. It feels sometimes Claude code is presented more as a yolo option where you put more trust on the agent about what it will produce.
I think the ability to change models is critical. Some models are better at designing frontend than others. Some are better at different programming languages, writing copy, blogs, etc.
I feel sabotaged if I can’t switch the models easily to try the same prompt and context across all the frontier options
Claude Code spends most of its time poking around the files. It doesn't have any knowledge of the project by default (no file index etc), unless they changed it recently.
When I was using it a lot, I created a startup hook that just dumped a file listing into the context, or the actual full code on very small repos.
I also got some gains from using a custom edit tool I made which can edit multiple chunks in multiple files simultaneously. It was about 3x faster. I had some edge cases where it broke though, so I ended up disabling it.
Seems like there's a speed/autonomy spectrum where Cursor is the fastest, Codex is the best for long-running jobs, and Claude is somewhere in the middle.
Personally, I found Cursor to be too inaccurate to be useful (possibly because I use Julia, which is relatively obscure) – Opus has been roughly the right level for my "pair programming" workflow.
> in particular the extra layer of diff-review of AI changes (red/green) which is not integrated into git
We're making this better very soon! In the coming weeks hopefully.
> remember which changes I made vs which changes AI made..
They are improving this use case too with their enhanced blame. I think it was mentioned in their latest update blog.
You'll be able to hover over lines to see if you wrote it, or an AI. If it was an AI, it will show which model and a reference to the prompt that generated it.
I do like Cursor quite a lot.
The OpenClaw/pi-agent situation seems similar to ollama/llama-cpp, where the former gets all the hype, while the latter is actually the more impressive part.
This is great work, I am looking forward how it evolves in the future. So far Claude Code seems best despite its bugs given the generous subscription, but when the market corrects and the prices will get closer to API prices, then probably the pay-per-token premium with optimized experience will be a better deal than to suffer Claude Code glitches and paper cuts.
The realization is that at the end agent framework kit that is customizable and can be recursively improved by agents is going to be better than a rigid proprietary client app.
> but when the market corrects and the prices will get closer to API prices
I think it’s more likely that the API prices will decrease over time and the CC allowances will only become more generous. We’ve been hearing predictions about LLM price increases for years but I think the unit economics of inference (excluding training) are much better than a lot of people think and there is no shortage of funding for R&D.
I also wouldn’t bet on Claude Code staying the same as it is right now with little glitches. All of the tools are going to improve over time. In my experience the competing tools aren’t bug free either but they get a pass due to underdog status. All of the tools are improving and will continue to do so.
This is basically identical to the ChatGPT/GPT-3 situation ;) You know OpenAI themselves keep saying "we still don't understand why ChatGPT is so popular... GPT was already available via API for years!"
FWIW, you can use subscriptions with pi. OpenAI has blessed pi allowing users to use their GPT subscriptions. Same holds for other providers, except Flicker Company.
And I'm personally very happy that Peter's project gets all the hype. The pi repo already gets enough vibesloped PRs from openclaw users as is, and its still only 1/100th of what the openclaw repository has to suffer through.
And like ollama it will no doubt start to get enshittified.
This is the first I'm hearing of this pi-agent thing and HOW DO PEOPLE TECH DECIDE TO NAME THINGS?
Seriously. Is creator not aware that "pi" absolutely invokes the name of another very important thing? sigh.
I'm writing my own agent too as a side project at work. This is a good article but simultaneously kinda disappointing. The entire agent space has disappeared down the same hole, with exactly the same core design used everywhere and everyone making the same mistakes. The focus on TUIs I find especially odd. We're at the dawn of the AI age and people are trying to optimize the framerate of Teletext? If you care about framerates use a proper GUI framework!
The agent I'm writing shares some ideas with Pi but otherwise departs quite drastically from the core design used by Claude Code, Codex, Pi etc, and it seems to have yielded some nice benefits:
• No early stopping ("shall I continue?", "5 tests failed -> all tests passed, I'm done" etc).
• No permission prompts but also no YOLO mode or broken Seatbelt sandboxes. Everything is executed in a customized container designed specifically for the model and adapted to its needs. The agent does a lot of container management to make this work well.
• Agent can manage its own context window, and does. I never needed to add compaction because I never yet saw it run out of context.
• Seems to be fast compared to other agents, at least in any environment where there's heavy load on the inferencing servers.
• Eliminates "slop-isms" like excessive error swallowing, narrative commenting, dropping fully qualified class names into the middle of source files etc.
• No fancy TUI. I don't want to spend any time fixing flickering bugs when I could be improving its skill at the core tasks I actually need it for.
It's got downsides too, it's very overfit to the exact things I've needed and the corporate environment it runs in. It's not a full replacement for CC or Codex. But I use it all the time and it writes nearly all my code now.
The agent is owned by the company and they're starting to ask about whether it could be productized so I suppose I can't really go into the techniques used to achieve this, sorry. Suffice it to say that the agent design space is far wider and deeper than you'd initially intuit from reading articles like this. None of the ideas in my agent are hard to come up with so explore!
> Special shout out to Google who to this date seem to not support tool call streaming which is extremely Google.
Google doesn't even provide a tokenizer to count tokens locally. The results of this stupidity can be seen directly in AI studio which makes an API call to count_tokens every time you type in the prompt box.
tbf neither does anthropic
AI studio also has a bug that continuously counts the tokens, typing or not, with 100% CPU usage.
Sometimes I wonder who is drawing more power, my laptop or the TPU cluster on the other side.
> The second approach is to just write to the terminal like any CLI program, appending content to the scrollback buffer
This is how I prototyped all of mine. Console.Write[Line].
I am currently polishing up one of the prototypes with WinForms (.NET10) & WebView2. Building something that looks like a WhatsApp conversation in basic winforms is a lot of work. This takes about 60 seconds in a web view.
I am not too concerned about cross platform because a vast majority of my users will be on windows when they'd want to use this tool.
If you use WPF you can have the Mica backdrop underneath your WebView2 content and set the WebView2 to have transparent background color, which looks nice and a little more native, fyi. Though if you're doing more than just showing the WebView maybe isn't a choice to switch.
[dead]
I was confused by him basically inventing his own skills but I guess this is from Nov 2025 so makes sense as skills were pretty new at that point.
Also please note this is nowhere on the terminal bench leaderboard anymore. I'd advise everyone reading the comments here to be aware of that. This isn't a CLI to use. Just a good experiment and write up.
It's batteries-not-included, by design. Here's what it looks like with batteries (and note who owns this repo):
https://github.com/mitsuhiko/agent-stuff/tree/main
Perhaps benchmarks aren't the best judge.
I don’t follow nor use pi so no horse in this race, but I think the results were never submitted to terminal bench? not sure how the process works exactly but it’s entirely missing from the benchmark. is this a sign of weakness? I honestly don’t know.
Main reason I haven’t switched over to the new pi coding agent (or even fully to Claude Code alternatives) is the price point. I eat tokens for breakfast, lunch, and dinner.
I’m on a $100/mo plan, but the codex bar makes it look like I’m burning closer to $500 every 30 days. I tried going local with Qwen 3 (coding) on a Blackwell Pro 6000, and it still feels a beat behind, either laggy, or just not quite good enough for me to fully relinquish Claude Code.
Curious what other folks are seeing: any success stories with other agents on local models, or are you mostly sticking with proprietary models?
I’m feeling a bit vendor-locked into Claude Code: it’s pricey, but it’s also annoyingly good
And its doubtful they are anywhere near break even costs
According to the article, Pi massively shrinks your context use (due to smaller system prompt and lack of MCPs) so your token use may drop. Also Pi seems to support Anthropic OAuth for your plan (but afaik they might ban you)
I particularly liked Mario's point about using tmux for long-running commands. I've found models to be very good at reading from / writing to tmux, so I'll do things like spin up a session with a REPL, use Claude to prototype something, then inspect it more deeply in the REPL.
One aspect that resonates from stories like this is the tension between opinionated design and real-world utility.
When building something minimal, especially in areas like agent-based tooling or assistants, the challenge isn’t only about reducing surface area — it’s about focusing that reduction around what actually solves a user’s problem.
A minimal agent that only handles edge cases, or only works in highly constrained environments, can feel elegant on paper but awkward in practice. Conversely, a slightly less minimal system that still maintains clarity and intent often ends up being more useful without being bloated.
In my own experience launching tools that involve analysis and interpretation, the sweet spot always ends up being somewhere in the intersection of: - clearly scoped core value, - deliberately limited surface, and - enough flexibility to handle real user variation.
Curious how others think about balancing minimalism and practical coverage when designing agents or abstractions in their own projects.
begone, bot
The best deep-dive into coding agents (and best architecture) I've seen so far. And I love the minimalism with this design, but there's so much complexity necessary already, it's kind of crazy. Really glad I didn't try to write my own :)
Re: security, I think I need to make an AI credential broker/system. The only way to securely use agents is to never give them access to a credential at all. So the only way to have the agent run a command which requires credentials, is to send the command to a segregated process which asks the user for permission, then runs it, then returns status to the agent. It would process read-only requests automatically but write requests would send a request to the user to authorize. I haven't yet found somebody else writing this, so I might as well give it a shot
Other than credentialed calls, I have Docker-in-Docker in a VM, so all other actions will be YOLO'd. I think this is the only reasonable system for long-running loops.
> Re: security, I think I need to make an AI credential broker/system. The only way to securely use agents is to never give them access to a credential at all. So the only way to have the agent run a command which requires credentials, is to send the command to a segregated process which asks the user for permission, then runs it, then returns status to the agent
This is a problem that model context protocol solves
Your MCP server has the creds, your agent does not.
Really awesome and thoughtful thing you've built - bravo!
I'm so aligned on your take on context engineering / context management. I found the default linear flow of conversation turns really frustrating and limiting. In fact, I still do. Sometimes you know upfront that the next thing you're to do will flood/poison the nicely crafted context you've built up... other times you realise after the fact. In both cases, you didn't have that many alternatives but to press on... Trees are the answer for sure.
I actually spent most of Dec building something with the same philosphy for my own use (aka me as the agent) when doing research and ideation with LLMs. Frustrated by most of the same limitations - want to build context to a good place then preserve/reuse it over and over, fire off side quests etc, bring back only the good stuff. Be able to traverse the tree forwards and back to understand how I got to a place...
Anyway, you've definitely built the more valuable incarnation of this - great work. I'm glad I peeled back the surface of the moltbot hysteria to learn about Pi.
> want to build context to a good place then preserve/reuse it over and over, fire off side quests etc, bring back only the good stuff
My attempt - a minimalist graph format that is a simple markdown file with inline citations. I load MIND_MAP.md at the start of work, and update it at the end. It reduces context waste to resume or spawn subagents. Memory across sessions.
https://pastebin.com/VLq4CpCT
I’m just curious why your writing is punctuated by lots of word breaks. I hardly see hyphenated word breaks across lines anymore and it made me pause on all those occurrences. I do remember having to do this with literal typewriters.
According to dev tools this is a simple `hyphens: auto` CSS
Minimal, intentional guidance is the cornerstone of my CLAUDE.md’s design philosophy document.
https://github.com/willswire/dotfiles/blob/main/claude/.clau...
I'm curious about how the costs compare using something like this where you're hitting api's directly vs my $20 ChatGPT plan which includes Codex.
You can use your ChatGPT subscription with Pi!
[dead]
I'm hoping someone makes an agent that fixes the container situation, better:
> If you're uncomfortable with full access, run pi inside a container or use a different tool if you need (faux) guardrails.
I'm sick of doing this. I also don't want faux guardrails. What I do want is an agent front-end that is trustworthy in the sense that it will not, even when instructed by the LLM inside, do anything to my local machine. So it should have tools that run in a container. And it should have really nice features like tools that can control a container and create and start containers within appropriate constraints.
In other words, the 'edit' tool is scoped to whatever I've told the front-end that it can access. So is 'bash' and therefore anything bash does. This isn't a heuristic like everyone running in non-YOLO-mode does today -- it’s more like a traditional capability system. If I want to use gVisor instead of Docker, that should be a very small adaptation. Or Firecracker or really anything else. Or even some random UART connection to some embedded device, where I want to control it with an agent but the device is neither capable of running the front-end nor of connecting to the internet (and may not even have enough RAM to store a conversation!).
I think this would be both easier to use and more secure than what's around right now. Instead of making a container for a project and then dealing with installing the agent into the container, I want to run the agent front-end and then say "Please make a container based on such-and-such image and build me this app inside." Or "Please make three containers as follows".
As a side bonus, this would make designing a container sandbox sooooo much easier, since the agent front-end would not itself need to be compatible with the sandbox. So I could run a container with -net none and still access the inference API.
Contrast with today, where I wanted to make a silly Node app. Step 1: Ask ChatGPT (the web app) to make me a Dockerfile that sets up the right tools including codex-rs and then curse at it because GPT-5.2 is really remarkably bad at this. This sucks, and the agent tool should be able to do this for me, but that would currently require a completely unacceptable degree of YOLO.
(I want an IDE that works like this too. vscode's security model is comically poor. Hmm, an IDE is kind of like an agent front-end except the tools are stronger and there's no AI involved. These things could share code.)
Great writeup on minimal agent architecture. The philosophy of "if I don't need it, it won't be built" resonates strongly.
I've been running OpenClaw (which sits on top of similar primitives) to manage multiple simultaneous workflows - one agent handles customer support tickets, another monitors our deployment pipeline, a third does code reviews. The key insight I hit was exactly what you describe: context engineering is everything.
What makes OpenClaw particularly interesting is the workspace-first model. Each agent has AGENTS.md, TOOLS.md, and a memory/ directory that persists across sessions. You can literally watch agents learn from their mistakes by reading their daily logs. It's less magic, more observable system.
The YOLO-by-default approach is spot on. Security theater in coding agents is pointless - if it can write and execute code, game over. Better to be honest about the threat model.
One pattern I documented at howtoopenclawfordummies.com: running multiple specialized agents beats one generalist. Your sub-agent discussion nails why - full observability + explicit context boundaries. I have agents that spawn other agents via tmux, exactly as you suggest.
The benchmark results are compelling. Would love to see pi and OpenClaw compared head-to-head on Terminal-Bench.
Glad to see more people doing this!
I built on ADK (Agent Development Kit), which comes with many of the features discussed in the post.
Building a full, custom agent setup is surprisingly easy and a great learning experience for this transformational technology. Getting into instruction and tool crafting was where I found the most ROI.
I did something similar in Python, in case people want to see a slightly different perspective (I was aiming for a minimal agent library with built-in tools, similar to the Claude Agent SDK):
https://github.com/NTT123/nano-agent
An excellent piece of writing.
One thing I do find is that subagents are helpful for performance -- offloading tasks to smaller models (gpt-oss specifically for me) gets data to the bigger model quicker.
I always wonder what type of moat systems / business like these have
None, basically.
The only moat in all of this is capital.
Its open source. Where does it say he wants to monetise it?
Capital, both social and economic.
Also data, see https://news.ycombinator.com/item?id=46637328
>The only way you could prevent exfiltration of data would be to cut off all network access for the execution environment the agent runs in
You can sandbox off the data.
Armin Ronacher wrote a good piece about why he uses Pi here: https://lucumr.pocoo.org/2026/1/31/pi/
I hadn't realized that Pi is the agent harness used by OpenClaw.
As a user of a minimal, opinionated agent (https://exe.dev) I've observed at least 80% of this article's findings myself.
Small and observable is excellent.
Letting your agent read traces of other sessions is an interesting method of context trimming.
Especially, "always Yolo" and "no background tasks". The LLM can manage Unix processes just fine with bash (e.g. ps, lsof, kill), and if you want you can remind it to use systemd, and it will. (It even does it without rolling it's eyes, which I normally do when forced to deal with systemd.)
Something he didn't mention is git: talk to your agent a commit at a time. Recently I had a colleague check in his minimal, broken PoC on a new branch with the commit message "work in progress". We pointed the agent at the branch and said, "finish the feature we started" and it nailed it in one shot. No context whatsoever other than "draw the rest of the f'ing owl" and it just.... did it. Fascinating.
Pi has probably the best architecture and being written in Javascript it is well positioned to use the browser sandbox architecture that I think is the future for ai agents.
I only wish the author changed his stance on vendor extensions: https://github.com/badlogic/pi-mono/discussions/254
“standardize the intersection, expose the union” is a great phrase I hadn’t heard articulated before
[dead]
"Also, it [Claude Code] flickers" - it does, doesn't it? Why?.. Did it vibe code itself so badly that this is hopeless to fix?..
Claude code programmers are very open that they vibe code it.
Because they target 60 fps refresh, with 11 of the 16 ms budget per frame being wasted by react itself.
They are locked in this naive, horrible framework that would be embarrassing to open source even if they had the permission to do it.
> If you look at the security measures in other coding agents, they're mostly security theater. As soon as your agent can write code and run code, it's pretty much game over.
At least for Codex, the agent runs commands inside an OS-provided sandbox (Seatbelt on macOS, and other stuff on other platforms). It does not end up "making the agent mostly useless".
My codex just uses python to write files around the sandbox when I ask it to patch a sdk outside its path.
Does Codex randomly decide to disable the sandbox like Claude Code does?
You really shouldn’t be running agents outside of a container. That’s 101.
Approval should be mandatory for any non-read tool call. You should read everything your LLM intends to do, and approve it manually.
"But that is annoying and will slow me down!" Yes, and so will recovering from disastrous tool calls.
I'm trying to understand this workflow. I have just started using codex. Literally 2 days in. I have it hooked up to my githbub repo and it just runs in the cloud and creates a pr. I have it touching only UI and middle layer code. No db changes, I always tell it to not touch the models.
I dont know how to feel about being the only one refusing to run yolo mode until the tooling is there, which is still about 6 months away for my setup. Am I years behind everyone else by then? You can get pretty far without completely giving in. Agents really dont need to execute that many arbitrary commands. linting, search, edit, web access should all be bespoke tools integrated into the permission and sandbox system. agents should not even be allowed to start and stop applications that support dev mode, they edit files, can test and get the logs what else would they need to do? especially as the amount of external dependencies that make sense goes to a handful you can without headache approve every new one. If your runtime supports sandboxing and permissions like deno or workerd this adds an initial layer of defense.
This makes it even more baffling why anthropic went with bun, a runtime without any sandboxing or security architecture and will rely in apple seatbelt alone?
You use YOLO mode inside some sandbox (VM, container). Give the container only access to the necessary resources.
[dead]
Being minimalist is real power these days as everything around us keeps shoving features in our face every week with a million tricks and gimmicks to learn. Something minimalist like this is honestly a breath of fresh air!
The YOLO mode is also good, but having a small ‘baby setting mode’ that’s not full-blown system access would make sense for basic security. Just a sensible layer of "pls don't blow my machine" without killing the freedom :)
Pi supports restricting the set of tools given to an agent. For example, one of the examples in pi --help is:
Otherwise, "yolo mode" inside a sandbox is perfectly reasonable. A basic bubblewrap configuration can expose read-only system tools and have a read/write project directory while hiding sensitive information like API keys and other home-directory files.Can I replace Vercel’s AI SDK with Pi’s equivalent?
It's not an API drop in replacement, if that's what you mean. But the pi-ai package serves the same purpose as Vercel's AI SDK. https://github.com/badlogic/pi-mono/tree/main/packages/ai
I've seen a couple of power users already switching to Pi [1], and I'm considering that too. The premise is very appealing:
- Minimal, configurable context - including system prompts [2]
- Minimal and extensible tools; for example, todo tasks [3]
- No built-in MCP support; extensions exist [4]. I'd rather use mcporter [5]
Full control over context is a high-leverage capability. If you're aware of the many limitations of context on performance (in-context retrieval limits [6], context rot [7], etc.), you'd truly appreciate Pi lets you fine-tune context for optimal performance.
It's clearly not for everyone, but I can see how powerful it can be.
---
[1] https://lucumr.pocoo.org/2026/1/31/pi/
[2] https://github.com/badlogic/pi-mono/tree/main/packages/codin...
[3] https://github.com/mitsuhiko/agent-stuff/blob/main/pi-extens...
[4] https://github.com/nicobailon/pi-mcp-adapter
[5] https://github.com/steipete/mcporter
[6] https://github.com/gkamradt/LLMTest_NeedleInAHaystack
[7] https://research.trychroma.com/context-rot
Pi is the part of moltXYZ that should have gone viral. Armin is way ahead of the curve here.
The Claude sub is the only think keeping me on Claude Code. It's not as janky as it used to be, but the hooks and context management support are still fairly superficial.
I really like pi and have started using it to build my agent. Mario's article fully reveals some design trade-offs and complexities in the construction process of coding agents and even general agents. I have benefited a lot!
The solution to the security issue is using `useradd`.
I would add subagents though. They allow for the pattern where the top agent directs / observe a subagent executing a step in a plan.
The top agent is both better at directing a subagent, and it keeps the context clean of details that don't matter - otherwise they'd be in the same step in the plan.
There are lots of ways of doing subagents. It mostly depends on your workflow. That's why pi doesn't ship with anything built in. It's pretty simple to write an extension to do that.
Or you use any of the packages people provide, like this one: https://github.com/nicobailon/pi-subagents
> from copying and pasting code into ChatGPT, to Copilot auto-completions [...], to Cursor, and finally the new breed of coding agent harnesses like Claude Code, Codex, Amp, Droid, and opencode
Reading HN I feel a bit out of touch since I seem to be "stuck" on Cursor. Tried to make the jump further to Claude Code like everyone tells me to, but it just doesn't feel right...
It may be due to the size of my codebase -- I'm 6 months into solo developer bootstrap startup, so there isn't all that much there, and I can iterate very quickly with Cursor. And it's mostly SPA browser click-tested stuff. Comparatively it feels like Claude Code spends an eternity to do something.
(That said Cursor's UI does drive me crazy sometimes. In particular the extra layer of diff-review of AI changes (red/green) which is not integrated into git -- I would have preferred that to instead actively use something integrated in git (Staged vs Unstaged hunks). More important to have a good code review experience than to remember which changes I made vs which changes AI made..)
Probably an ideal compromise solution for you would be to install the official Claude Code extension for VS Code, so you have an IDE for navigating large, complex codebases while still having CC integration.
Bootstrapped solo dev here. I enjoyed using Claude to get little things done which I happed on my TODO list below the important stuff, like updating a landing page, or in your case perhaps adding automated testing for the frontend stuff (so you don't have to click yourself). It's just nice having someone coming up with a proposal on how to implement something, even it's not the perfect way, it's good as a starter. Also I have one Claude instance running to implement the main feature, in a tight feedback loop so that I know exactly what it's doing. Yes, sometimes it takes a bit longer, but I use the time checking what the other Claudes are doing...
For me cursor provides a much tighter feedback loop than Claude code. I can review revert iterate change models to get what I need. It feels sometimes Claude code is presented more as a yolo option where you put more trust on the agent about what it will produce.
I think the ability to change models is critical. Some models are better at designing frontend than others. Some are better at different programming languages, writing copy, blogs, etc.
I feel sabotaged if I can’t switch the models easily to try the same prompt and context across all the frontier options
Claude Code spends most of its time poking around the files. It doesn't have any knowledge of the project by default (no file index etc), unless they changed it recently.
When I was using it a lot, I created a startup hook that just dumped a file listing into the context, or the actual full code on very small repos.
I also got some gains from using a custom edit tool I made which can edit multiple chunks in multiple files simultaneously. It was about 3x faster. I had some edge cases where it broke though, so I ended up disabling it.
Seems like there's a speed/autonomy spectrum where Cursor is the fastest, Codex is the best for long-running jobs, and Claude is somewhere in the middle.
Personally, I found Cursor to be too inaccurate to be useful (possibly because I use Julia, which is relatively obscure) – Opus has been roughly the right level for my "pair programming" workflow.
> in particular the extra layer of diff-review of AI changes (red/green) which is not integrated into git
We're making this better very soon! In the coming weeks hopefully.
> remember which changes I made vs which changes AI made..
They are improving this use case too with their enhanced blame. I think it was mentioned in their latest update blog.
You'll be able to hover over lines to see if you wrote it, or an AI. If it was an AI, it will show which model and a reference to the prompt that generated it.
I do like Cursor quite a lot.
The OpenClaw/pi-agent situation seems similar to ollama/llama-cpp, where the former gets all the hype, while the latter is actually the more impressive part.
This is great work, I am looking forward how it evolves in the future. So far Claude Code seems best despite its bugs given the generous subscription, but when the market corrects and the prices will get closer to API prices, then probably the pay-per-token premium with optimized experience will be a better deal than to suffer Claude Code glitches and paper cuts.
The realization is that at the end agent framework kit that is customizable and can be recursively improved by agents is going to be better than a rigid proprietary client app.
> but when the market corrects and the prices will get closer to API prices
I think it’s more likely that the API prices will decrease over time and the CC allowances will only become more generous. We’ve been hearing predictions about LLM price increases for years but I think the unit economics of inference (excluding training) are much better than a lot of people think and there is no shortage of funding for R&D.
I also wouldn’t bet on Claude Code staying the same as it is right now with little glitches. All of the tools are going to improve over time. In my experience the competing tools aren’t bug free either but they get a pass due to underdog status. All of the tools are improving and will continue to do so.
This is basically identical to the ChatGPT/GPT-3 situation ;) You know OpenAI themselves keep saying "we still don't understand why ChatGPT is so popular... GPT was already available via API for years!"
FWIW, you can use subscriptions with pi. OpenAI has blessed pi allowing users to use their GPT subscriptions. Same holds for other providers, except Flicker Company.
And I'm personally very happy that Peter's project gets all the hype. The pi repo already gets enough vibesloped PRs from openclaw users as is, and its still only 1/100th of what the openclaw repository has to suffer through.
And like ollama it will no doubt start to get enshittified.
This is the first I'm hearing of this pi-agent thing and HOW DO PEOPLE TECH DECIDE TO NAME THINGS?
Seriously. Is creator not aware that "pi" absolutely invokes the name of another very important thing? sigh.
I'm writing my own agent too as a side project at work. This is a good article but simultaneously kinda disappointing. The entire agent space has disappeared down the same hole, with exactly the same core design used everywhere and everyone making the same mistakes. The focus on TUIs I find especially odd. We're at the dawn of the AI age and people are trying to optimize the framerate of Teletext? If you care about framerates use a proper GUI framework!
The agent I'm writing shares some ideas with Pi but otherwise departs quite drastically from the core design used by Claude Code, Codex, Pi etc, and it seems to have yielded some nice benefits:
• No early stopping ("shall I continue?", "5 tests failed -> all tests passed, I'm done" etc).
• No permission prompts but also no YOLO mode or broken Seatbelt sandboxes. Everything is executed in a customized container designed specifically for the model and adapted to its needs. The agent does a lot of container management to make this work well.
• Agent can manage its own context window, and does. I never needed to add compaction because I never yet saw it run out of context.
• Seems to be fast compared to other agents, at least in any environment where there's heavy load on the inferencing servers.
• Eliminates "slop-isms" like excessive error swallowing, narrative commenting, dropping fully qualified class names into the middle of source files etc.
• No fancy TUI. I don't want to spend any time fixing flickering bugs when I could be improving its skill at the core tasks I actually need it for.
It's got downsides too, it's very overfit to the exact things I've needed and the corporate environment it runs in. It's not a full replacement for CC or Codex. But I use it all the time and it writes nearly all my code now.
The agent is owned by the company and they're starting to ask about whether it could be productized so I suppose I can't really go into the techniques used to achieve this, sorry. Suffice it to say that the agent design space is far wider and deeper than you'd initially intuit from reading articles like this. None of the ideas in my agent are hard to come up with so explore!
> Special shout out to Google who to this date seem to not support tool call streaming which is extremely Google.
Google doesn't even provide a tokenizer to count tokens locally. The results of this stupidity can be seen directly in AI studio which makes an API call to count_tokens every time you type in the prompt box.
tbf neither does anthropic
AI studio also has a bug that continuously counts the tokens, typing or not, with 100% CPU usage.
Sometimes I wonder who is drawing more power, my laptop or the TPU cluster on the other side.
> The second approach is to just write to the terminal like any CLI program, appending content to the scrollback buffer
This is how I prototyped all of mine. Console.Write[Line].
I am currently polishing up one of the prototypes with WinForms (.NET10) & WebView2. Building something that looks like a WhatsApp conversation in basic winforms is a lot of work. This takes about 60 seconds in a web view.
I am not too concerned about cross platform because a vast majority of my users will be on windows when they'd want to use this tool.
If you use WPF you can have the Mica backdrop underneath your WebView2 content and set the WebView2 to have transparent background color, which looks nice and a little more native, fyi. Though if you're doing more than just showing the WebView maybe isn't a choice to switch.
[dead]
I was confused by him basically inventing his own skills but I guess this is from Nov 2025 so makes sense as skills were pretty new at that point.
Also please note this is nowhere on the terminal bench leaderboard anymore. I'd advise everyone reading the comments here to be aware of that. This isn't a CLI to use. Just a good experiment and write up.
It's batteries-not-included, by design. Here's what it looks like with batteries (and note who owns this repo):
https://github.com/mitsuhiko/agent-stuff/tree/main
Perhaps benchmarks aren't the best judge.
I don’t follow nor use pi so no horse in this race, but I think the results were never submitted to terminal bench? not sure how the process works exactly but it’s entirely missing from the benchmark. is this a sign of weakness? I honestly don’t know.
Main reason I haven’t switched over to the new pi coding agent (or even fully to Claude Code alternatives) is the price point. I eat tokens for breakfast, lunch, and dinner.
I’m on a $100/mo plan, but the codex bar makes it look like I’m burning closer to $500 every 30 days. I tried going local with Qwen 3 (coding) on a Blackwell Pro 6000, and it still feels a beat behind, either laggy, or just not quite good enough for me to fully relinquish Claude Code.
Curious what other folks are seeing: any success stories with other agents on local models, or are you mostly sticking with proprietary models?
I’m feeling a bit vendor-locked into Claude Code: it’s pricey, but it’s also annoyingly good
And its doubtful they are anywhere near break even costs
According to the article, Pi massively shrinks your context use (due to smaller system prompt and lack of MCPs) so your token use may drop. Also Pi seems to support Anthropic OAuth for your plan (but afaik they might ban you)
I particularly liked Mario's point about using tmux for long-running commands. I've found models to be very good at reading from / writing to tmux, so I'll do things like spin up a session with a REPL, use Claude to prototype something, then inspect it more deeply in the REPL.
One aspect that resonates from stories like this is the tension between opinionated design and real-world utility.
When building something minimal, especially in areas like agent-based tooling or assistants, the challenge isn’t only about reducing surface area — it’s about focusing that reduction around what actually solves a user’s problem.
A minimal agent that only handles edge cases, or only works in highly constrained environments, can feel elegant on paper but awkward in practice. Conversely, a slightly less minimal system that still maintains clarity and intent often ends up being more useful without being bloated.
In my own experience launching tools that involve analysis and interpretation, the sweet spot always ends up being somewhere in the intersection of: - clearly scoped core value, - deliberately limited surface, and - enough flexibility to handle real user variation.
Curious how others think about balancing minimalism and practical coverage when designing agents or abstractions in their own projects.
begone, bot
The best deep-dive into coding agents (and best architecture) I've seen so far. And I love the minimalism with this design, but there's so much complexity necessary already, it's kind of crazy. Really glad I didn't try to write my own :)
Re: security, I think I need to make an AI credential broker/system. The only way to securely use agents is to never give them access to a credential at all. So the only way to have the agent run a command which requires credentials, is to send the command to a segregated process which asks the user for permission, then runs it, then returns status to the agent. It would process read-only requests automatically but write requests would send a request to the user to authorize. I haven't yet found somebody else writing this, so I might as well give it a shot
Other than credentialed calls, I have Docker-in-Docker in a VM, so all other actions will be YOLO'd. I think this is the only reasonable system for long-running loops.
> Re: security, I think I need to make an AI credential broker/system. The only way to securely use agents is to never give them access to a credential at all. So the only way to have the agent run a command which requires credentials, is to send the command to a segregated process which asks the user for permission, then runs it, then returns status to the agent
This is a problem that model context protocol solves
Your MCP server has the creds, your agent does not.
Really awesome and thoughtful thing you've built - bravo!
I'm so aligned on your take on context engineering / context management. I found the default linear flow of conversation turns really frustrating and limiting. In fact, I still do. Sometimes you know upfront that the next thing you're to do will flood/poison the nicely crafted context you've built up... other times you realise after the fact. In both cases, you didn't have that many alternatives but to press on... Trees are the answer for sure.
I actually spent most of Dec building something with the same philosphy for my own use (aka me as the agent) when doing research and ideation with LLMs. Frustrated by most of the same limitations - want to build context to a good place then preserve/reuse it over and over, fire off side quests etc, bring back only the good stuff. Be able to traverse the tree forwards and back to understand how I got to a place...
Anyway, you've definitely built the more valuable incarnation of this - great work. I'm glad I peeled back the surface of the moltbot hysteria to learn about Pi.
> want to build context to a good place then preserve/reuse it over and over, fire off side quests etc, bring back only the good stuff
My attempt - a minimalist graph format that is a simple markdown file with inline citations. I load MIND_MAP.md at the start of work, and update it at the end. It reduces context waste to resume or spawn subagents. Memory across sessions.
https://pastebin.com/VLq4CpCT
I’m just curious why your writing is punctuated by lots of word breaks. I hardly see hyphenated word breaks across lines anymore and it made me pause on all those occurrences. I do remember having to do this with literal typewriters.
According to dev tools this is a simple `hyphens: auto` CSS
Minimal, intentional guidance is the cornerstone of my CLAUDE.md’s design philosophy document.
https://github.com/willswire/dotfiles/blob/main/claude/.clau...
I'm curious about how the costs compare using something like this where you're hitting api's directly vs my $20 ChatGPT plan which includes Codex.
You can use your ChatGPT subscription with Pi!
[dead]
I'm hoping someone makes an agent that fixes the container situation, better:
> If you're uncomfortable with full access, run pi inside a container or use a different tool if you need (faux) guardrails.
I'm sick of doing this. I also don't want faux guardrails. What I do want is an agent front-end that is trustworthy in the sense that it will not, even when instructed by the LLM inside, do anything to my local machine. So it should have tools that run in a container. And it should have really nice features like tools that can control a container and create and start containers within appropriate constraints.
In other words, the 'edit' tool is scoped to whatever I've told the front-end that it can access. So is 'bash' and therefore anything bash does. This isn't a heuristic like everyone running in non-YOLO-mode does today -- it’s more like a traditional capability system. If I want to use gVisor instead of Docker, that should be a very small adaptation. Or Firecracker or really anything else. Or even some random UART connection to some embedded device, where I want to control it with an agent but the device is neither capable of running the front-end nor of connecting to the internet (and may not even have enough RAM to store a conversation!).
I think this would be both easier to use and more secure than what's around right now. Instead of making a container for a project and then dealing with installing the agent into the container, I want to run the agent front-end and then say "Please make a container based on such-and-such image and build me this app inside." Or "Please make three containers as follows".
As a side bonus, this would make designing a container sandbox sooooo much easier, since the agent front-end would not itself need to be compatible with the sandbox. So I could run a container with -net none and still access the inference API.
Contrast with today, where I wanted to make a silly Node app. Step 1: Ask ChatGPT (the web app) to make me a Dockerfile that sets up the right tools including codex-rs and then curse at it because GPT-5.2 is really remarkably bad at this. This sucks, and the agent tool should be able to do this for me, but that would currently require a completely unacceptable degree of YOLO.
(I want an IDE that works like this too. vscode's security model is comically poor. Hmm, an IDE is kind of like an agent front-end except the tools are stronger and there's no AI involved. These things could share code.)
Great writeup on minimal agent architecture. The philosophy of "if I don't need it, it won't be built" resonates strongly.
I've been running OpenClaw (which sits on top of similar primitives) to manage multiple simultaneous workflows - one agent handles customer support tickets, another monitors our deployment pipeline, a third does code reviews. The key insight I hit was exactly what you describe: context engineering is everything.
What makes OpenClaw particularly interesting is the workspace-first model. Each agent has AGENTS.md, TOOLS.md, and a memory/ directory that persists across sessions. You can literally watch agents learn from their mistakes by reading their daily logs. It's less magic, more observable system.
The YOLO-by-default approach is spot on. Security theater in coding agents is pointless - if it can write and execute code, game over. Better to be honest about the threat model.
One pattern I documented at howtoopenclawfordummies.com: running multiple specialized agents beats one generalist. Your sub-agent discussion nails why - full observability + explicit context boundaries. I have agents that spawn other agents via tmux, exactly as you suggest.
The benchmark results are compelling. Would love to see pi and OpenClaw compared head-to-head on Terminal-Bench.