and the worse thing for me is that everything shows up as aggregate usage. Total tokens, total cost, maybe per model.
So I ended up hacking together a thin layer in front of OpenAI where every request is forced to carry some context (agent, task, user, team), and then just logging and calculating cost per call and putting some basic limits on top so you can actually block something if it starts going off the rails. It’s very barebones, but even just seeing “this agent + this task = this cost” was a big relief.
It uses your own OpenAI key, so it’s not doing anything magical on the execution side, just observing and enforcing.
I want to know you guys are dealing with this right now. Are you just watching aggregate usage and trusting it, or have you built something to break it down per agent / task?
If useful, here is the rough version I’m using : https://authority.bhaviavelayudhan.com/
honestly though, im getting to a point where im running custom project mds that flip between different models for different things, using list outputs depending on what it finds and runs. (I have two monorepo projects, and one thats a polyglot microengine that jumps using gRPC communication.)
The mds are highly specialized for each project as each project deals with vastly different issues. Cycling through the different pro accounts and keeping the mds in place over it all is helping me not kill my wallet.
I’m seeing a different failure mode though that even with good routing, agents are looping or retrying and burning my money.
But also ensuring you start new fresh context threads, instead of banging through a single one untill your whole feature is done .. working in small atomic incrementals works pretty good
But my issue wasn’t just inefficiency, it was agents retrying when they shouldn’t.
I needed visibility + limits per agent/task, and the ability to cut it off, not just optimize it.