Back in January, we briefly broke all our early access AI features: our AI incident chatbot, and our automated investigations agent.
How? Weād hit the billing limits.
Back then just us and a few early customers had access, so impact was limited, but as we started to roll out the beta more widely, we wanted much more control over this spending. Thatās especially the case when systems like Investigations are much more (100x) as expensive when compared to our existing AI features, with greater potential for runaway costs.
After investing in our tools, our account setup, and processes, weāre in a much better place. Itās not just our production spending we now have control over, but our development testing and training costs, and weāve socialised ācostā in a healthy way across the team.
This post shares our approach, with lessons that can be easily applied to other company contexts.
Regardless of how AI systems are implemented, thereās one shared primitive at the bottom of the stack - prompts. If we could start tracking per-prompt cost, no matter how we implement things, it should make it easy to attribute spend to features or exact code paths.
We wanted to get to the stage where we could, for each request, have some data like this:
{
"name": "PromptDraftIncidentUpdate",
"project": "copilot",
"organisation_id": "01JQ4NHS8M01T51XNJPWMEQS6G".
"usage_details": {
"input": 41024,
"output": 395
"cached": 0,
"total": 41419
},
"input": "..." // full body of data sent to the model
"result": "..." // full response from the model
}
With data like the above, weād be able to calculate the full cost of the request by applying some maths to the token usage, and be able to connect that cost to a feature-specific project, or group cost by customer or specific prompt.
This required a small bit of wiring in our codebase, but should be doable for anyone working with LLM prompts.
We have quite a few different ways we call LLMs at incident.io - from the familiar chatbot style interfaces to the complex agents powering our automated investigations.
Not everything happens in response to a customer interaction, too - we have one prompt that is used to generate a technical analysis of pull requests in GitHub, which can be triggered 100s of times a second. Another is the daily processing of dashboards in Grafana to track changes and ensure the agentās understanding of them is correct.
Weāre a well documented Go shop, and much of the publicised and favoured LLM/AI tooling is for Python or TypeScript. This means weāve built up our own set of abstractions for calling LLMs, which we can tweak to fit exactly what we need.
This is roughly what it looks like to call a prompt at incident:
result, err := ai.RunPrompt(ctx, &PromptDraftIncidentUpdate{
Incident: incident,
User: identity.User,
})
Whatās worth noting here, is the strongly-typed prompt definition. Thanks to the Go runtimeās bundled reflection tooling, we can easily extract the type name of a prompt generically:
func RunPrompt[Result](ctx context.Context, prompt Prompt[Result]) (Result, error) {
completionRequest := CreateCompletionRequest(ctx, prompt)
result, err = client.CreateChatCompletion(ctx, completionRequest)
promptName := reflect.TypeOf(prompt).Name() // e.g. PromptChat
completionTokens := result.Usage.CompletionTokens
inputTokens := result.Usage.PromptTokens
}
With this attribution of token usage to a given prompt name, done from the single API that we use to run prompts across our codebase without any change to the function signature, weāre now able to track our token usage per execution of a prompt and pipe into our observability tooling.
We track this at incident.io in two ways:
Itās a lesson both in good abstractions, but also pragmatic middle grounds. Thereās an air of āreflection = badā in general programming, but weāre calling an LLM model hosted on another continent - the microseconds accessing the name of the prompt type doesnāt make a dent in the 6s+ spent waiting for generative AI.
Just because we could see where we were spending money, it didnāt stop it from actually being spent.
We decided to create individual projects in OpenAI, each with their own billing limit, so we could limit the damage of any run-away code. It also greatly improves our observability into aggregated buckets of functionality.
The important thing here is that prompts can be used in different contexts, so we donāt assign an OpenAI project on a per-prompt basis, but by use case. We make use of the same configuration files for our error attribution to match code defined in certain files to approriate AI 'features'.
An example is matching any code in files matching the pattern *incident*call*
(e.g. process_incident_call_transcript_summary.go
) to Scribe, our incident call transcriber:
{
"ai": [
{
"project": "copilot",
"files": ["subscriber_incident_activity*", "tool*"]
},
{
"project": "scribe",
"files": ["*incident*call*"]
},
{
"project": "processors",
"files": ["*process*"]
}
]
}
Thanks to a community terraform provider, we could automate creating the OpenAI projects and their corresponding API keys in Google Secret manager alongside the rest of our infrastructure. What couldāve been a painful set of hoops to jump through was now a single-line addition in a tfvars
file.
This has allowed us to contain any out-of-control spend to a single feature surface, while sleeping easy knowing our billing limits will prevent any bank-account-emptying incidents.
There is a natural biasāand active encouragementāwithin the industry from leaders to push for AI adoption. This is really exciting, especially as someone building AI tools: weāve got a green light to go and create awesome products!
However, this investment isnāt limitless, and thereās always a commercial angle. āMake it work, make it good, make it cheapā is a common adage within software development, but everyone knows that final step isnāt always easy.
With the data we collect on AI usage, weāre able to make really compelling arguments of āif we spend this amount of money, weāll save this amount of timeā. Itās led to investment and buy in from our leadership for spending money on the right things.
Part of this has been demonstrating that we can accurately extrapolate how much a feature will cost us to run, and make business decisions around that. We backed out of launching one feature because it would be insanely costly to run well, and the cost/benefit wasnāt there - despite it being really cool (itās parked for another day, perhaps when models become cheaper!).
Cost management can be a sensitive subject, especially when management doesn't feel like engineering teams take it seriously. At incident.io we encourage teams to take ownership over all parts of the product they run, including how much they're spending to run it. While our AI costs are modest now, showing our team care about and can accurately model costs is key to earning the trust that enables that autonomy, and is why it's worth doing even early in development.
If youāre working on AI products, you want to spend most of your time building, not worrying about costs. Counterintuitively, if you setup a regular cadence to receive a spend report this can let you forget about cost your normal day-to-day, instead relying on the check-in to ensure things are on track.
For us, that means that every working morning we receive a screenshot in a Slack channel (#ai-cost-pulse) breaking down the previous 7 days of spend across different environments. We also have a similar one for our projects, so we can attribute spend to individual features.
This has worked wonders for keeping everyone in the loop, prompting early discussion and problem solving for sudden burst in costs and letting us gauge changes over time with the ability to jump off easily into specific investigations as needed.
Gone are the days of us realising that for the past few weeks weāve been burning hundreds of dollars on bugs, and itās no longer possible to run a backfill you think is cheap and only find out youāre wrong months later. Both great outcomes for the team.
The first area we started surfacing cost properly was within our Chat product. For developer environments, and for our organisation in production, weāll tag every message from our bot in Slack with the cost and latency of an interaction:
This detail, aside from creating some āit cost how much?!?!ā moments, creates a really tight feedback loop for what youāre working on at any given time. You quickly learn what ānormalā looks like, and you quickly know when something has gone wrong.
An iteration on this is predicting cost. We have a pattern of āprocessorsā, which you can view as a mechanism for batch-processing data, chaining prompts and regular deterministic logic to generate a wide variety of metadata we use to feed into context later on.
Given we know the historical cost of each processor run, we can extrapolate that to give a grounded estimation in what a run of these backfills will cost in LLM spend - giving us a heads up before we spend $50,000.
Or a more modest $112.45, as you can see in this example:
Finally, before we even hit production, our eval suite, the ātestsā of the LLM prompting world, report the exact cost of their run. Knowing exactly how much weāre spending when we run our CI letās us run it every time we make a change, directly impacting the reliability of our product and the speed at which we iterate.
Team agency is one of the most important pillars of an engineering organisation to me, and enabling teams to operate on their own is one of the best ways to increase their velocity, happiness and impact.
By surfacing all this detail, in as close to an automated way as possible, teams are empowered to take bets on features they otherwise couldnāt. The direct cost of a feature is no longer a question they answer at the end of the development cycle, but one they can gauge as soon as the code starts running.
It makes building production AI-powered features easy, and any team is able to ship with confidence, without worrying theyāve broken the bank.
Now we have the confidence, the visibility and the damage-limitations in place - we took this as an opportunity to open the taps.
Thereās so much to be said for having the confidence, and evidence, for every cost-related decision for AI. Knowing the exact increase in cost for us bumping to a more intelligent model for a specific prompt, makes the conversations around spending thousands more a lot more streamlined.
A healthy, measured culture around cost is extremely positive in any engineering organisation. 5 years ago that meant not buying the 128-core Postgres machine, but joking about how cool that is. Now itās a snarky Slack message about how we could spend $128,000 calling OpenAI, but knowing itās only $7.53 to run this backfill.
AI is rapidly transforming incident response, automating manual tasks, and helping engineers tackle incidents faster and more effectively. We're building the future of incident management, starting with tools like Scribe for real-time summaries and Investigations to pinpoint root causes instantly. Here's a deep dive into our vision.
Incidents happen when the normal playbook failsāso why would we let AI run them solo? Inspired by Bainbridgeās Ironies of automation, this post unpacks how AI can go wrong in high-stakes situations, and shares the principles guiding our approach to building tools that make humans sharper, not sidelined.
Why blindly trusting AI to optimize your prompts can backfire, and human intuition is still essential when building intelligent agents.
Join the ambitious team helping companies move fast when they break things.
See open roles