Controlling costs when building with AI

Back in January, we briefly broke all our early access AI features: our AI incident chatbot, and our automated investigations agent.

How? We’d hit the billing limits.

Back then just us and a few early customers had access, so impact was limited, but as we started to roll out the beta more widely, we wanted much more control over this spending. That’s especially the case when systems like Investigations are much more (100x) as expensive when compared to our existing AI features, with greater potential for runaway costs.

After investing in our tools, our account setup, and processes, we’re in a much better place. It’s not just our production spending we now have control over, but our development testing and training costs, and we’ve socialised ‘cost’ in a healthy way across the team.

This post shares our approach, with lessons that can be easily applied to other company contexts.

Painting a clearer picture

Regardless of how AI systems are implemented, there’s one shared primitive at the bottom of the stack - prompts. If we could start tracking per-prompt cost, no matter how we implement things, it should make it easy to attribute spend to features or exact code paths.

We wanted to get to the stage where we could, for each request, have some data like this:

{
  "name": "PromptDraftIncidentUpdate",
  "project": "copilot",
  "organisation_id": "01JQ4NHS8M01T51XNJPWMEQS6G".
  "usage_details": {
		"input": 41024,
		"output": 395
		"cached": 0,
		"total": 41419
	},
	"input": "..." // full body of data sent to the model
	"result": "..." // full response from the model
}

With data like the above, we’d be able to calculate the full cost of the request by applying some maths to the token usage, and be able to connect that cost to a feature-specific project, or group cost by customer or specific prompt.

This required a small bit of wiring in our codebase, but should be doable for anyone working with LLM prompts.

How we run prompts

We have quite a few different ways we call LLMs at incident.io - from the familiar chatbot style interfaces to the complex agents powering our automated investigations.

Not everything happens in response to a customer interaction, too - we have one prompt that is used to generate a technical analysis of pull requests in GitHub, which can be triggered 100s of times a second. Another is the daily processing of dashboards in Grafana to track changes and ensure the agent’s understanding of them is correct.

We’re a well documented Go shop, and much of the publicised and favoured LLM/AI tooling is for Python or TypeScript. This means we’ve built up our own set of abstractions for calling LLMs, which we can tweak to fit exactly what we need.

This is roughly what it looks like to call a prompt at incident:

result, err := ai.RunPrompt(ctx, &PromptDraftIncidentUpdate{
	Incident: incident,
	User: identity.User,
})

What’s worth noting here, is the strongly-typed prompt definition. Thanks to the Go runtime’s bundled reflection tooling, we can easily extract the type name of a prompt generically:

func RunPrompt[Result](ctx context.Context, prompt Prompt[Result]) (Result, error) {	
	completionRequest := CreateCompletionRequest(ctx, prompt)
		
	result, err = client.CreateChatCompletion(ctx, completionRequest)

	promptName := reflect.TypeOf(prompt).Name() // e.g. PromptChat
	completionTokens := result.Usage.CompletionTokens
	inputTokens := result.Usage.PromptTokens
}

With this attribution of token usage to a given prompt name, done from the single API that we use to run prompts across our codebase without any change to the function signature, we’re now able to track our token usage per execution of a prompt and pipe into our observability tooling.

We track this at incident.io in two ways:

Aggregated metrics exposed to Grafana - intended for ad-hoc monitoring/investigation of specific issues
Persisted records of each request we make to an LLM, stored in Postgres - this feeds into our wider data pipeline and available for standard BI reporting

It’s a lesson both in good abstractions, but also pragmatic middle grounds. There’s an air of “reflection = bad” in general programming, but we’re calling an LLM model hosted on another continent - the microseconds accessing the name of the prompt type doesn’t make a dent in the 6s+ spent waiting for generative AI.

Tagging prompts

Just because we could see where we were spending money, it didn’t stop it from actually being spent.

We decided to create individual projects in OpenAI, each with their own billing limit, so we could limit the damage of any run-away code. It also greatly improves our observability into aggregated buckets of functionality.

The important thing here is that prompts can be used in different contexts, so we don’t assign an OpenAI project on a per-prompt basis, but by use case. We make use of the same configuration files for our error attribution to match code defined in certain files to approriate AI 'features'.

An example is matching any code in files matching the pattern *incident*call* (e.g. process_incident_call_transcript_summary.go) to Scribe, our incident call transcriber:

{
  "ai": [
    {
      "project": "copilot",
      "files": ["subscriber_incident_activity*", "tool*"]
    },
    {
      "project": "scribe",
      "files": ["*incident*call*"]
    },
    {
      "project": "processors",
      "files": ["*process*"]
    }
  ]
}

Thanks to a community terraform provider, we could automate creating the OpenAI projects and their corresponding API keys in Google Secret manager alongside the rest of our infrastructure. What could’ve been a painful set of hoops to jump through was now a single-line addition in a tfvars file.

This has allowed us to contain any out-of-control spend to a single feature surface, while sleeping easy knowing our billing limits will prevent any bank-account-emptying incidents.

Building trust with data

There is a natural bias–and active encouragement–within the industry from leaders to push for AI adoption. This is really exciting, especially as someone building AI tools: we’ve got a green light to go and create awesome products!

However, this investment isn’t limitless, and there’s always a commercial angle. “Make it work, make it good, make it cheap” is a common adage within software development, but everyone knows that final step isn’t always easy.

With the data we collect on AI usage, we’re able to make really compelling arguments of “if we spend this amount of money, we’ll save this amount of time”. It’s led to investment and buy in from our leadership for spending money on the right things.

Part of this has been demonstrating that we can accurately extrapolate how much a feature will cost us to run, and make business decisions around that. We backed out of launching one feature because it would be insanely costly to run well, and the cost/benefit wasn’t there - despite it being really cool (it’s parked for another day, perhaps when models become cheaper!).

Cost management can be a sensitive subject, especially when management doesn't feel like engineering teams take it seriously. At incident.io we encourage teams to take ownership over all parts of the product they run, including how much they're spending to run it. While our AI costs are modest now, showing our team care about and can accurately model costs is key to earning the trust that enables that autonomy, and is why it's worth doing even early in development.

Regularly schedule spend reports

If you’re working on AI products, you want to spend most of your time building, not worrying about costs. Counterintuitively, if you setup a regular cadence to receive a spend report this can let you forget about cost your normal day-to-day, instead relying on the check-in to ensure things are on track.

For us, that means that every working morning we receive a screenshot in a Slack channel (#ai-cost-pulse) breaking down the previous 7 days of spend across different environments. We also have a similar one for our projects, so we can attribute spend to individual features.

This has worked wonders for keeping everyone in the loop, prompting early discussion and problem solving for sudden burst in costs and letting us gauge changes over time with the ability to jump off easily into specific investigations as needed.

Gone are the days of us realising that for the past few weeks we’ve been burning hundreds of dollars on bugs, and it’s no longer possible to run a backfill you think is cheap and only find out you’re wrong months later. Both great outcomes for the team.

Surface costs as close to the developer as possible

The first area we started surfacing cost properly was within our Chat product. For developer environments, and for our organisation in production, we’ll tag every message from our bot in Slack with the cost and latency of an interaction:

This detail, aside from creating some “it cost how much?!?!” moments, creates a really tight feedback loop for what you’re working on at any given time. You quickly learn what “normal” looks like, and you quickly know when something has gone wrong.

An iteration on this is predicting cost. We have a pattern of “processors”, which you can view as a mechanism for batch-processing data, chaining prompts and regular deterministic logic to generate a wide variety of metadata we use to feed into context later on.

Given we know the historical cost of each processor run, we can extrapolate that to give a grounded estimation in what a run of these backfills will cost in LLM spend - giving us a heads up before we spend $50,000.

Or a more modest $112.45, as you can see in this example:

Finally, before we even hit production, our eval suite, the “tests” of the LLM prompting world, report the exact cost of their run. Knowing exactly how much we’re spending when we run our CI let’s us run it every time we make a change, directly impacting the reliability of our product and the speed at which we iterate.

Letting teams own what they build

Team agency is one of the most important pillars of an engineering organisation to me, and enabling teams to operate on their own is one of the best ways to increase their velocity, happiness and impact.

By surfacing all this detail, in as close to an automated way as possible, teams are empowered to take bets on features they otherwise couldn’t. The direct cost of a feature is no longer a question they answer at the end of the development cycle, but one they can gauge as soon as the code starts running.

It makes building production AI-powered features easy, and any team is able to ship with confidence, without worrying they’ve broken the bank.

With great observability, comes great opportunity

Now we have the confidence, the visibility and the damage-limitations in place - we took this as an opportunity to open the taps.

There’s so much to be said for having the confidence, and evidence, for every cost-related decision for AI. Knowing the exact increase in cost for us bumping to a more intelligent model for a specific prompt, makes the conversations around spending thousands more a lot more streamlined.

A healthy, measured culture around cost is extremely positive in any engineering organisation. 5 years ago that meant not buying the 128-core Postgres machine, but joking about how cool that is. Now it’s a snarky Slack message about how we could spend $128,000 calling OpenAI, but knowing it’s only $7.53 to run this backfill.