You’ve hit a wall. Despite your best efforts—distilling complex problems into step-by-step instructions, using one-shot prompting with examples of the desired output, and even bumping up to a larger model—the results just aren’t there.
If you’re new to the basics or prompt engineering, Cohere do a great job describing them in their Prompt Engineering guide.
I’ve been in these shoes, and over time, I’ve come across a few techniques that helped me break through the barrier to better results!
If your prompt asks for multiple outputs—like a classification and an explanation—you’re making the model’s job much harder.
// The result of a prompt which checks how likely a code change was
// to cause an incident.
{
"confidence": "high",
"reasoning": "This code change contributed to the deadlock issues because it introduced a new locking mechanism on the table."
}
The issue is that classification and natural language generation (NLG) rely on different reasoning:
When combined, these tasks can interfere with each other:
confidence = high
might stretch to justify it.From experience: one task is great, two is alright, three or more is a mess. Smaller models especially struggle with this. If your results are inconsistent or the reasoning feels off, this could be why.
To avoid this, split your prompt into two e.g. run the classification first and a second LLM to explain that result. This can actually work out:
This one sounds obvious but when multiple engineers are iterating on a prompt over time, it’s easy to end up with something that no longer makes sense.
Conflicting instructions make LLMs inconsistent. Sometimes, it’s really subtle. Here’s a real-world example from a prompt I wrote to find Slack discussions relevant to an incident. For example, if an alert comes in about database CPU being high, and someone recently posted a message in #engineering about how to handle situations like this, we'd want this prompt to highlight it.
messages := []openai.ChatCompletionMessage{
{
Role: "system",
Content: `
# Task
You will be given a message that might be related to an incident.
Evaluate whether the message contains information that would be helpful for
responders to know about when investigating the incident.
...
`,
},
{
Role: "user",
Content: `This is the incident being investigated: ` + incident,
},
{
Role: "user",
Content: `This is the message that is related: ` + message,
},
}
I was banging my head against the wall trying to figure out why the LLM was over-eager to identify messages as related, even though to me, it was obvious that the message wasn’t relevant.
I’d been so focussed on making the instructions in the system prompt clear that I hadn’t realised that the user prompt was influencing the LLM into thinking that the message must definitely be related.
--Before
This is the message that is related:
--After
This is the message that might be related:
Unluckily for me, LLM’s give more attention to recent tokens so the later instruction was “winning” most of the time!
LLMs can be pretty good at sifting through unstructured data, and that might work fine for you to begin with. But with more complex examples, or when consistency and accuracy is a concern, you’ll have to get much more professional.
Use this rule of thumb: if the hierarchy of your document isn’t clear enough to read as a human, an LLM will have a similarly hard time parsing it.
This is because LLMs create positional encodings to help them understand hierarchal patterns like headers and document structure. If your structure isn’t clear, you'll impact the LLM’s ability to connect tokens in one section of the doc with another, as the attention mechanism won’t have the positional encoding needed to prioritise the connections.
Here’s an example of a badly structured prompt:
You are helping to investigate an incident.
Here is some context about the incident:
{incident-details}
Here is a list of recent code changes:
{Code-changes}
The goal is to determine which code change is most likely responsible.
Where this goes wrong:
A better approach:
You are helping to investigate an incident.
The goal is to determine which code change is most likely responsible.
## Incident Details
{incident-details}
## Code Changes
For each code change, analyze its impact in relation to the incident.
{code-changes}
Reading a prompt top-to-bottom after every change is relatively infeasible for a team who frequently change complex prompts. At least, the amount of concentration required to read intentionally and catch how an LLM might misinterpret your instructions is more than you might expect.
That's why you should consider handing the reviewing to an LLM. We use LLM’s to check the “health” of our prompts regularly, and you’d be surprised at the number of blunders we create when we’re outside the realm of a linter!
We rely on LLMs having a substantial amount of pretraining that help them solve problems, even without the context that we’ll provide in the prompt. But you can’t assume they know what you know, because your expertise might not be in their training data.
Therefore you should always look at a prompt and ask “Could I solve this with only the info provided?”. A lot of the time you’ll realise that what you might think is obvious based on your own experience, isn’t in fact obvious at all!
For example:
Only consider merged code changes
, say Only consider code changes where status = closed
My pro tip: Drop your prompt into an LLM and ask it to flag any hidden assumptions. You’ll catch gaps you didn’t even realize were there.
If your prompt is quite complex, an LLM will be have to be thinking about a lot of things at once. If some aspects are more important to you than others, you should be saying so!
## Most important rules
* When summarising an incident, only include information mentioned by responders
* Never speculate about the cause of an incident
This is my oldest trick in the book and it works surprisingly well, but it should be used with caution. Hard-coding the model to focus on particular aspects can cause it to overfit; you can see a real-life example of this when we used vibe coding to engineer a prompt.
You might only support one model provider at the moment, but if you’re in a “right prompt, wrong model” scenario then you really ought to know about it!
It’s very low-effort to paste a prompt into the UI for a different model, and if you see better results then it’s good evidence for putting the work in to support it programatically. Also, in my experience, the effort to support another provider is can be less than trying to improve a really stubborn prompt.
Model providers have different strategies for building models, but even models within the same provider can behave very differently.
That's because:
Try not to be swayed by benchmarks and instead lean into evals that test your prompts with real data to empirically prove a model is better-or-worse for your use case.
After debugging stubborn prompt evals for hours, someone suggested I try Sonnet instead of 4o. It made that eval suite go from 50% failures to 100% passing, but in other situations the switch made no difference or things got worse!
When working with structured data, always ask yourself: How can I make this easier for the LLM?
For example:
I learned this the hard way—here’s an early version of a prompt that didn’t do any preprocessing:
You are helping to investigate an incident.
The goal is to determine which code change is most likely responsible.
## Incident Details
name: Deadlock on alerts table
created_at: 2025-03-24T15:00:00Z
## Code Changes
For each code change, analyze its impact in relation to the incident.
- name: Add a lock to alerts table
merged_at: 2025-03-24T14:00:00Z
- name: Start archiving old alerts
merged_at: 2025-03-24T11:00:00Z
I noticed that the LLM wasn’t prioritising code changes which were merged just before the incident started. And when I asked the LLM for a reasoning
to help debug, I noticed that it was calculating the duration incorrectly! Switching merged_at
to time_since_merged: 1 hour
instantly improved performance.
When a prompt isn’t working:
The key takeaway? Be methodical and be patient!
Incidents happen when the normal playbook fails—so why would we let AI run them solo? Inspired by Bainbridge’s Ironies of automation, this post unpacks how AI can go wrong in high-stakes situations, and shares the principles guiding our approach to building tools that make humans sharper, not sidelined.
Why blinding trusting AI to optimize your prompts can backfire, and human intuition is still essential when building intelligent agents.
Building with AI is one of the easiest ways to create a huge infrastructure bill. Teams need visibility and awareness of what they're spending, along with guardrails to catch mistakes. This is how we control spend at incident.io.