Engineering

We used GPT-4 during a hackathon—here's what we learned

We recently ran our first hackathon in quite some time.

Over two days, our team collaborated in groups on various topics. By the end of it, we had 12 demos to share with the rest of the team. These ranged from improvements in debugging HTTP request responses to the delightful “automatic swag sharer.”

Within our groups, a number of us tried integrating with OpenAI’s GPT to see what smarts we could bring to our product. With AI, and ChatGPT in particular, being so current, we had a number of different things we wanted to try, including:

  • Auto-generating incident summaries using the messages sent in Slack channels.
  • Allowing users to use natural language to describe the filters they want for our Insights, Incidents or Follow-ups features.
  • The ability to talk to our Slackbot using natural language without having to know exact commands.
  • Auto-suggest custom fields that seem like they apply to an incident.
  • Automate incident handovers with contextual summaries to the user joining the incident.

Trialing Chat GPT-4

Before the hackathon, one of our engineers, Lawrence, had already tried using GPT4 in our product: He had built a way to auto-generate incident summaries given each incident’s updates.

This feature-flagged code already in production could be triggered via a button in the dashboard.

Lawrence had also built a command line interface (CLI) that allowed you to fetch responses for a prompt for a given incident. With this, we had a short feedback loop to test with that allowed us to iterate quickly on prompts.

We came to the hackathon with a few ideas on how we might use AI, some of them detailed above.

Our first port of call was to update Lawrence’s existing code to use GPT4 so we were using the latest and greatest from OpenAI. Another engineer, Rob, took on doing this, while my colleague Aaron and I started hacking prompts together as proofs-of-concept for our various ideas.

Initially, I wanted to build upon Lawrence’s incident summary work to do two things: create more detailed summaries and make summaries contextual to specific users, so I could say something like, Generate a summary of the incident so far for Rory. Rory is an engineer who is taking over this incident.

That way, we hoped to receive a summary honing in on the technical details and providing the right information for me to join an incident and get started.

To get this extra detail, I wanted to start ingesting full Slack channels rather than just incident updates. To start with this at a basic level, I simply highlighted all of a test Slack channel and copy/pasted the results into my GPT4 prompt. I tidied up the formatting with some fun multi-cursor editing and was ready to go.

The prompt was as follows:

Create a welcome message for Rory, who will take over from Milhouse to solve an incident.

For context, Rory is a product engineer, and Milhouse is a product engineer.

You should provide some context on what Milhouse was doing in the incident.

You should summarize what’s happened so far and give Rory any context they need going forward:

The incident name is: ‘Our application is repeatedly crashing’

The Slack conversation is as follows:

incident.io APP 12:46 joined #inc-2023-04-04-a-security-researcher-has-found-two-minor-security-issues-an along with milhouse … … and so on

With that prompt, I got the following response:

Welcome Rory! Milhouse has been working on resolving the ‘Our application is repeatedly crashing’ incident.

At 13:39, an asynchronous event caused an error that took one of our Heroku dynos down, which then caused a ‘poison pill’ effect that took down our entire app.

However, we deployed a fix that quickly caught the error and prevented it from happening again. We also made improvements to our app and infrastructure setup.

Here are the incident updates: at 16:26, we stopped triggers and clocks and started purging messages. At 17:08, Alex deployed a fix that stopped this error from happening again. At 17:22, we separated our web and worker dynos and are now testing it on staging.

Please let Lawrence know if you need any assistance.

At this point, I was pretty happy with the response for about 20 minutes of work. However, it wasn’t all smooth sailing. I used GPT3.5 Turbo for this request, and each response took about 30 seconds to generate — not quite the real-time “I’m logging off and need to quickly hand this over” action I was looking for.

Additionally, before getting that response, I had to do some non-trivial trimming of the Slack messages. I’d hoped to just paste in the full conversation, which was a few hundred messages, but OpenAI has a concept of a “token limit” that limits how many characters you can request.

What are tokens?

Tokens can be thought of as pieces of words. Before the API processes the prompts, the input is broken down into tokens. These tokens are not cut up exactly where the words start or end — tokens can include trailing spaces and sub-words. Here are some helpful rules of thumb for understanding tokens in terms of lengths:

  • 1 token ~= 4 chars in English
  • 1 token ~= ¾ words
  • 100 tokens ~= 75 words

Different versions of OpenAI’s models can support different numbers of tokens. The GPT3.5 Turbo model I’d started with can handle up to 4,096 tokens; However, a beta GPT4 model supports up to 32,000 tokens.

To make things more complicated, this limit doesn’t just apply to the request you make. It’s a limit on the combination of the request’s length and the response’s length.

As a user, you can’t be entirely sure about the response length (without a specific prompt and expected outcome), so you generally have to be quite conservative with your request.

To explain this more concretely, when I pasted in the full Slack conversation, I initially received an error saying I’d sent 11,000 tokens and couldn’t send more than 4,096.

I trimmed the Slack messages to 4,000 characters by removing many less interesting messages. That, however, results in half a response. ChatGPT will generate a response up to 96 characters (the 4,096 ceilings, subtracting the 4000 I used in the request), then cancel everything else afterward.

To reliably receive responses, I had to reduce the prompt to around 3,900 characters, only around 35% of the original Slack message content. This felt like a real limitation and something that others must have already run into.

I researched similar experiences and found that the standard workaround people have for this is to have ChatGPT generate summaries of smaller chunks, then summarize the summaries.

Very meta.

You can effectively end up writing a recursive function something like:

max_model_length = 4096
expected_response_length = 200 
 
max_length = max_model_length - expected_response_length
 
def summarise(text):
 if num_tokens(text) > max_length:
 first_half  = text[:len(text)//2]
 second_half = text[len(text)//2:]
 return summarise(first_half) + summarise(second_half)
 
 return create_summary(text)


Considering that it was already taking ~20-30 seconds to generate a standard response, it simply wasn’t feasible to call the API several times over for each of these summaries.

Our conclusions

Catching up again with Aaron and Rob, we discussed where we’d gotten to and compared results.

Rob had merged a change to upgrade us to the latest OpenAI client libraries and had shared an account with us that had beta access to GPT4 models. We’d hoped to use the new 32,000 token limit on one of the GPT4 models but never got a response from the API.

Looking up the error we received, we concluded that although we had GPT4 access, that unfortunately did not include the larger 32,000 model.

After this, Rob built a feature where you could react to a Slack message with a ❓(question mark) emoji, and we’d reply to that message in a thread with an explanation of what someone has said.

This felt like a useful feature for cases where someone is in an incident and can’t quite understand some piece of technical jargon. Unfortunately, we ran into the same issues with latency, where it would take ~20-30 seconds to respond.

With a use case like this, it felt easier to just go and Google something than wait for a response.

However, Rob added a second mode where the AI was briefed to act as a “frustrated kindergarten teacher explaining the term to a child,” which yielded some pretty funny results that generally started, “OK, Sweetie.”

Meanwhile, Aaron had been prompt hacking a natural language process for incident filters so you could filter the incidents list with a sentence like, Incidents in the last two weeks that affected the payments service.” And should this prove successful, we could even extend it to other parts of the product, like policies, so someone could just give their policy a name + description, and we’d auto-generate all the conditions that encode that policy.

He got a fairly successful prototype of the feature working, with a text box in the UI that could accurately generate filters like, Show me critical private incidents in the last two weeks.”

However, he also ran into the same token limitations as I had. Some of our filters allow you to say things like, “Where reporter is Rory,” which results in a filter with role[“reporter”]=01ETNC3M9A89ERN6680GFTPPNA, where “Rory” has been replaced with my user ID. To generate filters like this, we’d have to provide mappings of all user names (and nicknames) to user IDs. This, alongside similar attributes like “custom fields,” means we’d need to provide huge prompts that bump into token limits again.

Chat GP-Pass

By the end of the day trying out each of these features, we all largely decided to scrap going any further with our implementations. That said, it didn’t feel like a waste of time.

We learned a lot about using OpenAI, and we know which things to keep an eye on to decide when it’s worth revisiting.

With latency, we did some investigation, and it does sound like responses in under 10 seconds are possible and were the standard at one point. But it seems like since OpenAI has blown up, few people are getting fast responses, and select partner organizations are likely prioritized for now.

Another takeaway was that prompt engineering is hard!

People joke on the internet about “prompt engineers” becoming legitimate jobs. Having written a few, I can imagine it being a legitimate job. It doesn’t feel like an exact science and will likely change from AI model to model. What I found tricky was when things went wrong, your intuition would be to add more instruction to the model, but as you layered on more and more instruction, your prompt could quickly become unwieldy and hard to understand for a human, let alone an AI model.

One final issue with taking these ideas further that we deferred thinking about for our hackathon is data privacy and security. As we were just testing with our test incidents, this wasn’t a concern for building out proof of concepts.

But, had we wanted to release this as an actual feature, we’d likely have to add safeguards to allow organizations to turn off these AI features and perhaps even controls to filter what sort of data gets sent in prompts.


Picture of Rory Bain
Rory Bain
Product Engineer

Modern incident management, built for humans