Engineering

Practical guidance for getting started as a Site Reliability Engineer

At the beginning of May, I joined incident.io as the first site reliability engineer (SRE), a very exciting but slightly daunting move.

With only some high-level knowledge of what the company and its systems looked like prior to this point, it’s fair to say that I didn’t have much certainty in what exactly I’d be working on or how I’d deliver it.

After joining and having settled in after a day or two, my mission became clear: The primary objective was to build a roadmap for our infrastructure, and then set out to deliver it.

At this company, reliability is something that’s valued even more than at most others; as providers of the tooling you depend on to pick up when your own systems are broken, we become a critical dependency and need to have our product available whenever you need us.

This helps set some initial context for what might be going into the roadmap: an emphasis on availability and reliability.

But if you find yourself in this kind of position, how do you start here and produce a roadmap, starting at zero context?

Here are my tips and advice for breaking down this problem.

Getting started

Starting with a blank space where you might usually expect to have a roadmap, coming from working in organizations where many years were spent considering these kinds of topics, is a different challenge but not an insurmountable one.

Here are a few strategies that might help you build up context, find the problems that really matter and turn these into a plan of action.

Get well acquainted with infrastructure and the code, too

Before diving in and making any changes, it’s obviously pretty vital to get a good feeling for what the current setup is within an organization.

Part of an SRE’s job, especially within a smaller team, is to enable and accelerate engineers, so you’ll need to both build empathy around the daily processes that engineers go through and have a sufficient technical understanding to make small changes to the product when required.

Definitely spend some time in “product land.” You should have a fully functioning development environment, a good understanding of the structure of the primary codebase and be able to put this into practice by picking up some smaller changes and delivering them from start to finish.

If your organization has a model like our Product Responders, pairing with those will give you a feel for some of the gnarlier issues.

Going through this process, having stepped into the shoes of a product engineer should present some great opportunities to learn not just about the core product, but also the details around deployments, observability tooling and data stores.

Talk with as many people as you can

Now this may sound obvious, but learning as much context as possible from those who’ve been living and breathing the systems you’re taking on responsibility for is going to be crucial.

Whenever you’re not working on onboarding or building up technical knowledge, try to fill the gaps with chats over coffee, going outside for a walk or grabbing lunch together.

Beyond building relationships, which is important in itself, this is a great opportunity to find out about current pain points, tools they wish they had and any projects that had been deferred “until we have an SRE.”

Especially important is remembering not just to limit yourself to talking to those grizzled veteran engineers who’ve seen it all, but also the new joiners who may have useful viewpoints, the product and engineering managers who get a good aggregate view from the engineers that they work with, and leadership who will have useful input on longer-term vision and vendor relationships.

Keep a finger on the pulse of your customers

Keep an eye out for whenever your organization’s customers are getting in touch with any issues that relate to infrastructure or shared concerns, whether that’s through asking the customer support team to keep you in the loop or monitoring your internal incidents channel.

If the opportunity is there, you should join discussion and talk with the customer directly, which will allow you to dig into the details further.

These types of interactions may be less frequent than others, but they are very valuable, as they give you an insight into what customers value (such as latency vs. availability) and how they interact with the product, and allow you to start understanding what kinds of trade-offs you can make further down the line.

Don’t limit yourself to just your peers

As an SRE at an early-stage company, there’s a good chance that you’ll need to bring in new platforms, tools and processes. There’s also a reasonable chance that these will look different from where you’ve worked previously.

Perhaps the business needs are different, or it’s just that industry trends have evolved beyond the systems you’ve worked with previously.

As you start to build up ideas for what kinds of changes you’d like to make, you might find that being the sole SRE makes it tricky to know if you’re on the right track. It’s really useful to validate ideas like this with people outside of your organization too.

Is the hot, new container technology you’re looking at not as good as it’s cracked up to be?

Perhaps contacts at similar-sized companies will have some insight. Similarly, if your company has existing relationships with platform vendors, then lean on them by sending your thoughts and proposals over to the account manager. They may be able to tell you whether you’re following best practices and what their recommendations are based on similar-sized orgs.

Pulling it all together

If you’ve followed some of the suggestions above, then you should now have a good feeling about the issues and missing building blocks within the organization and be able to make some informed decisions about the next steps.

You’ll also have a lot of diverse inputs from different stakeholders, so you’ll need to distill all of this into a sensible roadmap.

The key to this will be picking out common themes in the information by applying some grouping, but after that, you’ll need to make some calls about how to tackle the problems.

The insights you’ve gained into the business, customers, engineer workflows and the product should help you out here.

Be careful not to plan too far into the future—you should focus on the problems that are causing pain right now, and then revisit them in a few months’ time, rather than trying to set out a multiyear strategy from the beginning.

To make this all more concrete, here’s the summarized roadmap that I created.

Theme: Compute — How we deploy and run our core codebase

  • Better control of deployments and how we cut over to new versions.
  • Improving reliability and performance of routing HTTP requests to our application code.
  • Improving observability around the system-level metrics of our containers (such as CPU, memory, open file handles).

Theme: Database — How we store the data that powers our product

  • Gaining the ability to do PostgreSQL major version upgrades with minimal downtime.
  • Improving observability about what’s happening within the database.

Theme: Observability — How we monitor the health of our product and systems

  • Gaining the ability to capture application-level metrics. We work with logs and traces, but we’d like to instrument the application further.
  • Improving how we store logs and how they can be used effectively.

Distilling all of this into a document and sharing it with key stakeholders should give you the buy-in to go after the problems you need to tackle.

Picture of Ben Wheatley
Ben Wheatley
Site Reliability Engineer

Modern incident management, built for humans