All incident.io developers are given a MacBook for their development work.
That meant when Apple released the M3 MacBook Pros this past October, our team naturally started asking questions like, āWow, how much more productive might I be if my laptop looked that good?ā, and āPerhaps weād be more secure if our machines were Space Black š¤ā.
Our CTO, Pete, responded: āIf you can prove itās worthwhile, weāll do it."
So with an upgrade to a machine that Apple itself describes as "unmistakably pro" hanging in the balance and a personal challenge from Pete to find data to prove it was worthwhile, we set off on a journey that included:
With this, we had everything (and more!) needed to conclusively decide if the M3 was worth the upgrade, or if our team would be getting coal in their stockings this Christmas.
We wanted to know if upgrading existing developer machines from M1/M2 chips to M3 is worth it.
Quantifying developer productivity is difficult (see Measuring developer productivity), but we know (intuitively, and through research such as Maximizing Developer Effectiveness) that faster feedback loops make developers more effective.
For our team, the most common feedback loops in local development are:
All developers work with a fully-fledged incident.io environment locally on their laptops: it allows for a <30s feedback loop between changing code and running it, which is a key factor in how productively you can work with our codebase.
For that reason, compiling the Go app is one of the most frequent of these loops and one of the most expensive, with the codebase nearing 1M lines of code. As the most resource-hungry of those tasks (in terms of system resources) it would likely benefit most from more powerful hardware, too.
For that reason we picked the speed of Go compilation as the key metric of MacBook performanceānow, we just needed to measure it.
Weāve used codegangsta/gin as a hot-reloader for Go since first creating the incident.io GitHub repo. It worked wellāits job being to compile and then restart the new version of the app whenever code files changeābut weād recently hit issues that had made us consider moving.
When we looked at alternative hot reloaders, though, we didnāt find any that would provide the telemetry weād want to track our build times.
For each build, what we wanted to track was:
In the absence of a ready-made alternative, we decided to build our own that we could tailor to our needs. As our primary motivator is to produce build telemetry, this process began as a modest main.go
that we tweaked until it could extract what we needed from the host.
Mac machines have several binaries that can help you with this ā memory_pressure
, docker
, sysctl
and pmset
ā and the code is mostly exec-ing and parsing the output of them:
memoryFreeRegex := regexp.
MustCompile(`^System-wide memory free percentage:\s+(\d+)%`)
// Run memory_pressure...
memoryPressure, _ := exec.Command("memory_pressure").Output()
for _, line := range strings.Split(string(memoryPressure), "\n") {
if memoryFreeRegex.MatchString(line) {
// ...parse result!
ev.MemoryFree = memoryFreeRegex.FindStringSubmatch(line)[1]
}
}
Once weād built the system + runtime collectors, it was simple to wrap a Go build command so we could extract build-specific measures like time in stage (linker, compile, etc) and track which files had caused us to trigger a build (more on why this is useful later).
The end result was hot-reloader we ran from our existing make run
target.
This is an invisible change to our engineering team, but meant that now ā on completion of every build ā weāll send telemetry events to an HTTP endpoint of our choosing. We set up a Fivetran webhook endpoint receiver (our ETL tool, see Modern data stack for startups) that could push the events into our data warehouse, and that was it!
Finally, we have the data we need to judge our Mac Battle.
This is the part we were most excited about: having done the work to produce high-quality data and (impatiently) waited several weeks to build a large enough dataset, now was time to nerd-out and see what it can tell us.
It was particularly exciting because of the type of analysis it would require: most developers will know that build time is highly variable even on the same system. The Go compiler especially does a lot to cache builds, so even the most powerful M3 Max be much slower to build with no cache than an old Intel MacBook will with cache.
All this means we canāt just average the build times across platforms and straight up compare.
Instead, weāll need to really dig into the underlying data, presenting a great opportunity (excuse?) to play with OpenAIās fancy new code interpreter.
Itās almost a perfect use case for it, after all!
We use OpenAI to power features like auto-generating incident summaries and (soon!) a chat interface to your incident data, so we have experience analyzing data with AI tools.
Generally, the process includes:
select * except(payload) from developer__build_events
Weāre using the experimental gpt-4-1106-preview
model to power our assistant, and enabled code interpreter to power the data analysis. It only took a few minutes before we had the assistant ready and responding to our questions in the playground.
This is the interface weāll use to query our data, simply by asking the assistant questions.
We have about 25k builds in our dataset taken from all times of the day, a variety of laptops, and many different conditions.
For us to make good comparisons between platforms, weād want to establish as fair a comparison as possible for those builds, which means removing builds that have confounding factors.
Letās get a picture of how many of our build events are unsuccessful, along with a sense of how many builds were for each platform.
Asking:
Show me the distribution of builds by machine platform, where the platforms are ordered by M1 to M3, and within the platform class Pro comes before Max.
Split each platform build count by whether the build was successful (exit status 0 and non-cancelled) vs failures.
Iām a visual learner, and would like to see these results where we order platforms by M1 to M3, and within each platform class Pro comes before Max.
Feels weird to be talking to a machine like this butā¦
This is exactly what we asked for. Pretty wild, right?
Failed builds are almost never what we want, so we can instruct the assistant to drop these builds:
From this moment on, please exclude failed builds from the dataset.
š¤Ā I have excluded the failed builds from the dataset. We are now working with a subset that contains 12,525 successful builds.
We have this assumption that machines on battery power would be throttled, causing the build times to be poorly representative of general performance. Iād love to confirm this by comparing performance of of the same platform across power sources.
Asking:
Focusing only on successful builds with the Apple M1 Pro and Apple M2 Max platforms, can you visually compare the build duration across different power sources please?
We can see that AC power is generally faster, even if we have longer builds in each AC power category (due to us doing many more builds on AC than on battery, thus finding more build outliers).
The cool thing about having an assistant, though, is you can ask it directly what it thinks:
In your opinion, does being on AC or battery power impact build performance?
š¤Ā I will compute the t-test for the successful builds dataset and report the p-value. A low p-value (typically less than 0.05) would suggest that there is a significant difference in build performance between AC and battery power.
ā¦
The negative t-statistic indicates that the mean build duration on AC power is lower than on battery power. Moreover, with a p-value of approximately 0.0014, which is well below the common threshold of 0.05, we have sufficient evidence to reject the null hypothesis.
Given the results of this statistical test, we can conclude that being on AC or battery power does impact build performance across all platforms, with builds on AC power tending to be faster than those on battery power.
That clears it up: weāll consider only successful and AC powered builds from now on!
We keep a close eye on build performance for our Go monolith: while buying hardware is one way to speed things up, nothing beats removing or fine-tuning the build process itself.
For this reason, weāre well aware that depending on what you try to build, you might get very different build times.
Thatās because Go projects comprise of many "packages" (code modules) that the Go compiler will cache and only recompile if it thinks something has changed. Our app is deliberately architected for a wide dependency graph with few modules at the base to avoid most changes having to recompile the entire graph, but some level of dependency nesting is unavoidable.
This means builds tend to be either:
Instant (<3s)
Your change is not relevant to the Go compiler, in that you havenāt done anything to modify package symbols and can get away with providing a binary directly from cache, no compilation or expensive linking necessary.
Fast (<30s)
Youāre changing a single package with few dependents, so much of the app is cached and only your package needs compiling. The majority of the time is spent on linking the newly compiled package against the existing cache.
Medium (30s-1m)
Youāve modified a feature package with a few sub-package dependencies, but the majority of the app remains unchanged and Go can reuse the cache.
Slow (1m+)
Youāve added a type to our base domain
package and now every package in the app needs recompiling.
Whatever measurements we use to draw comparisons, we need to be aware of these variations so we can avoid an apples-oranges situation.
Now weāve cleaned our data to be just AC powered + successful builds and understand a bit more about how build times might vary, we can begin comparing platforms to one another.
Letās start by asking:
Focusing on just the M1 Pro and M2 Max (as these are the platforms we have the most data on) can you chart two histograms showing successful build durations, removing extreme outliers.
Overall, itās clear the M2 Max totally destroys the M1 Pro in terms of build speed. But what we havenāt shared so far is that these machines differ in more than just chipsetā¦
Can you show me how the build events split by machine platform and memory configuration please?
š¤Ā Here is the number of successful build events for each memory configuration within each platform:
Hmm, interesting. So the comparison between our M1 Pros and M2 Max is somewhat unfair as the M2s all have 32GB of memory which is twice that of the M1s (16GB).
That could be the reason the performance is so different (despite claims that Apple machines need little RAMā¦) and would be useful to know when deciding what specification of M3 we should upgrade to.
Thankfully we have 2k build events from an M2 Pro with 16GB of memory, soā¦
Can you redraw the histogram comparison of build durations, this time for the Apple M2 Pro 16GB and the Apple M2 Max 32GB normalized, please?
As a primer: M2 Pro and M2 Max are mostly the same chip, with the main difference being that the Max has two additional "energy efficient" cores. Those cores are ~1/5th as powerful as the performance cores and will contribute minimally to compiling a Go program, and therefore our build durations.
This comparison, then, is quite useful: it implies the 32GB is making little impact, perhaps only contributing to the higher density of fast build times, but overall not very noticeable.
From what weāve seen so far, the M2 Pro at 16GB memory seems the best value for money, almost twice as fast as the M1 and minimally slower than the M2 Max (even when the Max has twice the memory).
Now we need to figure out if the M3 is a meaningful improvement over the M2, and for that we needed to buy some laptops.
We purchased:
Three laptops in total, with the Pros differing in memory, while the Max is an attempt to measure if weāll see much difference for those extra 4 P-cores.
If we begin with the Pros:
Show me histograms of build duration for the M3 Pro 18GB and M3 Pro 36GB.
Weāve previously concluded that memory makes little difference to build performance, so itās unsurprising these graphs look similar.
In fact, Iād expect them to look even closer than they do, but weāre suffering from a lack of data with far fewer builds for the M3s than we have for other platforms.
As weāre low on data, letās combine the results for both M3 Pros, especially as we think theyāre quite similar.
Show me histograms of build duration for the M3 Pro 18GB and 36GB combined vs the M3 Max, removing very quick builds (<3s).
Again, weāre suffering from much less data here, but even so itās not looking like the M3 Max is doing anything truly remarkable beyond the already-fast M3 Pro.
At least, not enough to justify the 60% price increase over the base M3 Pro.
By now, weāve:
Itās time to get a full picture of how these platforms compareā¦
Show me histograms comparing build duration across all platform and memory combinations, excluding builds that complete in <3s.
Itās clear that variation within platforms (colored) is minimal, so for the purposes of our comparisonā¦
Please repeat the above but with one histogram per-row and combining the M1, M2 and M3 platforms.
Which makes it clear that:
Suggesting that:
Done!
Weāve done the comparisons of machines with different memory and couldnāt see meaningful improvements from ~16GB ā 32GB, at least not in the build durations weāre measuring.
But this was surprising. When guessing how this experiment might pan out, we really thought the extra memory would make a big difference. It feels weird we canāt see that in our graphs, or at least not very much.
After thinking a bit, we wondered if you could see the impact of more memory anywhere in what we measured, and that perhaps it was less consequential to builds overall than it might be to other things.
So what might we measure that could help determine this?
Well, if you remember our categorization of fast/medium/slow builds, a consistently large part of each build was time spent in the linker, where the compiled Go packages are joined together into one executable binary.
This process is (from memory of this doc) minimally concurrent, and spends a lot of time building structures in memory to perform symbol resolution and check each of the compiled modules are compatible with one another.
Itās very possible that additional memory could help here.
In our dataset is a build_stages column. Can we add a new column to the dataset which is called "linker_time" which is derived fromĀbuild_stages.link.duration_seconds
.
š¤Ā TheĀlinker_time
Ā column has been successfully added to the dataset. The new column reflects the duration in seconds of the linker stage,Ālink.duration_seconds
, extracted from the JSON structure within theĀbuild_stages
Ā column.
Thatās nice: our telemetry events contain durations of linking/compiling, so we can tell our assistant to build a new column for linker_time
into our build events.
Please graph histograms of linker_time
for each platform and memory combination running in order specified previously, colouring each histogram differently depending on whether the machine has 16-18GB or 32-36GB of memory.
Well, that looks quite different! Which is to say every machine ā M1, M2 or M3 ā that has >30GB of memory almost always completes linking <20s, and only machines with 18GB or less memory routinely take >20s to link.
This is cool because:
All this means for the nominal cost of the memory bump, it makes sense to buy the 36GB even if you consider it just āfuture proofingā.
In the face of irrefutable evidence that new Space Black M3 Pro laptops would dramatically improve not only how cool we look on our commute, but also how fast we can ship changes and delight customers, the decision was an easy one and the new laptops are now on their way!
Specifically, weāll upgrade our M1s to the base M3 Pro with 36GB of memory but wait a while for the M2s, as they seem to be performing really well already.
Besides the laptops, this was a really fun journey that helped us improve our tooling and get a much better understanding of our developer environment.
Along the way we:
I hope you enjoyed following along, and perhaps the conclusions weāve drawn can be useful to you, too.
In this post we review the impact of our AI-powered transcription feature, Scribe, as we analyse key metrics, user behaviour, and feedback to drive future improvements.
Writing API's is essentially for any engineering team. Find out how we make the whole process so easy and abstract out the nonsense to help our engineers move that much faster
Wondering what it takes to operate beyond a Senior Engineer level? Go beyond checklists and explore the core qualities of a true technical leader, from the multiplier effect to genuine grit and drive.
Ready for modern incident management? Book a call with one of our experts today.