It’s been over 6 months since Lawrence’s excellent blog post on our data stack here at incident.io, and we thought it was about time for an update. This post runs through the tweaks we’ve made to our setup over the past 2 months and challenges we’ve found as we’ve scaled from a company of 10 people to 30, now with a 2 person data team (soon to be 3 - we’re hiring)!
To summarise, we’ve:
GOOGLEFINANCE
function in Google Sheets to using an API feed for our GBP:USD FX ratesOur data stack can be broadly split into 3 sections:
Extract & load:
Transform:
Analyse:
relationships
test is specified.In our dbt yml file
The column in Metabase’s data model
We broadly stick with the ‘ELT’ approach to getting data into Google BigQuery, our analytics database, by using a mixture of extract & load tools to ingest raw data - then let dbt do the heavy lifting of transforming everything via SQL.
There are a few exceptions:
Fivetran migration
At the end of June, we were starting to see some of our dbt test
jobs fail in CircleCI without any clear error message. Each time, a single test (a different one each time) would time out after 10 minutes - despite each test not taking >5 seconds to run in BigQuery itself.
On inspecting our BigQuery monitoring page, it turned out that >100 concurrent interactive queries were running at the same time in our main BigQuery project, which is over the maximum quota of 100.
Fivetran syncs, hourly CircleCI runs of our master dbt branch, Metabase, and any ad-hoc queries run by us were all competing for resource in the same BigQuery project - not ideal!
We had a couple of options:
batch
instead of interactive
queries, which would prevent concurrency issues but the code (and a result any code changes) would be slower to runWe went for option 2, as in any case, it’s good practice to have your source data stored separately from your analytics data - both for keeping the BigQuery projects from becoming overloaded with several schemas (datasets) and controlling who has access to raw data.
Fortunately, Fivetran is very easy to get set up for all of the different connectors, and this took us less than a day’s worth of work in total to get everything shifted across.
We’ll probably encounter some concurrency issues when we scale even further with more analysts, but for the time being this setup will keep us running nicely for the next 3-6 months.
Run concurrency
When code changes are merged, CircleCI will run our dbt pipeline from scratch. We also run the code every hour on a schedule. At the moment, our dbt pipeline takes ~5 minutes to run and test end to end (perks of being early stage B2B), so infrequently we’d have cases where a code change and a schedules run overlapped.
This presented some weird cases where 2 pipelines, both creating and dropping the same set of tables in the same dbt_production
schema, were clashing.
The fix for this was fairly straightforward, we used the queue workflow orb developed by an engineer at CircleCI. This checks for concurrent workflows (our “deploy” workflow is essentially our dbt pipeline) running on the same branch, and makes a workflow wait for the prior one to finish before running.
Below is an example of our scheduled dbt pipeline (workflow) waiting for a prior run to finish before running the rest of our code.
As a side note, we’re going to look into testing dbt cloud later in the year as there are a lot of nice features that (unsurprisingly) tightly integrate with dbt itself.
FX rates
A lot of our Stripe subscriptions are denominated in GBP, but we report figures such as ARR in USD.
Previously, we had a simple Google sheet set up using the GOOGLEFINANCE
function which pulled daily GBP:USD FX rates, which we could then linked into BigQuery and used in a dbt model.
However, there are some known issues with this function - primarily that it’s not reliable and frequently returns #N/A
.
We needed a reliable set of FX rates that populates daily. So we solved this in 4 steps:
GOOGLEFINANCE
and version control them in a CSV file, then load these into BigQuery using a dbt seed. A benefit here was being able to easily write tests to make sure this data never changed in terms of row count or values1
and 2
together via a UNION ALL
Materialisation
A fairly straightforward change, but we changed all of our models to by default materialise as tables instead of views. This optimises performance downstream in Metabase when running queries - at the expense of a slightly longer runtime in our pipeline (this has little to no effect when we’re at our current size!).
There was a really interesting post from dbt labs themselves on how they dynamically materialise tables depending on what environment their code was running in. In short, they materialise some models as views when running in development and staging (for speed of running code), but as tables when running in production (for downstream query speed).
Staging and intermediate tables
Previously, every table in our dbt codebase was either a staging (stg) table, a dimension (dim) table, or a fact (fct) table.
One struggle we had was separating which staging tables were just a rename & recast operation - selecting from a single source, and which were more complex bits of SQL that were joining different models together.
So, in line with dbt’s style guide, we introduced intermediate (int) models into our codebase. However, we’ve made some design choices that can be seen below with a snapshot from our DAG:
int_product__user_action_incident_creation
defines the logic for when a user declares a “real” (not test or tutorial) incident in one place and reuses it in 2 separate places without having to define it twice:
dim_users
on a timestamp level (e.g. first_incident_created_at
)int_product__user_actions_by_day
at a daily level (e.g. number_incidents_created
)Given our pattern of enforcing fct_user_statistics
to run before dim_users
, so that the latter can pull the latest row’s worth of statistics data (e.g. number_incidents_created_to_date
), if we didn’t have this intermediate table then in dim_users
we’d have to either:
fct_user_statistics
uses as a CTE within dim_users
to get timestamps for first_incident_created_at
- which is an antipattern because we’re defining the same logic in 2 places, orfirst_incident_created_date
, which seems a bit arbitrary to lose granularity given how simple it is to fixint_product__user_actions_by_day
is then simply pulled into fct_user_statistics
by a LEFT JOIN as it’s already on a date & user_id level. There are other int_product__user_action_*
tables that go into this rollup table that aren’t shown in the diagram.
int_product__user_stats_org_level
gives us an easy way of rolling up daily user metrics to a date and organisation_id level, which is pulled into fct_organisation_statistics
as a simple LEFT JOIN.
Linter
It’s easier to review someone else’s code when you both format your code in a similar way. It’s also easier to write code when formatting is automatically taken care of for you.
Enter sqlfluff, a Python package that automatically formats your SQL code for you and is even compatible with dbt’s Jinja syntax!
Both of us currently develop using VSCode, and there’s a really nice extension that actually runs sqlfluff as you type and highlights errors - and even autoformats your code with a keyboard shortcut.
We use this in conjunction with a version controlled .sqlfluff
file that lets you configure sqlfluff rules and ignore rules you don’t like:
There are several benefits to this setup:
The only major changes to our Metabase setup have come from our folder structure. Beforehand, we had an “Our Analytics” folder (the default place you’d save to) and a scattering of folders for the different areas of incident (Product, Sales etc.).
We launched a couple of new folder structures which resulted in 4 main areas of Metabase:
The addition of the 2 new areas has made it simple to figure out where you need to go to look at how any of our new product launches are going, or if we’re running a deep dive analysis where we can save all of our charts in a one-off folder.
It also optimises for users who want to use Metabase as ‘read only’, as the core dashboards folder is basically the one-stop shop for everything we use to run the company with.
Importantly, it also doesn’t discourage anyone in incident.io from getting hands-on with data and saving their own questions and dashboards somewhere. There’s a delicate balance to strike between having “data approved” sections and people feeling like there’s a high barrier for what goes into Metabase.
A really nice perk of Metabase is that you can save a chart or a table in one place (as a question) and reuse it across several dashboards. So a question sat in our “Core - product questions” folder could appear in both our key metrics dashboard and our growth dashboard, and edits made to it are reflected in both places.
We’ve spent a fair amount of time over the past couple of months building out the core data models we’ll need to answer most of our questions in a quick and repeatable way. We’ve also made sure that our data pipeline is robust and well tested, so now our work more recently has shifted towards focusing very heavily on the analytics side of things!
From a data stack perspective, there are still a couple of things we’d like to look at in the coming months:
error
(get this fixed now) and warn
(fix at some point this week), and get our CI/CD process set up to send warn
messages to a separate slack channel - currently we’re only notified of error
messages via CircleCI, and spot warn
messages when we test locallystate:modified+
selector to only run CI checks for modified files and downstream, rather than our whole pipeline. This is a very minor point given how quick our pipeline runs in general, but it’s a relatively easy change to put in for probably quite a good performance gainWe hope you enjoyed reading through what we’ve been up to in the world of data! If you’d like to join us, we’re hiring into our data team (and a bunch of other roles too).
Moving fast does not happen by accident. Here is some of the intentional things our engineers do to move so quickly!
Cutting through the hype and dollar signs, why should you actually join incident.io? And also, why might this not work for you
In the past year, we've reimagined how we build AI products at incident.io, moving from simple prompt based features to now building full-blown AI-native systems end to end. Learn why we’re hiring AI Engineers, what that work looks like, and how it’s changing the future of incident response.
Ready for modern incident management? Book a call with one our of our experts today.