It’s been over 6 months since Lawrence’s excellent blog post on our data stack here at incident.io, and we thought it was about time for an update. This post runs through the tweaks we’ve made to our setup over the past 2 months and challenges we’ve found as we’ve scaled from a company of 10 people to 30, now with a 2 person data team (soon to be 3 - we’re hiring)!
To summarise, we’ve:
GOOGLEFINANCEfunction in Google Sheets to using an API feed for our GBP:USD FX rates
Our data stack can be broadly split into 3 sections:
Extract & load:
relationshipstest is specified.
In our dbt yml file
The column in Metabase’s data model
We broadly stick with the ‘ELT’ approach to getting data into Google BigQuery, our analytics database, by using a mixture of extract & load tools to ingest raw data - then let dbt do the heavy lifting of transforming everything via SQL.
There are a few exceptions:
At the end of June, we were starting to see some of our
dbt test jobs fail in CircleCI without any clear error message. Each time, a single test (a different one each time) would time out after 10 minutes - despite each test not taking >5 seconds to run in BigQuery itself.
Fivetran syncs, hourly CircleCI runs of our master dbt branch, Metabase, and any ad-hoc queries run by us were all competing for resource in the same BigQuery project - not ideal!
We had a couple of options:
interactivequeries, which would prevent concurrency issues but the code (and a result any code changes) would be slower to run
We went for option 2, as in any case, it’s good practice to have your source data stored separately from your analytics data - both for keeping the BigQuery projects from becoming overloaded with several schemas (datasets) and controlling who has access to raw data.
Fortunately, Fivetran is very easy to get set up for all of the different connectors, and this took us less than a day’s worth of work in total to get everything shifted across.
We’ll probably encounter some concurrency issues when we scale even further with more analysts, but for the time being this setup will keep us running nicely for the next 3-6 months.
When code changes are merged, CircleCI will run our dbt pipeline from scratch. We also run the code every hour on a schedule. At the moment, our dbt pipeline takes ~5 minutes to run and test end to end (perks of being early stage B2B), so infrequently we’d have cases where a code change and a schedules run overlapped.
This presented some weird cases where 2 pipelines, both creating and dropping the same set of tables in the same
dbt_production schema, were clashing.
The fix for this was fairly straightforward, we used the queue workflow orb developed by an engineer at CircleCI. This checks for concurrent workflows (our “deploy” workflow is essentially our dbt pipeline) running on the same branch, and makes a workflow wait for the prior one to finish before running.
Below is an example of our scheduled dbt pipeline (workflow) waiting for a prior run to finish before running the rest of our code.
As a side note, we’re going to look into testing dbt cloud later in the year as there are a lot of nice features that (unsurprisingly) tightly integrate with dbt itself.
A lot of our Stripe subscriptions are denominated in GBP, but we report figures such as ARR in USD.
Previously, we had a simple Google sheet set up using the
GOOGLEFINANCE function which pulled daily GBP:USD FX rates, which we could then linked into BigQuery and used in a dbt model.
However, there are some known issues with this function - primarily that it’s not reliable and frequently returns
We needed a reliable set of FX rates that populates daily. So we solved this in 4 steps:
Take a historical list of GBP:USD FX rates from
GOOGLEFINANCE and version control them in a CSV file, then load these into BigQuery using a dbt seed. A benefit here was being able to easily write tests to make sure this data never changed in terms of row count or values
Write a simple Python script to pull daily GBP:USD FX rates from the APILayer exchange rates API, write to a CSV file, then load this into BigQuery as a drop & replace. We can query up to 365 days’ worth of data from this API, so the only manual effort needed is, annually, to add a year’s worth of data to the seed file from step 1 and update the extraction dates
2 together via a
Write dbt tests to ensure that every combination of date and currency pair is unique, that every currency pair has data up until the current date, and that for every currency pair the dates are gapless - i.e. if there are 99 days between our earliest and latest GBP:USD rates data then there should be 100 rows of data
A fairly straightforward change, but we changed all of our models to by default materialise as tables instead of views. This optimises performance downstream in Metabase when running queries - at the expense of a slightly longer runtime in our pipeline (this has little to no effect when we’re at our current size!).
There was a really interesting post from dbt labs themselves on how they dynamically materialise tables depending on what environment their code was running in. In short, they materialise some models as views when running in development and staging (for speed of running code), but as tables when running in production (for downstream query speed).
Staging and intermediate tables
Previously, every table in our dbt codebase was either a staging (stg) table, a dimension (dim) table, or a fact (fct) table.
One struggle we had was separating which staging tables were just a rename & recast operation - selecting from a single source, and which were more complex bits of SQL that were joining different models together.
So, in line with dbt’s style guide, we introduced intermediate (int) models into our codebase. However, we’ve made some design choices that can be seen below with a snapshot from our DAG:
int_product__user_action_incident_creation defines the logic for when a user declares a “real” (not test or tutorial) incident in one place and reuses it in 2 separate places without having to define it twice:
dim_userson a timestamp level (e.g.
int_product__user_actions_by_dayat a daily level (e.g.
Given our pattern of enforcing
fct_user_statistics to run before
dim_users, so that the latter can pull the latest row’s worth of statistics data (e.g.
number_incidents_created_to_date), if we didn’t have this intermediate table then in
dim_users we’d have to either:
fct_user_statisticsuses as a CTE within
dim_usersto get timestamps for
first_incident_created_at- which is an antipattern because we’re defining the same logic in 2 places, or
first_incident_created_date, which seems a bit arbitrary to lose granularity given how simple it is to fix
int_product__user_actions_by_day is then simply pulled into
fct_user_statistics by a LEFT JOIN as it’s already on a date & user_id level. There are other
int_product__user_action_* tables that go into this rollup table that aren’t shown in the diagram.
int_product__user_stats_org_level gives us an easy way of rolling up daily user metrics to a date and organisation_id level, which is pulled into
fct_organisation_statistics as a simple LEFT JOIN.
It’s easier to review someone else’s code when you both format your code in a similar way. It’s also easier to write code when formatting is automatically taken care of for you.
Enter sqlfluff, a Python package that automatically formats your SQL code for you and is even compatible with dbt’s Jinja syntax!
Both of us currently develop using VSCode, and there’s a really nice extension that actually runs sqlfluff as you type and highlights errors - and even autoformats your code with a keyboard shortcut.
We use this in conjunction with a version controlled
.sqlfluff file that lets you configure sqlfluff rules and ignore rules you don’t like:
There are several benefits to this setup:
The only major changes to our Metabase setup have come from our folder structure. Beforehand, we had an “Our Analytics” folder (the default place you’d save to) and a scattering of folders for the different areas of incident (Product, Sales etc.).
We launched a couple of new folder structures which resulted in 4 main areas of Metabase:
The addition of the 2 new areas has made it simple to figure out where you need to go to look at how any of our new product launches are going, or if we’re running a deep dive analysis where we can save all of our charts in a one-off folder.
It also optimises for users who want to use Metabase as ‘read only’, as the core dashboards folder is basically the one-stop shop for everything we use to run the company with.
Importantly, it also doesn’t discourage anyone in incident.io from getting hands-on with data and saving their own questions and dashboards somewhere. There’s a delicate balance to strike between having “data approved” sections and people feeling like there’s a high barrier for what goes into Metabase.
A really nice perk of Metabase is that you can save a chart or a table in one place (as a question) and reuse it across several dashboards. So a question sat in our “Core - product questions” folder could appear in both our key metrics dashboard and our growth dashboard, and edits made to it are reflected in both places.
We’ve spent a fair amount of time over the past couple of months building out the core data models we’ll need to answer most of our questions in a quick and repeatable way. We’ve also made sure that our data pipeline is robust and well tested, so now our work more recently has shifted towards focusing very heavily on the analytics side of things!
From a data stack perspective, there are still a couple of things we’d like to look at in the coming months:
error(get this fixed now) and
warn(fix at some point this week), and get our CI/CD process set up to send
warnmessages to a separate slack channel - currently we’re only notified of
errormessages via CircleCI, and spot
warnmessages when we test locally
state:modified+selector to only run CI checks for modified files and downstream, rather than our whole pipeline. This is a very minor point given how quick our pipeline runs in general, but it’s a relatively easy change to put in for probably quite a good performance gain
We hope you enjoyed reading through what we’ve been up to in the world of data! If you’d like to join us, we’re hiring into our data team (and a bunch of other roles too).
Enter your details to receive our monthly newsletter, filled with incident related insights to help you in your day-to-day!
Intermittent downtime from repeated crashes
Making code-generation in Go more powerful with generics
Building great developer experience at a startup
Lisa Karlin Curtis