Updating our data stack

It’s been over 6 months since Lawrence’s excellent blog post on our data stack here at incident.io, and we thought it was about time for an update. This post runs through the tweaks we’ve made to our setup over the past 2 months and challenges we’ve found as we’ve scaled from a company of 10 people to 30, now with a 2 person data team (soon to be 3 - we’re hiring)!

To summarise, we’ve:

Migrated Fivetran to a separate BigQuery project to deal with concurrent query limits
Set up CircleCI to run the dbt pipeline sequentially rather than in parallel, in case the same pipeline is triggered twice (e.g. a scheduled run, and a manual run) and the runs would otherwise overlap
Moved from using the GOOGLEFINANCE function in Google Sheets to using an API feed for our GBP:USD FX rates
Materialised all of our dbt models as tables, instead of views, to optimise for query performance in Metabase
Introduced intermediate tables into our dbt pipeline, that sit in-between our staging tables and our reporting - dimension and fact - tables
Used the SQLFluff VSCode extension to automatically format our SQL code, and detect errors as we type
Introduced a new folder structure in Metabase that differentiates one-off analyses from those used by stakeholders across the company

A recap on our setup

Our data stack can be broadly split into 3 sections:

Extract & load:

We use Stitch to sync data from Close, our CRM tool
We use Segment to sync data from website activity and our web app (app.incident.io)
Our events data is written directly into BigQuery
All of our other external data sources, and our production data from our PostgreSQL database, are synced via Fivetran

Transform:

We’re big fans of dbt here, and use it for all of our SQL transformations and data model documentation & testing
We use CircleCI for running CI/CD (testing that code changes work before deploying, and deploying any changes), and generally running our dbt SQL code on a regular basis
We use hightouch to sync transformed data from BigQuery back into Vitally, our customer success tool - for metrics relating to customer health

Analyse:

We use Metabase for all of our internal analytics, as well as in our customer facing insights part of our web app
The excellent dbt-metabase Python package means that any changes to our data models in dbt are automatically reflected in Metabase whenever we merge a code change (via CircleCI). It can also set data types in Metabase from a dbt yml file, and even infers foreign key relationships whenever a relationships test is specified.

In our dbt yml file

The column in Metabase’s data model

We broadly stick with the ‘ELT’ approach to getting data into Google BigQuery, our analytics database, by using a mixture of extract & load tools to ingest raw data - then let dbt do the heavy lifting of transforming everything via SQL.

There are a few exceptions:

For LinkedIn, Twitter, and Intercom we use Fivetran’s dbt packages for social media reporting and Intercom. These do some very helpful standardised transformations that we’d have to create ourselves otherwise
For Google Analytics (web traffic data), Google doesn’t let you access the source data via Fivetran, so instead we use custom reports to bring in transformed data to BigQuery

What we’ve been updating

Extract & Load

Fivetran migration

At the end of June, we were starting to see some of our dbt test jobs fail in CircleCI without any clear error message. Each time, a single test (a different one each time) would time out after 10 minutes - despite each test not taking >5 seconds to run in BigQuery itself.

On inspecting our BigQuery monitoring page, it turned out that >100 concurrent interactive queries were running at the same time in our main BigQuery project, which is over the maximum quota of 100.

Fivetran syncs, hourly CircleCI runs of our master dbt branch, Metabase, and any ad-hoc queries run by us were all competing for resource in the same BigQuery project - not ideal!

We had a couple of options:

Change CircleCI to use batch instead of interactive queries, which would prevent concurrency issues but the code (and a result any code changes) would be slower to run
Move Fivetran to a separate BigQuery project

We went for option 2, as in any case, it’s good practice to have your source data stored separately from your analytics data - both for keeping the BigQuery projects from becoming overloaded with several schemas (datasets) and controlling who has access to raw data.

Fortunately, Fivetran is very easy to get set up for all of the different connectors, and this took us less than a day’s worth of work in total to get everything shifted across.

We’ll probably encounter some concurrency issues when we scale even further with more analysts, but for the time being this setup will keep us running nicely for the next 3-6 months.

Run concurrency

When code changes are merged, CircleCI will run our dbt pipeline from scratch. We also run the code every hour on a schedule. At the moment, our dbt pipeline takes ~5 minutes to run and test end to end (perks of being early stage B2B), so infrequently we’d have cases where a code change and a schedules run overlapped.

This presented some weird cases where 2 pipelines, both creating and dropping the same set of tables in the same dbt_production schema, were clashing.

The fix for this was fairly straightforward, we used the queue workflow orb developed by an engineer at CircleCI. This checks for concurrent workflows (our “deploy” workflow is essentially our dbt pipeline) running on the same branch, and makes a workflow wait for the prior one to finish before running.

Below is an example of our scheduled dbt pipeline (workflow) waiting for a prior run to finish before running the rest of our code.

As a side note, we’re going to look into testing dbt cloud later in the year as there are a lot of nice features that (unsurprisingly) tightly integrate with dbt itself.

FX rates

A lot of our Stripe subscriptions are denominated in GBP, but we report figures such as ARR in USD.

Previously, we had a simple Google sheet set up using the GOOGLEFINANCE function which pulled daily GBP:USD FX rates, which we could then linked into BigQuery and used in a dbt model.

However, there are some known issues with this function - primarily that it’s not reliable and frequently returns #N/A.

We needed a reliable set of FX rates that populates daily. So we solved this in 4 steps:

Take a historical list of GBP:USD FX rates from GOOGLEFINANCE and version control them in a CSV file, then load these into BigQuery using a dbt seed. A benefit here was being able to easily write tests to make sure this data never changed in terms of row count or values

Write a simple Python script to pull daily GBP:USD FX rates from the APILayer exchange rates API, write to a CSV file, then load this into BigQuery as a drop & replace. We can query up to 365 days’ worth of data from this API, so the only manual effort needed is, annually, to add a year’s worth of data to the seed file from step 1 and update the extraction dates
Stitch 1 and 2 together via a UNION ALL
Write dbt tests to ensure that every combination of date and currency pair is unique, that every currency pair has data up until the current date, and that for every currency pair the dates are gapless - i.e. if there are 99 days between our earliest and latest GBP:USD rates data then there should be 100 rows of data

Transform

Materialisation

A fairly straightforward change, but we changed all of our models to by default materialise as tables instead of views. This optimises performance downstream in Metabase when running queries - at the expense of a slightly longer runtime in our pipeline (this has little to no effect when we’re at our current size!).

There was a really interesting post from dbt labs themselves on how they dynamically materialise tables depending on what environment their code was running in. In short, they materialise some models as views when running in development and staging (for speed of running code), but as tables when running in production (for downstream query speed).

Staging and intermediate tables

Previously, every table in our dbt codebase was either a staging (stg) table, a dimension (dim) table, or a fact (fct) table.

One struggle we had was separating which staging tables were just a rename & recast operation - selecting from a single source, and which were more complex bits of SQL that were joining different models together.

So, in line with dbt’s style guide, we introduced intermediate (int) models into our codebase. However, we’ve made some design choices that can be seen below with a snapshot from our DAG:

int_product__user_action_incident_creation defines the logic for when a user declares a “real” (not test or tutorial) incident in one place and reuses it in 2 separate places without having to define it twice:

The data is pulled into dim_users on a timestamp level (e.g. first_incident_created_at)
The data is also rolled up in int_product__user_actions_by_day at a daily level (e.g. number_incidents_created)

Given our pattern of enforcing fct_user_statistics to run before dim_users, so that the latter can pull the latest row’s worth of statistics data (e.g. number_incidents_created_to_date), if we didn’t have this intermediate table then in dim_users we’d have to either:

Declare the same logic for “incident created” that fct_user_statistics uses as a CTE within dim_users to get timestamps for first_incident_created_at - which is an antipattern because we’re defining the same logic in 2 places, or
Lose the timestamp granularity and only be able to show first_incident_created_date, which seems a bit arbitrary to lose granularity given how simple it is to fix

int_product__user_actions_by_day is then simply pulled into fct_user_statistics by a LEFT JOIN as it’s already on a date & user_id level. There are other int_product__user_action_* tables that go into this rollup table that aren’t shown in the diagram.

int_product__user_stats_org_level gives us an easy way of rolling up daily user metrics to a date and organisation_id level, which is pulled into fct_organisation_statistics as a simple LEFT JOIN.

Linter

It’s easier to review someone else’s code when you both format your code in a similar way. It’s also easier to write code when formatting is automatically taken care of for you.

Enter sqlfluff, a Python package that automatically formats your SQL code for you and is even compatible with dbt’s Jinja syntax!

Both of us currently develop using VSCode, and there’s a really nice extension that actually runs sqlfluff as you type and highlights errors - and even autoformats your code with a keyboard shortcut.

We use this in conjunction with a version controlled .sqlfluff file that lets you configure sqlfluff rules and ignore rules you don’t like:

There are several benefits to this setup:

Enforces standards with indentation, table and column aliasing, and capitalisation
Very little effort, and the error highlights encourage you to fix as you go along
Single configuration that everyone uses

Analyse

The only major changes to our Metabase setup have come from our folder structure. Beforehand, we had an “Our Analytics” folder (the default place you’d save to) and a scattering of folders for the different areas of incident (Product, Sales etc.).

We launched a couple of new folder structures which resulted in 4 main areas of Metabase:

“Our Analytics” - save anything here, treat it like a sandbox
Area folders (Product, Sales etc.) - anything reusable that’s not a core dashboard we run the company with, or specific to a project or product launch
Data - an area for the data team to save anything relating to deep dive analyses, and dashboards used to monitor any new launches
Core Dashboards - a locked down area containing dashboards used to run the company with

The addition of the 2 new areas has made it simple to figure out where you need to go to look at how any of our new product launches are going, or if we’re running a deep dive analysis where we can save all of our charts in a one-off folder.

It also optimises for users who want to use Metabase as ‘read only’, as the core dashboards folder is basically the one-stop shop for everything we use to run the company with.

Importantly, it also doesn’t discourage anyone in incident.io from getting hands-on with data and saving their own questions and dashboards somewhere. There’s a delicate balance to strike between having “data approved” sections and people feeling like there’s a high barrier for what goes into Metabase.

A really nice perk of Metabase is that you can save a chart or a table in one place (as a question) and reuse it across several dashboards. So a question sat in our “Core - product questions” folder could appear in both our key metrics dashboard and our growth dashboard, and edits made to it are reflected in both places.

What’s next

We’ve spent a fair amount of time over the past couple of months building out the core data models we’ll need to answer most of our questions in a quick and repeatable way. We’ve also made sure that our data pipeline is robust and well tested, so now our work more recently has shifted towards focusing very heavily on the analytics side of things!

From a data stack perspective, there are still a couple of things we’d like to look at in the coming months:

How we can use dbt severity levels on tests to differentiate between error (get this fixed now) and warn (fix at some point this week), and get our CI/CD process set up to send warn messages to a separate slack channel - currently we’re only notified of error messages via CircleCI, and spot warn messages when we test locally
Using dbt’s state:modified+ selector to only run CI checks for modified files and downstream, rather than our whole pipeline. This is a very minor point given how quick our pipeline runs in general, but it’s a relatively easy change to put in for probably quite a good performance gain

We hope you enjoyed reading through what we’ve been up to in the world of data! If you’d like to join us, we’re hiring into our data team (and a bunch of other roles too).