You too can create a marketing data warehouse

Repeatable analyses, error reduction, and better outcomes can be yours, with no data engineer required

Aug 28, 2023

In this article, a guide to using modern data tooling and LLMs to enable semi-technical business folks to create their own data pipelines and data warehouses. Hope you enjoy it, please give me feedback on LinkedIn!

I’m no data engineer, but I am the proud creator of multiple sales/marketing data warehouses. Thanks to easier-to-use and maturing data tools and large language models being pretty great at SQL and the like, it’s now possible for semi-technical folks like me to stand up data warehouses that can be used for automated reporting, better analysis, and improved outcomes.

I’ve written previously about the jump in data sophistication many marketing teams are making in embracing media mix measurement (MMM) and incrementality measurement. Still, many marketing teams are in no position to do it, their data being comprised of a copy/pasted, manually-combined metrics report, that then gets repeated the next week, and so on.

There’s a better way. If you’re semi-technical (familiar with SQL and Zapier basics, say), oriented towards sales/marketing operations, and ready to learn more, this article is for you.

It’s not a step-by-step tutorial, but I will link to deeper guides, and illustrate how LLMs make navigating this process much faster.
For clarity and conciseness, I’ll stick with vendors I’ve used successfully in the past, but they’re not the only options out there and other options may work better in your context!
Eventually, real knowledge and engineering know-how will be required to speed up processes, prevent errors and control costs, but not yet!

How to build a sales and marketing data warehouse

⚠️ Concept Alert: “ETL” or “ELT”. As seen in the “Extract-Load-Transform” diagram above, data pipelines involve extracting data out of the source data format, loading it into destination databases, then transforming it to be clean and turned into the formats your teams need.

1. Choose a data warehouse

Google’s BigQuery is a particularly good option because of it’s inexpensive pricing, easy to use query interface for people who don’t like command lines, and built-in integrations with Google Sheets and Google Analytics 4. Snowflake and Redshift are also popular.

BigQuery Quickstart: Load and query data (Google)
BigQuery Tutorial (Coupler)

2. Extract and load data into the warehouse

Now that you’ve got somewhere for data to go, you need to get data into it. The backbone of this process is a tool like Airbyte, which retrieves data from the source and loads it into BigQuery or your data warehouse of choice.

⚠️ Concept Alert: IAM. One potentially confusing part of getting these pipelines going is security and permissions. As opposed to regular SaaS where you might use a Google Single-Sign On, in BigQuery, you’ll be specifying very specific permissions to “service accounts” that are dedicated bot users able to log into the database to do those specified tasks. “IAM” (Identity and Access Management), service accounts, roles, permissions, entitlements are all part of this world.

BigQuery Service Account guide

Within this “source” data, there are two broad types of data to think about for a sales or marketing use case with small/medium-sized data sources.

Automated source data

Large datasets, the basic ingredients of your data pipeline, come from platforms like Meta, Google, Salesforce, etc. These tend to have complex schemas that come across in a standardized way — usually not what you need out of the box.

To be able to manipulate it, you need to get the data into your own warehouse using a tool that grabs it every night, or every hour, or whichever: Airbyte.

Getting Started with Airbyte Cloud

Airbyte is great at getting data from these standardized platforms, but as you get to your bespoke business data, things often need to be more customized.

Human-labeled data

If automated data are the basic ingredients, human data the spice that makes it appetizing! To make the data match how folks actually think about the business, there is often extra data needed to be input that’s not already captured. Some of this can come from internal systems, but often, it’s kept in Google Sheets or Excel. Some example use cases:

Sorting: Adding “Focus Accounts” label so you can see high-priority accounts versus non-;
Tagging: Tagging ad creative with attributes like its visual style, script, human-readable name, or theme;
Cleanup: Excluding test accounts, irrelevant competitors, etc.

Surprisingly, getting this human-labeled data into your data warehouse is sometimes trickier than the big stuff. If your org uses Airtable, Monday.com, or another user-friendly database app or customized CRM, that can be cleanly synced to your database. BigQuery’s native Google Sheets integration can work, but you’ll need CAST statements everywhere and it can be easy to break. There are other options like Coupler as well.

3. Transform it by going from raw to readable with dbt and LLMs

Next up, it’s going to get a little engineering-y as you mix the ingredients together. In the spreadsheet world, analytical reports are generated via exporting from a vendor system and then summing them up with formulas or a pivot table. It’s quick and easy to do the first time, but as data updates and a new report is needed, the pivot table needs to be re-created and updated, which quickly gets laborious and error-prone.

Start in reverse. What data do you need in the final report and what does it need to look like? Can you do it in Google Sheets, first? After that, you’ll build the reports out in SQL queries until there are a few final views that your dashboard can read.

⚠️ Concept Alert: Transformation. Transformation in a data pipeline is the equivalent of the “pivot table”, using SQL queries that are run and refreshed on a regular basis that cover each step. Fully describing how transformation should be designed is a whole topic — check out a guide here — but you’ll frequently encounter these repeating themes:

Joining like data together — Meta and YouTube can finally live together in harmony, but only if you join the data together into a free-standing combined table.
Consistency through renaming — is this ad spend called television or broadcast? Spend versus Amount versus amt ?
Cleaning — Need to exclude those first campaigns with [TEST] in the name?
“Types” — Spreadsheets are typically very flexible with the type of data input into a cell — a number versus text, for instance. Databases are not. Numbers need to be numbers, and text to be strings, and sometimes that varies across sources.…

The backbone tool here is dbt Cloud, which enables you to write those queries in plain SQL, test and deploy them, and have it all run and refresh regularly.

dbt’s BigQuery Quickstart Guide

Leverage LLMs for errors, SQL questions, and optimizations

While I’m assuming some comfort with basic SQL queries for this article, an LLM really helps to access next-level work that’d traditionally be the province of a data engineer. This includes complex queries, date-time operations, dbt-specific questions, and more. Here are a few sample prompts to think about for various challenges:

How do I write a “case statement” in BigQuery SQL?
Is there anything about this query that can be optimized for performance in BigQuery?
How do I write a SQL query that takes a table that looks like [table] and aggregates it to show differences in ad spending across accounts?
If I have an advertising campaign that has a start date, end date, and total amount spent, how can I “pro-rata” this among each implied day of spend to make it into a daily spend table?
Can you make this [query] into a dbt model?

👀 Like with everything LLMs, you need to check the results. Even if it runs, it may not be right, and sometimes you can say “was that right?” and it will fix itself.

Implementing Scheduling

Once you have SQL queries that create clean and usable versions of your data, put them on a schedule to run automatically. The free version of dbt cloud can run them on recurring schedules, like once a night. If you upgrade to the paid plan (newly $100/mo), Airbyte can notify dbt that data is done loading and that it’s time to kick off the clean up job.

Visualize and Analyze

Dashboarding tools, Google Sheets’ Connected Sheets, and BigQuery’s own SQL Editor are popular ways to digest the data.

Within the Google Cloud universe, Looker Studio (formerly Data Studio) is a solid and free option to get started with a dashboard. If your company is already using another dashboarding solution like Tableau or Metabase, use that.

Google Sheets’ Connected Sheets lets you access BigQuery datasets within the familiar Google Sheets confines.

Key Takeaways

With modern data infrastructure and large language models, enterprising sales ops and marketing folks can create their own data pipelines without without deep engineering skills.

This democratization of data unlocks new analytics capabilities for marketing and growth teams, including saving a lot of reporting time for stakeholders and more advanced analytics of their program progress.

Greg’s Sometimes Newsletter

Discussion about this post