Your Data Warehouse Ignores Your Primary Key

2025-06-19

The agenda… Today we’re digging into the curious case of “Primary Keys” in modern data warehouses (think Redshift, BigQuery, Snowflake). On the menu: A high-stakes request from our VP What happens when a PRIMARY KEY doesn’t do what you think it does A hands-on experiment in Postgres to see why OLAP systems choose not to enforce a PK A dash of data quality, some opinions, and a tangent or two A little background Lets take Redshift for example.

Query Banks

2025-03-31

That query bank should be a semantic layer I’ve done it myself: “Oh we should just cache all these commonly used queries somewhere central, so we always have a nice definition for someone to use.” You could even extend the functionality of this “Query Bank” and train an LLM, or perform code reviews, or implement search, etc… Sweet, now anyone can grab some code and use it!

Table Creation

2025-03-17

Collection of thoughts on table creation. This is specifically regarding tables for analytics (OLAP queries, Data Warehouse, etc…). This is also specific to a larger organization, I don’t necessary apply these when working locally with duckDB per se. Get a good understanding of who’s going to use the table, what they need, and when they’ll need that data by. This will help inform your data model and manage dependencies. If you are working directly with another team have them list out all of the requirements and set a date when this list will be final.

Stop building on top of bad decisions

2025-01-16

One step back, two steps forward: Stop building on top off bad decisions and fix the bad decisions. Lets say you work with a bunch of people that are setting weekly goals. Each year goals are appended in a particular format to a table referenced by everyone to present goals throughout the organization. Long long ago in a galaxy far far away someone decided set the goal date column as type VARCHAR() so now you have goals associated with strings that range from ‘2025-05’ to ‘2025-02-05’ to ‘02/05/2025’.

DRY Analytics

2024-11-24

I was talking to one of our partner teams in a developing market. They are a newer analytics department of savvy Business Analysts & SCMs (Supply Chain Managers). Long story short, they are working on this new metric and I was like, How are you filtering for our specific products in your marketplace? And they were like: Oh we have this case statement that we include in all of our jobs…

Wide Tables

2024-10-06

A few considerations before building that extremely wide table. Every once in a while, an idea comes along to build a table with a wide array of columns at a particular granularity. This is usually associated with “self-serve analytics” and/or an attempt to define a “source of truth” table. Wide tables, self-serve analytics, and source of truth tables aren’t inherently good or bad… but there are some considerations to this type of table that I’m going to rant about.

[6 / 4 = 1 = 1.5]: Type Coercion and Precision in SQL

2024-07-06

An overview of what’s to come: I was originally writing this as a high-level introduction to send people when they encounter things like 6/4 returning 1 instead of 1.5 in some Database Management Systems (DBMS)… overtime its sprawled far beyond that. The points below serve as a TLDR, with a sprawling discussion expanding afterward. Operations like division in SQL can be confusing for novices or anyone without a programming background

Temporary Tables

2024-05-05

The database management system (DBMS) I’ll reference today is Redshift (AWS Cloud Datawarehouse offering) which is based on Postgres. I’ll dive into some of the anti-patterns around Temporary Tables I’ve seen abused during my work experience, why they don’t make sense, and how to fix them. Lets get it out of the way… Temporary Tables are not “better” than Common Table Expressions (CTEs) and CTEs are not “better” than temporary tables.