Your Data Warehouse Ignores Your Primary Key
Table of Contents
The agenda…⌗
Today we’re digging into the curious case of “Primary Keys” in modern data warehouses (think Redshift, BigQuery, Snowflake). On the menu:
- A high-stakes request from our VP
- What happens when a
PRIMARY KEY
doesn’t do what you think it does - A hands-on experiment in Postgres to see why OLAP systems choose not to enforce a PK
- A dash of data quality, some opinions, and a tangent or two
A little background⌗
Lets take Redshift for example. In Redshift, primary keys are informational only; they are not enforced. However, the query optimizer will trust our declaration and use it to create more efficient execution plans. How silly of Redshift to trust us like that.
This is not shocking to the seasoned data warehouse connoisseur, but to the uninitiated this may look unnerving. So for anyone unfamiliar here’s what I mean.
Our VP needs to know⌗
Our VP needs to know if Bob is hungry or not to understand whether the company is doing good or bad.
Shit, we deleted all our data due to PII (Personally Inflatable Information) concerns when the EPA FBI was hot on our ass. We can’t tell our VP that we don’t know anything about Bob so lets insert some random records. We’ll need to start from scratch!
First lets make a new table and define a primary key.
Alright lets manually insert our correct data!
And run our important query that our VP needs. We don’t have time for data quality checks so lets throw it into Excel and send it everywhere.
Oh no… we have duplicate records even after writing SELECT DISTINCT
. But I thought SELECT DISTINCT
would remove duplicates? Well, yes, but… we declared SOME_ID as a PRIMARY KEY, effectively telling Redshift, “I solemnly swear this column contains no duplicates.” The query optimizer, trusting us completely, chooses the most efficient execution plan: it just scans the column and hands it back without performing a costly deduplication step. Why would it? We already promised it was unique.
Imagine someone gives you a book with 1,000,000 unsorted names in it and tells you they are unique. Then they say, “cross out any duplicate names and hand the book back to me”. You’re initial reaction is to immediately throw that book back as checking for uniqueness of one million name sounds… well… awful.
In this case, the query optimizer is just handing us the book back. If we were to run:
We’d get something back like:
That’s just a SEQ SCAN handing us back the entire column because we’ve already told Redshift these values are unique. In contrast, what if we scan a non-PK column for uniqueness:
And and there you have it, now we get:
You’ll find XN Unique
in query plans when DISTINCT
or UNION
(which implies distinctness, unlike UNION ALL
) are used. Our DBMS now returns a unique result set from values it does not know are unique. I go on a tangent about this below if you’re interested.
Click to view a tangent on XN Unique & XN HashAggregate
If we have an index on the column deduplication can be fast, I’ll link the Postgres docs on B-TREE Indexes specifically the section on deduplication. But what if we don’t have an index… Redshift doesn’t even have indexes… Redshift just has metadata (min-max values) for blocks written to disk called Zone Maps.
Redshift docs notes that XN Unique
: “Removes duplicates for SELECT DISTINCT queries and UNION queries.” This gets weird because I’d assume with this operator that we are sorting and plucking unique values from this column. However, there is no XN Sort
in that plan. XN Sort
:“Evaluates the ORDER BY clause and other sort operations, such as sorts required by SELECT DISTINCT
queries. Our SELECT DISTINCT
query above doesn’t have an XN Sort
it only contains an XN Unique
, maybe XN Unique
implies an XN Sort
? I generated an explain plan for a relation whose rows were 100% unsorted according to STV_TBL_PER and that plan did NOT include an XN Sort
either…
I think a couple things could be happening:
- The optimizer can choose between different strategies to perform our
SELECT DISTINCT
and in these cases it’s actually choosing a hashing strategy but just labeling it asXN Unique
. XN Unique
implies a sorting operation, and we’re just not being shown that in the plan
For example if we take:
And re-write it as:
We get:
This is similar to XN Unique but the costs are different cost=0.00..0.09
& cost=0.09..0.09
respectively. The first number is the relative cost of returning the first row for this operation. The second value, in this case 0.09, provides the relative cost of completing the operation. So I’d surmise that depending on how we write this query:
XN HashAggregate
is a blocking operator designed for GROUP BY clauses; it must consume its entire input to correctly compute aggregate functions like COUNT() before returning any rows.XN Unique
is a specialized operator for SELECT DISTINCT that can implement a streaming (non-blocking) hash algorithm, allowing it to output unique rows as soon as they are encountered.
This seems to explain why Unique can have a zero startup cost while HashAggregate’s startup cost reflects processing its entire input.
On another note, I’m only able to get an XN Sort
operator via explicitly using ORDER BY
:
XN Merge (cost=1000000000000.19..1000000000000.20 rows=7 width=304)
Merge Key: an_attribute
-> XN Network (cost=1000000000000.19..1000000000000.20 rows=7 width=304)
Send to leader
-> XN Sort (cost=1000000000000.19..1000000000000.20 rows=7 width=304)
Sort Key: an_attribute
-> XN Unique (cost=0.00..0.09 rows=7 width=304)
-> XN Seq Scan on check123 (cost=0.00..0.07 rows=7 width=304)
Anyway, our VP is pissed because we sent trash, but why would an OLAP DBMS do something like this?
PK in name only⌗
We can see that our primary key’s exist in name only. The DBMS is not going to enforce them, its going to assume they are right, and if you want to enforce them then you’ll need to write that logic yourself.
So what could be the reasons why thoughtful and well payed, Engineers, TPMs, Directors, and VPs building these cloud data warehouses decided to allow people to designate primary keys that aren’t enforced? I’ll boil the arguments down to the following:
Write performance⌗
What do we mean by write performance, and can we actually take a look at a DBMS and see this impact on performance?
Yes, we can:
- spin up Postgres inside of a container
- create tables with/without PK
- write data to said tables
- measure performance
- and even look a little deeper at what Postgres is doing
A couple caveats before, so I don’t get dunked on…
- Single Node vs. MPP: I’m running Postgres inside a container on my
[kindle](Insert sponsored link)Mac, which is completely different than the orchestration/networking effort that is required for a cloud data warehouse with dozens of compute nodes - Row-Oriented vs. Columnar: Postgres is row oriented, an entire row (SOME_ID, AN_ATTRIBUTE, ANOTHER_ATTRIBUTE, SOME_NUMBER) is stored together. Redshift utilizes columnar storage, each column is stored in a separate block(s). This is fantastic for aggregates
SELECT SUM(SOME_NUMBER)
but makes single-row operations inefficient. If we wanted to update or insert a new row in Redshift while enforcing the primary key we’d have to:
- Search for the existence of/find the id of the record we are updating in its block (If it doesn’t exist then insert)
- Then find the corresponding attributes in their blocks, so on and so forth for every column
- Just to reconstruct one tuple for an update… the opposite of what columnar stores are designed for
This leads into the second point people will make… that this isn’t what the DW is for. You can now see that these inserts are going to be gnarly, why not have some sort of slowly changing dimension that would allow for faster writes and control changes to facts in an OLTP DBMS better suited to handle transactional workloads.
Insert with & w/o a PK⌗
Anyway… lets do our little test:
First we pull the docker image and run a container in detached mode:
We can check the status of our new container with docker ps -a
then we can run docker exec -it postgres-test psql -U postgres
to get a psql shell. We’ll create two tables, one with a PK and one without, and use generate_series()
to curate some faux data for us to write to these tables, something like:
So lets create our two tables, one with a PK one without:
Click to view a deep dive into Postgres internals
Now we have two tables and with_pk_test
has a primary key. We can see it has an associated index:
Each table and index is stored in a separate file. We can also see the initial size of this index, Postgres has allocated the first 8kB page for our B tree:
We can get the OID of all databases with:
Now we can exit psql (\q
) and use bash instead: docker exec -it postgres-test bash
. We can cd to the base directory and list the directories.
Nice… 1, 4, and 5 match the OID for the databases we listed above, and now if we cd to 5
and ls -lh
we’ll find:
-rw------- 1 postgres postgres 0 Jun 23 16:30 16388 <-- OID of our relation no_pk_test
-rw------- 1 postgres postgres 0 Jun 23 16:30 16391
-rw------- 1 postgres postgres 8.0K Jun 23 16:30 16392
-rw------- 1 postgres postgres 0 Jun 23 16:31 16393 <-- OID of our relation with_pk_test
-rw------- 1 postgres postgres 0 Jun 23 16:31 16396
-rw------- 1 postgres postgres 8.0K Jun 23 16:31 16397
-rw------- 1 postgres postgres 8.0K Jun 23 16:31 16398 <-- OID of our index with_pk_test_pkey
Alright, with that deep dive out of the way lets insert 10MM records using generate_series()
into these empty relations:
Yep, closing in on twice as long (3846.560 ms vs 6038.888 ms). And just for fun we can check on the size of our tables/index.
table_name | table_size | indexes_size | total_size |
---|---|---|---|
“public”.“with_pk_test” | 651 MB | 214 MB | 865 MB |
“public”.“no_pk_test” | 651 MB | 0 bytes | 651 MB |
And for even more fun we can see this time to insert explode further after adding a secondary Index.
There is NO free lunch, sure you can make your reads faster, but at what cost?
- Are people even using the index?
- Whats the ratio of writes to reads on this table anyway?
- I wonder how many B+ Trees I can fit in my 2012 Chevy impala LT?
- These are the questions we need to answer
Merge Upsert⌗
What happens if we insert additional records into our tables that already have 10,000,000 records in them. We could compare the data warehouse style insert where we insert everything vs a merge upsert where we update records that already exist and insert new records that don’t.
Let’s regenerate IDs spanning 9,000,001 -> 11,000,000 so we get a mix of new and already existing records and perform our two writes.
We can perform our basic insert into no_pk_test
:
Now what about our upsert? We can use On Conflict
courtesy of Peter Geoghegan, Heikki Linnakangas, Andres Freund and Jeff Janes.
The merge upsert takes much longer (598.475 ms -> 3287.132 ms). These are two different kinds of writes, but its generally indicative of what you might be trying to accomplish in an OLAP system vs OLTP. Hopefully this sheds some light on what people mean when they bring up the “performance” argument.
These constraints should be enforced upstream⌗
This is idea #2.
And this is opening a can of opinions, hot takes and worms. I think in theory maybe this works, but in the real world there are certainly people with relations sitting in their DW that could benefit from the safety of PK enforcement or, at least, have that option. So on one hand we have quick writes/optimized plans, and on the other hand all those performance gains are a wash because we have to write those deduping checks anyway and people are confused that their PRIMARY KEY
has duplicates in it.
Opinion Time! I don’t understand why the syntax in Redshift couldn’t have been changed to fully remove confusion. If you read the documentation its clear, but why call it a
PRIMARY KEY
when it could have beenPRIMARY KEY NOT ENFORCED
… a little clarity over brevity. Databricks has theRELY
option. Snowflake doesn’t enforce them in their standard table offering but do enforce them for hybrid tables. BigQuery usesNOT ENFORCED
.
(1048064189, Have fun out there!), (1048064189, Hope you enjoyed the blog!), (1048064189, Take Care!)