The agenda…

Today we’re digging into the curious case of “Primary Keys” in modern data warehouses (think Redshift, BigQuery, Snowflake). On the menu:

  • A high-stakes request from our VP
  • What happens when a PRIMARY KEY doesn’t do what you think it does
  • A hands-on experiment in Postgres to see why OLAP systems choose not to enforce a PK
  • A dash of data quality, some opinions, and a tangent or two

A little background

Lets take Redshift for example. In Redshift, primary keys are informational only; they are not enforced. However, the query optimizer will trust our declaration and use it to create more efficient execution plans. How silly of Redshift to trust us like that.

This is not shocking to the seasoned data warehouse connoisseur, but to the uninitiated this may look unnerving. So for anyone unfamiliar here’s what I mean.

Our VP needs to know

Our VP needs to know if Bob is hungry or not to understand whether the company is doing good or bad.

Shit, we deleted all our data due to PII (Personally Inflatable Information) concerns when the EPA FBI was hot on our ass. We can’t tell our VP that we don’t know anything about Bob so lets insert some random records. We’ll need to start from scratch!

First lets make a new table and define a primary key.

-- REMEMBER! When Storing strings make sure to take up as much space as possible 
-- Defining these columns as CHAR pads strings with blanks to ensure we waste as much space as we can 
-- Everyone will love you for this, and you'll probably get promoted!
CREATE TABLE CHECK123 (
SOME_ID CHAR(4096) PRIMARY KEY, <-- Ahh, a nice PK
AN_ATTRIBUTE CHAR(4096),
ANOTHER_ATTRIBUTE CHAR(4096),
SOME_NUMBER INT4
);

Alright lets manually insert our correct data!

INSERT INTO CHECK123
VALUES ('abc123', 'Bob', 'Hungry', 2) <-- Hmmm...
     , ('wow', 'Big Tom', 'Fast', 86)
     , ('this', 'Dale', 'Thoughtful', 12)
     , ('is', 'Doug', 'Tired', 9)
     , ('pretty', 'Hammer time', 'Happy', 6)
     , ('annoying', 'The bug unit', 'Angry', 6)
     , ('abc123', 'Whopper', 'Sad', 6) <-- Hmmm...

And run our important query that our VP needs. We don’t have time for data quality checks so lets throw it into Excel and send it everywhere.

SELECT SOME_ID
     , COUNT(1)
FROM CHECK123
GROUP BY SOME_ID;

abc123,2 -- UH OH!!!!
wow,1
this,1
is,1
pretty,1
annoying,1
SELECT DISTINCT SOME_ID FROM CHECK123;

abc123 -- UH OH!!!!
wow
this
is
pretty
annoying
abc123  -- UH OH!!!!

Oh no… we have duplicate records even after writing SELECT DISTINCT. But I thought SELECT DISTINCT would remove duplicates? Well, yes, but… we declared SOME_ID as a PRIMARY KEY, effectively telling Redshift, “I solemnly swear this column contains no duplicates.” The query optimizer, trusting us completely, chooses the most efficient execution plan: it just scans the column and hands it back without performing a costly deduplication step. Why would it? We already promised it was unique.

Imagine someone gives you a book with 1,000,000 unsorted names in it and tells you they are unique. Then they say, “cross out any duplicate names and hand the book back to me”. You’re initial reaction is to immediately throw that book back as checking for uniqueness of one million name sounds… well… awful.

In this case, the query optimizer is just handing us the book back. If we were to run:

EXPLAIN SELECT DISTINCT SOME_ID FROM CHECK123

We’d get something back like:

XN SEQ SCAN on check123 (cost=0.00..0.07 rows=7 width-304)

That’s just a SEQ SCAN handing us back the entire column because we’ve already told Redshift these values are unique. In contrast, what if we scan a non-PK column for uniqueness:

SELECT DISTINCT AN_ATTRIBUTE FROM CHECK123;

And and there you have it, now we get:

XN Unique  (cost=0.00..0.09 rows=7 width-304)
    -> XN SEQ SCAN on check123 (cost=0.00..0.07 rows=7 width-304)

You’ll find XN Unique in query plans when DISTINCT or UNION (which implies distinctness, unlike UNION ALL) are used. Our DBMS now returns a unique result set from values it does not know are unique. I go on a tangent about this below if you’re interested.

Click to view a tangent on XN Unique & XN HashAggregate

If we have an index on the column deduplication can be fast, I’ll link the Postgres docs on B-TREE Indexes specifically the section on deduplication. But what if we don’t have an index… Redshift doesn’t even have indexes… Redshift just has metadata (min-max values) for blocks written to disk called Zone Maps.

Redshift docs notes that XN Unique: “Removes duplicates for SELECT DISTINCT queries and UNION queries.” This gets weird because I’d assume with this operator that we are sorting and plucking unique values from this column. However, there is no XN Sort in that plan. XN Sort:“Evaluates the ORDER BY clause and other sort operations, such as sorts required by SELECT DISTINCT queries. Our SELECT DISTINCT query above doesn’t have an XN Sort it only contains an XN Unique, maybe XN Unique implies an XN Sort? I generated an explain plan for a relation whose rows were 100% unsorted according to STV_TBL_PER and that plan did NOT include an XN Sort either…

I think a couple things could be happening:

  1. The optimizer can choose between different strategies to perform our SELECT DISTINCT and in these cases it’s actually choosing a hashing strategy but just labeling it as XN Unique.
  2. XN Unique implies a sorting operation, and we’re just not being shown that in the plan

For example if we take:

SELECT DISTINCT AN_ATTRIBUTE FROM CHECK123;

And re-write it as:

SELECT AN_ATTRIBUTE FROM CHECK123 GROUP BY AN_ATTRIBUTE;

We get:

XN HashAggregate  (cost=0.09..0.09 rows=7 width-304)
    -> XN SEQ SCAN on check123 (cost=0.00..0.07 rows=7 width-304)

This is similar to XN Unique but the costs are different cost=0.00..0.09 & cost=0.09..0.09 respectively. The first number is the relative cost of returning the first row for this operation. The second value, in this case 0.09, provides the relative cost of completing the operation. So I’d surmise that depending on how we write this query:

  • XN HashAggregate is a blocking operator designed for GROUP BY clauses; it must consume its entire input to correctly compute aggregate functions like COUNT() before returning any rows.
  • XN Unique is a specialized operator for SELECT DISTINCT that can implement a streaming (non-blocking) hash algorithm, allowing it to output unique rows as soon as they are encountered.

This seems to explain why Unique can have a zero startup cost while HashAggregate’s startup cost reflects processing its entire input.

On another note, I’m only able to get an XN Sort operator via explicitly using ORDER BY:

SELECT DISTINCT AN_ATTRIBUTE FROM CHECK123 ORDER BY AN_ATTRIBUTE;
XN Merge  (cost=1000000000000.19..1000000000000.20 rows=7 width=304)               
  Merge Key: an_attribute                                                          
  ->  XN Network  (cost=1000000000000.19..1000000000000.20 rows=7 width=304)       
        Send to leader                                                             
        ->  XN Sort  (cost=1000000000000.19..1000000000000.20 rows=7 width=304)    
              Sort Key: an_attribute                                               
              ->  XN Unique  (cost=0.00..0.09 rows=7 width=304)                    
                    ->  XN Seq Scan on check123  (cost=0.00..0.07 rows=7 width=304)

Anyway, our VP is pissed because we sent trash, but why would an OLAP DBMS do something like this?

PK in name only

We can see that our primary key’s exist in name only. The DBMS is not going to enforce them, its going to assume they are right, and if you want to enforce them then you’ll need to write that logic yourself.

So what could be the reasons why thoughtful and well payed, Engineers, TPMs, Directors, and VPs building these cloud data warehouses decided to allow people to designate primary keys that aren’t enforced? I’ll boil the arguments down to the following:

Write performance

What do we mean by write performance, and can we actually take a look at a DBMS and see this impact on performance?

Yes, we can:

  • spin up Postgres inside of a container
  • create tables with/without PK
  • write data to said tables
  • measure performance
  • and even look a little deeper at what Postgres is doing

A couple caveats before, so I don’t get dunked on…

  1. Single Node vs. MPP: I’m running Postgres inside a container on my [kindle](Insert sponsored link) Mac, which is completely different than the orchestration/networking effort that is required for a cloud data warehouse with dozens of compute nodes
  2. Row-Oriented vs. Columnar: Postgres is row oriented, an entire row (SOME_ID, AN_ATTRIBUTE, ANOTHER_ATTRIBUTE, SOME_NUMBER) is stored together. Redshift utilizes columnar storage, each column is stored in a separate block(s). This is fantastic for aggregates SELECT SUM(SOME_NUMBER) but makes single-row operations inefficient. If we wanted to update or insert a new row in Redshift while enforcing the primary key we’d have to:
  • Search for the existence of/find the id of the record we are updating in its block (If it doesn’t exist then insert)
  • Then find the corresponding attributes in their blocks, so on and so forth for every column
  • Just to reconstruct one tuple for an update… the opposite of what columnar stores are designed for

This leads into the second point people will make… that this isn’t what the DW is for. You can now see that these inserts are going to be gnarly, why not have some sort of slowly changing dimension that would allow for faster writes and control changes to facts in an OLTP DBMS better suited to handle transactional workloads.

Insert with & w/o a PK

Anyway… lets do our little test:

First we pull the docker image and run a container in detached mode:

docker run --name postgres-test -e POSTGRES_PASSWORD=mysecretpassword -p 5432:5432 -d postgres

We can check the status of our new container with docker ps -a then we can run docker exec -it postgres-test psql -U postgres to get a psql shell. We’ll create two tables, one with a PK and one without, and use generate_series() to curate some faux data for us to write to these tables, something like:

SELECT i, 'some payload data for row ' || i PAYLOAD FROM generate_series(1, 10) s(i);

-- Which will produce the following ID + PAYLOAD 
 i  |           PAYLOAD          
----+------------------------------
  1 | some payload data for row 1
  2 | some payload data for row 2
  3 | some payload data for row 3
  4 | some payload data for row 4
  5 | some payload data for row 5
  6 | some payload data for row 6
  7 | some payload data for row 7
  8 | some payload data for row 8
  9 | some payload data for row 9
 10 | some payload data for row 10
(10 rows)

So lets create our two tables, one with a PK one without:

CREATE TABLE no_pk_test (
    id  INTEGER,
    payload TEXT
);

CREATE TABLE with_pk_test (
    id  INTEGER PRIMARY KEY,
    payload TEXT
);
Click to view a deep dive into Postgres internals

Now we have two tables and with_pk_test has a primary key. We can see it has an associated index:

SELECT oid, relname, relkind FROM pg_class WHERE relname IN ('with_pk_test', 'with_pk_test_pkey');

-- Returns:
  oid  |      relname      | relkind 
-------+-------------------+---------
 16393 | with_pk_test      | r <-- Our Relation
 16398 | with_pk_test_pkey | i <-- Our index
(2 rows)

Each table and index is stored in a separate file. We can also see the initial size of this index, Postgres has allocated the first 8kB page for our B tree:

-- 8192 bytes
SELECT pg_size_pretty(pg_relation_size('with_pk_test_pkey')) AS index_size;

We can get the OID of all databases with:

SELECT * FROM pg_database;

-- Returns (truncated Columns):
 oid |  datname  
-----+-----------
   5 | postgres
   1 | template1
   4 | template0
(3 rows)

-- We can see our current database with: 
SELECT oid FROM pg_database WHERE datname = current_database();

-- Returns:
 oid 
-----
   5
(1 row)

Now we can exit psql (\q) and use bash instead: docker exec -it postgres-test bash. We can cd to the base directory and list the directories.

cd /var/lib/postgresql/data/base && ls` # shows 1  4  5

Nice… 1, 4, and 5 match the OID for the databases we listed above, and now if we cd to 5 and ls -lh we’ll find:

-rw------- 1 postgres postgres    0 Jun 23 16:30 16388 <-- OID of our relation no_pk_test
-rw------- 1 postgres postgres    0 Jun 23 16:30 16391
-rw------- 1 postgres postgres 8.0K Jun 23 16:30 16392
-rw------- 1 postgres postgres    0 Jun 23 16:31 16393 <-- OID of our relation with_pk_test
-rw------- 1 postgres postgres    0 Jun 23 16:31 16396
-rw------- 1 postgres postgres 8.0K Jun 23 16:31 16397
-rw------- 1 postgres postgres 8.0K Jun 23 16:31 16398 <-- OID of our index with_pk_test_pkey

Alright, with that deep dive out of the way lets insert 10MM records using generate_series() into these empty relations:

-- INSERT 0 10000000
-- Time: 3846.560 ms (00:03.847)
\timing on
INSERT INTO no_pk_test SELECT i, 'some payload data for row ' || i PAYLOAD FROM generate_series(1, 10000000) s(i);

-- INSERT 0 10000000
-- Time: 6038.888 ms (00:06.039)
INSERT INTO with_pk_test SELECT i, 'some payload data for row ' || i PAYLOAD FROM generate_series(1, 10000000) s(i);

Yep, closing in on twice as long (3846.560 ms vs 6038.888 ms). And just for fun we can check on the size of our tables/index.

table_name table_size indexes_size total_size
“public”.“with_pk_test” 651 MB 214 MB 865 MB
“public”.“no_pk_test” 651 MB 0 bytes 651 MB

And for even more fun we can see this time to insert explode further after adding a secondary Index.

CREATE TABLE multi_index_test (
  id INTEGER PRIMARY KEY,
  name VARCHAR(256) NOT NULL, -- Our new table has a new column (name)
  payload TEXT
);
CREATE INDEX idx_name ON multi_index_test (name); -- Add a secondary B+Tree index
INSERT INTO multi_index_test
SELECT
  i,
  'user_name_' || (i % 100000),
  'some payload data for row ' || i
FROM generate_series(1, 10000000) s(i);
INSERT 0 10000000
Time: 30859.168 ms (00:30.859) <-- *cough*

There is NO free lunch, sure you can make your reads faster, but at what cost?

  • Are people even using the index?
  • Whats the ratio of writes to reads on this table anyway?
  • I wonder how many B+ Trees I can fit in my 2012 Chevy impala LT?
  • These are the questions we need to answer

Merge Upsert

What happens if we insert additional records into our tables that already have 10,000,000 records in them. We could compare the data warehouse style insert where we insert everything vs a merge upsert where we update records that already exist and insert new records that don’t.

Let’s regenerate IDs spanning 9,000,001 -> 11,000,000 so we get a mix of new and already existing records and perform our two writes.

-- Creating a new table with the records as discussed above 
CREATE TEMP TABLE new_batch AS
SELECT i, 'new payload for ' || i FROM generate_series(9000001, 11000000) s(i);

We can perform our basic insert into no_pk_test:

INSERT INTO no_pk_test
SELECT * FROM new_batch;
INSERT 0 2000000
Time: 598.475 ms 

-- 12MM Records now, 10MM old and 2MM new
SELECT LEFT(payload, 10), COUNT(1) FROM no_pk_test group by 1;
    left    |  count   
------------+----------
 new payloa |  2000000
 some paylo | 10000000
(2 rows)

Now what about our upsert? We can use On Conflict courtesy of Peter Geoghegan, Heikki Linnakangas, Andres Freund and Jeff Janes.

INSERT INTO with_pk_test
SELECT * FROM new_batch
ON CONFLICT (id) DO UPDATE SET
    payload = payload;
INSERT 0 2000000
Time: 3287.132 ms (00:03.287) 

                                QUERY PLAN                                
--------------------------------------------------------------------------
 Insert on with_pk_test  (cost=0.00..33414.40 rows=0 width=0)
   Conflict Resolution: UPDATE
   Conflict Arbiter Indexes: with_pk_test_pkey
   ->  Seq Scan on new_batch  (cost=0.00..33414.40 rows=1869440 width=36)

-- 11MM Records now, 9MM old and 2MM new
    left    |  count  
------------+---------
 new payloa | 2000000
 some paylo | 9000000
(2 rows)

-- We can also use merge, slightly slower 
-- Time: 4593.521 ms (00:04.594)
MERGE INTO with_pk_test pk
USING new_batch n
ON n.id = pk.id
WHEN MATCHED THEN
  UPDATE SET payload = n.payload
WHEN NOT MATCHED THEN
  INSERT (id, payload)
  VALUES (id, payload);
MERGE 2000000
Time: 4593.521 ms (00:04.594)
                                          QUERY PLAN                                          
----------------------------------------------------------------------------------------------
 Merge on with_pk_test pk  (cost=398687.23..523481.91 rows=0 width=0)
   ->  Hash Left Join  (cost=398687.23..523481.91 rows=1869440 width=48)
         Hash Cond: (n.id = pk.id)
         ->  Seq Scan on new_batch n  (cost=0.00..33414.40 rows=1869440 width=42)
         ->  Hash  (cost=207833.88..207833.88 rows=10979388 width=10)
               ->  Seq Scan on with_pk_test pk  (cost=0.00..207833.88 rows=10979388 width=10)

The merge upsert takes much longer (598.475 ms -> 3287.132 ms). These are two different kinds of writes, but its generally indicative of what you might be trying to accomplish in an OLAP system vs OLTP. Hopefully this sheds some light on what people mean when they bring up the “performance” argument.

These constraints should be enforced upstream

This is idea #2.

And this is opening a can of opinions, hot takes and worms. I think in theory maybe this works, but in the real world there are certainly people with relations sitting in their DW that could benefit from the safety of PK enforcement or, at least, have that option. So on one hand we have quick writes/optimized plans, and on the other hand all those performance gains are a wash because we have to write those deduping checks anyway and people are confused that their PRIMARY KEY has duplicates in it.

Opinion Time! I don’t understand why the syntax in Redshift couldn’t have been changed to fully remove confusion. If you read the documentation its clear, but why call it a PRIMARY KEY when it could have been PRIMARY KEY NOT ENFORCED… a little clarity over brevity. Databricks has the RELY option. Snowflake doesn’t enforce them in their standard table offering but do enforce them for hybrid tables. BigQuery uses NOT ENFORCED.

(1048064189, Have fun out there!), (1048064189, Hope you enjoyed the blog!), (1048064189, Take Care!)