A few considerations before building that extremely wide table.

Every once in a while, an idea comes along to build a table with a wide array of columns at a particular granularity. This is usually associated with “self-serve analytics” and/or an attempt to define a “source of truth” table. Wide tables, self-serve analytics, and source of truth tables aren’t inherently good or bad… but there are some considerations to this type of table that I’m going to rant about.

When I talk about a single “wide table” I’m kind of talking about a denormalized table just an especially wide one with many loosely related columns. The opposite of this would be a normalized design where one would “store different but related pieces of information in separate logical tables (called relations)”. I don’t want to linger on this topic too much but typically a denormalized scheme speeds up reads at the expense of writes, and vice versa for a normalized schema.

I’ll also mostly be talking in the context of OLAP, not OLTP.

Lets define our wide table

For the sake of the rest of this thing lets define a wide table.

You work for a company that sells things. You record data on what was sold, how much was sold, what was projected to be sold, how much of that thing you had, how many distinct customers bought that thing, how quickly did you get that thing to your customer… I’m not going to hash out the full DDL for this table but you get the picture.

  • It should be apparent the sheer number of unique metrics that can be generated just from selling things!

You know what would be really nice? What if we decided on some granularity say… for a particular thing in a particular week we just record every metric about that thing for that week! So, a unique row in our table will be a combination of Thing + Week… and then we’ll store the name of the thing, the inventory, the shipments, the forecast, etc…

Note that this is a BATCH solution where data is written periodically at specific intervals in time… NOT a streaming solution where shipments are recorded in real time. Maybe the upstream sources this table is generated by are populated via some Kafka/Redpanda/Kinesis situation.

Why would you want to do this?

Maybe this table is for an LLM or ML based query generation solution, pulling data for non-technical users with ad-hoc business questions, maybe it’s to simplify the SQL associated with these ad-hoc queries to move joins/other logic upstream, speeding up reads (SELECT), or maybe the table needs to have a ton of columns for some other reason. That’s fine! Standardizing definitions is generally a good thing and improving the speed at which ad-hoc questions are answered is also good!

Dueling parties wielding Excel files generated by random queries stored and iterated on in Notepad, resulting in different outputs is always annoying. It’d be easier to trouble shoot if everyone was relying on the same source and using some form of version control (I know I’m only dreaming at this point). It’d be even better if you didn’t need to join any other relations on that source, simply SELECT & FILTER and you’re good to go, thus, our wide table.

What are some issues? [Begin Rant]

This table is now a dependency bomb.

All of these columns are derived from upstream tables, you’re inventory data is coming from scan events at your warehouse, outbound shipments from another source and data science is producing the forecast in a local Python Notebook triggered via CRON that writes to the windows shared drive, you’ve asked if this could be moved to sage maker and the resulting dataset cataloged and directly written to the data-warehouse but that’s “below the line” for this quarter.

You could very well now have a ball of 70 dependencies.

  • If any of those dependencies are delayed every metric will be waiting… so instead of just having the guy who needs to report inventory levels to the director on Monday pissed, everyone is pissed.

Now you’re like ok… if any dependency hasn’t satisfied by Monday at 5:00 am, it doesn’t matter I’m just going to force the job so that at least some people get their data. However, now you’ve just written dirty rows to this table… the inventory guy is still pissed because inventory is half of what it should be because you forced the data set before the latest scan events were written to the upstream table. You also KNOW that by forcing dependencies you’ve got incorrect data in your table and you’re just going to have to re-run your transform job again and actually let the dependencies satisfy this time. If you have a single transform job that means running everything again on your compute even if it’s already been written to the table successfully. You’ll also need some mechanism to notify the inventory guy that his numbers are going to be garbage Monday am, so he doesn’t accidentally report something ridiculous to the director.

Inventory guy does not like you he thinks you’re an ass, you think he’s an ass, inventory guy keys the right side of your 2022 Toyota Avalon, the Toyota Avalon is now discontinued and you’re sad because you should’ve waited and purchased the Toyota Crown, but you didn’t realize the Crown was coming out so you bought an Avalon, anyway… right hook to inventory guy! Now you’re being charged with assault! Now you’re in jail, your arraignment isn’t until Tuesday, but you need to get inventory numbers out by Monday! Who’s going to force those dependencies???

What about storage?

With this wide table, we’ll have the name of the product, maybe the product belongs to a broader group of products and we’ll throw the description of the product in there as well.

  • These facts about the product may change… but probably fairly infrequently.

In our table every week the same values will be included for those products over and over. Instead, they could be stored in a separate table that contains each product, the description, color, etc… and then that could be joined to our fact table. Now we are again beginning to talk about Normalization which is critical to think about when setting up your relations.

A few solid rebuttals to the above statement given that we are mostly talking about an OLAP situation:

  1. Modern columnar storage formats like Parquet are efficient at compressing data, especially repeated values in a column… something like MVC
  2. Data storage is relatively cheap… if we are only talking in the millions of rows, it just not going to be a major driver of cost
  3. Denormalized data often performs better for analytical queries, with columnar storage, queries will only need to scan the columns as denoted by the query

A few points on Inserts/Updates/Deletes

This is where denormalization can be slower and get more complicated. Let’s say we want to change the product group that one of our products belongs to in the denormalized table. We’ll want all historical data to be attributed to the new group so we’ll need to update the previous product group in every row to the new one. We can launch a new product, but it’s possible data science hasn’t finished the production forecast for that product yet… now the forecast column is NULL because we aren’t going to stall inserting the rest of the columns for that product just because the forecast is NULL. However, people that read from this table will need to understand that the forecast is NULL and choose how to deal with it accordingly. If an analyst is tasked with deriving forecast accuracy, and they notice some NULLs and just NVL(forecast_column, 0)… that’s just not true, science has a beta forecast for the product and its certainly > 0… there’s a reason we decided to launch the new product after all, we predicted customers will want to buy it!

To be fair, if you launch something like the Pontiac Aztek maybe anticipating 0 demand is the correct choice when attempting to sell an angry kitchen appliance of a car.

Where will your data model break down?

Q: Hey I know we have that inventory column, could we break it out into the different inventory dispositions (Sellable, In transit, and Damaged)?

A: Sure, now our table increases in... width...

Q: Hey I know we have shipments for all of last week, and its only Thursday but I’d really to see what we shipped out Mon/Tues/Wed of this week to see where we are at.

A: We cant really answer that with the demoralized table at the current granularity… even if it was daily we’d probably be waiting on dependencies and wouldn't be able to get back to this person until Friday. Now this ask simply becomes a query against the upstream shipments table for the latest data.

Q: Hey I’m using that wide table you made but I want to see how the descriptions for this product have evolved over time; however, all the descriptions are the same in this table?

A: This question will need to be directed towards another table that contains this info as the denormalized table has been updated to only reflect the latest definition.

Just as with anything there are tradeoffs. Above, I’m posing specifically tricky questions of the table, but it would do extremly well with basic aggregations and reducing the knowledge burden for end users of all the different upstream tables.

I only bring up these points because I think they are relevant to think about when designing how you want to capture your data. Again, there is no right or wrong, it’s simply context/data dependent and the right choice will be different across different teams. However, there is a clear right and wrong when it comes purchasing a Pontiac Aztek; I’ll leave that exercise to the reader.