DRY Analytics

I was talking to one of our partner teams in a developing market. They are a newer analytics department of savvy Business Analysts & SCMs (Supply Chain Managers).

Long story short, they are working on this new metric and I was like, How are you filtering for our specific products in your marketplace? And they were like: Oh we have this case statement that we include in all of our jobs…

SELECT COl1
     , COL2
     , COL3
     , ...
     , CASE WHEN PRODUCT_WEIGHT > 45 THEN 'OUR_PRODUCT'
            WHEN PRODUCT_LENGTH >= 37 AND PRODUCT_WIDTH >= 33 THEN 'OUR_PRODUCT'
            WHEN CATEGORY = 'SOME_CATEGORY' OR EARTH_IS_ROUND_FLAG = 'Y' THEN 'OUR_PRODUCT'
          WHEN COLOR = 'RED' THEN 'OUR_PRODUCT'
          ELSE 'NOT_OUR_PRODUCT' END AS PRODUCT_DEF
FROM BIG_TABLE
WHERE PRDDUCT_DEF = 'OUR_PRODUCT'

Quick aside: When I write Job, Profile, Model, or Asset I’m talking about SQL that creates some result set and is most likely scheduled to run at a specified interval or once all dependencies have satisfied.

I was like ohhhh man can I save you some pain. Maintaining this kind of logic in tens or worse hundreds of jobs is going to inevitably suck! At some point someone’s going to come along and say:

 Hey… color = 'BLUE' products should also be in our list of products.

Now when this day comes and you’ve got a bunch of different teams that all maintain product definitions… someone is going to have to coordinate this migration and get the timing right. Heat flashes across your forehead and your palms get a little sweaty. For so long we’ve been copying and pasting case statements but we’ve officially reached a crescendo of misaligned definitions and numbers not matching up. The CAT is way out of the bag on this one, it does NOT want to go back in the bag and you need it back in the bag if you ever want to make it to the FIFe World Show.

This change will need to be cascaded across all the teams that maintain such a definition

a. Some teams might disagree and say Blue products shouldn’t be included

b. It may be difficult to locate everyone who is defining products so there will be some level of uncertainty that everyone has been notified

c. And still others will ask, since when did we define OUR_PRODUCT with a color dimension????
Somebody won’t have bandwidth to implement the change and will need to push the date back, others may need time to figure out where exactly they are defining product definitions in their pipeline, someone’s OOTO. This will bump around a bunch of teams and there’ll be several calls to organize.
Some teams may have different data models, we only keep the latest definition, we keep a versioned definition over time, we have a denormalized table and qualify row_number() over(partition by product_id order by last_updated_date desc) = 1 every single time to derive the most up to date definition.
Now that we’ve updated the definition do we need to rewrite all historical tables that were derived using the previous definition? Now those blue products are OUR_PRODUCT but at any point time pre-change they would never be OUR_PRODUCT and would not reside in any stored result in any relation. And don’t forget this is happening across multiple teams in parallel and that change will need to be cascaded through several sequential jobs!

Ok so how do we avoid pure pain?⌗

Maintain a centralized definition.

DENORM_PRODUCT_DEFS [Captures all product changes over time] denormalized table appending all changes to products over time. Our products may change color or a vendor my adjust the packaging changing its dimensions this can all impact the classification of a specific product. This table could hold when the records were first inserted, when they were most recently updated, and denotes the active definition. This will save you a bunch of LEAD/LAG stuff if you’re reading from this table a lot.
VERSIONED_PRODUCT_DEFS [Tracks historical versions for auditing and reproducability] versioned table that contains a snapshot of the product definition at a specific point in time. Back in JUNE we used this definition which would have resulted in this classification of these products and now in OCTOBER we use this definition which would have resulted in this classification of these products. This table would nicely be partitioned on version and would also be great to have columns denoting start and end dates for the version.
FACT_PRODUCT_DEFS [Current, streamline view for active use cases] only the most up to date definition for products that is pre-filtered to only include your products.

So now we have profiles that look like:

SELECT B.COl1
     , B.COL2
     , B.COL3
     , B...
FROM BIG_TABLE B
    INNER JOIN FACT_PRODUCT_DEFS F ON B.PRODUCT_ID = F.PRODUCT_ID

And even better, now we can update how this reference is configured and if we have this query written fifty times everything points back to that reference and is updated [ref dynamically resolves table and schema names, so changes to a single reference propagate across all queries]:

SELECT B.COl1
     , B.COL2
     , B.COL3
     , B...
FROM {{ ref("BIG_TABLE") }} B
    INNER JOIN {{ ref("FACT_PRODUCT_DEFS") }} F ON B.PRODUCT_ID = F.PRODUCT_ID

Distinct logic across teams is pain. Host a central definition. Argue about said definition, update said definition, cry about said definition, whatever… but allow everyone to read from a single definition.

Any time you see the same code written over and over alarm bells should be going off, there is a fire in the kitchen! Not only is this just MORE lines of code, but every instance its re-written is another opportunity for logic to become stale or be slightly miswritten. DRY DRY DRY. Do not pour water on the grease fire in the kitchen, suffocate it with a deluge consistent data/reporting instead.

This is such a common source of the “This number doesn’t match that number” problem. This is the “but it works on my machine” equivalent for day to day analytics.

The numbers don’t match up because distributed hard coded definitions are hosted across teams using different dependencies running at different times… I would be FAR more perplexed if the numbers DID match up. In fact I’d be completely dumb founded. The numbers have absolutely no business to be matching up because they don’t have a data model/governance to support them actually doing so. Its not even automatically testable because there’s no way to determine where the code diverges without, I guess, literally parsing jobs as text to rip out the dependencies and business logic.

There are so many different “types” of what I described above. For instance, possibly you have a data migration campaign and need to prefix all of your jobs with a new Database. In current state, hundreds of profiles have FROM clauses like FROM SCHEMA.TABLE but now needs to be FROM DATABASE.SCHEMA.TABLE to every single one… hopefully this illustrates why the above idea of ref is so helpful.

I’m not here to praise DBT but the Jinja templating solution to make SQL more dynamic handles some of these problems nicely. I think a better data model can resolve some nasty issues, but if you’re repeating logic or referencing specific relations over and over across profiles, assets, whatever… be prepared for pain.

If you’re working with an analytics team, one of the worst things that can happen is an erosion of trust in data quality from that team’s stakeholders. If you ordered crayons from Amazon and you receive pencils, or you set up an AWS billing alarm to send you an email when your spend has exceeded $100 and it doesn’t… you lose trust. One of the key pillars of a successful analytics team is building trust in their outputs (Also, move quickly I need this data yesterday lets go lets go… but that’s a given).

This is the unglamorous plumbing that underpins a successful analytics organization. While you can never fully control the madness that occurs in a 30 tab excel file authored by ten people off of windows shared drive you can rest easy knowing that at least whatever was copy pasted in there was reputable and “correct”.