Query Banks
That query bank should be a semantic layer⌗
I’ve done it myself:
“Oh we should just cache all these commonly used queries somewhere central, so we always have a nice definition for someone to use.”
- You could even extend the functionality of this “Query Bank” and train an LLM, or perform code reviews, or implement search, etc…
Sweet, now anyone can grab some code and use it! They can paste it here and in this scheduled job there, science can use it and slam it into a notebook, and others can run ad hoc workloads against it.
No, ahh! The BANK IS CLOSED⌗
A query bank misses the entire point of DRY, you end up with a hairball of copy/paste code that has no reference to its origin.
Imagine we took all package management and said you know what instead of depending on our source code you should just copy and paste it into your project
and reference it there!
That is what the “Query Bank” is doing because you can't inject the latest definition(query) at run time
. If you CAN do this then you don’t have a query bank, you have an actual data model, fleshed out with materialized relations that have some established dependency chain which you can reference in your workflow.
The query bank is just a bandaid for broken process/data models:
- No lineage of where code has been copied
- No ability to update code across all copies
- Limited ability to notify users of changes to upstream code
- Most likely repeating ourselves which is extra work for people and DBMS
- Will 100% cause discrepancies between “Sources of Truth”
- No ability to ensure code is being copy/pasted into the correct context
Creating tables establishes consistency, durability and determinism that’s not possible with a “Query bank”.
The Bank is open⌗
Make a table, document the table, expose the table, manage access, and share the table.
Ask yourself a few questions before you act:
- Should I really send this code to someone or should I just DEFINE this once and allow them to read from the definition?
- This job, model, notebook whatever you want to call it is already like 1000 lines of SQL should I really pile on more logic to this or maybe its time ot break down into a series of relations?
- I’ve seen this Temporary table, CTE, etc… in a bunch of people’s queries should I just paste it into mine or is there something better I can do?
It’s your problem until you fix it or find the owner⌗
Here’s a challenge as an example. If you see a long case statement being shared around. Yes, literally just a case statement without any context… take the initiative and see if you can’t centralize that and store it is a relation for the people needing it.
It’s most likely they don’t know any better and will continue to do this until they have a nasty data quality issue/argument.