Collection of thoughts on table creation.

This is specifically regarding tables for analytics (OLAP queries, Data Warehouse, etc…). This is also specific to a larger organization, I don’t necessary apply these when working locally with duckDB per se.

  1. Get a good understanding of who’s going to use the table, what they need, and when they’ll need that data by. This will help inform your data model and manage dependencies. If you are working directly with another team have them list out all of the requirements and set a date when this list will be final.

When working on a request for our finance partners, I was starting on it and everyday they wanted to add this, change that…. nip that in the bud… have them curate what they want with the expectation that it will lock once you start building it. This will force them to stop, think critically, and figure out exactly what they need. I have only had to make significant updates to this table once in the last 14 months because of this. (I guess I can’t prove causation, but I think it helped)

  1. Don’t try and do too much in a single relation, you may think you are doing people favors by adding columns, nested data structures, and complexity but there’s a delicate balance here. A classic mistake is building extremely wide tables. These tables are for everybody and nobody because each aspect of the table wont be “good enough”. The goal is commonly to try and bring in hundreds of attributes at a specific granularity, but it turns into a dependency and data quality nightmare without being able to satisfy many questions without making comprises that nobody can accept.

  2. Document your stuff. I’m a firm believer that what you build is only as good as your documentation. Don’t continue to contribute to the sea of blank column descriptions and opaque DAGs that your stakeholder/peers will have to waste time figuring out. You will also waste your own time explaining things to your stakeholders over and over. Instead, create a thorough explanatory resource and point people to it.

  3. Publish it to a shared data catalog if possible… I’ve made this mistake countless times of creating local tables, them gaining a solid user base and then having to later share it and manage access across various people/software entities. A few caveats in that I would not do this if you’re still developing the table and making fast iterations… but otherwise it’s going to give you the ability to manage access, control updates, configure dependencies, and cascade relevant notifications so much easier.

  4. Write tests and perform code reviews. We are working on better testing frameworks to role out.

  • The last thing you want is people/down stream jobs consuming trash that could’ve been blocked from being written to the table in the first place.
  • Having someone review your code/table is a no-brainer. Writing SQL against the table and using it as it would be used in production will uncover oddities/issues that would otherwise plague consumer.