SQL Buckets
Case Statements:⌗
A common occurrence when writing SQL is needing to bucket some data and look at how it is distributed across said buckets.
Typically I’d write (have an LLM spit out buckets) something like this:
I have to add a character/number in each bucket to allow for sorting, making writing/updating the lengthy case statement that much more annoying. I have to write all this because unlike other OLAP DBMS, Redshift doesn’t have a bucketing function that I’m aware of.
Or do I…
Solution:⌗
Let’s say I have a table that records when a record was created and when it was last updated. I’ve never used this table before, and I’m interested in the distribution of the number of days between created and updated to see how common it is for a record to have an update n days after it was created.
Cool, so we want to know how these records look like in some buckets!
We could do this:
That is pretty annoying… it is verbose and, if you’ve done this before, you’ll know that catch all bucket at the end will have some spike in records in it that you’ll want to further investigate. So I came up with the following:
We can pass in the column to bucket, adjust the bucket size, and also handle the number of buckets. This does use LPAD but FLOOR() is standard. Regardless, lets break it down in detail if you need to rewrite for a different SQL dialect.
- FLOOR: rounds a number down to the next whole number
- LPAD: prepends characters to a string
On an input of “1” this is what the flow would look like to go from 1 -> 00. 0-4 Days
We’ve transformed from an input of CD_LD = 1 and assigned it a sortable, bucketable (probably not a word) value -> ‘00. 0-4 Days’
I’m worried about an unconstrained series of buckets… how do I cap it?⌗
I want to change the increment size⌗
I have more than 99 buckets, the two digit solution wont work:⌗
Hope that’s helpful! Maybe there’s a better solution, I didn’t do an exhaustive search!