SQL Buckets

Case Statements:⌗

A common occurrence when writing SQL is needing to bucket some data and look at how it is distributed across said buckets.

Typically I’d write (have an LLM spit out buckets) something like this:

SELECT CASE WHEN COLUMN_TO_BUCKET < 5 THEN 'a. < 5'
            WHEN COLUMN_TO_BUCKET < 10 THEN 'b. >=5 - <10'
            WHEN COLUMN_TO_BUCKET < 15 THEN 'c. >=10 - <15'
            ETC...
            ELSE 'u. >100' END AS COLUMN_TO_BUCKET_DIST
      , SUM(THIS_IS_ANNOYING) AS ANNOYED
FROM TABLE
GROUP BY 1

I have to add a character/number in each bucket to allow for sorting, making writing/updating the lengthy case statement that much more annoying. I have to write all this because unlike other OLAP DBMS, Redshift doesn’t have a bucketing function that I’m aware of.

Or do I…

Solution:⌗

Let’s say I have a table that records when a record was created and when it was last updated. I’ve never used this table before, and I’m interested in the distribution of the number of days between created and updated to see how common it is for a record to have an update n days after it was created.

CREATE TEMP TABLE #DELTA_DIST AS (
SELECT A_UNIQUE_ID_OF_A_ROW
     , DATE_DIFF('DAY', CREATION_DATE, LAST_UPDATED) AS CD_LD
FROM MY_TABLE
WHERE A_FILTER >= '2025-01-01'::DATE
  AND ANOTHER_FILTER = 1
);

Cool, so we want to know how these records look like in some buckets!

We could do this:

SELECT CASE WHEN CD_LD < 5 THEN 'a. < 5 Day'
            WHEN CD_LD < 10 THEN 'b. 5-10 Day'
            WHEN CD_LD < 15 THEN 'c. 10-15 Day'
            WHEN CD_LD < 20 THEN 'd. 15-20 Day'
            WHEN CD_LD < 25 THEN 'e. 20-25 Day'
            WHEN CD_LD < 30 THEN 'f. 25-30 Day'
            WHEN CD_LD < 35 THEN 'g. 30-35 Day'
            WHEN CD_LD < 40 THEN 'h. 35-40 Day'
            WHEN CD_LD < 45 THEN 'i. 40-45 Day'
            WHEN CD_LD < 50 THEN 'j. 45-50 Day'
            WHEN CD_LD < 55 THEN 'k. 50-55 Day'
            WHEN CD_LD < 60 THEN 'l. 55-60 Day'
            WHEN CD_LD < 65 THEN 'm. 60-65 Day'
            WHEN CD_LD < 70 THEN 'n. 65-70 Day'
            WHEN CD_LD < 75 THEN 'o. 70-75 Day'
            WHEN CD_LD < 80 THEN 'p. 75-80 Day'
            WHEN CD_LD < 85 THEN 'q. 80-85 Day'
            WHEN CD_LD < 90 THEN 'r. 85-90 Day'
            WHEN CD_LD < 95 THEN 's. 90-95 Day'
            WHEN CD_LD < 100 THEN 't. 95-100 Day'
            ELSE 'u. >100' END AS CD_LD_DIST
, COUNT(DISTINCT A_UNIQUE_ID_OF_A_ROW) AS NUM_ID
FROM DELTA_DIST
GROUP BY 1

That is pretty annoying… it is verbose and, if you’ve done this before, you’ll know that catch all bucket at the end will have some spike in records in it that you’ll want to further investigate. So I came up with the following:

SELECT LPAD(FLOOR(COLUMN_TO_BUCKET / 5::FLOAT)::TEXT, 2, '0') -- Creates the sortable digits at the beginning
            + '. ' +
       FLOOR(COLUMN_TO_BUCKET / 5::FLOAT) * 5 -- Generates the first day in the range
            + '-' +
       (FLOOR(COLUMN_TO_BUCKET / 5::FLOAT) * 5 + 4) -- Generates the last day in the range  (+4 as I'm bucketing by 5 days at a time)
            +
       ' Days' AS COLUMN_TO_BUCKET_DIST -- Adds the Days string at the end for clarity
     , COUNT(DISTINCT THING_TO_AGGREGATE) AS AGGREGATED_THING
FROM TABLE_I_WANT_TO_DISTRIBUTE
GROUP BY 1

We can pass in the column to bucket, adjust the bucket size, and also handle the number of buckets. This does use LPAD but FLOOR() is standard. Regardless, lets break it down in detail if you need to rewrite for a different SQL dialect.

FLOOR: rounds a number down to the next whole number
LPAD: prepends characters to a string

On an input of “1” this is what the flow would look like to go from 1 -> 00. 0-4 Days

-- FLOOR of 1/5 = 0 and cast to text
SELECT FLOOR(1 / 5::FLOAT)::TEXT -> '0'
-- LPAD adds a leading zero to single digits
SELECT LPAD(FLOOR(1 / 5)::TEXT, 2, '0') -> '00'
-- Just cosmetic, appending a "."
SELECT LPAD(FLOOR(1 / 5)::TEXT, 2, '0') + '. '  -> '00.'
-- Now we know that this will iteratively return 0,1,2,3,n... as our input increases
-- And if we multiple by 5 we get the beginning of our bucket
       -- 0 * 5 = 0 (for 0-4)
       -- 1 * 5 = 5 (for 5-9)
       -- etc...
SELECT LPAD(FLOOR(1 / 5)::TEXT, 2, '0') + '. ' + FLOOR(1 / 5::FLOAT) * 5 -> '00. 0'
-- Now we add another cosmetic piece ( a dash to separate start/end of bucket)
SELECT LPAD(FLOOR(1 / 5)::TEXT, 2, '0') + '. ' + FLOOR(1 / 5::FLOAT) * 5 + '-' -> '00. 0-'
-- Remember how we generated the start of the bucket 
-- If we know the bucket is going to span 5 days (inclusive of start and end) we just add 4 to the start
SELECT LPAD(FLOOR(1 / 5)::TEXT, 2, '0') + '. ' + FLOOR(1 / 5::FLOAT) * 5 + '-' + (FLOOR(1 / 5::FLOAT) * 5 + 4) -> '00. 0-4'
--- Add final cosmetic piece ' Days'
SELECT LPAD(FLOOR(1 / 5)::TEXT, 2, '0') + '. ' + FLOOR(1 / 5::FLOAT) * 5 + '-' + (FLOOR(1 / 5::FLOAT) * 5 + 4) + ' Days' -> '00. 0-4 Days'

We’ve transformed from an input of CD_LD = 1 and assigned it a sortable, bucketable (probably not a word) value -> ‘00. 0-4 Days’

I’m worried about an unconstrained series of buckets… how do I cap it?⌗

SELECT  -- Define our buckets when we are below our threshold
       CASE WHEN CD_LD < 100 THEN LPAD(FLOOR(CD_LD / 5::FLOAT)::TEXT, 2, '0')
            || '. ' ||
       FLOOR(CD_LD / 5::FLOAT) * 5
            || '-' ||
       (FLOOR(CD_LD / 5::FLOAT) * 5 + 4)
            ||
       ' Days'
       
       -- Define final bucket when threshold exceeded
       ELSE LPAD(FLOOR(100 / 5)::TEXT, 2, '0')
          || '. >100' END AS CD_LD_DIST
     , COUNT(DISTINCT A_UNIQUE_ID_OF_A_ROW) AS NUM_ID
FROM DELTA_DIST
GROUP BY 1

I want to change the increment size⌗

-- Change increment size from 5 days to 7 days
-- 5 -> 7
-- 4 -> 6
SELECT LPAD(FLOOR(CD_LD / 7::FLOAT)::TEXT, 2, '0')
            || '. ' ||
       FLOOR(CD_LD / 7::FLOAT) * 7 --
            || '-' ||
       (FLOOR(CD_LD / 7::FLOAT) * 7 + 6) -- Generates the last day in the range  (+6 as I'm bucketing by 7 days at a time)
            ||
       ' Days' AS CD_LD_DIST -- Adds the Days string at the end for clarity
      , COUNT(DISTINCT A_UNIQUE_ID_OF_A_ROW) AS NUM_ID
FROM DELTA_DIST
GROUP BY 1

I have more than 99 buckets, the two digit solution wont work:⌗

-- Pad an additional 0 with LPAD
-- 
SELECT LPAD(FLOOR(CD_LD / 5::FLOAT)::TEXT, 3, '0') -- 2 -> 3
            || '. ' ||
       FLOOR(CD_LD / 5::FLOAT) * 5 
            || '-' ||
       (FLOOR(CD_LD / 5::FLOAT) * 5 + 4) 
            ||
       ' Days' AS CD_LD_DIST -- Adds the Days string at the end for clarity
      , COUNT(DISTINCT A_UNIQUE_ID_OF_A_ROW) AS NUM_ID
FROM DELTA_DIST
GROUP BY 1

Hope that’s helpful! Maybe there’s a better solution, I didn’t do an exhaustive search!