Skip to content(if available)orjump to list(if available)

Having, a less understood SQL clause

galkk

Having is less understood?

After 20+ years of SQL usage (as an ordinary dev, not business/reporting/heavy SQL dev), I learned about `group by cube` from this article...

"group by cube/coalesce" is much more complicated thing in this article than "having" (that could be explained as where but for group by)

CodeIsTheEnd

I had never heard of GROUP BY CUBE either! It looks like it's part of a family of special GROUP BY operators—GROUPING SETS, CUBE, and ROLLUP—that basically issue the same query multiple times with different GROUP BY expressions and UNION the results together.

Using GROUP BY CUBE(a, b, c, ...) creates GROUP BY expressions for every element in the power set of {a, b, c, ...}, so GROUP BY CUBE(a, b) does separate GROUP BYs for (a, b), (a), (b) and ().

It's like SQL's version of a pivot table, returning aggregations of data filtered along multiple dimensions, and then also the aggregations of those aggregations.

It seems like it's well supported by Postgres [1], SQL Server [2] and Oracle [3], but MySQL only has partial support for ROLLUP with a different syntax [4].

[1]: https://www.postgresql.org/docs/current/queries-table-expres...

[2]: https://docs.microsoft.com/en-us/sql/t-sql/queries/select-gr...

[3]: https://oracle-base.com/articles/misc/rollup-cube-grouping-f...

[4]: https://dev.mysql.com/doc/refman/8.0/en/group-by-modifiers.h...

nerdponx

I would gladly buy a book of "SQL Recipes" ranging from beginner-level to advanced stuff that uses features like this, ideally with coverage of at least a few popular database systems, but at minimum Postgres.

Is there such a book?

hobs

In fact, Yugabyte is giving it away for free - https://downloads.yugabyte.com/marketing-assets/O-Reilly-SQL...

eatonphil

Joe Celko has a number of SQL for Smarties books I've been meaning to look through.

ttfkam

Not a pivot table equivalent. Most useful for calculating multiple related aggregates at once for reporting purposes, but ROLLUP doesn't substitute values for columns, ie. it doesn't pivot results on an axis.

MS SQL Server has native pivot support, and Postgres has a limited emulation of it through the crosstab(...) function. https://stackoverflow.com/a/11751905 https://www.postgresql.org/docs/current/tablefunc.html

For folks just learning about ROLLUP et al, I highly recommend this comparison chart for an overview of major features offered by modern relational databases. https://www.sql-workbench.eu/dbms_comparison.html

There's a whole constellation of advanced features out there that arguably most application developers are largely unaware of. (Which explains why most app devs still treat relational databases like dumb bit buckets at the far end of their ORMs.)

ttfkam

I should amend for clarity: GROUPING SETS are really useful for speeding up complex report generation.

zweifuss

Yes, PostgreSQL (9 and later), SQL Server (2008 and later), Oracle, SQL Anywhere and DB2 support grouping sets. All had it for longer than a decade.

Some also support the MySQL/MariaDB with rollup syntax introduced in 2018.

mastax

I had a situation recently where I had a huge amount of data stored in a MariaDB database and I wanted to create a dashboard where users could interactively filter subsets and view the data. The naive solution of computing the aggregate statistics directly based on the users' filter parameters was too slow, most of the aggregation needed to be done ahead of time and cached. The website's backend code was a spaghetti house of horrors so I wanted to do as much as possible in the DB. (The first time in my career I chose to write more SQL rather than code)

If I had a fancy DB I could use CUBE or GROUPING SETS and MATERIALIZED VIEWs to easily pre-calculate statistics for every combination of filter parameters that automatically get updated when the source data changed. But I had MariaDB so I made do. I ended up with something like this:

    SELECT ... SUM(ABS(r.ilength)) AS distance, COUNT(*) AS intervals FROM r
    GROUP BY average_retro_bucket, customer, `year`, lane_type, material_type, state, county, district WITH ROLLUP
    HAVING average_retro_bucket IS NOT NULL AND customer IS NOT NULL;
"The WITH ROLLUP modifier adds extra rows to the resultset that represent super-aggregate summaries. The super-aggregated column is represented by a NULL value. Multiple aggregates over different columns will be added if there are multiple GROUP BY columns."

So you can query like this to get stats for all districts in CA->Mendocino county:

    SELECT * FROM stats_table WHERE state = 'CA' AND county = 'Mendocino' AND district IS NULL
or like this to get a single aggregate of all the counties in CA put together:

    SELECT * FROM stats_table WHERE state = 'CA' AND county IS NULL AND district IS NULL
However unlike CUBE, WITH ROLLUP doesn't create aggregate result sets for each combination of grouping columns. If one grouping column is a NULL aggregate, all the following ones are too. So if you want to query all the years put together but only in CA, you can't do:

    SELECT * FROM stats_table WHERE year IS NULL AND state = 'CA'
If `year` is null, all the following columns are as well. The solution was to manually implement wildcards before the last filtered group column by combining the rows together in the backend.

I worked around not having materialized views by creating an EVENT that would re-create the stats tables every night. The stats don't really need to be real-time. Re-writing the multiple-GB statistics tables every night will wear out the SSDs in 20 years or so, oh well.

kmac_

SQL opens up when used with OLAP schemas. Most devs are experienced in querying "object mapped" schemas where cube, roll up, etc. are not useful. Nothing bad per se, but it can give an impression that SQL is a bad language, when actually it clicks well with a proper data schema.

rjbwork

Indeed. I think your mind can really be opened by having to answer complex business questions with an expansive and well designed data warehouse schema. It's a shame it's such a relatively niche and unknown topic, especially in the startup world.

nerdponx

This is why the data engineers get paid the big bucks, and also why having a good data engineer is a lot more important than a good data scientist in the early stages of a company.

OJFord

I've never used `cube` in any context, but if I may I'd suggest you're parsing this wrongly:

`group by cube`/`group by coalesce` aren't special advanced features, they're just `group by`. You can group on 'anything', e.g. maybe you want to group on a name regardless of case or extraneous whitespace - you can use functions like `lower` and `strip` in the `group by` no problem, it's not something to learn separately for every function.

zweifuss

Cube gets all possible combinations of grouping sets. It´s like showing all subtotals in a pivot table. That´s different than just grouping on the lowest level without totals.

remram

Your suggestion is incorrect. CUBE is not part of the expression.

OJFord

Well, I did say I wasn't familiar with it, but it's correct for `coalesce` and I don't think `cube` is different: https://www.postgresql.org/docs/current/cube.html

In the context of `group by` it's treated as grouping sets, but that's not its only use. (Though that does seem to be special cased in terms of parsing, since afaict - I can't find the full query BNF on mobile - `grouping sets` is not optional.)

cm2187

Exactly the same for me. It makes me think I need to start reading the TSQL documentation cover to cover.

hobs

Not a huge fan of Microsoft but the TSQL documentation is solid. If you're not using CROSS APPLY to tear apart things and put them back together you've not lived.

Simon_O_Rourke

Heard of group by cube before, even tried it in production, but it was a total dog of a query to run, so I filed it away to try again in a few years.

RexM

Same!

I’ve been a developer for 15 years, consider myself pretty good with SQL, but I’m only now learning about group by cube.

ramraj07

The only “standard” feature id rather try not to understand is recursive CTEs lol

conceptme

it's pretty useful when working with hierarchical data, but you do not to put some check for cyclical relations, I have seen those take an application down :D.

dspillett

If you aren't careful you can cause that without infinite recursion. If the query optimiser an't see to push relevant predicates down to the root level where needed for best performance, or they are not sargable anyway, or the query optimiser simply can't do that (until v14 CTEs were an optimisation barrier in posrges), then you end up scanning whole tables (or at least whole indexes) multiple times, where a few seeks might be all that is really needed. In fact, you don't even need recursion for this to have a bad effect.

CTEs are a great feature for readability, but when using them be careful to test your work on data at least as large as you expect to see in production over the life of the statements you are working on, in all DBMSs you support if your project is multi-platform.

eyelidlessness

they’re just self joins on views of the query.

teddyh

HAVING is less understood? There’s nothing strange about HAVING, it’s just like WHERE, but it applies after GROUP BY has grouped the rows, and can use the grouped row values. (Obviously, this is only useful if you actually have a GROUP BY clause.) If HAVING did not exist, you could just as well do the same thing using a subselect (i.e. doing SELECT * FROM (SELECT * FROM … WHERE … GROUP BY …) WHERE …; is, IIUC, equivalent to SELECT * FROM … WHERE … GROUP BY … HAVING …;)

syntheticcdo

FWIW I just tested a somewhat complex query using HAVING vs a sub-select as you indicated and Postgres generated the same query plan (and naturally, results) for both.

djbusby

A DB wizard I used to work for showed me HAVING after looking at this nasty sub-sub select I did, with some app-layer-loop after.

Caesar if you see this thanks for being a great mentor.

nerdponx

Snowflake SQL also has the interesting feature QUALIFY.

Their docs go into more detail (https://docs.snowflake.com/en/sql-reference/constructs/quali...), but the short version is that typically SELECT is evaluated in the order FROM, WHERE, GROUP BY, HAVING, WINDOW, DISTINCT, ORDER BY, LIMIT.

But what happens if you want to filter on the result of a WINDOW? Sorry, time to write a nested query and bemoan the non-composability of SQL.

Snowflake adds QUALIFY, which is executed after WINDOW and before DISTINCT. Therefore you can write some pretty interesting tidy queries, like this one I've been using at work:

    SELECT
      *,
      row_number() OVER (
        PARTITION BY quux.record_time
        ORDER BY quux.projection_time DESC
      ) AS record_version
    FROM foo.bar.quux AS quux
    WHERE (NOT quux._bad OR quux._bad IS NULL)
    QUALIFY record_version = 1
Without QUALIFY, you'd have to nest queries (ugh):

    SELECT *
    FROM (
      SELECT
        *,
        row_number() OVER (
          PARTITION BY quux.record_time
          ORDER BY quux.projection_time DESC
        ) AS record_version
      FROM foo.bar.quux AS quux
      WHERE (NOT quux._bad OR quux._bad IS NULL)
    ) AS quux_versioned
    WHERE quux_versioned.record_version = 1
or use a CTE (depending on whether your database inlines CTEs).

I definitely pine for some kind of optimizing Blub-to-SQL compiler that would let me write my SQL like this instead:

    (query (from foo.bar.quux :as quux)
           (select *)
           (where (or (not quux._bad)
                      (null quux._bad))
           (select (row-number :over (:partition-by quux.record-time
                                      :order-by (:desc quux.projection-time))
                               :as record-version)
           (where (= 1 record-version)))

someforevery

> I definitely pine for some kind of optimizing Blub-to-SQL compiler

I've been playing with malloy[1] that lets you pipeline/nest queries like you are describing here.

  source: quux as from_sql(..) {
    where: record_version = 1 
    where: _bad != null  
  }
1. https://looker-open-source.github.io/malloy/documentation/la...

larsrc

But HAVING can also act on aggregate results. In fact, the example in the article is not the most important use for HAVING. Subselects can't do something like.

SELECT year, COUNT(*) sales, SUM(price) income FROM sales HAVING sales > 10 AND income < 1000;

p4bl0

Why wouldn't

    SELECT *
    FROM (SELECT year, COUNT(*) as nb_sales, SUM(price) as income
          FROM sales)
    WHERE nb_sales > 10 AND income < 1000;
work just like your example?

dspillett

For more complicated examples, HAVING can produce easier to read/understand/maintain statements.

There may also be performance differences depending on the rest of the query, but for simple examples like this exactly the same plan will be used, so the performance will be identical.

Unfortunately, the simplest examples do not always illustrate the potential benefits of less common syntax.

Linosaurus

This should work, right? Just a bit more unnecessary text.

SELECT year, sales, income FROM ( SELECT year, COUNT(*) sales, SUM(price) income FROM sales ) AS innerquery WHERE sales > 10 AND income < 1000;

ww520

Having is the where-clause for Group By. It's easier to understand by thinking the SQL query as a pipeline.

Stage 1: From returns the whole world of rows.

Stage 2: Where filters down to the desired set of rows.

Stage 3: Group By aggregates the filtered rows.

Stage 4: Having filters again on the aggregated result.

Stage 5: Select picks out the columns.

occamrazor

What I never understood is why HAVING and WHERE are different clauses. AFAIU, there are no cases where both could be used, so why can’t one simply use WHERE after a GROUP BY?

(I know that I am probably missing some important technical points, I would like to learn about them)

qsort

It doesn't improve the power of SQL, it's just syntactic sugar. Because of how the SQL syntax works, you'd have to do something like:

   WITH A AS (
        SELECT x, sum(y) AS z
          FROM SomeTable
      GROUP BY x
   )

   SELECT * FROM A WHERE z > 10
With an HAVING clause you can instead just tuck it after the GROUP BY clause.

Also, although it's not an issue these days given how good query planners are (any decent engine will produce exactly the same query plan with a subquery or an having clause, it's indexes that fuck up stuff), but you're signaling that the filter happens "at the end".

It's like having both "while" and "for" in a programming language. Technically you don't need it, but it's for humans and not for compilers.

snidane

    select product
         , sum(price) as price 
      from table
     where price<1000
    having price>10000

You can refer to the aliased price column before or after aggregation using where or having. Depending on the sql engine.

occamrazor

I didn’t know one could still refer to the value before aggregation after it is aliased.

Is there an implicit group by in this query?

mwexler

This is non-standard and highly dependent on the sql engine. If you believe in portability, your where clauses should (sadly) use the long form.

If that's not a requirement, this approach can add some clarity

hobs

There's plenty of cases - remove the people who have attribute, then aggregate them, then filter the aggregate.

Find me the list of non-deleted users who have more than 50 dollars worth of transactions in the transaction table.

Technically you can always subquery and use a where instead of a having but its nice to ... have.

croes

I have pretty often cases where I use where and having. Where filters the results before the group and having afterwards.

kevincox

HAVING is a hack. Because SQL has such an inflexible pipeline we needed a second WHERE clause for after GROUP BY.

It would be nice if we could have WHERE statements anywhere. I would like to put WHERE before joins for example. The optimizer can figure out that the filter should happen before the join but it is nice to have that clear when looking at the code and it often makes it easier to read the query because you have to hold less state in your head. You can quickly be sure that this query only deals with "users WHERE deleted" before worrying about what data gets joined in.

bonzini

You can put WHERE before joins using nested queries, i.e. "FROM (SELECT * FROM users WHERE deleted") INNER JOIN ...".

mastax

I like using HAVING just to have conditions that reference expressions from the SELECT columns.

e.g. rather than having to do

    SELECT
        COALESCE(extract_district(rm.district), extract_district(p.project_name), NULLIF(rm.district, '')) AS district,
        ...
    FROM ...
    WHERE COALESCE(extract_district(rm.district), extract_district(p.project_name), NULLIF(rm.district, '')) IS NOT NULL
just do

    SELECT
        COALESCE(extract_district(rm.district), extract_district(p.project_name), NULLIF(rm.district, '')) AS district,
        ...
    FROM ...
    HAVING district IS NOT NULL
Hopefully the optimizer understands that these are equivalent, I haven't checked.

p4bl0

This should work with WHERE too at least as long as the name given using AS in the SELECT is given to a row and not to an aggregate.

gigatexal

I use this all the time to find dups:

Select pk columns + count(column that might have dups) From table Group by pk Having count(*) > 1

redhale

Forget HAVING -- in my opinion RANK is the most useful SQL function that barely anyone knows about. Has gotten me out of more sticky query issues than I can count.

jupiter909

The entire family of Windowing Functions which are mostly ubiquitous on all modern SQL seem to be glossed over by most.

mike256

RANK is a good example of a useful less known window function. Please also consider dense_rank and row_number. All 3 have their advantages and disadvantages... (Rank leaves holes after rows with the same order, dense_rank gives consecutive numbers, row_number just numbers the rows.)

tantalor

Calling each select "a sql" is really cute

OJFord

It doesn't irk me anything like as much as 'a Docker' (a Docker what? Usually container) for some reason.

Although perhaps that shouldn't annoy me anyway, it ought to be better than being specific but inaccurate (which you could probably expect from someone unfamiliar enough to say 'a Docker') - mixing up image/container as people do.

kaoD

Agree. Let me create a quick Jira for that.

OJFord

Thanks for your HN, appreciate it.

fzil

> Agree. Let me create a quick Jira for that.

Yeah, that really irritates me for some reason.

higeorge13

Having vs where is my first question to filter candidates who have less experience in sql than they claim.

notimetorelax

Well, I used SQL professionally for at least 7 years in different capacities and “having” was a new construct to me. You might say that I suck at SQL or you might realize that there’s more than one way to achieve a result in SQL. I usually prefer “with” statements. I hope this is not the only criteria you use to filter candidates. In my mind a better question would be to state the problem and see if the candidate can come up with a working SQL statement.

higeorge13

I would ask you more questions like window functions, cte, joins, etc. or how would you address certain questions, and tbh i would be really curious that you know good sql but haven’t heard of having.

In my hiring experience it always acted as my first sql impression of candidates and 100% filter of people who don’t actually know sql besides some basic inner joins and aggregations. There are always exceptions i guess. :-)

SnowHill9902

It’s also a hint that the candidate is not curious and doesn’t thoroughly read the docs which is a minus.

xivzgrev

Can anyone explain why the query without having needs 14 separate queries? That seemed insane to me.

It seems like the author is using one query per country. Where it seems like you’d just group by country and year, where country <> US

You would need some unions to bolt on the additional aggregations, but it’s more like 4 queues, not 14

Eg

select c.ctry_name, i.year_nbr, sum(i.item_cnt) as tot_cnt, sum(i.invoice_amt) as tot_amt from country c inner join invoice i on (i.ctry_code = c.ctry_code) where c.ctry_name <> 'USA' group by c.ctry_name, i.year_nbr

Linosaurus

> Can anyone explain why the query without having needs 14 separate queries? That seemed insane to me.

Yeah, four queries for the four requirements seems the most straightforward, and easier to maintain than the final version.

But still a very nice illustrating of group by cube.

samatman

The generous reading, which I'm inclined to, is that it's illustrating OLAP techniques which one would actually apply to a relational database of high normalization.

It's tedious to construct an example database of, say, fifth normal form, which would show the actual utility of this kind of technique. So we're left with a highly detailed query, with some redundancies which wouldn't be redundant with more tables.

pedrow

My SQL knowledge is very limited - I had heard of HAVING but not GROUP BY CUBE or COALESCE - but one thing stood out: "The rewritten sql ... ran in a few seconds compared to over half an hour for the original query." I know there were four million rows in the dataset, but is 30 minutes the kind of run-time you would expect for a query like this?

noir_lord

> I know there were four million rows in the dataset, but is 30 minutes the kind of run-time you would expect for a query like this?.

Clearly not since they got it down to a few seconds ;).

But tongue in cheek aside, it's incredibly dependent on what is in those 4M rows, how big they are, if they are indexed, whether you join in weird ways, whether the query planner does something unexpected.

SQL tooling is by and large pretty barbaric compared to most other tool these days, Intellij (and various spin off language specific IDE's) have the best tooling I've seen in that area but even then it's primitive.

fuy

I'd have to disagree with you on the tooling. Mature (especially commercial) SQL databases have a lot of great tooling built around them that inany areas surpass most of the language tools. For example, Extended events, which comes with SQL Server by default, allows you to trace/profile/filter hundreds of different events in real time, going from simple things like query execution tracing and going all the way to profiling locks, spinlocks, IO events and much more. There's (also built-in) another tool called Query Store that allows to track performance regressions, aggregate performance statistics etc. And then there's whole infrastructure of 3d party tools for things like execution plan analysis, capacity planning etc etc. Oracle has similar rich set of tools. Postgres is lacking some of those, but it's getting better. IDE support in JetBrains products is not that far away from, say, java experience, but from a pure coding perspective it's a bit behind.

mnsc

I think that when doing these kind of report style queries cubed/rolled up columns becomes labels and loses their data type. So it makes sense to convert+coalesce the year column as well to get 'All years' instead of the hard to interpret "null year".