Brian Lovin
/
Hacker News

Show HN: A benchmark for analytical databases (Snowflake, Druid, Redshift)

benchmark.clickhouse.com

I created a web page to compare different analytical databases (both self-managed and services, open-source and proprietary) on a realistic dataset. It contains 20+ databases, each with installation and data loading scripts. And they can be compared to each other on a set of 43 queries, by data load time or by storage size.

There are switches to select different types of databases for comparison - for example, only MySQL compatible or PostgreSQL compatible.

If you play with the switches, many interesting details will be uncovered.

Full description: https://github.com/ClickHouse/ClickBench/blob/main/README.md

Daily Digest email

Get the top HN stories in your inbox every day.

gianm

This is impressive work: it's time consuming to set up and benchmark so many different systems!

Impressiveness of the effort notwithstanding, I also want to encourage people to do their own research. As a database author myself (I work on Apache Druid) I have really mixed feelings about publishing benchmarks. They're fun, especially when you win. But I always want to caution people not to put too much stock in them. We published one a few months ago showing Druid being faster than Clickhouse (https://imply.io/blog/druid-nails-cost-efficiency-challenge-...) on a different workload, but we couldn't resist writing it in a tongue-in-cheek way that poked fun at the whole concept of published benchmarks. It just seems wrong to take them too seriously. I hope most readers took the closing message to heart: benchmarks are just one data point among many.

That's why I appreciate the comment "All Benchmarks Are Liars" on the "limitations" section of this benchmark -- something we can agree on :)

thomoco

Benchmarks can be quite difficult to interpret when the test parameters vary between tests. However, I think the point in providing the open-source ClickBench benchmark [https://github.com/ClickHouse/ClickBench] is exactly to allow users to do their own research in providing a standardized client and workload across any SQL-based DBMS. Standardized benchmarking is an important technique, for comparing across different applications, or also for comparing the same application across different environments (compute, storage, cloud vs. self-managed, etc.). SPEC [https://www.spec.org] used to do a great job in developing and releasing standardized benchmarks, although their activity has waned of late

AdamProut

Any idea why Druid performed so poorly though? 100x slower seems odd. I though druid was reasonably good at single table analytics like in this benchmark. Is it the small data size?

gianm

It does seem odd, especially since in real world cases I'm more accustomed to seeing Druid and ClickHouse be in the same ballpark of performance. Sometimes one is somewhat faster than the other. But in my experience that's more like 2–3x difference in one direction or the other depending on workload, not 100x.

Hard to say why more of a difference shows up here, since I haven't analyzed the benchmark. It's possible the Druid configuration is suboptimal in some way. It's also possible it has something to do with the setup. It appears that the ClickHouse tests were done using a local table, which there isn't an equivalent of in Druid. Druid treats every table like what ClickHouse would call a "distributed" table. My understanding is using a distributed table in ClickHouse adds overhead since the system can no longer assume all data is on a single server. It may be that using distributed queries in both systems would yield a different result. And of course it may be that some of the test queries exercise functionality where ClickHouse is legitimately better optimized. But, again, hard to say anything for certain without detailed analysis.

abhishcool

In this benchmark, druid was killed and restarted after every query because druid seems to get into a degraded state otherwise. Very very likely that there is something wrong with the druid setup here. It would have been useful to know basic details such as druid version.

qoega

Many queries that did do not run had aggregations over strings like MIN/MAX. I don't know specifics why many Java based DBMS lack this aggregation functions.

lk888

Very well said!

AdamProut

It looks like the queries are all single table queries with group-bys and aggregates over a reasonably small data set (10s of GB)?

I'm sure some real workloads look like this, but I don't think it's a very good test case to show the strengths/weaknesses of an analytical databases query processor or query optimizer (no joins, unions, window functions, complex query shapes ?).

For example, if there were any queries with some complex joins Clickhouse would likely not do very well right now given its immature query optimizer (Clickhouse blogs always recommend denormalizing data into tables with many columns to avoid joins).

zX41ZdbW

There are many limitations of this benchmark, indeed: https://github.com/ClickHouse/ClickBench/#limitations

qoega

There are several existing benchmarks that test query optimisers with a lot of joins. It does not show performance of query engine, but more likely how good is your optimiser was tailored for this queries.

AdamProut

I think your missing my point. The page is entitled "a Benchmark For Analytical DBMS" not "A Benchmark for Single Table Query Execution". Most analytical workloads are more complex then single table queries.

I didn't say it wasn't useful to test single table columnstore performance on workload that runs best on single host databases, just that this isn't the be-all end-all of Analytical Database performance testing.

zX41ZdbW

You are absolutely right. That's why this benchmark is named "a Benchmark For Analytical DBMS", not "the definitive benchmark for analytical DBMS".

ruw1090

There's a lot more involved in an execution engine running complex queries that are not single table group by than just QO (though this is important). It includes things like join implementations and associated optimizations, shuffle performance (which is important even for single table queries as you scale), etc.

riku_iki

> It does not show performance of query engine, but more likely how good is your optimiser was tailored for this queries.

you can join just two large tables without leaving much space for query optimizer.

doliveira

But isn't that the main goal of analytical databases? They're not for data-warehousing

FridgeSeal

I’m somewhat convinced that the “difference” between OLAP and “data warehouses” is shady advertising.

Structurally they’re really similar, I suspect some vendors couldn’t match the outright performance of existing OLAP db’s, so added extra features to differentiate it enough to justify a new product category, and then talk endlessly about how OLAP databases aren’t capable of handling this brave new future; even though for the majority of workloads, people would be better off just going with a “boring” OLAP database.

Large parts of this comment are directed pointedly at Snowflake.

abrazensunset

I think it's more a matter of comparing minivans (cloud "DWH" engines) to sports cars (Clickhouse et al) here.

Snowflake's performance characteristics & ops paradigm have always been more consistent with managed Spark than anything else. Thus the competition with Databricks. They have only recently started pretending to be anything than a low-maintenance batch processor with a nice managed storage abstraction, and their pricing model reinforces this.

That being said, for now it's pretty hard currently to find something that gives you: - Bottomless storage - Always "OK" performance - Complete consistency without surprises (synchronous updates, cross table transactions, snapshot isolation) - The ability to happily chew through any size join and always return results - Complete workload isolation

...all in one place, so people will probably be buying Snowflake credits for a few years yet.

I'm excited about the coming generation--c.f. StarRocks and the Clickhouse roadmap--but the workloads and query patterns for OLAP and DWH only overlap due to marketing and the "I have a hammer" effect.

I don't think the slight misuse of either type of engine is bad at small-to-medium scale, either. It's healthy to make "get it done" stacks with fewer query engines, fewer integration points, and already-known system limitations.

AdamProut

I don't know where you draw the line between SQL analytics and SQL data warehousing.

I think your typical analytical workload definitely involves more data then this benchmark though. Something like DuckDB is more ideal for this small of a data set . 10s of GB of data can be analysed on a laptop - you don't need a full fledged database server.

zX41ZdbW

DuckDB is included in this benchmark. But there were many OOMs in this benchmark and it was not easy to make it working: https://github.com/duckdb/duckdb/issues/3969

The data size is 75 GiB in uncompressed CSV and 13.7 GiB in Parquet.

singhrac

A small suggestion: since the machines aren’t always comparable (eg Redshift) maybe it makes sense to allow the user to make a plot of ($/hr) vs query time? I realize this adds to the complexity, but most people in this space are making a trade off between (maintainability/ease-of-use, long-running costs, performance), and plotting the latter two would probably highlight the Pareto frontier.

zX41ZdbW

Yes, this is what I want to include - the cost metric.

Although it will be tricky for systems with consumption-based pricing: BigQuery and Athena.

And Snowflake has almost consumption based pricing - while you pay for warehouse time, it can quickly spin it up and down based on load.

Maybe use some assumption like - the price for running queries on consumption based systems is equalized to one hour of run time of other systems...

ruw1090

Yea, Redshift looks like it was run on an ra3.xl (4 cores) and is comparing against clickhouse on a c6a.4xl (16 cores). I suspect if this were normalized Redshift would have an edge.

zX41ZdbW

Redshift configuration was selected to get 16 cores in total, so it is 4 nodes of ra3.xplus 4vCPU to get 16vCPU in total.

https://github.com/ClickHouse/ClickBench/tree/main/redshift

We can add more results on different configurations easily.

There is also Redshift Serverless, which does not have any hardware configuration to tune.

ruw1090

That makes sense, thanks for clarifying. I got confused by the UI.

derN3rd

My company recently started to invest more into analytics and we had to find a good solution on which backend/database we want to decide. After some research we settled on ClickHouse and we couldn't be happier.

- Super easy to setup - Easy to backup - Fast configuration (documentation could be better at some parts) - Similar SQL dialect as our devs use in MySQL

Only negative points I could find so far: - No 'good' management GUI as e.g. phpMyAdmin, pgAdmin, Mysql Workbench - Caching/Batching layers not directly implemented, but through external software

As we are a fairly small company all other analytical databases would have cost us a large amount of money/time more. Friends of us recently hired a group of data engineers/analytics who also brought all their AWS knowledge and toolings with them, which basically brought them to the same outcome as us, while we only have 5% of their costs and all our devs are able to either ingest or query some data

EDIT: Does anyone have some recommendations on what GUIs I could give our PMs to work with ClickHouse instead of writing queries? All SaaS I found didn't support ClickHouse yet or would cost us a newborn. Also what tools do your devs use when they work with ClickHouse data?

glogla

I like a lot of things about Clickhouse but one thing I'm afraid of is what happens when your data won't fit on a single machine. The replication and sharding seems pretty difficult and from reading the documentation feels like it might be pretty fragile.

I think once you reach that scale, systems that completely separate data and compute (like snowflake or trino+s3) are much less of a pain to run since even if you completely blow up your compute the data stays.

> Does anyone have some recommendations on what GUIs I could give our PMs to work with ClickHouse instead of writing queries? All SaaS I found didn't support ClickHouse yet or would cost us a newborn. Also what tools do your devs use when they work with ClickHouse data?

I think both Superset and Metabase were interesting choices - if you want to save money (at the expense of engineering time) you can self-host them.

hodgesrm

> I like a lot of things about Clickhouse but one thing I'm afraid of is what happens when your data won't fit on a single machine.

ClickHouse sharding and replication is not that hard to master--it's simple and the parts are visible. If you don't want the headache of distributed system management, run it in Kubernetes or a managed service. ClickHouse-as-a-service is widely available from multiple vendors.

ClickHouse is also fast and cost-efficient at scale. It's a very good fit for multi-tenant SaaS analytics where you need fixed latency on responses to users.

Disclaimer: I work for Altinity, who run a cloud platform for ClickHouse.

nojito

Ebay wrote a great blog post about clustering clickhouse

https://tech.ebayinc.com/engineering/ou-online-analytical-pr...

lk888

--- The replication and sharding seems pretty difficult and from reading the documentation feels like it might be pretty fragile.

That's why trip.com replaced Clickhouse with StarRocks, a free (open source) OLAP database that handles sharding beautifully to give you linear scalability, and better query performance for single table or joined queries.

gaploid

We built a Managed ClickHouse service to help exactly these difficulties with that technology. We are handling sharding, clustering, zookeeper, patching, updates without downtime, and Hybrid storage based on S3. https://double.cloud

AdamProut

How many Clickhouse as a service offerings exist now? I stopped counting at 7 a few months ago (double.cloud was not on my list).

zX41ZdbW

ClickHouse can run in a classic shared-nothing setup and in "cloud-native" setup with shared storage. Setting up a distributed system with hundreds of machines can be difficult... but it's actually not more difficult than for any other distributed system at this scale.

blinkov

Just in case, here's built-in batching https://clickhouse.com/docs/en/operations/settings/settings/... and a list of mature UIs https://clickhouse.com/docs/en/connect-a-ui (that all have SaaS offerings I believe) and all third-party UIs https://clickhouse.com/docs/en/interfaces/third-party/gui

derN3rd

Thanks!

I remember trying the built-in batching but we had some trouble with it, so we just switched to <https://github.com/nikepan/clickhouse-bulk> which works without any issue since then.

Will have a look at the UIs listed there

ceeplusplus

Tableau is second to none for dataviz imo. Much more versatile than the competition. You can do interactive visualizations, set thresholds, color code/label data, etc. Of course it costs a lot, but if dataviz is something that is delivering real value the cost is more than worth it.

gbrits

Clickhouse support for Tableplus landed 2 months ago. https://github.com/TablePlus/TablePlus/issues/670

undefined

[deleted]

simonw

Anyone got a succinct explanation for why ClickHouse is doing so well in this kind of benchmark? What are the key tricks they're using to get great analytical performance, and why haven't those same tricks been adapted by other competing systems?

zX41ZdbW

I see that other systems often forgot many small details.

Imagine you are implementing COUNT(DISTINCT) function. And imagine it is used in GROUP BY something, like GROUP BY region_id. And on realistic datasets, you will have a large number of different region_id and something like power law distribution of their frequency: https://en.wikipedia.org/wiki/Zipf%27s_law

Then you will expect that most of the values of COUNT(DISTINCT) will be small. And the natural optimization will be - using a linear array before using a hash table. Or: place smaller hash tables in something like a memory pool.

This is about a ton of optimizations on distributions of real datasets.

drchaim

I would say deep optimizations in every aspect of the data cycle. From storing to querying data is heavily optimized. I recall an online meetup where a 10ms gain after a patch was huge celebrated. That kind of things compound over and over and make a really fast data platform for that use cases.

kthejoker2

Seems like a great ad for c6a.metal !

Without dollars as part of the analysis, I don't find these comparisons useful.

drchaim

it's curious there is no available c6a.metal today in eu-west-2, I would say last week there were available in this region.

garciasn

Can BigQuery be added? I'm interested to know how it compares in these sorts of scenarios.

We use BQ because I do NOT want to have to deal w/management of any of the infrastructure; paying by query is convenient for our small team. But if there is true value that can be realized by something like ClickHouse, I want to see how it compares.

zX41ZdbW

I've added BigQuery and run the benchmark and get the results... but due to their ToS , the benchmark results cannot be published. See https://cube.dev/blog/dewitt-clause-or-can-you-benchmark-a-d...

The benchmark instructions for BigQuery are here: https://github.com/ClickHouse/ClickBench/tree/main/bigquery

If you follow them, just add the .json with the results and regenerate the HTML with the generate-results.sh script. You will get the version of the benchmark page with BigQuery included - for your own use.

I wish more vendors will get rid of DeWitt Clause sooner. For example, Snowflake has removed it a few months ago (maybe due to pressure from Databricks), SingleStore also removed the clause recently.

garciasn

Much appreciated.

throwawaylala1

Wow Snowflake absolutely crushes these benchmarks. Anyone know why? I don't know much about this space.

ruw1090

The snowflake clusters are several factors bigger (and more expensive) than almost all of the comparison points.

mdcallag

README.md was interesting to read. I enjoyed learning about the history of the effort and I appreciate how much work got it to this point.

I enjoy both benchmarking and benchmarketing. Although benchmarketing is more fun when it isn't about the DBMS I am working on.

arikfr

This is great! Will it be possible to include Databricks in the benchmark?

zX41ZdbW

Databricks is in the list of systems I want to add: https://github.com/ClickHouse/ClickBench/#systems-included

Lucasoato

Yeah, also a comparison with SparkSQL on Kubernetes would be interesting! Let's get in touch, I may help with that.

ruw1090

Do you have any benchmark results with Clickhouse in a clustered configuration (and other db's)? Is Clickhouse expected to be run in only single node deployments?

zX41ZdbW

ClickHouse is typically run on a cluster. There are setups with over 1000 machines and over 100 PB of data.

But for this benchmark, the most simple setup is selected. I can add clustered setup for ClickHouse in this benchmark.

lk888

[dead]

riku_iki

> Regarding the joined table queries that are missing in the tests

looks like huge issue with this benchmark.

Daily Digest email

Get the top HN stories in your inbox every day.

Show HN: A benchmark for analytical databases (Snowflake, Druid, Redshift) - Hacker News