Skip to content(if available)orjump to list(if available)

ClickHouse Cloud is now in Public Beta

ClickHouse Cloud is now in Public Beta

164 comments

·October 4, 2022

thomoco

I wanted to note that ClickHouse Cloud results are now also being reported in the public ClickBench results: https://benchmark.clickhouse.com/

Good to see transparent comparisons available now for Cloud performance vs. self-hosted or bare metal results as well as results from our peers. The ClickHouse team will continue to optimize further - as scale and performance is a relentless pursuit here at ClickHouse, and something we expect to be performed transparently and in a reproducible manner. Public benchmarking benefits all of us in the tech industry as we learn from each other in sharing the best techniques for attaining high performance within a cloud architecture

Full disclosure: I do work for ClickHouse, although have also been a past member of SPEC in developing and advocating for public, standardized benchmarks

carlineng

To help understand the results of the benchmark, I find it helpful to look at how the benchmark is constructed, and what it tests for. From the README:

"The dataset is represented by one flat table. This is not representative of classical data warehouses, which use a normalized star or snowflake data model. The systems for classical data warehouses may get an unfair disadvantage on this benchmark."

Taking a look at the queries [0], it looks like it mostly consists of full table scans with filters, aggregations, and sorts. Since it's a single table, there are no joins.

[0]: https://github.com/ClickHouse/ClickBench/blob/main/snowflake...

morelisp

Can you clarify what a "write unit" is? Naively it sounds like it might be blocks x partitions x replicas that actually hit disk. (Which is also probably not very clear to people not already using CH, but I have at least middling knowledge of CH's i/o patterns and I have no clue what a "write unit" is from the page's description.)

zX41ZdbW

One write unit is around 100..200 INSERT queries.

If you are doing INSERT in batches with one million rows, it will give

    SELECT formatReadableQuantity(1000000 * 100 / 0.0125)
    
    8.00 billion
inserted rows per dollar. Pretty good, IMO.

If you are doing millions of INSERT queries with one record, without "async_insert" setting, it will cost much more.

That's why we have "write units" instead of just counting inserts.

morelisp

More helpful would be answers to my questions at https://news.ycombinator.com/item?id=33081502 - async_insert is a relatively new feature, we're still using buffer tables for example - but also most of our "client" inserts are actually onto multi-MV-attached null engines. Those MVs are also often doing some pre-aggregation before hitting our MTs as well. So we might insert a million rows, but the MV aggregates that down into 50k, but then that gets inserted into five persistent tables, each of which has its own sharding/partitioning so that blows up to 200k or something "rows" again. (And at some point those inserts are also going to get compacted into stuff inserted previously / concurrently by the MT itself.)

As I've said several times in this thread, I understand why you don't count inserts or rows. What I don't understand is what unit a WU does actually correspond to. In particular I don't understand its relation to e.g. parts or blocks, which are the units one would focus on optimizing self-hosted offerings.

tylerhannan

It's Tyler from ClickHouse.

Check out the response below that has a reference to some of our billing FAQs.

eloff

It doesn't mention anything about what a write unit is, except to say you can reduce write units by batching inserts (that part I guessed already.)

There's no way to think about what an actual write unit means. You could measure the costs on a sample workload, but that's far from ideal. Some transparency here would be nice.

I understand the answer is complicated, based on hairy implementation details, and subject to change. Give me the complexity and let me interpret it according to my needs.

morelisp

Right, that link covers read units which is also what I expected - essentially the number of files I have to touch - but I still have no clue about write units.

Is one block on one non-partitioned non-distributed table one write unit? What about one insert that's two blocks on such a table? What about one block on a null engine with two MVs listening to insert into two non-partitioned non-distributed tables? What if the table is a replacing mergetree, do I incur WUs for compactions? etc.

My worry is that it is essentially 1 WU = 1 new part file, which I understand makes sense to bill on but is tremendously intransparent for users - at least I have no clue how often we roll new part files, instead I'm focused on total network and disk i/o performance on one side and client query latency on the other.

avereveard

Is lower time the right metric here? Seems normalizing per price would make a more useful metric for big data as long as the response time is reasonable

thomoco

Yes, ClickBench results are presented as Relative Time, where lower is better. You can read more on the specifics of ClickBench methodology in the GitHub repository here: https://github.com/ClickHouse/ClickBench/

There are other responses from ClickHouse in the comments on the pricing, so I'll defer to their expertise on that topic there. Thank you for your feedback and ideas, as normalizing a price-based benchmark is an interesting concept (and where ClickHouse would expect to lead also given the architecture and efficiency)

tbragin

This benchmark focuses on analytical query latency for representative analytical queries, so yes - lower number is better.

latchkey

Wow, I hadn't heard of StarRocks before... seems like an interesting competitor.

https://starrocks.io/blog/clickhouse_or_starrocks

loveapdb

See the SelectDB built from Apache Doris and by the creators of Apache Doris. the performance is amazing. https://en.selectdb.com/blog/SelectDB%20Topped%20ClickBench%...

cliffcrosland

Looks really cool! Great work!

Had a pricing question. Say we connected a log forwarder like Vector to send data to Clickhouse Cloud, once per second. If each write unit is $0.0125, and we execute 86,400 writes over the course of the day, would we end up spending $1080? Do you only really support large batch, low frequency writes?

tbragin

Hi Cliff - A write unit does not correspond to a single INSERT. A single INSERT with less than 16MB of data generates ~0.01 “write unit”, so a single “write unit” typically corresponds to ~100 INSERTs. In your example, that would come closer to $11 a day. Depending on how small of batches you plan to write in that examples, there may be ways to reduce that spend further, by batching even more or turning on "async_insert" inside ClickHouse.

menaerus

> and advocating for public, standardized benchmarks

For full transparency, I think you should do the same in ClickHouse. Or is there a strong reason not to run benchmarks on standard analytical workloads like TPC-H, TPC-DS or SSB?

qoega

You can't post results of TPC benchmarks without official audit. So it complicates posting results. You can't find common names that are usually compared with ClickHouse there [1]. So open standardized ClickBench tries to encourage benchmarking for everyone.

There are numerous benchmarks that use similar to TPC queries, but those are not standardized and can be misleading. For example a lot of work was done by Fivetran to get this report [0], but they show only overall geomean for those systems and you can't understand how they actually differ. Anyway their queries are not original TPC - variables are fixed in queries, they run first query when official query is a multiquery.

Contributors from Altinity run SSB with flattened and original schemas [2]. SSB is not well standardized and we see a lot of pairwise comparisons with controversial results - generally you can't just reproduce them and get all the results in single place for the same hardware.

[0] https://www.fivetran.com/blog/warehouse-benchmark [1] https://www.tpc.org/tpcds/results/tpcds_results5.asp?orderby... [2] https://altinity.com/blog/clickhouse-nails-cost-efficiency-c...

zX41ZdbW

There is a good reference to the available benchmarks for analytical databases: https://github.com/ClickHouse/ClickBench#similar-projects

menaerus

On couple of occasions I've seen TPC-H benchmarks with the remark that the results are not audited. Is that not possible?

gaploid

Why are you using 'threads' instead of vcpus or aws instances like it was for other benchmarks? Thats really hard to compare and add suspicions here.

zX41ZdbW

It is related to the "max_threads" setting of ClickHouse, and by default, it is the number of physical CPU cores, which is twice lower as the number of vCPUs.

For example, the c6a.4xlarge instance type in AWS has 16 vCPUs, 8 cores and "max_threads" in ClickHouse will be 8.

ignoramous

Interesting set of results. Ignoring ClickHouse, StarRocks seems to be better in almost all metrics.

I was curious to compare MonetDB, DuckDB, ClickHouse-Local, Elasticsearch, DataFusion, QuestDB, Timescale, and Athena. Amazingly, MonetDB shows up better than DuckDB in all metrics (except storage size), and Athena holds its own and fares admirably well, esp given that it is stateless. While, Timescale and Quest did not come up as good as I hoped they would.

https://benchmark.clickhouse.com/#eyJzeXN0ZW0iOnsiQXRoZW5hIC...

It'd be interesting to see how rockset, starburst (presto/trino), and tiledb fare, if and when they get added to the benchmark.

mytherin

The particular way in which the data is loaded into DuckDB and the particular machine configuration on which it is run triggers a problem in DuckDB related to memory management. Essentially the standard Linux memory allocator does not like our allocation pattern when doing this load, which causes the system to run out-of-memory despite freeing more memory than we allocate. More info is provided here [1].

As it is right now the benchmark is not particularly representative of DuckDB's performance. Check back in a few months :)

[1] https://github.com/duckdb/duckdb/issues/3969#issuecomment-11...

ignoramous

Thanks. Btw, we use DuckDB (via Node/Deno) for analytics (on Parquet/JSON), and so I must point out that despite the dizzying variation among various language bindings (cpp and python seem more complete), the pace of progress, given the team size, is god-like. It has been super rewarding to follow the project. Also, thanks for permissively licensing it (unlike most other source-available databases).

Goes without saying, if there are cost advantages to be had due to DuckDB's unique strengths, then serverless DuckDB Cloud couldn't come here soon enough.

menaerus

> despite freeing more memory than we allocate

> despite DuckDB freeing more buffers than it is allocating

Can you please clarify how is that even possible?

vorillaz

Happy ClickHouse user here. This is one amazing piece of software to be honest, for anyone ever wanted to parse, analyse and query billions of time series entry points ClickHouse is the way to go.

The cloud offering seems like an amazing product for companies that they could afford I am not sure if the billing is right, but for 5M inserts per month the total bill would be $62K.

nemo44x

That can’t be right, that’s insane. I would find it easy to do 5m inserts in 30 minutes.

IanCal

A write unit is not the same thing as a single insert, if for that cost you've multiplied it up.

jcims

I hate the part of my brain that has allowed the name to interfere with my interest in even looking at it.

zX41ZdbW

No, write unit is not a single INSERT. A single INSERT will take around 0.01 write units.

teacpde

That's quite high price tag per insert, do you have to write large amount of data per insert?

gsanderson

Checking the pricing, $0.0125 per write unit. It says each write "generates one or more write units". So ... $1.25 per 100 writes? That can't be right. I wondered if it meant writes per second (like how AWS DynamoDB or Azure CosmosDB work, with their unit-based billing).

tbragin

With an analytical database like ClickHouse, you can write many rows with a single INSERT statement (thousands of rows, millions of rows, and more). In fact, this kind of batching is recommended. Larger inserts will consume more write units than smaller inserts. Check out our billing FAQ for some examples, and we will be enhancing it with more detail as questions from our users come in (we'll work on clarifying this specific point): https://clickhouse.com/docs/en/manage/billing/ We also provide a $300 credit free trial to try out your workload and see how it translates to some of the usage dimensions. Finally, this is a Beta product, so keep the feedback coming!

gsanderson

Thanks, I agree batching inserts would indeed be a good idea and it makes sense that's recommended. However that link you mention (as of now) does not specify what a write unit is. So if that could be clarified, that would be great. Since from your reply it sounds like one INSERT would indeed (at a minimum) incur one write unit. And thus 100 writes could indeed cost $1.25. Which could get expensive, fast.

tbragin

An INSERT can consume less than one write unit, depending on how many rows and bytes it writes to how many partitions, columns, materialized views, data types, etc. So, a "write unit", which corresponds to hundreds of low-level write operations, typically translates to many batched INSERTS. We are working to improve our examples in the FAQ to clarify - thank you so much for asking the question!

benjaminwootton

How about if you are streaming in from Kafka and inserting each event as if arrives? Clickhouse is ideal for rapid analytics over event data so to introduce batching would be disappointing.

Batch upload is of course more cost effective, but I would expect that to be more typical in a traditional data warehouse where we are uploading hourly or daily extracts. Clickhouse would likely show up in more real time and streaming architectures where events arrive sporadically.

I am a huge fan and advocate of Clickhouse, but the concept of a write unit is strange and the likely charges if you have to insert small batches would make this unviable to use. A crypto analytics platform I built on top of Clickhouse would cost in the $millions per month vs $X00 or low $X000 with competitors or self hosting.

mrwnmonm

"Each write operation (INSERT, DELETE, etc) consumes write units depending on the number of rows, columns, and partitions it writes to."

zX41ZdbW

A single INSERT takes around 0.01 write units.

MentallyRetired

Some unsolicited feedback:

1) I had no idea what Clickhouse was for the first 30 seconds looking at the homepage. I now understand it to be a database of some sort. I shouldn't have seen the words "performance" and "cloud" and "serverless" before seeing the word database, right? I'm starting off confused. There shouldn't be an assumption that I know what you all do.

2) I have no idea what a column oriented database is. I've been a developer for 29 years (mostly frontend but I do a lot of full stack too). If I need an explainer, a lot of devs will.

Aside from that, it looks like a nice offering and I wish you all the best!

lkrubner

"column oriented database"

We all have our specialties, and that is fine. It is a common pattern that a developer gets comfortable with a particular tech stack, and then uses it for many years without seeing the need for much else. Some developers use Ruby on Rails plus Postgres for everything, others use C# and .NET and SQL Server. It's fine, if that's all you need.

Still, this is the year 2022. Cassandra, to take one example, was released in 2008. For everyone who has needed these fast-read databases, they've been much discussed for 14 years, including here on Hacker News, and on every other tech forum. At this point I think a company can simply assume that most developers will have some idea what a column database is.

cachemiss

Cassandra is not a columnar database, columnar in this sense is about the storage layout. Values for a column are laid next to each other in storage, thus allowing for optimizations in compression and computation, at the expense of reconstructing entire rows. Postgres is a row store, meaning all the columns for a row are stored next to each other in storage, which makes sense if you need all of the values for a row (or the vast majority).

ipaddr

You never explained what a column database is. One row and unlimited columns?

cinbun8

If you don't know what a column oriented DB is, that page is probably not for you.

bagels

I know what a column oriented DB is, but ClickHouse was not on my radar before.

The pitch on the landing page is that ClickHouse Cloud is great if you love ClickHouse. If you don't know what ClickHouse is, you have to do some work to find out.

pqdbr

Care to elaborate? I’m a fullstack Rails dev for 12 years and I had no idea either, just like OP. Why alienate potential users from the get go?

ipaddr

Isn't it just a database in Second normal form (2NF)?

bfung

No. Column oriented storage db’s make it fast to read 10s to billions of “rows”, usually when you have data that you want to read that can be independent of other columns. Ex: (stock close) - row storage isn’t going to buy you much here, you’d rather read the whole “stock close” column as fast as possible.

Whereas traditional db’s, data like [first name, last name], the columns may have way less meaning on their own and you need both columns to have the data make sense.

A traditional db with B(-)tree storage is much slower when dealing with that type of usage. Storing (stock close) in a single column format makes that type of query much faster.

TotoHorner

Column oriented database really isn’t esoteric knowledgeable. You should check out DDIA, especially if you’re doing full stack

tylerhannan

Thanks for the well wishes!

And thanks for the honest feedback.

It's always an interesting balance of promoting a new thing (Cloud) and explaining an existing thing. This might be helpful.

https://clickhouse.com/docs/en/home

(note: I work at ClickHouse)

piggybox

That's fine. You probably never played a data warehouse. No big deal, we all have our own areas mastered and gaps elsewhere.

shadowtree

Columnar DBs have been around since 1969, predate relational DBs. Seriously, if you're not into large scale data, it really isn't for you.

andrewmutz

Clickhouse was spun out of Yandex, which is a Russian corporation. Given existing geopolitical tensions, is there anything to worry about there?

Does anyone know how much of the Clickhouse team (or ownership) is still located in Russia?

46Bit

They have been extremely clear in their support of Ukraine https://clickhouse.com/blog/we-stand-with-ukraine

risyachka

It's just a bunch of text, doesn't show any support whatsoever.

They can show support by donating to UA defence and showing proof (important - not some neutral org). Otherwise it is not support but a bunch of bs

ericb

In places like Russia, those words are very dangerous, and could get you jailed or sent to the front. Even calling it a "war" was/is a punishable offense.

So, yes, words, but the potential consequences of these words have more significance than empty air.

xwowsersx

Why do you get to define the acceptable forms of support...?

u2315

They support Ukraine, however, given the company spun out of Yandex, the latter is most certainly financially benefitting from their success, and is paying taxes that are funding the war. AFAIR, Yandex also has 2 director sits on Clickhouse board, although that could have changed.

MajimasEyepatch

> ClickHouse, Inc. is a Delaware company with headquarters in the San Francisco Bay Area. We have no operations in Russia, no Russian investors, and no Russian members of our Board of Directors. We do, however, have an incredibly talented team of Russian software engineers located in Amsterdam, and we could not be more proud to call them colleagues.

From their "We Stand With Ukraine" page. [1]

[1] https://clickhouse.com/blog/we-stand-with-ukraine

u2315

That's interesting, thanks for clarifying. Yandex do show up as an investor in Crunchbase, including in their most recent Series B. The cited blog post says:

> The formation of our company was made possible by an agreement under which Yandex NV, a Dutch company with operations in Russia

While Yandex NV is registered in the Netherlands, it's pretty clear that Yandex NV is directly related to Yandex (specifically, according to Wikipedia, it's a holding company of Yandex). For those who don't know, Yandex is basically the Google of Russia and holds 42% of market share among search engines in Russia.

The blog post does not seem to make any claims that Russia is not benefitting financially from the commercial success of Clickhouse. And given the above, such claim would unlikely be true. As such, I still think it's pretty much safe to assume that a portion of any $$$ paid to Clickhouse ultimately goes to fund the war and kill people.

That said, I sincerely hope they could find a way to stop that flow of money from happening somehow, as otherwise it's a nice technology and a great technical team behind it...

kelp

You can see on their jobs page and 'our story' page, they are mostly in the US and The Netherlands.

mike_d

Yandex is technically incorporated in The Netherlands and has an engineering presence there, but they are as Russian controlled as can be.

izrailev

risyachka

Just a bunch of words. Everyone says they support Ukraine if it benefits their business.

Show proof

danielbln

What kind of proof are you looking for here?

zX41ZdbW

> Does anyone know how much of the Clickhouse team (or ownership) is still located in Russia?

Zero.

No employees in Russia, no ownership from Russia, no influence from Russia.

zX41ZdbW

I have (an extremely boring, but quite hands-on) video about various ways of ClickHouse optimizations on top of external storage: https://www.youtube.com/watch?v=rK2BsaaaOCA (starting from around 40:00).

loveapdb

I am from selectdb company. selectdb is a cloud native data warehouse developed by the founding team of apache doris. Recently we submitted ClickBench's test results and we achieved good rankings. Later, we will incorporate all performance optimizations into Apache Doris and submit the test results of Apache Doris. Regardless of selectdb or doris, we not only have outstanding performance on large wide tables, but also perform well in scenarios such as tpc-h where there are many joins. https://en.selectdb.com/blog/SelectDB%20Topped%20ClickBench%...

epberry

Great stuff! I know the ClickHouse team and they are world class. If you're more familiar with a relational database some things will feel weird. For example, you should not insert 1 row at a time, there's over 1,000 built-in functions, and the default connection is often over HTTPS, but not 443.

zX41ZdbW

ClickHouse Cloud provides port 443 as well as 8443. You can insert one row at a time, it's perfectly ok with "async_insert=1".

encoderer

I’m a little sad to see them embrace magic “insert unit” pricing, instead of taking the approach Altinity uses where you are renting an instance size that you can compare apples-to-apples with running your own cluster on ec2.

aseipp

Honestly that's something they can probably offer separately, but I really prefer this pricing for most use cases where I want a database that is always available but has bursty request/response patterns. This means I can have an analytical database available for all my small services, websites, etc without having to think too much about availability, support, and a constant price overhead. But ClickHouse is so fast you can get pretty far with a $10 VPS, I admit.

Probably the best comparison is CockroachDB Cloud. They have a "serverless" offering based on unit pricing and a dedicated offering based on provisioned servers + support/maintenance overhead. I think that would be the ideal place to go long-term, but I'm super excited for this current one. I love ClickHouse and want to support them.

ClickHouse is also an interesting case because there's lots of options to migrate clusters, use S3 as long-term storage, etc to where I don't particularly feel locked into this offering if I ever wanted to shift into my own.

hodgesrm

> I’m a little sad to see them embrace magic “insert unit” pricing, instead of taking the approach Altinity uses where you are renting an instance size that you can compare apples-to-apples with running your own cluster on ec2.

Thanks for this comment. We'll be publishing a blog at Altinity to compare both models. My view is that they both have their place. The BigQuery pricing model is great for testing or systems that don't have constant load. The Altinity model is good for SaaS systems that run constantly and need the ability to bound costs to ensure margins.

Having a selection of vendors that offer different economics seems better for users than everyone competing for margin on the same operational model.

Disclaimer: I'm CEO of Altinity.

izrailev

Serverless "pay for usage" is different than the fixed-size dedicated cluster pricing model, and can be quite a bit cheaper, especially with spiky query traffic. It should also be more reliable, since you don't have to predict and provision capacity for your peak usage ahead of time.

Disclaimer: I work for ClickHouse.

encoderer

Ok but it can also be quite a bit more expensive and your bill is less predictable.

Disclaimer: I’m a customer of altinity.

izrailev

Start a free trial of ClickHouse Cloud (https://clickhouse.cloud/signUp) and compare price/performance with your current set up.

Pay-for-usage should be cheaper in most use cases, and you can further limit your costs by reducing your scale-up limits in the "Advance Scaling" setting (it will impact your performance though -- best to just let the autoscaler do its job...)

ian-whitestone

Outside of being open source, how does ClickHouse differ from Snowflake/BigQuery? In what scenarios would I choose ClickHouse over those existing solutions?

62951413

Druid and Pinot are more likely to be the peer group (e.g. see https://leventov.medium.com/comparison-of-the-open-source-ol...)

zX41ZdbW

ClickHouse supports ad-hoc analytics, real-time reporting, and time series workloads at the same time. It is perfectly suited for user-facing analytics services. It supports low-latency (<100ms) queries for real-time analytics as well as high query throughput (500 QPS and more) - all of this with real-time data ingestion of logs, events, and time series.

Take some notable examples from the list: https://clickhouse.com/docs/en/about-us/adopters/, something around web analytics, APM, ad networks, telecom data... ClickHouse is perfectly suited for these use cases. But if you try to align these scenarios with, say, BigQuery, they will become almost impossible or prohibitively expensive or just slow.

There are specialized systems for real-time analytics like Druid and Pinot, but ClickHouse does it better: https://benchmark.clickhouse.com/

There are specialized systems for time-series workloads like InfluxDB and TimescaleDB, but ClickHouse does it better: https://gitlab.com/gitlab-org/incubation-engineering/apm/apm... https://arxiv.org/pdf/2204.09795.pdf http://cds.cern.ch/record/2667383/

There are specialized systems for logs and APM, but ClickHouse does it better: https://blog.cloudflare.com/log-analytics-using-clickhouse/

There are specialized systems for ad-hoc analytics, but ClickHouse does it better as well: https://github.com/crottyan/mgbench

Well, even if you want to process a text file, ClickHouse will do it better than any other tool: https://github.com/dcmoura/spyql/blob/master/notebooks/json_...

And ClickHouse looks like a normal relational database - there is no need for multiple components for different tiers (like in Druid), no need for manual partitioning into "daily", "hourly" tables (like you do in Spark and Bigquery), no need for lambda architecture... It's refreshing how something can be both simple and fast.

Jemaclus

We are using Clickhouse Cloud at our company and the speed at which it serves up data is mind-blowing to our team, which had been using older systems for almost 10 years. Congrats on the public beta, and we can't wait to see what comes next!

jcuenod

Just wanted to chime in and say that I'm using a self-hosted instance of CH for a small hobby project and the performance is awesome.