Show HN: ScratchDB – Open-Source Snowflake on ClickHouse

github.com

Hello! For the past year I’ve been working on a fully-managed data warehouse built on Clickhouse. I built this because I was frustrated with how much work was required to run an OLAP database in prod: re-writing my app to do batch inserts, managing clusters and needing to look up special CREATE TABLE syntax every time I made a change. I found pricing for other warehouses confusing (what is a “credit” exactly?) and worried about getting capacity-planning wrong.

I was previously building accounting software for firms with millions of transactions. I desperately needed to move from Postgres to an OLAP database but didn’t know where to start. I eventually built abstractions around Clickhouse: My application code called an insert() function but in the background I had to stand up Kafka for streaming, bulk loading, DB drivers, Clickhouse configs, and manage schema changes.

This was all a big distraction when all I wanted was to save data and get it back. So I decided to build a better developer experience around it. The software is open-source: https://github.com/scratchdata/ScratchDB and and the paid offering is a hosted version: https://www.scratchdb.com/.

It's called “ScratchDB” because the idea is to make it easy to get started from scratch. It’s a massively simpler abstraction on top of Clickhouse.

ScratchDB provides two endpoints [1]: one to insert data and another to query. When you send any JSON, it automatically creates tables and columns based on the structure [2]. Because table creation is automated, you can just start sending data and the system will just work [3]. It also means you can use Scratch as any webhook destination without prior setup [4,5]. When you query, just pass SQL as a query param and it returns JSON.

It handles streaming and bulk loading data. When data is inserted, I append it to a file on disk, which is then bulk loaded into Clickhouse. The overall goal is for the platform to automatically handle managing shards and replicas.

The whole thing runs on regular servers. Hetzner has become our cloud of choice, along with Backblaze B2 and SQS. It is written in Go. From an architecture perspective I try to keep things simple - want folks to make economical use of their servers.

So far ScratchDB has ingested about 2 TB of data and 4,000 requests/second on about $100 worth of monthly server costs.

Feel free to download it and play around - if you’re interested in this stuff then I’d love to chat! Really looking for feedback on what is hard about analytical databases and what would make the developer experience easier!

[1] https://scratchdb.com/docs

[2] https://scratchdb.com/blog/flatten-json/

[3] https://scratchdb.com/blog/scratchdb-email-signups/

[4] https://scratchdb.com/blog/stripe-data-ingest/

[5] https://scratchdb.com/blog/shopify-data-ingest/

Daily Digest email

Get the top HN stories in your inbox every day.

CharlesW

Can you explain what "open-source Snowflake" means, since you don't explain it in this description, in the repo, or on the site?

Is your goal explicitly to replicate all Snowflake capabilities? https://docs.snowflake.com/en/user-guide/intro-supported-fea...

memset

This is good feedback on writing a clear message - I appreciate it.

The goal of this project is to build a superb developer experience around analytical databases. I think that is one of Snowflake's (many) value propositions. It is also a goal for users to be able to have full control of their data, how it is processed, and be able to make economical use of their compute.

This project does not have a goal of matching anyone else's features. But of course, as we grow, we'll end up building things which are important to enterprises.

willseth

Calling it "Open-Source Snowflake" implies you are trying to be a drop-in replacement for Snowflake for at least a subset of features, SQL syntax, etc.

skeeter2020

I too was confused by the title. If it doesn't support feature-parity, nor approach the problem space from the same perspective as Snowflake, I don't think it's the open source version.

az226

Snowflake started as cloud-first (elastic) data warehouse. DX came later.

tbragin

Disclaimer: I work at ClickHouse.

Thank you! Looks really interesting!

I personally agree that real-time OLAP databases have potential to better serve workloads currently in Postgres or cloud data warehouses that need real-time ingest and analytical queries. And simplifying developer experience on top of that, so you don't have to learn about all the details of a powerful database, really speeds up developer velocity.

I'm curious, how you see your project differs from GraphJSON (https://www.graphjson.com/) and Tinybird (https://www.tinybird.co/)?

Congratulations again on the launch!

memset

Good to meet you! I hadn't seen graphjson before, I'll check that out. I'm also a fan of tinybird, and I think we have similar goals of wanting to make it easier for people to adopt OLAP.

On the technical side, I've made different design decisions when it comes to ingesting and processing data. For example, after a table is created, you can post new JSON, with different columns, and we ingest it without needing a manual migration step. I also have a different treatment for JSON arrays, where we can split them into multiple Clickhouse rows (rather than using a clickhouse array.)

Philosophically, I think there is a lot of room for open-source software that also has an amazing UI and developer experience. I've been writing OSS for years and basically think this is the best way to build successful developer tools.

tbragin

> Philosophically, I think there is a lot of room for open-source software that also has an amazing UI and developer experience.

+1 to that! Welcome to the ClickHouse community!

hawk_

I would like to be an OSS dev but how do you pay the bills?

mike_d

> how you see your project differs from GraphJSON (https://www.graphjson.com/) and Tinybird (https://www.tinybird.co/)?

Not OP, but neither of these appear to be open source?

giovannibonetti

Great product! Thanks for sharing it!

Question: I thought Clickhouse already has native support for flattening JSON [1], although it was released recently (version 22.3.1). Did you start working on it [2] before that? Or is it a different take? I'm curious about the pros and cons of each one.

[1] https://clickhouse.com/docs/en/integrations/data-formats/jso... [2] https://scratchdb.com/blog/flatten-json/

memset

This is a really good question. My snarky answer is "the way we ingest JSON doesn't require 50 pages and n configuration settings to explain."

Here's a more factual one:

- We don't use tuples for nested JSON. We just use underscores to show parent_child relationships in keys.

- We don't use arrays. Instead, we make it easy to split into multiple rows so you can use regular SQL.

I haven't directly compared Clickhouse's all of different JSON treatments compared to what I've implemented, but my goal was to build something that you could run and would "just work the way it should."

antoniojtorres

ClickHouse JSON certainly has some sharp edges because they do a lot of schema inference to transparently column store the component parts, which makes it tricky with anything that could be irregular.

Big fan of CH on my end. Will follow your project closely.

zepolen

What happens with this?

  {
    "some_key": "1"
    "some": {
      "key": "2"
    }
  }

memset

Probably an error! Should create a ticket for that. What would we want the behavior to be?

(This hasn't come up with existing customers because theoretically it could and we should handle it.)

mosen

I tested it with the “explode” flattening mode, and it resulted in { “some_key”: “2” }

datatrashfire

as a huge clickhouse fan, i have to agree json is simply not a great experience even with all the changes in the last year

lionkor

maybe use the json flattening RFC instead of inventing your own?

zX41ZdbW

Schema inference for nested JSON objects is a feature of ClickHouse 23.9. I've made a video about it: https://www.youtube.com/watch?v=yS8YU-rBpMM&t=1846s

tiffanyh

AGPL-3.0 license, for those wondering.

iFire

https://github.com/scratchdata/ScratchDB/blob/main/LICENSE

throwaway295729

Congrats on the release! Can this be used for log data? How long is ingested data kept?

memset

You can use it for logs. I have a basic example here: https://scratchdb.com/blog/fly-logs-to-clickhouse/

It's just storing data in a database so it persists as long as you want. For the hosted version you just pay per GB of compressed data, so if you have tons of logs then you can keep them, or you can clear out old data to save on disk space.

pitah1

Thanks for sharing. Looks very clean and simple to use.

Do you plan on supporting non-JSON data types for insertion? For example, inserting CSV files, parquet files, Avro or Protobuf messages?

memset

Yes! Have an issue for that https://github.com/scratchdata/ScratchDB/issues/19

What would you want it to look like?

NortySpock

Not the parent poster, but I wouldn't race to solve the problem via supporting many more connectors that require lots of config options. You'll just spend time supporting lots of connectors.

Instead, recommend that users use another (connector heavy) tool to get data in -- my mind jumped to using Benthos to convert a CSV or parquet file (or any other input stream) into a series of JSON calls -- and just ask users to hammer ingestion requests at your server. From there, your job is "just" to handle JSON ingestion as fast as you can, rather than maintain many connectors.

If JSON becomes a problem, then find exactly one other well-defined file format for bulk data loads (parquet, perhaps?), and support that.

When I saw this submission, I think I fell in love. JSON to get data in; SQL to get data out; what could be simpler?

pitah1

Agreed. I guess it depends on the target market. If your target is small to medium sized businesses, this is great to start doing analytics. But for large organisations, they generally have ETL/ELT jobs that do all the extraction from data sources to push to an analytics store or save in some format that is performant for analytics and storage (i.e. parquet). I'm also not sure how many data visualisation tools support API endpoints for query serving.

didip

You should submit your benchmarks to ClickBench.

shrubble

What does the license mean, if I don't change any of the code you provide, but use it to provide a public-facing service? Like if I use it for a forum, for instance, but am using a separate bit of code to push data into and retrieve out of ScratchDB?

ddorian43

Why is your storage 10X that of bigquery? How does your compute price compare to bigtable?

Edit: bigtable->bigquery

memset

I don't bill separately for the compute vs storage the way bigtable does. The pricing per GB of data is inclusive of compute. The goal is for pricing to be modeled similarly to DynamoDB - just pay for what you use. The other way I charge is by query wall time - so a 30s query will cost you more than a 500ms one.

I haven't used bigtable but it seems like the minimum charge is on the order of $300 before you have any data. With ScratchDB, the minimum charge is $10 for 30 GB.

Additionally, on average, data has a 25% compression ratio. So if your 1 TB of data only takes up 250GB, you only pay for that.

Bigtable isn't OLAP, so you would not use them for the same data. This competes more directly with GCP's BigQuery.

Finally, I'm interested in pricing feedback! The goal is to be able to sustain the development of this, so I want to do what makes sense.

ddorian43

I wrote wrong, I meant bigquery, sorry. It's 9x of bigquery.

memset

I haven't calculated the break-even point for bigquery vs scratchdb. It would be impossible to do so, as I'd need to know the number of rows bigquery will scan in order to do an estimation in advance. Also what is a "slot-hour"?

That is why I chose pricing uses units of "GB" and "hours" for storage and compute - those are things you can more easily observe.

It is a good question, though, and perhaps I can do an experiment and write a blog post using example data showing the differences. I might be surprised at how efficient bigquery is!

gbrits

Congrats with the launch. This looks great. Inferring schemas on the fly is awesome to get started quickly, but are there ways to explicitly define a schema if I wanted to? For example, thinking of setting column specific compression

memset

Currently no, but I’m open to the idea of being able to set this. Perhaps the ability to set it on the fly (or better yet, inform the user of which type of compression to use based on their actual data!) would be useful. Happy to discuss in a gh issue too.

jed_sanders12

This looks great. I have one question. When you are automatically creating tables, how do you choose primary keys order for clickhouse table?

memset

Great question. Each new table is given a __row_id column which ScratchDB populates and that is the default primary key.

Then, as data is added, I look at the cardinality of each column and set an ORDER BY appropriately. The goal (implementation in progress, not launched yet) is for ScratchDB to do this automatically - we should be able to automatically look at both the data in your table and queries and analytically come up with an appropriate sorting key.

jed_sanders12

This sounds like a very interesting problem. How costly is it to modify order by after data has been loaded into the table? It sounds expensive because it should reorder the data after changing order by.

memset

It is not a cheap operation, I basically create a new table and then "INSERT INTO SELECT..." Then again, it isn't something you typically need to do a lot.

anon3949494

Just signed up but didn't receive a confirmation email. Are you currently accepting new sign-ups for the managed service?

memset

Yes! I should set up confirmation emails - I'm reaching out to the folks who have signed up so you should hear back soon! If I miss it (apologies!) then feel free to email directly.

Daily Digest email

Get the top HN stories in your inbox every day.