Friendlier SQL with DuckDB

Daily Digest email

Get the top HN stories in your inbox every day.

roncohen

Lots of great additions. I will just highlight two:

Column selection:

When you have tons of columns these become useful. Clickhouse takes it to the next level and supports APPLY and COLUMN in addition to EXCEPT, REPLACE which DuckDB supports:

    - APPLY: apply a function to a set of columns
    - COLUMN: select columns by matching a regular expression (!)

Details here: https://clickhouse.com/docs/en/sql-reference/statements/sele...

Allow trailing commas:

I can't count how many times I've run into a problem with a trailing comma. There's a whole convention developed to overcome this: the prefix comma convention where you'd write:

    SELECT
      first_column
      ,second_column
      ,third_column

which lets you easily comment out a line without worrying about trailing comma errors. That's no longer necessary in DuckDB. Allowing for trailing commas should get included in the SQL spec.

nicoburns

> Allowing for trailing commas should get included in the SQL spec.

Yep! That would be my #1 request for SQL. Seems ridiculous that it's not supported already.

layer8

I agree, though you can always use a dummy value as a workaround:

  SELECT
      first_column,
      second_column,
      third_column,
      null

_dark_matter_

BigQuery also supports trailing commas!

snidane

Allow referencing columns defined previously in the same query would make duckdb competitive for data analytics. Without that one has to chain With statements for just the tiniest operations.

  select 1 as x, x + 2 as y, y/x as z;

flakiness

There is a bug for that and it looks someone is even working on it. https://github.com/duckdb/duckdb/issues/1547

karmakaze

There's also no need to make it left to right usage, as long as it's acyclic:

  select y-2 as x, 3 as y, y/x as z;

1egg0myegg0

Yes, good thought! That is listed at the bottom of the article as something we are looking at for the future.

gigatexal

Yep! Agreed!

zasdffaa

(nothing to do with DuckDB but..) SQL is complex enough, and allowing this (and acyclically as mentioned below) would do my $%^& nut implementing it.

But I know a user requirement when I hear one, so can you give me an large, real example of where allowing this would make things easier? That would be mega helpful, ta

wruza

SQL is complex enough

No, it is not. I mean it is, but not in parts where that could be seen as useful and/or convenient. [A]cyclic graph traversal/etc is one of the basic tests in a modern interview at any CRUD studio. How come it could do $%^& to any part of yours?

karmakaze

The thing that makes SQL simple for me is that I can think in set operations devoid of proceduralness. Once we make things more and more sequential the more it is like programming than a formula.

layer8

What does "do my $%^& nut" even mean? (looks like Perl ;))

yread

for example

    select id, count(...something complicated) as complicated_count
    from ....
    order by complicated_count

would help

gregmac

Yes, trailing commas should work everywhere!

JSON is the other one where it annoys me, but luckily I rarely hand-write any JSON anymore (and there are semi-solutions for this like json5).

In code I always add trailing commas to anything comma-separated. It makes editing simpler (you can shuffle lines without thinking about commas). In a diff or blame it doesn't show adding a comma as a change.

SQL is the one spot where this doesn't work, and it's a constant foot-gun as I often don't remember until I run and get a syntax error.

skrtskrt

JSONC allows comments and trailing commas, but adoption seems to be low.

VSCode uses it for configuration, but when I wanted to use it in Python (to add context to source-controlled Elasticsearch schemas) there were only a couple old barely-maintained libraries for parsing.

nicoburns

You can often find better maintained libraries for json5, which is a superset of jsonc

yunohn

> there were only a couple old barely-maintained libraries for parsing.

Do they work, though? If it’s a mostly stable standard, doesn’t seem like you’d need a frequently updated parser.

go_prodev

EXCEPT columns would get my vote for ansi standard SQL adoption. So much time is spent selecting all but a few columns.

skeeter2020

You can do the same thing with your WHERE clause and ANDs by always starting them WHERE 1=1 as well.

>> Allowing for trailing commas should get included in the SQL spec.

So there is no "SQL spec" per se, there's an ANSI specification with decades of convention and provider-specific customizations piled on top. This support for trailing commas is the best you're going to get.

1egg0myegg0

Thank you for the feedback! I will check those Clickhouse features out. I totally agree on the trailing commas, and I use commas first syntax for that same reason! But maybe not anymore... :-)

franga2000

> Allowing for trailing commas should get included in the SQL spec

Not just SQL, trailing commas are stupidly useful and convenient, so as far as I'm concerned every language should have them. To be fair, a decent amount of them have implemented them (I was pleasantly surprised by GCC C), but there are still notable holdouts (JSON!).

throw_away

Are leading commas allowed? Because otherwise, you've just traded out the inability to comment out the last element for the inability to comment out the first. I never understood this convention.

sagarm

I agree that it's ugly and don't use it myself, but I find that I modify the last item in a list far more frequently than the first. Probably because the grouping columns tend to go first by convention, and these change less.

Timpy

Wow this was definitely a pessimistic click for me, I was thinking "trying to replace SQL? How stupid!" But it just looks like SQL with all the stuff you wish SQL had, and some more stuff you didn't even know you wanted.

eis

I was just yesterday exploring DuckDB and it looked very promising but I was very surprised to find out that indexes are not persisted (and I assume that means they must fit in RAM).

> Unique and primary key indexes are rebuilt upon startup, while user-defined indexes are discarded.

The second part with just discarding previously defined indexes is super surprising.

https://duckdb.org/docs/sql/indexes

This was an instant showstopper for me or I assume most people whose databases grow to a bigger size at which point an OLAP DB becomes interesting in the first place.

Also the numerous issues in Github regarding crashes make me hesitant.

But I really like the core idea of DuckDB being a very simple codebase with no dependencies and still providing very good performance. I guess I just would like to see more SQLite-esque stability/robustness in the future and I'll surely revisit it at some point.

pholanda

Hey eis, you are correct. We do not support index storage in our latest release. I am currently implementing this, and it is in a fairly advanced stage. So it should be featured in the next release.

This took a little while because we use a fairly modern index structure with no literature definition on how to buffer manage it.

eis

It's good to hear that persistent indexes are coming soon. I saw it was on the roadmap but didn't know how far out this feature was. Do you have an idea when that release could be out?

BTW Do you have some kind of code/docs one can take a look at regarding the index structure? I'm a part-time data structure nerd :)

1egg0myegg0

Here is a paper that was used when building the DuckDB approach!

https://db.in.tum.de/~leis/papers/ART.pdf

There are some other papers and details about the architecture here: https://duckdb.org/why_duckdb

nojvek

Will the persistent indices still be built on ART indexes? ART is a great choice of data structure. Fast lookups, range queries and prefix queries.

Typesense DB (Typesense.org) under the hood also uses ART as the base datastructure for fast full text search queries.

So I assume it would be straightforward to build an text search engine on top of duck DB and utilize ART indices to do even more lovely things.

1egg0myegg0

Persistent indexes are being actively worked on! Stay tuned. As for the crashes - DuckDB is very well tested and used in production in many places. The core functionality is very mature! Let us know if you test it out! Happy to help if I can.

(disclaimer - on the DuckDB team)

eis

Hi, good to hear that you guys care about testing. One thing apart from the Github issues that led me to believe it might not be super stable yet was the benchmark results on https://h2oai.github.io/db-benchmark/ which make it look like it couldn't handle the 50GB case due to a out of memory error. I see that the benchmark and the used versions are about a year old so maybe things changed a lot since then. Can you chime in regarding the current story of running bigger DBs like 1TB on a machine with just 32GB or so RAM? Especially regardung data mutations and DDL queries. Thanks!

1egg0myegg0

Yes, that benchmark result is quite old in Duck years! :-)

We actually run that benchmark as a part of our test suite now, so I am certain that there is improvement from that version.

The biggest DuckDB I've used so far was about 400 GB on a machine with ~250 GB of RAM.

There is ongoing work that we are treating as a high priority for handling larger-than-memory intermediate results within a query. But we can handle larger than RAM in many cases already - we sometimes run into issues today if you are joining 2 larger than RAM tables together (depending on the join), or if you are aggregating a larger than RAM table with really high cardinality in one of the columns you are grouping on.

Would you be open to testing out your use case and letting us know how it goes? We always appreciate more test cases!

eyelidlessness

Often efforts and articles like this feel like minor affordances which don’t immediately jump out as a big deal, even if they eventually turn out to be really useful down the road. Seeing the article title, that’s what I expected. I did not expect to read through the whole thing with my inner voice, louder and more enthusiastically, saying “yes! this!” Very cool.

tosh

How does DuckDB compare to SQLite (e.g. which workloads are a good fit for what? Would it be a good idea to use both?)

I found https://duckdb.org/why_duckdb but I'm sure someone here can share some real world lessons learned?

1egg0myegg0

Excellent question! I'll jump in - I am a part of the DuckDB team though, so if other users have thoughts it would be great to get other perspectives as well.

First things first - we really like quite a lot about the SQLite approach. DuckDB is similarly easy to install and is built without dependencies, just like SQLite. It also runs in the same process as your application just like SQLite does. SQLite is excellent as a transactional database - lots of very specific inserts, updates, and deletes (called OLTP workloads). DuckDB can also read directly out of SQLite files as well, so you can mix and match them! (https://github.com/duckdblabs/sqlitescanner)

DuckDB is much faster than SQLite when doing analytical queries (OLAP) like when calculating summaries or trends over time, or joining large tables together. It can use all of your CPU cores for sometimes ~100x speedup over SQLite.

DuckDB also has some enhancements with respect to data transfer in and out of it. It can natively read Pandas, R, and Julia dataframes, and can read parquet files directly also (meaning without inserting first!).

Does that help? Happy to add more details!

OskarS

One of SQLite's most appealing aspects to me is using it as an application file format, as described in this article: https://www.sqlite.org/appfileformat.html

How does DuckDB compare in that aspect? Does it have the same kind of guarantees of robustness, incorruptibility and performance (especially reading/writing binary blobs) that SQLite does?

In any case: DuckDB looks great, nice work! Good to have more players in this space!

1egg0myegg0

It is a goal of ours to become a standard multi-table storage format! However, today we are still in beta and have made some breaking changes in the last few releases. (Exporting from the old version, then reimporting your DB in the new allows you to upgrade!) Those should happen less often as we move forward (the storage format was genericized a bit and is more resilient to future enhancements now), and locking in our format amd guaranteeing backwards compatibility will occur when we go to 1.0!

tosh

Thanks, it does help! I understand SQLite might be better/ideal for OLTP (?) but would DuckDB also work for use cases where I query for specific records (e.g. based on primary key) or would I rather use SQLite for OLTP stuff and then read SQLite from DuckDB for analytical workloads?

Basically I'm wondering: if I go all in on DuckDB instead of SQLite would I notice? Do I have to keep anything in mind?

I know, probably difficult to answer without a concrete example of data, schema, queries and so on.

The SQL query features in the article seem really neat. Kudos @ shipping.

1egg0myegg0

Good questions! You are correct that it depends. We do have indexes to help with point queries, but they are not going to be quite as fast as SQLite because DuckDB stores data in a columnar format. (Soon they will be persistent - see comments above!) That columnar format is really great for scanning many items, but not optimal for grabbing all of a single row.

With DuckDB, bulk inserts are your friend and are actually super fast.

Definitely let us know what you find! Just open up a discussion on Github if you'd like to share what you find out: https://github.com/duckdb/duckdb/discussions

1egg0myegg0

I should add that we can read/write Apache Arrow as well!

enjalot

one thing I love about DuckDB is that it supports Parquet files, which means you can get great compression on the data. Here's an examples getting a 1 million row CSV under 50mb and interactive querying in the browser: https://observablehq.com/@observablehq/bandcamp-sales-data?c...

the other big thing is better native data types, especially dates. With SQLite if you want to work with timeseries you need to do your own date/time casting.

1egg0myegg0

Yes, it is always difficult to use dates in SQLite... DuckDB makes dates easier - like they should be!

eatonphil

DuckDB: embedded OLAP, SQLite: embedded OLTP. For small datasets (<1M rows let's say) either would be similar in performance.

I need to do the benchmarks to substantiate this but this is my intuition.

IshKebab

I haven't used DuckDB yet but the biggest differences I've discovered if you aren't working on huge datasets where the column/row thing makes a difference (you're probably not) are:

1. SQLite has a great GUI and is really really widely supported.

2. DuckDB is properly statically typed with a much wider range of types than SQLite, which is dynamically typed and only just added support for any kind of type checking at all.

1egg0myegg0

If you'd like to work with DuckDB in a SQL IDE/GUI, we recommend DBeaver! It uses the DuckDB JDBC connector. A quick how to guide is here: https://duckdb.org/docs/guides/sql_editors/dbeaver

wenc

> huge datasets where the column/row thing makes a difference (you're probably not)

For programmers, it’s a tossup.

For most people working in data (databases, data engineering, ML etc) the column vs row thing makes a difference for datasets as small as a few hundred k records.

ergocoder

Sqlite's SQL is severely limited.

If you want to do something a bit more complex, you will have a bad time. Hello! With recursive.

mattrighetti

`EXCLUDE`

Extremely useful, is there a reason why this is something not implemented in SQL in the first place? I often find myself writing very long queries just to select basically all columns except for two or three of them.

heyda

because columns can be added to tables in production databases, so any time you use select * you run the chance the number of columns changing and breaking anything you wrote.

wruza

This seems reasonable on its own, but then you can add a compound index and forget to join on a second part, or refactor a column in two and only collect one value into aggregation. This spotted babysitting is just stupid. If you’re anxious about query integrity, get some tooling and check your sqls/ddls against some higher-level schema.

Even if that turns out to be a constant source of trouble worth not having, then why SQL can’t provide columnsets at least, so that queries could include, group or join on these predefined sets of columns instead of repeating tens of columns and/or expressions and/or aggregations many times across a single query. You had employees.bio_set=(name, dob), now you add `edu` to it and it just works everywhere, because you think in sets rather than in specific columns. Even group by bio_set works. Heck, I bet most of ORMs partially exist only to generate SQL, because it’s sometimes unbearable as is.

justin_oaks

That's a problem with select * in general, not a problem with using EXCLUDE with select *. So that still doesn't explain why it's not in SQL to begin with.

munk-a

I've always viewed SELECT * as a convinence for schema discovery and a huge bonus for subqueries - our shop excludes its use at the top level in production due to the danger of table definitions changing underneath... but we happily allow subqueries to use SELECT * so long as that column list is clearly defined before we leave the database.

Worst, by far, than a column you didn't expect being added is a column you did expect being removed. Depending on how thorough your integration tests are (and ideally they should be pretty thorough) you could suddenly start getting strange array key access (or object key unfound) errors somewhere on the other side of the codebase.

heyda

"A traditional SQL SELECT query requires that requested columns be explicitly specified, with one notable exception: the * wildcard. SELECT * allows SQL to return all relevant columns. This adds tremendous flexibility, especially when building queries on top of one another. However, we are often interested in almost all columns. In DuckDB, simply specify which columns to EXCLUDE:"

It appears how this works is that is selects all columns and then EXCLUDES only the column's specified, the reason this doesn't exist in normal SQL is because it is a terrible idea. This is something that will break at many companies with large technical debt if it is ever used.

gmueckl

That's why good database wrappers support referencing columns in result sets by column name. It's good practice.

aerzen

If anyone is interested in improvements to SQL, checkout PRQL https://github.com/prql/prql, a pipelined relational query language.

It supports:

- functions,

- using an alias in same `select` that defined it,

- trailing commas,

- date literals, f-strings and other small improvements we found unpleasant with SQL.

https://lang.prql.builders/introduction.html

The best part: it compiles into SQL. It's under development, though we will soon be releasing version 0.2 which would be "you can check it out"-version.

lijogdfljk

This is neat. Have you found this new found capability at odds with "good SQL"? Eg, i run a fairly large application that has a huge DB schema, and more often than not when the SQL gets huge and ugly it often means we're asking too much of the DB. "Too much" being more easy to run into poor indexes, giving more chances for it to pull in unexpectedly large number of rows, etc.

My fear with PRQL is that i'd more easily ask too much of the DB, given how easy it looks to write larger and more complex SQL. Thoughts?

aljazmerzen

That's true - when you hit 4th CTE you are probably doing something wrong.

But not always. Some analytical queries may actually need such complexity. Also, during development, you would sometimes pick only first 50 rows before joining and grouping, with intention of not overloading the db. To do this you need a CTE (or nested select), but in PRQL you just add a `take 50` transform to the top.

learndeeply

Since the DuckDB people are here, just want to say that what you're doing is going to be a complete game-changer in the next few years, much like SQLite changed the game. Thanks for making it open source!

wolf550e

That 750KB PNG can probably be a 50KB PNG. Even without resizing it compresses to less than half its size.

https://duckdb.org/images/blog/duck_chewbacca.png

1egg0myegg0

Thanks! Can you tell that my SQL-fu is stronger than my HTML-fu? :-) Much appreciated!

andai

In this case it should probably be a JPEG? (Unless it has a transparent background and the site responds to the user's dark-mode setting? :) Also, this image looks like it almost certainly was a JPEG, at some point!)

tracker1

The color palette is pretty limited, so using something that can have a specific/limited palette (say 48-64 color in this case) is probably going to have a better result than jpeg. Also, optimizing for display size would take it further still. Alpha transparency support is also a bonus for png over jpeg.

Oxodao

Came across this a few time but never got to try it out because the only golang binding is unofficial and I can't get CGO to work as expected...

That would be really neat to have an official one. This articles makes me want to try it even more

1egg0myegg0

Here is a solved Github Issue related to CGO for the Go bindings! If you have another issue, please feel free to post it on their Github page!

https://github.com/marcboeker/go-duckdb/issues/4

Oxodao

Thanks! I didn't see that, I'll give it a try again!

nlittlepoole

Ran into some similar issues with those bindings. We switched to using the ODBC drivers and its been great. Those are official and we just use an ODBC library in go to connect

_raoulcousins

I love love love DuckDB. When I can use DuckDB + pyarrow and not import pandas, it makes my day.

kristianp

I'm enjoying experimenting with Duckdb from python, it's a promising product and has a large list of data formats it can read, including pandas dataframes from in-memory with zero-copy. However its still quite the moving target, with a number of things not at maturity yet. e.g. the TimestampZ column type isn't implemented yet [1], although it is in the documentation.

Edit: I came across it via the podcast: https://www.dataengineeringpodcast.com/duckdb-in-process-ola...

Latest release notes: https://github.com/duckdb/duckdb/releases/tag/v0.3.3

[1] Error message: Not implemented Error: DataType TIMESTAMPZ not supported yet...

mytherin

The correct type is TIMESTAMPTZ. Is the type TimestampZ mentioned anywhere in our documentation? If so, that looks like a typo.

I fully agree that error message needs to improve, however. I will have a look at that.

kristianp

You're right, I was using 'timestampz' when I should have been using 'timestamptz', (or 'timestamp with time zone') thanks for that.

mytherin

On the topic of Friendlier SQL, I have extended the similarity search to types in this PR [1], so the system will now offer you this correction as well :)

[1] https://github.com/duckdb/duckdb/pull/3633

Daily Digest email

Get the top HN stories in your inbox every day.