Show HN: Cozo – new Graph DB with Datalog, embedded like SQLite

Hi HN, I have been making this Cozo database since half a year ago, and now it is ready for public release.

My initial motivation is that I want a graph database. Lightweight and easy to use, like SQLite. Powerful and performant, like Postgres. I found none of the existing solutions good enough.

Deciding to roll my own, I need to choose a query language. I am familiar with Cypher but consider it not much of an improvement over CTE in SQL (Cypher is sometimes notationally more convenient, but not more expressive). I like Gremlin but would prefer something more declarative. Experimentations with Datomic and its clones convinced me that Datalog is the way to go.

Then I need a data model. I find the property graph model (Neo4j, etc.) over-constraining, and the triple store model (Datomic, etc.) suffering from inherent performance problems. They also lack the most important property of the relational model: being an algebra. Non-algebraic models are not very composable: you may store data as property graphs or triples, but when you do a query, you always get back relations. So I decided to have relational algebra as the data model.

The end result, I now present to you. Let me know what you think, good or bad, and I'll do my best to address them. This is the first time that I use Rust in a significant project, and I love the experience!

Daily Digest email

Get the top HN stories in your inbox every day.

samuell

How I have waited for this: A simple, accessible library for graph-like data with datalog (also in a statically compiled language, yay). Have even pondered using SWI-prolog for this kind of stuff, but it seems so much nicer to be able to use it embedded in more "normal" types of languages.

Looking forward to play with this!

The main thing I will be wondering now is how it will scale to really large datasets. Any input on that?

zh217

Thanks for your interest in this!

It currently uses RocksDB as the storage engine. If your server has enough resources, I believe it can store TBs of data with no problem.

Running queries on datasets this big is a complicated story. Point lookups should be nearly instant, whereas running complicated graph algorithms on the whole dataset is (currently) out of the question, since all the rows a query touches must reside in memory. Also, the algorithmic complexity of some of the graph algorithms is too high for big data and there's nothing we can do about it. We aim to provide a smooth way for big data to be distilled layer by layer, but we are not there yet.

bryanrasmussen

when you say currently it implies it will change? Does that mean all rows will not be in memory?

what if you had not so many nodes but each node had a lot of data would that improve it? Probably not but just normally I think of the number of nodes in your graph as the problem.

zh217

Yes. For example, in Postgres you can sort tables arbitrarily large, not constrained by main memory. Postgres uses external merge sort when the tables are really large. There are other situations where the working data are disk-based when they are too large in Postgres. We will eventually be able to do that in Cozo as well, but no timetable is available yet.

For your second question, say you have a relation with lots of fields, one of them particularly large. As long as you don't use that field in your query, it will not impact memory usage. The query may be slower though since the RocksDB storage engine needs to read more pages from disk, but the fields that are loaded by RocksDB but not needed will be promptly evicted from memory.

samuell

Many thanks for the detailed answer!

samuell

For folks looking for documentation or getting started-examples, see:

- The tutorial: https://nbviewer.org/github/cozodb/cozo-docs/blob/main/tutor...

- The language documentation: https://cozodb.github.io/current/manual/

- The pycozo library README for some examples on how to run this from inline python: https://github.com/cozodb/pycozo#readme

ekidd

This is a really impressive piece of work! Congratulations!

I note that it appears to be a library, but it's licensed under the Affero GPL. I believe this means that if I link your library into a program, and if I then allow users to interact with that combined program in any way over a network, then I have to make it possible for users to download the source code to my entire program. Is that your goal here? Were you thinking of some kind of commercial licensing model for people writing server-side apps that use your library?

(I'm curious because I've been deciding whether or not to roll my own toy Datalog for a permissively-licensed open source Rust project.)

zh217

No, my understanding is that if you don't make any changes to the Cozo code, you don't need to release anything to the public. If you do, and you cannot release your non-Cozo code, then you must dynamically link to the library (and release your changes to the Cozo code). The Python, NodeJS and Java/Clojure libraries all use dynamic linking.

There is no plan for any commercial license - this is a personal project at the moment. My hope is for this project to grow into a true FOSS database with wide contributions and no company controlling it. If a community forms and after I understand the consequences a little bit more, the license may change if the community decides that it is better for the long-term good of the project. For the moment though, it is staying AGPL.

Cu3PO42

Let me preface by saying that this seems like a great piece of software and it is absolutely within your right to license it as whatever you would like, no matter what any of the commenters here think.

However, I don't believe your understanding of AGPL is accurate.

> No, my understanding is that if you don't make any changes to the Cozo code, you don't need to release anything to the public. If you do, and you cannot release your non-Cozo code, then you must dynamically link to the library (and release your changes to the Cozo code). The Python, NodeJS and Java/Clojure libraries all use dynamic linking.

This sounds like you're thinking of the LGPL, not AGPL. Whereas LGPL is less strict than GPL because the exception you describe above applies. AGPL on the other hand is more strict. Essentially, if you use any AGPL code to provide a service to users then you must also make the source code available, even if the software itself is never delivered to users.

The intention here is that you can't get around GPL by hiding any use of the GPL code behind a server, so it makes perfect sense to use it for a database. But I don't think it does what you want.

Whichever way you decide to go, be it AGPL, LGPL or something else, I encourage you to make a choice before accepting any outside contributions. As soon as you have code from other authors without a CLA you will need to obtain their permission to change the license (with some exceptions).

(Disclaimer: I'm not a lawyer, just interested in licenses.)

zh217

It seems that I really did misunderstand the differences. It is now under LGPL. The repo still requires CLA for contribution for the moment until I am really sure.

zh217

Thank you for your perspective.

Maybe I was confused about the case of using an executable vs linking against a library. Let me double-check with a few friends who understand copyright laws better than me. If everything checks out, the next release will be under LGPL.

About CLA: at the previous suggestion of a friend, the repo was locked with CLA requirement currently (even though nobody outside contributed yet). This will be lifted once the situation becomes clearer.

ekidd

> If a community forms and after I understand the consequences a little bit more, the license may change if the community decides that it is better for the long-term good of the project. For the moment though, it is staying AGPL.

Yes, I do want to be clear: I encourage you to use whatever license you like. You wrote the code! I was just curious, because it would also affect the license of any hypothetical software I wrote that used the library.

Here's a super oversimplified version of the main license types (I am not a lawyer):

- Permissive: "Do whatever you want but don't sue me."

- LGPL: "If you give this library to other people, you must 'share and share alike' the source and your changes to this library."

- GPL: "If you use this code in your program, you must 'share and share alike' your entire program, but only if you give people copies of the program."

- AGPL: "If you use this code in your program, you must 'share and share alike' your entire program with anyone who can interact with it over a network."

The AGPL makes a ton of sense for an advanced database server, because otherwise AWS may make their own version and run it on their servers as a paid service, without contributing back.

But like I said, I'm simplifying way too much. Take a look at the FSF's license descriptions and/or talk to a lawyer. This shouldn't be stressful. Figure out what license supports the kind of users and community you want, pick it, and don't look back. :-)

(I may end up writing a super-simple non-persistent Datalog at some point for an open source project. My needs are much simpler than the things you support, anyways—I only ever need to run one particular query.)

zh217

I realized my mistake, as I said in the other comments. The main repo is now under LGPL. I'll see what I'll do with the bindings. Writing code is so much better than dealing with licenses!

dangoor

I am not a lawyer, but I work in an open source programs office and am currently working specifically on open source license compliance.

Beyond what the sibling comments have said about LGPL sounding more like what you're going for, I'll just note that if you'd like broad adoption of this while still ensuring that changes to your code remain open, you might also want to consider the Mozilla Public License.

From what I understand of MPL and LGPL is that MPL is better for instances where dynamic linking isn't possible. The MPL basically says that any changes _to the files you created_ must be available under the MPL, preserving their public availability.

That said, most organizations are fine with the LGPL, but it just gets gnarly if there are instances where you really want to statically link something but you still fully want to support the original library's openness.

georgewfraser

Licensing under AGPL will make it hard for any startup to use Cozo. Lawyers always ask about AGPL in venture financing diligence and it is considered a red flag. You can argue that they are wrong, the linking exception and so on, but you’re basically shouting into the wind.

pie_flavor

AGPL is a variant of the GPL, not the LGPL. Meaning that dynamic linking still constitutes (according to them) a derivative work, meaning that even programs that dynamically link against it must themselves be AGPL in their entirety. Dynamic linking is also meaningfully complicated to do in Rust, and this licensure of the crates.io crate will be a footgun for anyone not using cargo-deny.

I think this is a very cool project, but its use of *GPL essentially ensures I'm not going to use it for anything. If you're planning on reducing it to LGPL, I'm not sure what the GPL is getting you over going with the Rust standard license set of MIT + Apache 2.0.

kylebarron

If I'm not mistaken that sounds more like LGPL than the AGPL?

zh217

Maybe, and maybe I need to consult a lawyer someday to get the facts straight. To tell you the truth my head hurts when I attempt to understand what these licenses say. Regardless, I intend this project to be true FOSS, the "finer detail" of which FOSS license it uses may change.

philzook

Very cool! I love the sqlite install everywhere model.

Could you compare use case with Souffle? https://souffle-lang.github.io/

I'd suggest putting the link to the docs more prominently on the github page

Is the "traditional" datalog `path(x,z) :- edge(x,y), path(y,z).` syntax not pleasant to the modern eye? I've grown to rather like it. Or is there something that syntax can't do?

I've been building a Datalog shim layer in python to bridge across a couple different datalog systems https://github.com/philzook58/snakelog (including a datalog built on top of the python sqlite bindings), so I should look into including yours

zh217

I find nothing wrong with the classical syntax, but there is a very practical, even stupid reason why the syntax is the way it is now. As you can see from the tutorial (https://nbviewer.org/github/cozodb/cozo-docs/blob/main/tutor...), you can run Cozo in Jupyter notebooks and mix it with Python code. This is the main way that I myself interact with Cozo. Since I don't fancy writing an unmaintainable mess of Jupyter frontend code that may become obsolete in a few years, CozoScript had better look like python enough so as not to completely baffle the Jupyter syntax highlighter. That's why the syntax for comments is `#`, not `//`. That's also why the syntax for stored relation is `*stored`, not `&stored` or `%stored`.

This is a hack from the beginning, but over time I grew to like the syntax quite a bit. And hopefully by being similar to Python or JS superficially, fewer confusion results for new users :)

samuell

Interesting! I'm thinking ... perhaps a small syntax comparison for prolog/classical datalog vs cozo, would help people used to the classical syntax quickly get started.

philzook

Ah, that's very interesting. Thank you. `s.add(path(x,z) <= edge(x,y) & path(y,z))` is what I chose as python syntax, but it is clunkier.

jitl

This is amazing!

Have you looked at differential-datalog? It's rust-based, maintained by VMWare, and has a very rich, well-typed Datalog language. differential-datalog is in-memory only right now, but could be ideal to integrate your graph as a datastore or disk spill cache.

https://github.com/vmware/differential-datalog

zh217

Differential-datalog is a cool project. I think the targeted use cases are different as compared to Cozo. The most important difference is that Cozo is focused on graphs, whereas differential-datalog is focused on incremental computation. These two goals are somewhat at odds with each other, as for queries with lots of joins (very common in graph computations), you can't know whether it's better to compute new results incrementally or to recompute everything until you actually run the query. Also, Cozo caters for the exploratory phase of data analysis (no need to define types/tables beforehand), whereas in differential-datalog everything must be explicit upfront.

remram

For everyone else: it looks like parent submitted it and it's now on the frontpage: https://news.ycombinator.com/item?id=33521561

mark_l_watson

Thank you, this looks very useful. I will try the Python embedded mode when I have time.

I especially like the Datalog query examples in your READ project file. I usually use RDF/RDFS and the SPARQL query language, with must less use of property graphs using Neo4J. I expect an easy ramp up learning your library.

BTW, I read the discussion of your use of the AGPL license. For what it is worth, that license is fine with me. I usually release my open source projects using Apache 2, but when required libraries use GPL or AGPL, I simply use those licenses.

canadiantim

You mention how Cypher is not much of an improvement over CTE in SQL, I was wondering if you could expand on this point a bit if possible?

Some part of me is considering using Apache AGE graph extension for postgres, but another part wonders whether it's worth it considering CTE's can do a lot very similarly.

I'll definitely be following the progress for Cozo though, sounds great on the face of it. Definitely will have to consider potentially using Cozo as well. I wonder if it could make sense to use Postgres and Cozo together?

zh217

Yes of course.

Perhaps I should start by clarifying that I am talking about the number of queries the Cypher language can express, without any vendor-specific extensions, since my consideration was whether to use it as the query language for my own database. And Cypher is of course much more convenient to _type_ than SQL for expressing graph traversals - it was built for that.

With that understanding, any cypher pattern can be translated into a series of joins and projections in SQL, and any recursive query in cypher can be translated into a recursive CTE. Theoretically, SQL with recursive CTE is not Turing complete (unless you also add in window functions in recursive CTE, which I don't think any of the Cypher databases currently provide), whereas Datalog with function symbol is. Practically, you can easily write a shortest path query in pure Datalog without recourse to built-in algorithms (an example is shown in README), and at least in Cozo it executes essentially as a variant of Dijkstra's algorithm. I'm not sure I can do that in Cypher. I don't think it is doable.

samuell

Does Cypher even support nested and/or recursive queries? I remember asking the Neo4j guys at a meetup about that many years ago, and they didn't even seem to understand the question. Might have changed since then of course.

Otherwise the thing I have noticed with the datalog (as well as prolog) syntax, is you are able to build a vocabulary of re-usable queries, in a much more usable was than any of the solutions I've seen in SQL, or other similar languages.

It thus allows you to raise your level of abstraction, by layer by layer define your definitions (or "classes" if you will) with well crafted queries, that can be used for further refined classifying queries.

zh217

Re Datalog syntax: yes, the "composability" is the main reason that I decided to adopt it as the query language. This is also the reason why we made storing query results back into the database very easy (no pre-declaration of "tables" necessary) so that intermediate results can be materialized in the database at will and be used by multiple subsequent queries.

abc3354

This look nice !

Datascript seems to be another Datalog engine (in memory only)

https://github.com/tonsky/datascript

fsiefken

there are a few more, including ones supporting on disk databases https://en.wikipedia.org/wiki/Datalog#Systems_implementing_D...

Jeff_Brown

I thought there was a big class of queries Datalog could not express -- something about negation, queries like "all X for which not Y". Is that not true? Or if it is, is Datalog somehow Turing complete nonetheless?

zh217

Technically, Cozo is using something called "Datalog with stratified negation, stratified aggregation, and function symbols", allowing aggregations to be auto-recursive when they form a semi-lattice, together with built-in algorithms which are black boxes taking in relations and giving you back another relation. Your example is taken care of by the "stratified negation" part.

I believe "Datalog with function symbols" is already Turing complete, but you are right, what they call "Datalog" without any qualification in academic papers is not.

syats

Great initiative, hope this takes off :)

Just FYI, The largest, most-used knowledge graphs in the world (Google and LinkedIn) are not running on RDF4J or any RDF Triplestore, but on their proprietary graph stores, which also use Datalog as a query language.

For those looking for an enterprise-ready equivalent (also datalog queries) and have a good wad of cash, consider https://www.oxfordsemantic.tech/product

stevesimmons

This does look very nice!

Especially (from my point of view) having the Python interface.

What's the max practical graph sizes you anticipate?

zh217

For the moment: you can have as much data as you want on disk as long as the RocksDB storage engine can handle it, which I believe is quite large. For any single query though, you want all the data you touch to fit in memory. The good news is that Rust is very efficient in using memory. This will be improved in future versions.

For the built-in graph algorithms, you are also limited by the algorithmic complexity, which for some of them is quite high (notably betweenness centrality). There is nothing the database can help in this case, though we may add some approximate algorithms with lower complexities later.

kobaroko

And what is the biggest size that you have tested?

zh217

Around 10 GBs of data, on the standalone version. We will have systematic benchmarks when the API, syntax, etc. settle down a little bit.

wodenokoto

Looks incredibly polished. Everything from the logo through the ready made bindings. Very impressive!

pgt

Good job! How to transact? The examples only show queries.

zh217

Transactions are described in the manual: https://cozodb.github.io/current/manual/stored.html#chaining....

Sorry about the docs being all over the place at the moment! My only excuse is that Cozo is very young. The documentation (and the implementation) still needs a lot of work!

Daily Digest email

Get the top HN stories in your inbox every day.