Postgres as a graph database

Daily Digest email

Get the top HN stories in your inbox every day.

szarnyasg

I designed and maintain several graph benchmarks in the Linked Data Benchmark Council, including workloads aimed for databases [1]. We make no restrictions on implementations, they can any query language like Cypher, SQL, etc.

In our last benchmark aimed at analytical systems [2], we found that SQL queries using WITH RECURSIVE can work for expressing reachability and even weighted shortest path queries. However, formulating an efficient algorithm yields very complex SQL queries [3] and their execution requires a system with a sophisticated optimizer such as Umbra developed at TU Munich [4]. Industry SQL systems are not yet at this level but they may attain that sometime in the future.

Another direction to include graph queries in SQL is the upcoming SQL/PGQ (Property Graph Queries) extension. I'm involved in a project at CWI Amsterdam to incorporate this language into DuckDB [5].

[1] https://ldbcouncil.org/benchmarks/snb/

[2] https://www.vldb.org/pvldb/vol16/p877-szarnyas.pdf

[3] https://github.com/ldbc/ldbc_snb_bi/blob/main/umbra/queries/...

[4] https://umbra-db.com/

[5] https://www.cidrdb.org/cidr2023/slides/p66-wolde-slides.pdf

JimmyRuska

This video goes through some of the nuances https://www.youtube.com/watch?v=tNgFMBzYcl8

Take a look at RDFox, unlike postgres where you can only derive a column from another column once, you can't derive a column from a derived column, RDFox can derive unlimited columns or even objects through rules incrementally. Kind of like unlimited chaining of materialized views

https://docs.oxfordsemantic.tech/

eternalban

Re [5]'s asssertion under "blunders" of the diminish usecases post sql/pgq, what do you think of [something] like Quine?

https://github.com/thatdot/quine

Their claim to fame is progressive incremental computation - each node is an actor responding to events -- and I'm not sure how a relational db could do that and match the latencies. That usecase is pretty much pattern matching and forensics and stuff like that.

https://docs.quine.io/core-concepts/architecture.html

szarnyasg

It's also worth to mention the idea of WITH ITERATIVE. This is a proposed SQL syntax extension that allows formulating iterative computations. For many problems (including graph traversals), this makes queries easier to write and execute.

> We derive new iterative CTE variants from the simple loop-based, operational semantics of SQL:1999’s WITH RECURSIVE.

https://www.cidrdb.org/cidr2023/papers/p14-hirn.pdf

https://www.cidrdb.org/cidr2023/slides/p14-hirn-slides.pdf

https://www.youtube.com/watch?v=ZDtSGAeYKU0

szarnyasg

Quine seems to be an interesting project but I'm not too familiar with it. Its main feature, evaluating complex multi-way join queries on incoming streaming data, exists in the relational world as the Materialize database which leverages differential dataflow for computation.

Quine uses Cypher so expressing path queries can be done with the concise Kleene-star syntax, e.g. (p1:Person)-[:knows*]-(p2:Person).

Materalize is getting support for WITH RECURSIVE and WITH MUTUALLY RECURSIVE (their own SQL extension that fixes some issues of WITH RECURSIVE):

https://github.com/MaterializeInc/materialize/issues/11176

simonw

This same technique - WITH RECURSIVE - is also supported by SQLite.

I ported the example in this tutorial to SQLite here - now you can play with it in Datasette Lite: https://lite.datasette.io/?sql=https%3A%2F%2Fgist.githubuser...

Amusingly I ported the sample PostgreSQL code using the new ChatGPT "Browsing" alpha, with the following prompt:

> Read this tutorial and then output equivalent create table and insert statements for SQLite https://www.dylanpaulus.com/posts/postgres-is-a-graph-databa...

punnerud

If we have links to other Sqlite3 databases in the SQLite itself, we could make an static webpage distributed graph database: https://news.ycombinator.com/item?id=27016630

With the static database trick, combined with recursive graphs

henryfjordan

I implemented pretty much exactly this setup once.

For what it's worth, there's a lot of footguns with this approach. You need to be careful about infinite cycles and things like that. You also just do not get the ergonomics of using a query language that is meant for graph data. Once you get it all setup and experience the issues you will no doubt be able to build whatever you want but there's a learning curve.

In the end we switched to using Neo4j and were able to build new features a lot more quickly than we would've been able to on Postgres.

It's also worth mentioning that there are many ways to store graph data, and using an "adjacency list" like is being done here may or may not be the best way for your use case. Neo4j I think uses a different approach where nodes store info about edges they are connected to directly, so you don't even need to hit an index to follow an edge.

cryptonector

> You need to be careful about infinite cycles and things like that.

Not really, just never forget this: always use `UNION`, and never `UNION ALL`, in recursive queries!

twic

There are still ways to go wrong. Try writing a query which explores a graph from a starting node, building up a path from the start to each node as a string. Even using UNION you will get an infinite loop, because every visit to a node in the loop has a different path, so it makes a unique row.

undefined

[deleted]

cryptonector

The aggregation needs to be done after you've computed the transitive closure, not while you're computing it.

magicalhippo

Seems to be a Postgres-specific advice?

Others, for example SQLite[1], seem to advocate using UNION ALL over UNION.

[1]: https://sqlite.org/lang_with.html#recursive_query_examples

cryptonector

No, it's not Postgres-specific advice. Using UNION ALL in recursive queries is a mistake if you have cycles in your data. This follows from first principles and the standard.

mistermann

Piggybacking on your Neo4J reference....can anyone suggest any resources that would be good for someone from the SQL world who is very uninformed on graph databases to quickly get their head around the key ideas, but even more importantly, how to choose a platform among the offerings (licensing being I think a key issue, I've heard Neo4J can get expensive)...assume a large, "social media" scale system, to be safe.

henryfjordan

Neo4j has a self-hosted community edition that is free, though you need to update to Enterprise to expand your DB beyond one node. It's worth noting though that scaling from a single node to a cluster is going to have some pretty huge performance issues whether you are using SQL or Neo4j to store your graph. The power of graph databases is that it should be very inexpensive to follow an edge in the graph but when that leads to a network hop you lose a lot of performance.

If you are comfortable fitting your data onto one box, then development speed is probably more important than other factors and I would just try a few databases out and especially pay attention to how good the libraries are in your programming language of choice. Neo4j for instance had high quality libraries in Java but the experience in Python was not as good.

If you have a lot of data or need next-level performance, I would start by doing some research on the various ways to store graph data and then pick a DB that supports the approach you want to take.

j45

Vaticles typed bid intriguing

joshspankit

As someone who hated being in the SQL world and couldn’t figure out why until years later when I found out about graph databases, here’s one big shift:

Graph database store true relationships.

The R in RDBMS and the concept of relationships by primary keys are lies (imo). They lead people away from what true relationships can be. Databases like Neo4j are not about doing any sort of PK or join or merge-before-query. Looking up by connection is the fastest way to find information. If your RDBMS had one table for phone numbers, one for email addresses, one for first names, one for last names, one for street, and so-on it would take huge amounts of time to query to find all the detail for a single person. With a graph database it takes a small amount of time to find the first bit of info (let’s say phone number since it’s easily sorted), and then every other bit of linked data is essentially free.

tejtm

Apologies in advance; the R for relational is with respect to the tuple, that is amongst the unordered items which are serialized within a row in a table and not the keys which link different tables, the lie you find does not originate in the definition.

As someone who has wanted to love the graph dbs, keep with narrow focus top down queries where the index free adjacency lets you avoid touching the bulk of your data, and stay away from bulk queries that may need to visit everywhere multiple times, which is unfortunately where things get interesting.

eurasiantiger

Effectively, moving from RDB to a graph database just shifts relationship resolution from query time to insert/update time. Sometimes this can be an advantage. I’d even wager usually.

WolfOliver

The R in RDBMS stands for the mathematical term of relations which is defined as a subset of a cartesian product.

throwaway689236

Sounds great, what are the downsides? When is it a definitely bad idea to use it?

bavell

I've been using EdgeDB in a greenfield project and loving it. Some rough edges (forgive the pun) but fantastic for prototyping / smaller projects and could be a good candidate for large projecs. I like it better than when I used neo4j a few years back for a client project. Much better license and I really like that it's built on top on postgres.

pottertheotter

I was looking at graph databases/Neo4j a lot early last year and found Neo4j has a lot of good resources.

You can get this book for free through their website. [1]

This have some good documentation[2], including this page "Transition from relational to graph database".[3]

[1] https://neo4j.com/graph-databases-book/

[2] https://neo4j.com/docs/getting-started/current/

[3] https://neo4j.com/docs/getting-started/current/appendix/grap...

taubek

For some basics of graph modeling take a look at https://memgraph.com/docs/memgraph/tutorials/graph-modeling.

Some differences between SQL and graph databases are covered at https://memgraph.com/blog/graph-database-vs-relational-datab....

xpe

> how to choose a platform among the offerings (licensing being I think a key issue, I've heard Neo4J can get expensive)

Safest bet (knowing nothing else about your needs): Build at least three prototypes, preferably rapidly and in parallel. Only after doing so will you be able to ask the right questions that speak accurately to your specific needs and therefore be able to map actual requirements to the different graph databases.

I’d be curious what other experienced graph database people would say about their first bigger projects. Looking back, my first graph projects involved significant iteration and several restarts. And these were still necessary after having significant background knowledge about the algorithms, offerings, and trade offs.

There are many people in roles/orgs that don’t have the humility or organizational cover to admit that these early projects are largely experimental learning experiences.

> ...assume a large, "social media" scale system, to be safe.

This broad assumption is going to cost you.

Instead, I’d encourage you to reframe this assumption in more specific terms such as read, write patterns; need for consistency, availability; support expectations; your talent pool; and lots more. Estimates can be on a log10 scale: 1, 10, 100, 1000 and so on,

viksit

you can use datomic as a good example. https://hashrocket.com/blog/posts/using-datomic-as-a-graph-d...

another good example using predicates - https://github.com/threatgrid/asami/wiki/2.-Introduction

undefined

[deleted]

mdavidn

I do love PostgreSQL, and I often reach for this approach when working with hierarchical data, but a word of warning: Recursive queries will not scale to handle _very_ large edge tables. The recursive query will always consider _all_ edges, even when the outer query seeks only a single node's relationships. The solution is to build and index denormalized n-th order relationship tables.

Others have already pointed out the issues of cycles.

ellisv

It's not even that it always considers all edges, but that you can't exit the search early when a condition is met. In other words, you have to first find all the paths and then, outside the CTE, filter to the shortest. We push the node filter into the CTE by wrapping it in a function.

> The solution is to build and index denormalized n-th order relationship tables.

This sounds much more performant but also more difficult to maintain.

mdavidn

Yes, it is difficult to maintain. That might be a good moment to consider a proper graph database!

ellisv

I’ve considered using a column to store indicate 2nd, 3rd, etc degree relations instead of n-th order tables.

switchbak

"you can't exit the search early when a condition is met" - I have a DAG traversal written in a recursive CTE, and I can bail early out just fine when my traversal predicates no longer match. Not sure why I'd have to do that outside the (recursive part of the) CTE?

Obviously maintaining a flattened version is going to perform better in queries, but you're trading maintenance (write) time for query time. Or materialization time if you decide to do it lazily.

ellisv

If you have an edge list like:

A->B

A->D

B->C

C->D

Postgres will walk both paths from A to D in the recursive CTE and then you can filter them afterwards to keep only the shortest.

You can use aggregate functions within the recursive CTE, so you can’t GROUP BY your starting node and stop once you find the first path. There isn’t a way to compare across multiple paths or iterations of the for-loop.

Nezteb

I learned from one of the comments on the post about AGE[1], "a PostgreSQL extension that provides graph database functionality."

[1] https://age.apache.org/

cryptonector

Cycles are not a problem: just use `UNION`, not `UNION ALL`.

I myself build transitive closure tables that I update incrementally as needed (sometimes in a deferred way), so that many of the recursive queries I do can be very fast, but I only build transitive closures for the smallest portion I can (basically group nesting).

liotier

> Recursive queries will not scale to handle _very_ large edge tables

What do you consider large ? We have a 12-year old telco application with a few million objects in Neo4J, doing fault impact calculations... Would PostgreSQL handle that easily nowadays ?

simonw

My hunch is that a few million objects is pretty tiny these days - you could probably write a script to import all of them into PostgreSQL in a few minutes and try it out yourself to see how it behaves.

simonw

I tried it against SQLite. I got ChatGPT to write me a script that would insert 1m edges and 1m nodes:

https://gist.github.com/simonw/c16ce01244760e186a3a0aa3fee04...

Then I ran that query again and it seems to return results in about 80ms:

https://lite.datasette.io/?sql=https://gist.github.com/simon...

ellisv

It depends a lot on how connected the graph is and whether you can fit it in memory.

runeks

> The solution is to build and index denormalized n-th order relationship tables.

Can you elaborate on this?

Does an n-th order relationship table contain all the nodes reachable from some node going through n edges?

And you'd have one such table for each integer in the range 2..n?

afandian

I'm sorry could you spell it out? What exactly does "recursive query will always consider _all_ edges" mean? A table scan? I'd be very grateful if you could give some pesudocode or point to a doc page.

jeremyjh

I think GP means that it has to completely expand the recursive part for every branch before any where condition on the edge nodes can be applied. Graph databases presumably can optimize this.

I've found recursive queries to be difficult to scale in real-world queries past a few million edge nodes. We've denormalized several tree relationships so that the edge is connected both to its parent and to the root of its tree in several cases.

afandian

Thanks! Sounds like you're saying that brute-force recursive algorithms without backtracking or early termination aren't a good match for path finding. That's not a surprise.

I'm using recursive algorithm on Postgres to find trees in a graph, but only where I'm specificaly interested in all of the nodes.

cpa

It's a good tool as long as your queries are simple, and your graph isn't too large (which is probably the case 99% of the time). And if that allows you to have one less tool to deal with, go for it!

But I've tried to push it to its limits (using PostGIS, my nodes being land plots that were connected to each others — the biggest query used joins, lateral, recCTE, windows… you name it) and ran into many issues: can't branch out early, hard to optimize, no support for complex queries (writing anything beyond BFS and DFS is a pita). Yet, in the end I accepted the struggle (and the long queries) because it was so nice to work with my data directly where it was kept :)

amyres

Using the built-in PostGIS topology tools? Or something more custom? I'm curious as I just started digging into and using PostGIS for a land parcel use case. I've wondered about the graph structure of parcels adjacent to others, rather than just a big unordered table of geom objects.

cpa

Just using the geography stuff and doing joins on ST_Intersects. I couldn't guarantee that my data formed a partition of the plane, which is necessary for topology iirc.

Fun was had and headaches too. The biggest speedup I got was computing the graph online (creating a node/vertices tables from geometries) and then doing the recursive CTE on that table.

amyres

Cool thanks! I'm amazed at how much can be done in PostGIS (if my SQL queries are good enough...)

jghn

Over time I've come to take this a step further. I feel people reach for a graph database way too early. Most use cases that most people have will work fine in a standard relational setup. And that brings with it the benefits of using technologies like Postgres that have stood the test of time. So I treat it as I do any performance consideration: first demonstrate that you actually need to go down the path less traveled, don't just assume you need to do so.

rajman187

Completely agree. In fact, even massive graph data can make use of a relational database + caching [1]

[1] https://engineering.fb.com/2013/06/25/core-data/tao-the-powe...

yooloo

Of course relational db can act like a graph db. It's just not as efficient due to how things are stored and queried. Would be great to have a graph db plugin (and I found one https://github.com/apache/age)

piyh

Having first class pagerank and shortest path functions in tinkerpop gremlin vs having to roll it yourself with recursive CTEs feels like a graph DB win.

eurasiantiger

The real question is, is a RDB better at being a graph DB than a graph DB is at being a RDB?

jakewins

Johan wrote the original Neo4j implementation as a graph layer on top of Postgres, random trivia. Then Java released NIO, he got excited and tried replacing the PG engine with one built for graphs from scratch.

dapearce

Also see Apache Age, a graph database extension for Postgres: https://github.com/apache/age

ChicagoDave

As a C# guy, I’ve figured out how to build in-memory graphs with generics and dsl/fluid queries. I’m working on a blog entry about it because it’s such a powerful skill to learn and leverage. No data store required. I can deserialize/serialize the data structure to/from binary or json.

taubek

I personally prefer native graph databases such as Memgraph[1]. Graph databases have it's advantages and disadvantages. Sometimes you, need RDBMS, sometimes you need NoSQL solution. So I pick my stack based on situation and issue that I need to solve.

[1] https://memgraph.com

rajman187

“Native” is a misleading or misused term in the context of graph databases because a graph cannot be mapped in a maximally consistent way with the underlying hardware (the von Neumann architecture represents data in a sequential manner and it is much faster to access it this way rather than randomly).

There are generally two families in the graph database world: those which use underlying traditional tables of nodes and many-to-many edges; and index-free adjacency which just means each node in the graph knows the memory address of its connections (other side of the edges).

Distributed graphs necessarily end up using the former because it’s difficult if not impossible for a node to know the memory address of its connection when that crosses a physical boundary. So typically index-free adjacency graphs have a master-slave setup with multiple read replicas but a single one to write to.

So with a “native graph” you don’t rely on potentially expensive join operations to find neighbors of neighbors and can traverse complex paths easily.

dtomicevic

This is spot on. Though, in Memgraph’s context, there is a combination of lock free techniques and raw pointer representation that can work exceptionally well for a large variety of use cases, and even better so compared to others on streaming / event driven / large volume of writes. Using raw pointers, there are also optimization opportunities when storing or rebalancing the graph to maximize the number of adjacent nodes in a cache line so benefiting from sequential representation too.

mashroom

Thanks for pointing me toward Memgraph. I didn't hear of it before and it appears to be a great alternative to neo4j.

Rapzid

I've found this post pretty inspiring for what can be done with graphs in Postgres: https://www.alibabacloud.com/blog/postgresql-graph-search-pr...

There are a LOT of use cases that will fit the capabilities and performance characteristics.

Daily Digest email

Get the top HN stories in your inbox every day.