Brian Lovin
/
Hacker News

Show HN: HelixDB – Open-source vector-graph database for AI applications (Rust)

github.com

Hey HN, we want to share HelixDB (https://github.com/HelixDB/helix-db/), a project a college friend and I are working on. It’s a new database that natively intertwines graph and vector types, without sacrificing performance. It’s written in Rust and our initial focus is on supporting RAG. Here’s a video runthrough: https://screen.studio/share/szgQu3yq.

Why a hybrid? Vector databases are useful for similarity queries, while graph databases are useful for relationship queries. Each stores data in a way that’s best for its main type of query (e.g. key-value stores vs. node-and-edge tables). However, many AI-driven applications need both similarity and relationship queries. For example, you might use vector-based semantic search to retrieve relevant legal documents, and then use graph traversal to identify relationships between cases.

Developers of such apps have the quandary of needing to build on top of two different databases—a vector one and a graph one—plus you have to link them together and sync the data. Even then, your two databases aren't designed to work together—for example, there’s no native way to perform joins or queries that span both systems. You’ll need to handle that logic at the application level.

Helix started when we realized that there are ways to integrate vector and graph data that are both fast and suitable for AI applications, especially RAG-based ones. See this cool research paper: https://arxiv.org/html/2408.04948v1. After reading that and some other papers on graph and hybrid RAG, we decided to build a hybrid DB. Our aim was to make something better to use from a developer standpoint, while also making it fast as hell.

After a few months of working on this as a side project, our benchmarking shows that we are on par with Pinecone and Qdrant for vectors, and our graph is up to three orders of magnitude faster than Neo4j.

Problems where a hybrid approach works particularly well include:

- Indexing codebases: you can vectorize code-snippets within a function (connected by edges) based on context and then create an AST (in a graph) from function calls, imports, dependencies, etc. Agents can look up code by similarity or keyword and then traverse the AST to get only the relevant code, which reduces hallucinations and prevents the LLM from guessing object shapes or variable/function names.

- Molecule discovery: Model biological interactions (e.g., proteins → genes → diseases) using graph types and then embed molecule structures to find similar compounds or case studies.

- Enterprise knowledge management: you can represent organisational structure, projects, and people (e.g., employee → team → project) in graph form, then index internal documents, emails, or notes as vectors for semantic search and link them directly employees/teams/projects in the graph.

I naively assumed when learning about databases for the first time that queries would be compiled and executed like functions in traditional programming. Turns out I was wrong, but this creates unnecessary latency by sending extra data (the whole written query), compiling it at run time, and then executing it. With Helix, you write the queries in our query language (HelixQL), which is then transpiled into Rust code and built directly into the database server, where you can call a generated API endpoint.

Many people have a thing against “yet another query language” (doubtless for good reason!) but we went ahead and did it anyway, because we think it makes working with our database so much easier that it’s worth a bit of a learning curve. HelixQL takes from other query languages such as Gremlin, Cypher and SQL with some extra ideas added in. It is declarative while the traversals themselves are functional. This allows complete control over the traversal flow while also having a cleaner syntax. HelixQL returns JSON to make things easy for clients. Also, it uses a schema, so the queries are type-checked.

We took a crude approach to building the original graph engine as a way to get an MVP out, so we are now working on improving the graph engine by making traversals massively parallel and pipelined. This means data is only ever decoded from disk when it is needed, and parts of reads are all processed in parallel.

If you’d like to try it out in a simple RAG demo, you can follow this guide and run our Jupyter notebook: https://github.com/HelixDB/helix-db/tree/main/examples/rag_d...

Many thanks! Comments and feedback welcome!

Daily Digest email

Get the top HN stories in your inbox every day.

rohanrao123

Congrats on the launch! I'm one of the authors of that paper you cited, glad it was useful and inspiring to building this :) Let me know if we can support in any way!

GeorgeCurtis

Wow! I enjoyed reading it a lot and it was definitely inspiring for this project!

Would love to talk to you about it and make sure we capture all of the pain points if you're open to it? :)

rohanrao123

Absolutely, will DM you on X!

quantike

I spent a bit of time reading up on the internals and had a question about a small design choice (I am new to DB internals, specifically as they relate to vector DBs).

I notice that in your core vector type (`HVector`), you choose to store the vector data as a `Vec<f64>`. Given what I have seen from most embedding endpoints, they return `f32`s. Is there a particular reason for picking `f64` vs `f32` here? Is the additional precision a way to avoid headaches down the line or is it something I am missing context for?

Really cool project, gonna keep reading the code.

xavcochran

thanks for the question! we chose f64 as a default for now as just to cover all cases and we believed that basic vector operations would not be our bottleneck initially. As we optimize our HNSW implementation, we are going to add support for f32 and binary vectors and drop using Vec<f64/f32> and instead use [f64/f32; {num_dimensions}] to avoid unnecessary heap allocation!

quantike

I appreciate the reply! Yeah that sounds like the correct path forward is swapping out the type for some enum of numeric types you want to cover.

I'd be curious if there's some benefit to the runtime-memory utilization to baking in the precision of the vector if it's known at comptime/runtime. In my own usage of vector DBs I've only ever used a single-precision (f32), and often have a single, known dimension. But if Helix is aiming for something more general purpose, then it makes sense to offer the mixing of precision and dimension in the internals.

Cheers

xavcochran

The benefit of baking in the dimension and size of individual elements (the precision) is the fact that the size will be known at compile time meaning it can be allocated on the stack instead of being heap allocated.

srameshc

I was thinking about intertwining Vector and Graph, because I have one specific usecase that required this combination. But I am not courageos or competent enough to build such a DB. So I am very excited to see this project and I am certainly going to use it. One question is what kind of hardware do you think this would require ? I am asking it because from what I understand Graph database performance is directly proportional to the amount of RAM it has and Vectors also needs persistence and computational resources .

GeorgeCurtis

The fortunate thing about our vector DB, like I mentioned in the post, is that we store the HNSW on disk. So, it is much less intense on your memory. Similar thing to what turbo puffer has done.

With regard to the graph db, we mostly use our laptops to test it and haven't run into an issue with performance yet on any size dataset.

If you wanna chat DM me on X :)

UltraSane

Neo4j supports vector indexes

GeorgeCurtis

Neo4j first of all is very slow for vectors, so if performance is something that matters for your user experience they definitely aren't a viable option. This is probably why Neo4j themselves have released guides on how to build that middleman software I mentioned with Qdrant for viable performance.

Furthermore, the vectors is capped at 4k dimensions which although may be enough most of the time, is a problem for some of the users we've spoken to. Also, they don't allow pre filtering which is a problem for a few people we've spoken to including Zep AI. They are on the right track, but there are a lot of holes that we are hoping to fill :)

Edit: AND, it is super memory intensive. People have had problems using extremely small datasets and have had memory overflows.

mauvo59

Hey, want to correct some of your statements here. :-)

Neo4j's vector index uses Lucene's HNSW implementation. So, the performance of vector search is the same as that of Lucene. It's worth noting that performance suffers when configured without sufficient memory, like all HNSW vector indexes.

>> This is probably why Neo4j themselves have released guides on how to build that middleman software I mentioned with Qdrant for viable performance.

No, this is about supporting our customers. Combining graphs and vectors in a single database is the best solution for many users - integration brings convenience, consistency, and performance. But we also recognise that customers might already have invested in a dedicated vector database, need additional vector search features we don't support, or benefit from separating graph and vector resources. Generally, integrating well with the broader data ecosystem helps people succeed.

>> Furthermore, the vectors is capped at 4k dimensions

We occasionally get asked about support for 8k vectors. But so far, whenever I've followed up with users, there doesn't seem to be a case for them. At ~32kb per embedding, they're often not practical in production. Happy to hear about use cases I've missed.

>> Also, they don't allow pre filtering which is a problem for a few people we've spoken to including Zep AI.

We support pre- and post-filtering. We're currently implementing metadata filtering, which may be what you're referring to.

>> AND, it is super memory intensive.

It's no more memory-intensive than other similar implementations. I get that different approaches have different hardware requirements. But in all cases, a misconfigured system will perform poorly.

hbcondo714

Congrats! Any chance Helixdb can be run in the browser too, maybe via WASM? I'm looking for a vector db that can be pre-populated on the server and then be searched on the client so user queries (chat) stay on-device for privacy / compliance reasons.

GeorgeCurtis

Interesting, we've had a few people ask about this. So essentially you'd call the server to retrieve the HNSW and then store it in the browser and use WASM to query it?

Currently the road block for that is the LMDB storage engine. We have on our own storage engine on our roadmap, which we want to include WASM support with. If you wanna talk about it reach out to my twitter: https://x.com/georgecurtiss

xavcochran

to add to George's reply, for helix to run on the browser with WASM the storage engine has to be completely in memory. At the moment we use LMDB which uses file based storage so that does't work with the browser. As George said, we plan on making our own storage engine and as part of that we aim to have an in-memory implementation.

hansworst

Not entirely sure if you could use it, but wondering if you’ve heard about the origin private file system feature of modern browsers? https://developer.mozilla.org/en-US/docs/Web/API/File_System...

xavcochran

very interesting, will look into this. I know for a fact that you cannot compile the likes of LMDB and RocksDB to work with WASM but this looks promising for our custom storage engine to be able to make it work with the browser. Thanks for this!

huevosabio

Can I run this as an embedded DB like sqlite?

Can I sidestep the DSL? I want my LLMs to generate queries and using a new language is going to make that hard or expensive.

GeorgeCurtis

Currently you can't run us embedded and I'm not sure how you could sidestep the DSL :/

We're working on putting our grammar in llama's cpp code so that it only outputs grammatically correct HQL. But, even without that it shouldn't be hard or expensive to do. I wrote a Claude wrapper that had our docs in its context window, it did a good job of writing queries most of the time.

undefined

[deleted]

tmpfs

This is very interesting, are there any examples of interacting with LLMs? If the queries are compiled and loaded into the database ahead of time the pattern of asking an LLM to generate a query from a natural language request seems difficult because current LLMs aren't going to know your query language yet and compiling each query for each prompt would add unnecessary overhead.

GeorgeCurtis

This is definitely a problem we want to work on fixing quickly. We're currently planning an MCP tool that can traverse the graph and decide for itself at each step where to go to next. As opposed to having to generate actual text written queries.

I mentioned in another comment that you can provide a grammar with constrained decoding to force the LLM to generate tokens that comply with the grammar. This ensures that only valid syntactic constructs are produced.

esafak

How does it compare with https://kuzudb.com/ ?

GeorgeCurtis

Kuzu don't support incremental indexing on the vectors. The vector index is completely separate and decoupled from the graph.

I.e: You have to re-index all of the vectors when you make an update to them.

wontonaroo

Firstly congratulations on your effort.

How does the graph component of your database perform compared to Kuzu? Do you have any benchmarks.

For RAG I've tried Qdrant, Meilisearch, and Kuzu. At the moment I wouldn't consider HelixDB because of HelixQL. Wondering why you didn't use OpenCypher?

At the moment you have this system which is aimed to support AI/LLM systems but by creating HelixQL you do not have an AI coding friendly query language.

With OpenCypher even older cheap models can generate queries. Or maybe some GraphQL layer.

GeorgeCurtis

Thanks for the support :)

We're currently working on benchmarks so nothing exact on Kuzu right now with regards to performance. We've had quite a few requests for benchmark comparisons against different databases, so they should take a good few days. Will return here when they are ready

When we've used Cypher in the past we didn't get on with the methodology of the language that well. A functional approach, like gremlin, suited our minds better. But, Gremlin's syntax is awful (in our opinion), and the amount of boilerplate code you need we felt was unnecessary.

We wanted to make something that was easier to read than Gremlin, like Cypher, but also have functional aspect that just made traversals feel so much more intuitive.

Another note, we're more fond of type-safe languages, and it didn't make much sense to us that out of all the programming languages that exist, query languages were the non-type-safe ones.

We know it's a pain learning a new language, but we really believe that our approach will pave the way for a better development experience and a better paradigm.

Onto the AI stuff, you're right, it isn't ideal (right now). We did make a gpt wrapper that did a pretty good job of writing queries based on a condensed version of our docs, but this isn't ideal. So, the next thing on our road map is a graph traversal MCP tool. Instead of the agent having to generate text written queries, it can use the traversal tools and determine where it should hop to at each step.

We know we're being quite ambitious here, but we think there's a lot we can improve on over existing solutions.

Thanks again :)

youdont

Looks very interesting, but I've seen these kind of multi-paradigm databases like Gel, Helix and Surreal and I'm not sure that any of them quite hit the graph spot.

Does Helix support much of the graph algorithm world? For things like GrapgRAG.

Either way, I'd be all over it if there was a python SDK witch worked with the generated types!

BlooIt

Shameless plug: If you're exploring graph+vector databases, check out https://github.com/Pometry/Raphtory/ — with a full Python SDK and built-in support for most common graph algorithms.

It’s built in Rust with native vector support. The open-source version is in-memory, but the commercial version supports disk-based scaling (we tested it with a 3TB graph on an M1 MacBook + insert all 100x faster than existing GraphDBs).

mbrinman

When are you planning on releasing your commercial version? I couldn't find any information online with regard to pricing, etc.

xavcochran

Looking at your benchmarks you say for inserting 1k edges its around 500,000 ns/iteration. Is this 500,000 ns/per edge insertion or for all 1k of them?

BlooIt

Hello. These benchmarks are a bit outdated, we’re currently updating them this sprint.

The open-source in-memory version loads around 3 million edges/second, while the on-disk version handles does about 2 million edges/second with a WAL batch size of 100, and 3m with no WAL.

GeorgeCurtis

We started as a graph database, so that's definitely the main thing we want to get right and we wan't to prioritise capturing all that functionality.

We have a python SDK already! What do you mean by generated types though?

Onawa

I have been happily using Gel (formerly EdgeDB) for a few projects. I'm curious what you think it is missing in regards to hitting the "graph spot"?

GeorgeCurtis

gel is a relational database, have you been building with it under a graph type philosophy?

iannacl

Looks really interesting. A couple of questions: Can you explain how helix handles writes? What are you using for keys? UUIDs? I'm curious if you've done, or are thinking about, any optimizations here.

Feel free to point me to docs / code if these are lazy questions :)

xavcochran

We utilize some of LMDB's optimizations such as the APPEND put flags. We also make use of LMDB handling duplicates as a one-to-many key instead of duplicating keys. This means we can get all values for one key in one call rather than a call for each duplicate.

For keys we are using UUIDs, but using the v6 timestamped uuids so that they are easily lexicographically ordered at creation time. This means keys inserted into LMDB are inserted using the APPEND flag, meaning LMDB shortcuts to the rightmost leaf in its B-Tree (rather than starting at the root) and appends the new record. It can do this because the records are ordered by creation time meaning each new record is guaranteed to be larger (in terms of big-endian byte order) than the previous record.

We also store the UUIDs as u128 values for two reasons. The first is that a u128 takes up 16 bytes where as a string UUID takes up 36 bytes. This means we store 56% less data and LMDB has to decode 56% less bytes when doing code accesses.

For the outgoing/incoming edges for nodes, we store them as fixed sizes which means LMDB packs them in, removing the 8 byte header per Key-Value pair.

In the future, we are also going to separate the properties from the stored value as empty property objects still take up 8 bytes of space. We will also make it so nothing is inserted if the properties are empty.

You can see most of this in action in the storage core file: https://github.com/HelixDB/helix-db/blob/main/helixdb/src/he...

Attummm

It sounds very intriguing indeed. However, the README makes some claims. Are there any benchmarks to support them?

> Built for performance we're currently 1000x faster than Neo4j, 100x faster than TigerGraph

GeorgeCurtis

Those were actual benchmarks that we run, we didn't get a chance to write them out properly before posting. I'll get on it now and notify by replying to this comment when they're on the readme :)

sitkack

Excellent work. Very exited to test this out. What are the limits or gotchas we should be aware of, or how do you want it pushed?

What other papers did you get inspiration from?

xavcochran

Thanks for the kind words! At the moment the query language transpilation is quite unstable but we are in the process of a large remodel which we aim to finish in the next day or so. This will make the query language compilation far more robust, and will return helpful error messages (like the rust compiler). The other thing is the core traversals are currently single threaded, so aggregating huge lists of graph items can take a bit of a hit. Note however, that we are also implementing parallel LMDB iterators with the help of the meilisearch guys to make aggregation of large results much faster.

ckugblenu

Very interesting project. Would be curious of a comparison with memgraph. Will definitely give it to try for my knowledge graph use case.

GeorgeCurtis

I'll add memgraph to our benchmarking list! Make sure you join our discord. would love to help in any way we can and hear about any issues you run in to

Daily Digest email

Get the top HN stories in your inbox every day.

Show HN: HelixDB – Open-source vector-graph database for AI applications (Rust) - Hacker News