Sonic: Fast, lightweight and schema-less search backend

Daily Digest email

Get the top HN stories in your inbox every day.

blacklight

While I really like their lightweight, SQL-like protocol instead of Elasticsearch's fat JSON, I really think that this project could have much more impact if it could be a drop-in replacement for ES.

Even if it offers only a fraction of the features offered by ES, that may be fair enough for at least half of the use-cases out there.

Sonic could have really had a strong selling point: "Use an ES-alternative that works fine in most of the real-world applications, but it's written in Rust and it only takes a fraction of the memory footprint required by ES, and it shouldn't require you to change your application code".

Instead, they are proposing yet another search protocol, that developers have to learn and adopt. That definitely increases the adoption barriers.

xvello

Since Elastic spitefully patched all of their client libraries to fail if the server is not a "genuine" ES server, I don't see what good a drop-in replacement with protocol compatibility would do.

Go client: https://github.com/elastic/go-elasticsearch/blob/3985f2a1554...

Python client: https://github.com/elastic/elasticsearch-py/commit/e72aa3e24...

snikolaev

Is it prohibited to include `X-Elastic-Product: Elasticsearch` in the output of your server if the user instructs the server to do so? :)

AbraKdabra

Those libraries are open source, just nuke those restrictions and you're good to go. Is it the best way? Maybe not, but it's better than modifying your server responses (and in the worst 1984 case, allowing Elastic to sue you), if you develop such a tool you can always put that distinction in your README.

hangonhn

I don't see how they can legally have any control over what a 3rd party's software outputs. And more importantly, how would they even enforce such restrictions?

leros

ElasticSearch is so much more than search. Sonic is very minimal in comparison, so a drop in replacement doesn't work here.

But yes, Sonic could replace lots of use cases.

markandrewj

Although not exactly the same, Elastic has an SQL query syntax which can be used now as well.

https://www.elastic.co/what-is/elasticsearch-sql

_boffin_

They’ve had it for a long time already. Was using it over a year and a half ago. Not really “new”

neodymiumphish

In fairness, the comment you're replying to didn't say it was new.

tensor

It's probably fairly easy to write an adapter here.

undefined

[deleted]

hardwaresofton

Wow it's weird that this comes up, I'm actually running a site I am going to repost to HN today that I want to use as a testbed for search engines (kind of like an extension to my recent collaboration with supabase[0]).

Right now I've got the site going on just Postgres FTS + trigram and it's pretty darn fast, looks like I need to test sonic too.

Going to burn some midnight oil (in my timezone, anyway) and get it out -- though sonic isn't implemented yet!

Anyway to make this comment useful to people, here's my short list of engines that I want to run in parallel:

- MeiliSearch (https://github.com/meilisearch/MeiliSearch)

- TypeSense (https://github.com/typesense/typesense)

- Lyra (https://github.com/LyraSearch/lyra)

- OpenSearch (https://github.com/opensearch-project/OpenSearch)

- ZincSearch (https://github.com/prabhatsharma/zinc)

- Sonic (https://github.com/valeriansaliou/sonic)

There isn't enough out there comparing all these for the simple typical fuzzy search/search box usecase, so I'm adapting a little podcast search site I made to try and use all of these at the same time. So far only Postgres though, will try and add Meilisearch today and post it!

Like other people are pointing out, most of these engines won't have all the features of ES (or more accurately Lucene) but I am pretty convinced that most of the time it doesn't actually matter and if someone is searching on your site excessively maybe there's a problem with your UX (unless you're a search engine or repository of information).

[0]: https://supabase.com/blog/postgres-full-text-search-vs-the-r...

nightpool

> and if someone is searching on your site excessively maybe there's a problem with your UX (unless you're a search engine or repository of information).

I don't understand this comment. Why would you search something that *isn't*, in some senses, a repository of information? I would say almost every website needs to have search in some sense, and it's *because* sites function as a repository of information that they need this search. Think about e.g. Stripe's documentation, or Github's repository / code search. HN is also another great example—I search for stories or comments all the time to try and remember something I read about recently or heard about last week, but couldn't quite remember. I'm hard-pressed to think of a web site I use regularly that *shouldn't* have full-text search, if I'm being honest.

TylerE

Most site searches are basically unusable. Either it isn't very good, is painfully slow, or both.

Just gooling site:foo.com/baz <query> almost always produces better results.

hardwaresofton

I don't consider use cases like documentation a "repository" of information, but maybe this is just me not phrasing it badly. In the literal sense sure it is, but when I think of a "repository of information" I think of wikipedia, amazon search items, etc.

The scale of a documentation site is a very different problem -- you can brute force it in ways that you can't at larger scales.

I agree that HN would be a case of the large repository, but even then what most people want out of HN search is pretty simple/basic keyword search. I think a decent non-frustrating HN search feature could be very basic and get by without most of the advanced features/rabbit holes available in search.

Basically I think most apps fall into the lighter search use case -- command palettes, search inside of apps with a small scale of information, etc.

My comment wasn't that apps shouldn't have full text search -- it was that most that have full text search don't need complex full text search with all the bells and whistles that lucene and other serious search engines provide. These up-and-comers might be enough for a bunch of apps for which search is not the main feature.

nightpool

I think the sibling comment on this thread kind of disproves this to some extent. Sure, some simple in-memory search library might be good enough for a command palette or for "apps", but the fact that even tiny docs and "marketing" sites that try to implement site search are almost always outdone by google "site:example.com query" really goes to show how much value a full stemming / synonym clustering / syntax normalizing search engine can bring.

Bilal_io

Hey that's a great list of tools.

Are you aware of any that can be used client side like Lyra and supports faceted search?

I've been looking for a solution and cannot find it, even an algorithm and/or a data structure can be helpful. I attempted coming up with a solution myself but ended up with frustration when it came to making the facets dynamic and update as other filters are applied.

I read a couple of papers and one stood out [0], which introduces category theory as a solution to faceted filtering. I understood it in theory and it was still does not seem straight forward to implement but I haven't attempted yet.

0. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5145200/#!po=28...

hardwaresofton

So for client-side search, I generally know of Lunr.js:

https://lunrjs.com/docs/index.html

There are some others but I can't find them at this moment -- a bunch of the other projects I find are somewhat abandoned, lunr is actually on my list of things to use (because it makes the most sense to just ship a pre-built index with the first like... 5 letters maybe of typeahead, no matter how fast the backend is)

Bilal_io

Thanks for the link. This unfortunately is not what I am looking for. Faceted filters are a different beast.

hawski

Thank you for this comparison. I would also like to know how Bleve Search (https://github.com/blevesearch/bleve) turns out.

I have for many years now a small search engine project in my free-time pipeline, but I'm before crawling even and I intend to sit for searching part after some of that.

hardwaresofton

You're right I should put bleve on there as well. This isn't even the whole list. Toshi (https://github.com/toshi-search/Toshi) is also out there...

snikolaev

If you decide to add Manticore Search to the list feel free to ping me at sergey@manticoresearch.com if you need help with preparing the ingestion scripts etc.

francoismassot

You can consider also lnx that is based on tantivy and is performing quite well (https://lnx.rs/).

MobiusHorizons

Would it make sense to include Sqlite FTS5 in that mix?

hardwaresofton

It would, I did for the supabase post but... This is already way too much! I have no idea when I'll actually be able to get to all this as-is.

Waiting for meilisearch to ingest documents right now and the Show HN is going up.

hardwaresofton

Meili is still ingesting documents but we're live:

https://news.ycombinator.com/item?id=33321268

Maybe I should have used their batch thing instead.

jojo_

Have you heard about Xapian?

hardwaresofton

Nope I hadn't, but it was mentioned here:

https://news.ycombinator.com/item?id=33318533

nathell

I've written a full-text search engine as well. I don't tout it as a replacement for Elasticsearch, but it does have a few advantages: it's fast; supports HTML documents; supports Polish inflection (via a full-blown morphological dictionary, not just a stemmer); and has a very compact on-disk format (pre-parsed HTML trees, Huffman-encoded over large alphabets). Oh, and it's 100% Clojure.

It underlies a concordancer GUI called Smyrna: https://github.com/nathell/smyrna, https://smyrna.danieljanus.pl

I haven't touched it in six years, other than a few small changes. But I do plan on revisiting it when time permits.

johnebgd

That’s very cool. I hope you consider open sourcing it so others can contribute.

nathell

It is open-source already (MIT)! I just need to make other languages more easily pluggable, and factor out the search engine so that it can be used on its own. :)

_tom_

Could your steamer be ported to Lucene? Might get more usage there.

atesti

>Also, Sonic only keeps the N most recently pushed results for a given word, in a sliding window way (the sliding window width can be configured)

Does this mean that it only ever finds at most N documents per word? Even searches for "A and B" would probably not find everything, even if less than N documents contain A and B, because they might have been removed with the sliding window already for A or B alone. Is that correct?

Aeolun

Huh? Yeah. I can keep my index size down by throwing results away as well.

Every time you think it’s somehow magic, someone has to dump a bucket of cold water over your head.

sanxiyn

As far as I can tell, yes, this is correct.

9dev

Every time someone comes up with an alternative to a software behemoth like Elasticsearch, what they actually mean is: "An alternative to the 10% of functionality of $tool that are interesting to me".

This is surely an impressive engineering feat, but hardly a replacement for the myriad of query possibilities Elasticsearch offers.

GrinningFool

From the opening line of the README:

"Sonic can be used as a simple alternative to super-heavy and full-featured search backends such as Elasticsearch in some use-cases."

Seem pretty up-front about it, and doesn't claim to be a full-featured alternative.

lolinder

Agreed, they do a good job of hedging it. I think OP was probably pre-empting the usual comments along the lines of "yep, $tool is super bloated, $smallerTool proves that those other guys building $tool are bad engineers."

marginalia_nu

To be fair, there is often a better reasons to only replace a portion of ES' functionality, since doing so can save a lot of computation and space; than to replace ES itself, since it already exists and does a good job if what you need is the full kit.

I found myself last week reimplementing 10% of RoaringBitmap's functionality as a homebrew replacement, because doing so was 500% faster. Not that RB isn't great, but it's designed for a general problem space, and not my particular problem.

unrealhoang

can you explain more of your problem and solution? Or is your solution open/published anywhere?

jamil7

Agreed, although to be fair to the actual author (assuming they didn't post this here) the readme is a lot more upfront about it's capabilities.

> Sonic can be used as a simple alternative to super-heavy and full-featured search backends such as Elasticsearch in some use-cases.

ianbutler

I don't think your opinion is wrong, but I do think ElasticSearch has a lot of features that many people consider bloat depending on their work, and scaling and doing general dev ops for ES can be an absolute slog. Light weight alternatives that cut down to a set of core features for some niche seem like a good idea to me.

9dev

It's totally fine that many people consider stuff bloat, but other people don't. I've built a highly specialised search engine for manufacturing companies on top of Elasticsearch, and I decidedly need vector queries, TF-IDF queries, geospatial range queries, and heaps of other, niche features you probably never used before.

Having a lightweight search engine is fine, but calling it an alternative to Elasticsearch is not doing either justice.

ianbutler

That's very assumptive of you. I have in fact used most of those features, and note I said their opinion was not wrong. In their readme they said it's a replacement for some use cases which is upfront and fine.

Vector queries aren't niche, Elastic however only tacked on a proper (non HNSW) implementation in the last year and a half. Geospatial isn't niche, anyone working with location data will work with those queries. TF-IDF is a basic ranking algo / signal.

Maybe Elasticsearch is good for you because they have all their features in aggregate. But I can name a tool that focuses specifically on each area and query type and is better for that specific subset of functionality.

So my point still stands, if all you need are specific features Elastic is too much. You need all of it and that's fine too.

coldtea

"ative to a software behemoth like Elasticsearch, what they actually mean is: "An alternative to the 10% of functionality of $tool that are interesting to me"

Which is perfectly fine. A lot of tools become so general and bloated, that there are large groups that would be fine with many different 10% subsets of their features...

Kind of like how I don't need MS Word or OpenOffice Write, any simple text editing program with a few basic features (like printing, bold/italics, and word count) will do for my needs...

9dev

I'm not opposed to that, however, the chance of their 10% and my 10% overlapping is rather slim. Just like you only need basic formatting, and I require footnotes in my documents. Nothing wrong with either, but I'd be upset if you tried to sell me GEdit as a replacement for OpenOffice Write.

RicoElectrico

Honestly most of the "alternative to" programs do not meet expectations they set by dropping a big known name. So much so, that I think people are doing FOSS disservice by comparing to those who they can't meaningfully overtake.

The only exceptions could be small single feature utilities.

graftak

To me it seems the “alternative to” part is more damaging in that sense than dropping a big name. The name is used to put a complicated piece of software in a context many people are familiar with. The same thing happens with the “Tinder/Uber/Airbnb for <x>…” type of services.

The friction is introduced where it’s not made crystal clear how it’s similar, and which concept are different or missing altogether. Then it will cause unmet expectations.

Perhaps it’s better to say “inspired by …” or “similar to …” to make a more precise statement.

tensor

My guess is that the majority of people using ES could actually use something simpler like this.

manigandham

True, but most deployments are also just generic searching of records like Algolia rather than using all the low-level functionality.

Tyoesense is probably the most compete competitor in that regard: https://typesense.org/

Other alternatives here: https://gist.github.com/manigandham/58320ddb24fed654b57b4ba2...

PedroBatista

While I get the wants-and-needs since ElasticSearch has a voracious appetite for RAM, I get the feeling most people think search engines are a simple thing where you can just import some lib, fool around for a bit and call it a day.

The truth is that ElasticSearch/Solr/Lucene is orders of magnitude more complex and powerful than these "alternatives". All this is mostly fine as long everyone is on the same page regarding the expectations.

Most people don't need ElasticSearch for their use cases on the surface, but I feel they expect top-notch mind-reading results and that requires something like ElasticSearch and someone who knows the field.

Having said all of that, Meilisearch and this are quite fine.

alessmar

I would like to suggest https://typesense.org/ It has some features that makes it a better choice than Meilisearch

paraboul

Can you elaborate on said features?

I migrated from typesense to Meilisearch on a project after I found it had much better search accuracy. I can't exactly explain why, but overall Meilisearch results feel more relevant by default.

snikolaev

There are actually benchmarks that allow measuring search relevancy objectively, e.g. BEIR[1]. Manticore Search team did an effort to make a PR to include it to the list. The results are here [2]. Unfortunately the BEIR team seems to be too busy to review a whole pile of PRs including about Vespa. Nevertheless it would be nice to have both Meilisearch and Typesense there too since it's interesting what performance those non-tf-idf based search engines would show compared to BM25-based and vector search engines.

[1] https://github.com/beir-cellar/beir [2] https://docs.google.com/spreadsheets/d/1_ZyYkPJ_K0st9FJBrjbZ...

jabo

I work on Typesense. Mind if I ask which version of Typesense and Meilisearch you tried this on? And if this was on some public dataset I can use?

I’d love to take a closer look.

keyle

Yeah there needs to be some kind of acid test that will compare these products on equal footing and show the pitfalls.

sanxiyn

This is very very difficult, but Tantivy tried: see https://github.com/quickwit-oss/search-benchmark-game

DeathArrow

Here is a performance benchmark: https://db-benchmarks.com/test-hn/#manticore-search-columnar...

jasfi

That would be great. However if you wanted to benchmark relevance ranking, how would you do that?

sanxiyn

You need a dataset and an evaluation metric. The usual evaluation metric is NDCG(Normalized Discounted Cumulative Gain): https://en.wikipedia.org/wiki/NDCG

An example dataset is BEIR(BEnchmarking Information Retrieval), published in NIPS 2021: https://github.com/beir-cellar/beir

Spivak

I think the upshot is that if you have no idea what all the advanced features of ES even are then you probably don't need ES because it's not turnkey.

If you utter the phrase "I just want search" then it really is a matter of just using one of these lightweight projects and libs because your needs are simple.

ilyt

There is also the other important thing: the "search engine" in elasticsearch is just "searching the content of documents", not more.

It won't show you which one is "best" (for a given value of "best"), just one that looks most similar to the input.

Trying to index anything that can contain any trace of SEO would be doomed to failure, it also won't tell you which of the sites got linked the most, and million other things other web search engines do to give good results.

In "just put a documents in DB and search them" it is barely enough to look thru corporate knowledge database and it still won't get nuances like "this page is linked from 20 other pages, maybe it should be higher?"

marginalia_nu

* We imported ~1,000,000 messages of dynamic length (some very long, eg. emails);

* Once imported, the search index weights 20MB (KV) + 1.4MB (FST) on disk;

This is almost unbelievably succinct! If you encode the document features into 8 bits per document, and thus completely forego the need to store the document ID by indexing them implicitly, that alone is 1 MB.

Getting meaningful search out of on average 21 bytes per document seriously impressive.

[For reference, this sentence is 42 bytes.]

mattb314

Wonder if this has anything to do with the sliding window:

> Sonic only keeps the N most recently pushed results for a given word, in a sliding window way (the sliding window width can be configured)

Default window looks like 1k documents. I read this as saying that super common words are basically dropped from the index (only 1k out of many thousands of docs retained), but I don’t know enough about the internals to be sure. Not sure if this actually hurts search results in practice, seems like an ok trade off for help docs at least.

411111111111111

It's definitely a great trade-off to make for efficiently, but makes it inherently unusable for most of elastic searchs usecases.

Looking at it from a practical example such as log search (almost everyone I know has used kibana/logstash/elasticsearch at some point): you'd be able to search for things like tracingId/requestId but adding more filters such as logLevel, requestType or serviceName would be impossible

It has it's niche, but calling it an elasticsearch alternative really is a stretch

rabuse

Also the ability to weight fields when fetching results to boost relevancy, which is needed for a lot of my use cases.

nightpool

I wonder how easy it would be to change "most recently pushed" to something like a redis sorted set where each document has a score and only the top N results are retained when sorted by their separate score value? That would allow you to sort by pageviews / popularity in a more useful way. But it fails entirely when looking for uncommon intersections of common words, which feels like it makes it useless for most actual full-text search use-cases :(

undefined

[deleted]

syrusakbary

Long ago I was searching on lightweight search engines that could run on the Edge, as ElasticSearch –while very popular– is also quite heavy and relies on the Lucene/JVM.

Apart from Sonic, I also found Tantivy [1] and Meilisearch [2]... all delightfully made in Rust. My favorite, and the closest one to ElasticSearch (for its features) is probably Tantivy.

I'd recommend anyone to check up this three projects and choose on what best fits your needs... it's awesome to see that more projects are becoming available by the day!

[1]: https://github.com/quickwit-oss/tantivy

[2]: https://github.com/meilisearch/meilisearch

alserio

I've looked up tantivy and quickwit. Quickwit uses tantivy as the engine. It has decoupled storage (awesome, only recently elastic announced something comparable) but is oriented towards log processing and esplitly warns against its use to power an user facing site search. Do you happen to know if there's anything like that with the same minimal footprint that can scale up and, importantly, down to serve the needs of highly variable traffic websites? Right now I'm looking at something with clustering capabilities and decoupled storage (e.g on s3) like quickwit

francoismassot

One of the reasons for not using Quickwit for user facing search is the latency: for example, you pay 70ms of latency when you make a request on AWS S3... and generally you expect latency below that figure. Decoupling compute and storage while keeping a very low latency may be then impossible unless ending up by caching all your data on disk :).

You can have a look at lnx (https://lnx.rs/) that is based on tantivy and is performing quite well. It's not yet distributed but the author Chillfish8 has some thoughts about how to do it.

alserio

Thank you! I'll look into it

codedokode

There is also sphinx search which was open source before 3.0 version.

snikolaev

And it's open source continuation - Manticore Search [1]

[1] https://manticoresearch.com/

croes

Do they support document access control like ES does?

sanxiyn

Yes, Meilisearch supports ES-like document access control.

dewey

Another interesting alternative: https://github.com/meilisearch/meilisearch - I'm using it in one of my (small) projects and I had a good experience with it, also very helpful community.

excsn

This is not a direct alternative to ElasticSearch. Tantivy is closer to an alternative to ElasticSearch since ES is built on top of Lucene. An alternative could be achieved if built on top of Tantivy.

Sonic here only returns document identifiers so you will never be able to get document information back. This is very useful though if all you want to do is index text data and then get the stored information from another data store.

sanxiyn

Quickwit is a search engine built on top of Tantivy (by the author of Tantivy): https://github.com/quickwit-oss/quickwit

Quickwit supports Elasticsearch compatible bulk indexing API.

codedokode

> Sonic here only returns document identifiers

In many cases that is what you want because you have the data in a database and don't want to duplicate it in Elastisearch.

counttheforks

> Sonic here only returns document identifiers so you will never be able to get document information back

Why would you want that anyway? Always thought it was silly to duplicate all your data which will be stored in a real database anyway

excsn

From a use case I am not experienced with. If you index books, you want the search engine to return highlighted data like google does.

Also, now that I think of it, typically logs/structured data is stored only in ES.

DeathArrow

>Also, Sonic only keeps the N most recently pushed results for a given word, in a sliding window way (the sliding window width can be configured)

If you discard many potential hits, why not use /dev/null as the search engine?

Someone1234

I believe you must have misread what you quoted, because whatever point you're trying doesn't really follow what you quoted.

They let you configure the number of expected results to cache for a given query, the number of cache results are configurable based on your use-case for the results (e.g. if your website only lists 100 results, don't store beyond that).

If more results than that for a given query are returned then they disregard additional results since you told it you won't make use of them. In essence, they're saving you from caching results that you'll never consume.

How you got from this to "just use /dev/null" is a mystery to me. It has to be a misread or misunderstanding.

nine_k

This thing looks like a very genetic cache. You can of course use /dev/null as a degenerate cache, without any performance benefit though.

mhitza

The readme doesn't offer enough information to accept that it can be an alternative to elasticsearch. From what I can gather by skimming the information, it can only do word level matching and that it isn't some form of TF-IDF type index (as is Lucene, which stands behind Solr/ElasticSearch).

sanxiyn

Yes, it doesn't do any ranking at all. Results are returned in the reverse order of indexing.

Daily Digest email

Get the top HN stories in your inbox every day.