Elastic, Loki and SigNoz – A Perf Benchmark of Open-Source Logging Platforms

Daily Digest email

Get the top HN stories in your inbox every day.

trengrj

I'm not a fan of competitors creating benchmarks like this as when faced with any tuning decision, they will usually pick the one the makes their competitors slower. But anyway lets take a look at how they tuned Elasticsearch.

Disclaimer I used to work at Elastic!

- Used Logstash instead of Beats for simple task of reading syslog json data. Beats (https://www.elastic.co/guide/en/beats/filebeat/current/fileb...) would have performed better especially around resource usage.

- Set very low Logstash heap of 256mb https://github.com/SigNoz/logs-benchmark/blob/0b2451e6108d8f...

- Added grok processor https://github.com/SigNoz/logs-benchmark/blob/0b2451e6108d8f... Dissect is faster here

- No index template configuration This would cause higher disk usage than needed due to duplicate mappings. Again a Logstash vs Beats thing. For this test more primary shards and a larger refresh interval would also improve things.

- Graph complaining Elasticsearch using 60% available memory. This is as configured, they could use less with not much impact to performance.

- Document counts do not match.. This is probably due to using syslog with random generated data vs creating a test dataset on disk and reading the same data into all platforms.

- Aggregation queries were not provided in repo https://github.com/SigNoz/logs-benchmark so cannot validate.

I'm actually surprised Elastic did so well in this benchmark given the misconfiguration.

pranay01

thanks for the note. Our approach for this benchmark was to use the default configs which each of the logging platforms come with.

This is also because we are not experts in Elastic or Loki, so we won't know the possible impact of tuning configs. To be fair, we also didn't tune SigNoz for this specific data or test scenario and ran it in default settings.

> Graph complaining Elasticsearch using 60% available memory. This is as configured, they could use less with not much impact to performance.

This is something we discussed about, and have added a note in the benchmark blog as well. Pasting again for reference

> For this benchmark for Elasticsearch, we kept the default recommended heap size memory of 50% of available memory (as shown in Elastic docs here). This determines caching capabilities and hence the query performance.

We could have tried to tinker with the different heap sizes ( as a % of total memory) but that would impact query performance and hence we kept the default Elastic recommendation

nijave

Part of the issue is, Elasticsearch isn't an open-source logging platform--it's a search-oriented database. Effectively using it as an open-source logging platform highly depends on the config vs things optimized only for logs out of the box.

I imagine you'd have similar issues with Postgres or any general purpose datastore without the correct configuration.

bdcravens

I'm not an Elastic expert either, just a developer responsible for a lot of things that can Google pretty good, and I knew those configs seemed off. I've been hearing for years that Beats is preferable over Logstash. I don't even claim to work in the logging space :-)

wardb

Unfortunately it's severely misunderstood in the benchmark how Grafana Loki should be queried for high cardinality data. See also https://github.com/SigNoz/logs-benchmark/issues/1

pranay01

Thanks for creating the issue. Yeah, this is what we also found, that Loki is not designed for querying high cardinality data.

But since Loki is many times used in observability use cases, where there is sometimes a need to query high cardinality data, we thought to include it.

wardb

That's incorrect, Loki is designed for querying high cardinality data.

The difference is that in Loki the index is only used for metadata around the source of the log lines (environment, team, cluster, host, pod etc) for selecting the right log stream to search in.

Parsing, aggregation and/or filtering of log lines on high cardinality data is all done at query time using LogQL. See also https://www.youtube.com/watch?v=UiiZ463lcVA and this live example where a 95th quantile is calculated using the request_time field of nginx logs https://play.grafana.org/d/T512JVH7z/loki-nginx-service-mesh...

wstuartcl

This is kind of the issue with an interested party/vendor running benchmarks like these. Be it by pure dumb luck or malfeasance you are much more likely to configure and be knowledgeable about your own product than the others and toss out responses and results that are wildly inaccurate/misleading.

dig1

> While ELK was better at performing queries like COUNT, SigNoz is 13x faster than ELK for aggregate queries.

The author should also mention how much ES was faster against SigNoz with trace_id fetches (137x) and fetching first 100 logs (14x). Aggregating queries is known pain point for ES and will always be, due to ES design. People use additional tools for this, like Kafka Streams or Spark.

> ClickHouse provides various codecs and compression mechanism which is used by SigNoz for storing logs

What was "index.codec" for ES? Unfortunately, the default value does not provide the best compression ratio.

I won't say that ES (or OpenSearch) is perfect, but I was surprised it holds well here, considering ES was run in (I'll presume) a non-optimal environment. First, put Kafka instead of Logstash (or in front of Logstash), and your ingestion rate will skyrocket. The second, learn how to tune JVM.

Also, the author should use OpenSearch [1] because that is the place where all open-source development is happening now.

[1] https://opensearch.org

nullify88

What about index mapping, how many primaries, how many replicas, index rollover. Is your hot tier optimised for ingest, warm tier optimised for querying, and cold tier optimised for storage? There's so much to think about to get Elasticsearch running "optimally" and to keep it that way.

It highlights the operational cost of running Elasticsearch.

nityananda123

Hi, I think this question is pointed towards Elasticsearch,

But here are some points for SigNoz. ( I am one of the maintainers at SigNoz)

> Directly ingesting to disk(hot tier) is faster than directly ingesting to s3(cold storage)

> The query results were an average of cold + hot run ( for elk as well ). We didn’t have an explicit concept of warm storage for SigNoz in our benchmark.

> The query perf for logs with cold storage is almost similar to hot storage, but the operational cost will reduce with cold storage. So ingesting to host storage and moving to cold storage after a certain amount of time is a good option for Signoz.

EdwardDiego

Any tool handling large amounts of data has an operational cost.

nullify88

Sure but in the end it boils down to whether that operational cost is worth it for the value received. Tools like Loki, are worthwhile alternatives for centralising infrastructure logs with lower operational costs.

pranay01

Thanks for the feedback. We chose Elasticsearch as in our experience it is still the default tool for people to get started with logs. But do understand that may be opensearch is also catching up now.

>The author should also mention how much ES was faster against SigNoz with trace_id fetches (137x) and fetching first 100 logs (14x).

We didn't mention this in the summary as for the scale we tested at the difference would not have been perceived by a user. e.g. For getting logs corresponding to a trace_id (log corresponding to high cardinality field), SigNoz tool 0.137s, and Elastic took 0.001s

I think we read somewhere (will try to find source) that anything below 200ms in server response is not perceived by user

Aeolun

That sounds like an excuse. I do agree they’d likely both feel more or less instant though.

snikolaev

You can also find Elasticsearch vs Clickhouse performance benchmarks for log data on db-benchmarks.com [1]. The corresponding article is here [2]

[1] https://db-benchmarks.com/?cache=fast_avg&engines=clickhouse...

[2] https://db-benchmarks.com/test-logs10m/

pranay01

This very interesting, will go through it. Thanks for sharing.

Obertr

Isn't elastic NOT truly opensource since 7.10?

You are using 8.4.3 and you are building on top of it. have you checked the terms & conditions?

OpenSearch is an open source alternative

https://opensearch.org/ https://www.elastic.co/pricing/faq/licensing

nullify88

I may have missed something, what have they done that builds on top of 8.4.3? SigNoz uses Clickhouse as its storage backend.

pranay01

>You are using 8.4.3 and you are building on top of it. have you checked the terms & conditions?

We have just used Elastic 8.4.3 for the benchmark. We are not building on top of it. So, I am not sure how do the T&Cs apply. Can you share more?

ensignavenger

The Hacker News title says "open source" in it... but Elastic isn't open source, so it shouldn't qualify... though the article doesn't say open source... so maybe the Hacker News title was editorialized and is incorrect?

dijit

Free software is loaded with terms.

Open Source means concretely: that you can read the source.

Free and Open source usually means you are free to use it in many ways, sometimes with restrictions or agreements that contributions must be made available. This freedom is the point of contention with ElasticSearch and MongoDB.

Sorry to he pedantic, but terminology is important.

kris_wayton

I saw that Cloudflare moved from Elastic to Clickhouse for logging.

https://blog.cloudflare.com/log-analytics-using-clickhouse/

pranay01

Yes, and the perf improvements they have achieved is also staggering. In the linked presentation in the blog they have mentioned -

> CPU and memory consumption on the inserter side got reduced by 8 times.

jpgvm

Clickhouse go brrr basically.

Saw similar results on a hand-rolled version of this purely for logs. Nice to see an OSS solution that also bundles in the other bits of the observability stack into Clickhouse.

pranay01

thanks. Yeah, ClickHouse does have quite a good perf esp. for observability user cases where aggregate queries tend to dominate.

Here's a blog from Uber where they has 70-80% aggregate queries in production env, as saw 50% improvement in resource required.

From their blog `We reduced the hardware cost of the platform by more than half compared to the ELK stack`

[1]https://www.uber.com/en-IN/blog/logging/

jpgvm

The system I worked with was acquired by Uber but built independently of that solution, they were constructed -very- similarily. (I worked at Uber for a short time after it was acquired).

CSDude

What schema does SigNoz use with Clickhouse? The Open Telemetry Collector uses this schema https://github.com/open-telemetry/opentelemetry-collector-co... and I found out that accesing map attributes is much slower (10-50x) compared to regular columns. I expected some slow down but this is too much.

srikanthccv

SigNoz also follows a similar approach since the attributes can be arbitrary, and ClickHouse needs a fixed schema ahead. The options are map, parried arrays etc.. but they all are slow depending on the object unpacking ClickHouse needs to do. ClickHouse does its best on the regular columns as it's built for it. If the access is on Map/Array types, it is faster than other DB systems but slower than regular columns.

dan-robertson

Hmm. I think benchmarking this sort of thing well is pretty hard. I’ll note:

- these clusters are tiny (circa 4 machines). eg one thing that is difficult about looking for alternatives of what we currently have is trying to guess how they would perform at similar size. It is complicating when one requires rack space and millions of dollars of machines for a comparison cluster

- there are lots of tuning options. If another system performs poorly, is that because you don’t have many years of experience of tuning it well

- perf will be quite sensitive to the shape of the data and queries. Here I got the impression that the data is very uniform, there are a small number of different fields, and most fields will be set on each log line. If you have many different teams producing different log lines with different fields, then dense representations won’t work as well as they could for this format, for example.

The experience reports from Uber/cloudflare are useful. One worry is that it is much more common to write a blog post about switching to some new system than to write one about how the shiny new system turned out to have significant flaws that needed to be worked around.

pranay01

Agree to your points.

Performance benchmarks are not easy to execute. Each tool has nuances, and the testing environments must aim to provide a level playing field for all tools. We have tried our best to be transparent about the setup and configurations used in this performance benchmark.

I think you would need to test different solutions for your specific data and query patterns to understand better how different tools would fit.

But hopefully our results could give you pointers on where to poke

nijave

Curious how this compares to S3 + Athena (object + Presto or Apache Drill). ES always seems like a bit of a weird fit for logs since

- it's optimized for repeated searching (logs don't tend to be searched very often) - it isn't optimized for aggregation (ad-hoc metrics) - it has a fixed schema (logs can but require effort)

pranay01

Thats a good question. We have not evaluated this stack for logs yet?

Do you `S3 + Athena (object + Presto or Apache Drill)` stack currently for logs? What do you like about it?

nijave

Not currently but I did at a different role. It's dirt cheap and almost infinitely scalable. There's not all the added complexity of running and tuning something like ES

francoismassot

One thing that I would want to see in this kind of benchmarks is the performance of these engines when data is stored on an object storage.

The main reason for that is that the amount of logs, metrics, traces data can be huge...

I think Loki was made to work on object storage.

wardb

+1 Loki is designed for object storage as backend. Persisting all data (so not storage tiering) on object storage vs local storage gives you cost savings, increased durability and simplified operational at scale.

nijave

ES works on object storage, too afaik but it's a paid feature

thewisenerd

am interested in the `dummyBytes` generated by flog[2], in the linked benchmark result[1]. it's random words from `/usr/share/dict/words`, which may not be highly compressible..

i.e., 500 GB -> 207 GB (with zstd data compression + indexes) seems like a worst case scenario. with "real" logs, I am expecting this to be much better (for logs at least)..

does anyone have a similar size comparison with real life examples? (similar data size, interested in compressed logs size and indexes size with clickhouse)

[1] https://signoz.io/blog/logs-performance-benchmark [2] https://github.com/signoz/flog [3] https://github.com/tjarratt/babble/blob/cbca2a4833c1dd0e0287...

pranay01

Yeah, agreed. This is a worst case scenario. We also expect compression to be better in real life scenario data

Daily Digest email

Get the top HN stories in your inbox every day.