It's the end of observability as we know it (and I feel fine)

Daily Digest email

Get the top HN stories in your inbox every day.

RainyDayTmrw

I think we are, collectively, greatly underestimating the value of determinism and, conversely, the cost of nondeterminism.

I've been trialing a different product with the same sales pitch. It tries to RCE my incidents by correlating graphs. It ends up looking like this page[1], which is a bit hard to explain in words, but both obvious and hilarious when you see it for yourself.

[1]: https://tylervigen.com/spurious-correlations

graemep

Its fun, but the point should be well known (i know its not). Time series are very prone to spurious correlations - r² is not useful.

Its even worse if you just eyeball a graph. If something changes over time, you need to use appropriate measures.

feoren

> r² is not useful

People want so badly to have an "objective" measure of truth that they can just pump data into and get a numeric result that "doesn't lie". r², p < 0.05, χ2, etc. It's too far to say these numbers aren't useful at all -- they are -- but we're just never going to be able to avoid the difficult and subjective task of interpreting experimental results in their broader context and reconciling them with our pre-existing knowledge. I think this is why people are so opposed to anything Bayesian: we don't want to have to do that work, to have those arguments with each other about what we believed before the experiment and how strongly. But the more we try to be objective, the more vulnerable we are to false positives and spurious correlations.

SOLAR_FIELDS

I agree - the problem here is probably less that the AI makes the mistake and more that it’s just really easy to make this mistake whether you’re AI or human. It’s probably true that the AI selects more correlations that would be obviously irrelevant to the end user though. I, like others doing SRE work, spend a fair amount of time trawling through these types of graphs and it’s very common to see something that looks like a correlation, look closer, then dismiss it due to noise.

Time series are also very much subject to MTUP[1]. Something that looks like a correlation or problem at a certain zoom level becomes totally normal behavior when you zoom out, for instance

1: https://en.m.wikipedia.org/wiki/Modifiable_temporal_unit_pro...

worldsayshi

Perhaps I'm missing you point a bit but you can absolutely have deterministic UX when it matters with LLM based applications if you design it right. Whenever you need determinism, make the LLM generate a deterministic specification for how to do something and/or record it's actions. And let the user save away re-playable specifications along with the dialogue. Then build ways for the AI to suggest fixes for failing specs when needed.

It's basically the same flow as when you use AI for programming. Except you need to constrain the domain of the specifications more and reason more about how to allow the AI to recover from failing specifications if you don't want to force the user to learn your specification language.

zug_zug

As somebody who's good at RCA, I'm worried all my embarrassed coworkers are going to take at face value a tool that's confidently incorrect 10% of the time and screw stuff up more instead of having to admit they don't know something publicly.

It'd be less bad if the tool came to a conclusion, then looked for data to disprove that interpretation, and then made a more reliably argument or admitted its uncertainty.

jakogut

You can achieve a good amount of this with system prompts. I've actually had good success using LLMs to craft effective system prompts and custom instructions to get more rigorous and well researched answers by default.

One I use with ChatGPT currently is:

> Prioritize substance, clarity, and depth. Challenge all my proposals, designs, and conclusions as hypotheses to be tested. Sharpen follow-up questions for precision, surfacing hidden assumptions, trade offs, and failure modes early. Default to terse, logically structured, information-dense responses unless detailed exploration is required. Skip unnecessary praise unless grounded in evidence. Explicitly acknowledge uncertainty when applicable. Always propose at least one alternative framing. Accept critical debate as normal and preferred. Treat all factual claims as provisional unless cited or clearly justified. Cite when appropriate. Acknowledge when claims rely on inference or incomplete information. Favor accuracy over sounding certain.

j_bum

Do you just add this to your “instructions” section?

And what type of questions do you ask the model?

Thanks for sharing

jakogut

Yes, with ChatGPT, I added that paragraph as custom instructions under personalization.

I ask a wide variety of things, from what a given plant is deficient in based on a photo, to wireless throughout optimization (went from 600 Mbps to 944 Mbps in one hour of tuning). I use models for discovering new commands, tools, and workflows, interpreting command output, and learning new keywords for deeper and more rigorous research using more conventional methods. I rubber duck with it, explaining technical problems, my hypotheses, and iterating over experiments until arriving at a solution, creating a log in the process. The model is often wrong, but it's also often right, and used the way I do, it's quickly apparent when it's wrong.

I've used ChatGPT's memory feature to extract questions from previous chats that have already been answered to test the quality and usability of local models like Gemma3, as well as craft new prompts in adjacent topics. Prompts that are high leverage, compact, and designed to trip up models that are underpowered or over quantized. For example:

>> "Why would toggling a GPIO in a tight loop not produce a square wave on the pin?"

> Tests: hardware debounce, GPIO write latency, MMIO vs cache, bus timing.

>> "Why is initrd integrity important for disk encryption with TPM sealing?"

> Tests: early boot, attack surface, initramfs tampering vectors.

>> "Why would a Vulkan compute shader run slower on an iGPU than a CPU?"

> Tests: memory bandwidth vs cache locality, driver maturity, PCIe vs UMA.

>> "Why is async compute ineffective on some GPUs?"

> Tests: queue scheduling, preemption granularity, workload balance.

>> "Why might a PID loop overshoot more when sensor update rate decreases?"

> Tests: delayed ACK, bufferbloat, congestion control tuning.

>> "How can TCP experience high latency even with low packet loss?"

> Tests: delayed ACK, bufferbloat, congestion control tuning.

>> "How can increasing an internal combustion engine's compression ratio improve both torque and efficiency?"

> Tests: thermodynamics, combustion behavior, fuel octane interaction.

>> "How can increasing memory in a server slow it down?"

> Tests: NUMA balancing, page table size, cache dilution.

>> "Why would turning off SMT increase single-threaded performance on some CPUs?"

> Tests: resource contention, L1/L2 pressure, scheduling policy.

dr_kiszonka

This looks excellent. Thanks for sharing.

heinrichhartman

> New Relic did this for the Rails revolution, Datadog did it for the rise of AWS, and Honeycomb led the way for OpenTelemetry.

I find this reading of history of OTel highly biased. OpenTelemetry was born as the Merge of OpenCensus (initiated by Google) and OpenTracing (initiated by LightStep):

https://opensource.googleblog.com/2019/05/opentelemetry-merg...

> The seed governance committee is composed of representatives from Google, Lightstep, Microsoft, and Uber, and more organizations are getting involved every day.

Honeycomb has for sure had valuable code & community contributions and championed the technology adoption, but they are very far from "leading the way".

loevborg

As someone who recently adopted Honeycomb, it really is an amazing tool. Especially with otel auto-instrumentation, you get insights within a few hours. The dashboard / query features are very powerful and obviously stem from a deep philosophical understanding of observability. My team was shocked at how good the tools is.

Datadog by contrast seems to be driven by marketing and companies having a "observability" checkbox to tick.

ayewo

Which programming languages are you using with the OTel auto-instrumentation feature?

loevborg

Node and Python. It's amazing how much works out of the box - express routes, http queries, dns queries, the list goes on

stego-tech

Again, sales pitch aside, this is one of the handful of valuable LLM applications out there. Monitoring and observability have long been the exclusive domains of SRE teams in large orgs while simultaneously out of reach to smaller orgs (speaking strictly from an IT perspective, NOT dev), because identifying valuable metrics and carving up heartbeats and baselines for them is something that takes a lot of time, specialized tooling, extensive dev environments to validate changes, and change controls to ensure you don’t torch production.

With LLMs trained on the most popular tools out there, this gives IT teams short on funds or expertise the ability to finally implement “big boy” observability and monitoring deployments built on more open frameworks or tools, rather than yet-another-expensive-subscription.

For usable dashboards and straightforward observability setups, LLMs are a kind of god-send for IT folks who can troubleshoot and read documentation, but lack the time for a “deep dive” on every product suite the CIO wants to shove down our throats. Add in an ability to at least give a suggested cause when sending a PagerDuty alert, and you’ve got a revolution in observability for SMBs and SMEs.

chupasaurus

> identifying valuable metrics and carving up heartbeats and baselines for them

The first problem is out of reach for LLMs, the other 2 are trivial with convolutional NNs for a long time.

> extensive dev environments to validate changes, and change controls to ensure you don’t torch production

Are out of scope for observability.

stego-tech

Not for small environments they’re not. If you say “observability” to a big Fortune company, they peg it to a narrow, specifically defined team and goal; that same word to a SMB/E is “oh, you mean monitoring and change control”.

I used the word correctly for the environments I was describing. They’re just not accurate for larger, more siloed orgs.

JimBlackwood

Agreed! I see huge gains for small SRE teams aswell.

I’m in a team of two with hundreds of bare metal machines under management - if issues pop up it can be stressful to quickly narrow your search window to a culprit. I’ve been contemplating writing an MCP to help out with this, the future seems bright in this regard.

Plenty of times when issues have been present for a while before creating errors, aswell. LLM’s again can help with this.

techpineapple

I feel like the alternate title of this could be “how to 10x your observability costs with this one easy trick”. It didn’t really show a way to get rid of all the graphs, the prompt was “show me why my latency spikes every four hours”. That’s really cool, but in order to generate that prompt you need alerts and graphs. How do you know you’re latency is spiking to generate the prompt?

The devil seems to be in the details, but you’re running a whole bunch more compute for anomaly detection and “ Sub-second query performance, unified data storage”, which again sounds like throwing enormous amounts of more money at the problem. I can totally see why this is great for honeycomb though, they’re going to make bank.

tptacek

I'm not sure I understand the question. He's writing from the vantage point of someone with a large oTel deployment; that's the data he has to work with. Honeycomb has an MCP server. Instead of him clicking around Honeycomb and making inferences from the data and deciding where to drill down, an LLM did that, and found the right answer quicker than a human would have.

Where's the extra expense here? The $0.60 he spent on LLM calls with his POC agent?

danpalmer

I think the implication is that if you have that graph you’re already half way towards a solution because you know there’s a problem.

In terms of _identifying the problems_, shoving all your data into an LLM to spot irregularities would be exceptionally expensive vs traditional alerting, even though it may be much more capable at spotting potential issues without explicit alerting thresholds being set up.

tptacek

Look, I like Honeycomb a lot and we're dependent on it for parts of our orchestrator. It's great; it accelerates investigations.

But even with Honeycomb, we are sitting on an absolute mountain of telemetry data, in logs, in metrics, and in our trace indices at Honeycomb.

We can solve problems by searching and drilling down into all those data sources; that's how everybody solves problems. But solving problems takes time. Just having the data in the graph does not mean we're near a solution!

An LLM agent can chase down hypotheses, several at a time, and present them with collected data and plausible narratives. It'll miss things and come up with wild goose chases, but so do people during incidents.

techpineapple

Right, but the vision isn’t LLM troubleshooting, the title of the article isnt look at the way you can troubleshoot with an LLM, it’s “its the end of observability as we know it). he goes on to say it’s AI constantly analyzing data in a Unified sub-second database? Even without the AI that’s expensive.

hooverd

Is the LLM making these inferences from the aether?

zer00eyz

Also we need to talk about what should be logged and where.

There seem to be two schools of thought, just enough to tell something is wrong but not what it is - OR - you get to drink from the firehose. And most orgs go from the first to the second.

As to where, well thats at the hardware/vm/container level, and mirror and extend what it does. Nothing worse than 20 different ideas of how to log and rotate and trying to figure out who did what, when where and why. If you can't match a log entry to a running environment... well.

I weep quietly inside when some or all of this goes through one, or several S3 buckets for no good reason.

techpineapple

Additionally, I wonder if any of this fixes the fact that anomaly detection in alerting is traditionally a really hard problem, and one I’ve hardly seen done well. Of any set of packaged or recommended alerts, I probably only use 1% of them because anomalies are often the norm.

zer00eyz

> because anomalies are often the norm

You fix these issues or you tune your alert system to make it clear that they aren't actionable. Otherwise you end up turning them off so your system doesn't turn into the boy who cried wolf (or worse teams learn to ignore it and it becomes useless)

Bayesian filters, and basic dirivative functions (think math) can do a lot to tame output from these systems. These arent "product features" so in most orgs they dont get the attention they need or deserve.

zdragnar

> or worse teams learn to ignore it and it becomes useless

This is basically every team I've worked with. Product wants new features, and doesn't want to spend on existing features. Hurry up and write new stuff! Ignore problems and they'll go away!

Also: I've already reported this bug! Why haven't the developers fixed it is yet?

tptacek

That reinforces his argument.

techpineapple

If he stopped at “LLms can help rca’s, I’d agree with you.

resonious

The title is a bit overly dramatic. You still need all of your existing observability tools, so nothing is ending. You just might not need to spend quite as much time building and staring at graphs.

It's the same effect LLMs are having on everything, it seems. They can help you get faster at something you already know how to do (and help you learn how to do something!), but they don't seem to outright replace any particular skill.

sakesun

1. help you get faster at something you already know how to do

2. and help you learn how to do something!

This is the second time I heard this conclusion today. Using inference to do 2. and then getting superpower in doing 1., this is probably the right way to go forward.

autoexec

> 2. and help you learn how to do something!

I think it's more likely that people will learn far less. AI will regurgitate an answer to their problem and people will use it without understanding or even verifying if it is correct so long as it looks good.

All the opportunities a person would have to discover and learn about things outside of the narrow scope they initially set their sights on will be lost. Someone will ask AI to do X and they get their copy/paste solution so there's nothing to learn or think about. They'll never discover things like why X isn't such a good idea or that Y does the job even better. They'll never learn about Z either, the random thing that they'd have stumbled upon while looking for more info about X.

sakesun

I find that GenAI is really good at explaining programming languages features.

germandiago

This is exactly how I have been using it.

Which takes me to the question: who would you hire:

1. an expert for salary X? 2. a not-so-expert for 0.4x salary + AI tool that can do well enough?

I suspect that, unless there are specific requirements, 2. tends to win.

Seb-C

2 might win until the damage he does becomes visible enough. Then 1 wins.

nkotov

Title is dramatic but the point is clear - the moats are definitely emptying.

scubbo

> The title is a bit overly dramatic.

I call it "The Charity Majors effect".

joeconway

She is not the author

scubbo

But she does shape the culture of her company.

geraneum

> This isn’t a contrived example. I basically asked the agent the same question we’d ask you in a demo, and the agent figured it out with no additional prompts, training, or guidance. It effectively zero-shot a real-world scenario.

As I understand, this is a demo they already use and the solution is available. Maybe it should’ve been a contrived example so that we can tell if the solution was not in training data verbatim. Not that it’s not useful what the LLM did but if you announce the death of observability as we know it, you need to show that the tool can generalize.

nilkn

It's not the end of observability as we know it. However, the article also isn't totally off-base.

We're almost certain to see a new agentic layer emerge and become increasingly capable for various aspects of SRE, including observability tasks like RCA. However, for this to function, most or even all of the existing observability stack will still be needed. And as long as the hallucination / reliability / trust issues with LLMs remain, human deep dives will remain part of the overall SRE work structure.

yellow_lead

Did AI write this entire article?

> In AI, I see the death of this paradigm. It’s already real, it’s already here, and it’s going to fundamentally change the way we approach systems design and operation in the future.

How is AI analyzing some data the "end of observability as we know it"?

ok_dad

"Get AI to do stuff you can already do with a little work and some experts in the field."

What a good business strategy!

I could post this comment on 80% of the AI application companies today, sadly.

tptacek

I think you think this is a dunk, but "some experts in [this] field" are extraordinarily expensive. If you can actually do this, it's no wonder you're seeing so many flimsy AI companies.

NewJazz

I think the commenter might have been saying that you need experts I'm the field to leverage AI here, in which case your response is supporting their point.

tptacek

"you can already do with" implies otherwise.

gilbetron

There's a bit of a flaw in the "don't need graphs and UIs to look at your data" premise behind this article: sure, LLMs will be great ... when the work great. When they fail, you need a human there to figure it out and they will still need the graphs.

Furthermore, while graphing and visualization are definitely tough, complex parts about observability, gathering the data and storing it in forms to meet the complex query demands are really difficult as well.

Observability will "go away" once AI is capable of nearly flawlessly determining everything out itself, and then AI will be capable of nearly anything, so the "end of observability" is the end of our culture as we know it (probably not extinction, but more like culture will shift profoundly, and probably painfully).

AI will definitely change observability, and that's cool. It already is, but has a long way to go.

kacesensitive

LLMs won't replace observability, but they absolutely change the game. Asking "why is latency spiking" and getting a coherent root cause in seconds is powerful. You still need good telemetry, but this shifts the value from visualizing data to explaining it.

jacobsenscott

The problem with LLMs is the answer always sounds right, no matter if it is or isn't. If you already know the answer to a question it is kind of fun to see an LLM get lucky and cobble together a correct answer. But they are otherwise useless - you need to do all the same work you would do anyway to check the LLM's "answer".

dcre

There’s a world of difference between “always sounds right, but actually is right 80% of the time” and “always sounds right, but actually is right 99% of the time.” It has seemed clear to me for a while that we are on the way to the latter though a combination of model improvement and boring, straightforward engineering on scaffolding (e.g., spending additional compute verifying answers by trying to produce counterarguments). Model improvement is maybe less straightforward but the trajectory is undeniable and showing no sign of plateauing.

Nathanba

I was initially agreeing with the article but it's a clever marketing piece. Nothing changes, these graphs were already really easy to read and if they weren't then they should be. You should already be capable of zooming into your latency spike within seconds and clicking on it and seeing which method was slow. So asking the AI will be more comfortable but it doesn't change anything.

ActorNightly

LLMS are basically just like higher level programming tools - knowing how to utilize them is the key. Best practice is not to depend on them for correctness, but instead utilize them as automatic maps from data->action that you would otherwise have to write manually.

For example, I wrote my own MCP server in Python that basically makes it easy to record web browser activities and replay them using playwright. When I have to look at logs or inspect metrics, I record the workflow, and import it as an MCP tool. Then I keep a prompt file where I record what the task was, tool name, description of the output, and what is the final answer given an output.

So now, instead of doing the steps manually, I basically just ask Claude to do things. At some point, I am going to integrate real time voice recording and trigger on "Hey Claude" so I don't even have to type.

The only thing I wish someone would do is basically make a much smaller model with limited training only on things related to computer science, so it can run at high resolution on a single Nvidia card with fast inference.

Daily Digest email

Get the top HN stories in your inbox every day.