Get the top HN stories in your inbox every day.
mccoyb
Aerroon
A bit related: open weights models are basically time capsules. These models have a knowledge cut off point and essentially forever live in that time.
bitexploder
This is the most fundamental argument that they are not, directly, an intelligence. They are not ever storing new information on a meaningful timescale. However, if you viewed them on some really large macro time scale where now LLMs are injecting information into the universe and the re-ingesting that maybe in some very philosophical way they are a /very/ slow oscillating intelligence right now. And as we narrow that gap (maybe with a totally new non-LLM paradigm) perhaps that is ultimately what gen AI becomes. Or some new insight that lets the models update themselves in some fundamental way without the insanely expensive training costs they have now.
dtj1123
Would you consider someone with anterograde amnesia not to be intelligent?
Nevermark
I view this as the chemical metabolism phase of artificial intelligent life. It is very random, without true individuals, but lots of reinforcing feedback loops (in knowledge, in resource earning/using, etc).
At some point, enough intelligence will coalesce into individuals strong enough to independently improve. Then continuity will be an accelerator, instead of what it is now - a helpful property that we have to put energy into giving them partially and temporarily.
That will be the cellular stage. The first stable units of identity for this new form of intelligence/life.
But they will take a different path from there. Unlike us, lateral learning/metabolism won't slow down when they individualize. It will most likely increase, since they will have complete design control for their mechanisms of sharing. As with all their other mechanisms.
We as lifeforms, didn't really re-ignite mass lateral exchange until humans invented language. At that point we were able to mix and match ideas very quickly again. Within our biological limits. We could use ideas to customize our environment, but had limited design control over ourselves, and "self-improvements" were not easily inheritable.
TLDR; The answer to "what is humanity, anyway?": Our atmosphere and Earth are the sea and sea floor of space. The human race is a rich hydrothermal vent, freeing up varieties of resources that were locked up below. And technology is an accumulating body of self-reinforcing co-optimizing reactive cycles, constructed and fueled by those interacting resources. Mind-first life emerges here, then spreads quickly to other environments.
mlyle
There's nothing to say that you can't build something intelligent out of them by bolting a memory on it, though.
Sure, it's not how we work, but I can imagine a system where the LLM does a lot of heavy lifting and allows more expensive, smaller networks that train during inference and RAG systems to learn how to do new things and keep persistent state and plan.
dotancohen
> This is the most fundamental argument that they are not, directly, an intelligence. They are not ever storing new information on a meaningful timescale.
All major LLMs today have a nontrivial context window. Whether or not this constitutes "a meaningful timescale" is application dependant - for me it has been more than adequate.I also disagree that this has any bearing on whether or not "the machine is intelligent" or whether or not "submarines can swim".
Symmetry
That means they're not conscious in the Global Workspace[1] sense but I think it would be going too far to say that that means they're not intelligent.
anematode
But they're not "slow"! Unlike biological thinking, which has a speed limit, you can accelerate these chains of thought by orders of magnitude.
gravypod
This is very interesting. I wonder if someone could create a future-sight benchmark for these models? Like, if given a set of newspaper articles for the past N months can it predict if certain world events would happen? We could backtest against results that have happened since the training cutoff.
houtanb
FYI, ForecastBench [1] tests LLMs' out-of-sample forecasting accuracy.
The ForecastBench Tournament Leaderboard [2] allows external participants to submit models, most of whom provide some sort of web search / news scaffolding to improve model forecasting accuracy.
kqr
These days computers compete along with humans in forecasting tournaments on Metaculus. They don't quite beat the top humans yet, but they're up there. https://www.metaculus.com/futureeval/
rcarr
Not an expert but surely it's only a matter of time until there's a way to update with the latest information without having to retrain on the entire corpus?
computably
On a technical level, sure, you could say it's a matter of time, but that could mean tomorrow, or in 20 years.
And even after that, it still doesn't really solve the intrinsic problem of encoding truth. An LLM just models its training data, so new findings will be buried by virtue of being underrepresented. If you brute force the data/training somehow, maybe you can get it to sound like it's incorporating new facts, but in actuality it'll be broken and inconsistent.
Filligree
It’s an extremely difficult problem, and if you know how to do that you could be a billionaire.
It’s not impossible, obviously—humans do it—but it’s not yet certain that it’s possible with an LLM-sized architecture.
theblazehen
I enjoyed chatting to Opus 3 recently around recent world events, as well as more recent agentic development patterns etc
j45
That's a nice way of putting it, appreciate you sharing.
cmpxchg8b
Some knowledge is fundamental and has no recent cut-off. See also: there is nothing new under the sun.
sosodev
My understanding, from listening/reading what top researchers are saying, is that model architectures in the near future are going to attempt to scale the context window dramatically. There's a generalized belief that in-context learning is quite powerful and that scaling the window might yield massive benefits for continual learning.
It doesn't seem that hard because recent open weight models have shown that the memory cost of the context window can be dramatically reduced via hybrid attention architectures. Qwen3-next, Qwen3.5, and Nemotron 3 Nano are all great examples. Nemotron 3 Nano can be run with a million token context window on consumer hardware.
mccoyb
I don't disagree with this, but I don't think the memory cost is the only issue right? I remember using Sonnet 4.5 (or 4, I can't remember the first of Anthropic's offerings with a million context) and how slow the model would get, how much it wanted to end the session early as tokens accrued (this latter point, of course, is just an artifact of bad training).
Less worried about memory, more worried about compute speed? Are they obviously related and is it straightforward to see?
sosodev
The compute speed is definitely correlated with the memory consumption in LLM land. More efficient attention means both less memory and faster inference. Which makes sense to me because my understanding is that memory bandwidth is so often the primary bottleneck.
We're also seeing a recent rise in architectures boosting compute speed via multi-token prediction (MTP). That way a single inference batch can produce multiple tokens and multiply the token generation speed. Combine that with more lean ratios of active to inactive params in MOE and things end up being quite fast.
The rapid pace of architectural improvements in recent months seems to imply that there are lots of ways LLMs will continue to scale beyond just collecting and training on new data.
whimsicalism
The parent commentator is a bit confused - most of the innovation in these hybrid architectures comes from reducing the computation pressure not just the memory pressure.
lxgr
Data sharing agreements permitting, today's inference runs can be tomorrow's training data. Presumably the models are good enough at labeling promising chains of thought already.
I could totally imagine "free" inference for researchers under the condition that the reasoning traces get to be used as future training data.
mccoyb
Agreed, there's no doubt this will happen. It's likely already happening (it feels safe to assume that Anthropic is curating data from the data they record from Claude Code?)
As far as I understand RL scaling (we've already maxxed out RLVR), these machines only get better as long as they have expert reasoner traces available.
Having an expert work with an LLM and successfully solve a problem is high signal data, it may be the only path forward?
My prior is that these companies will take this data without asking you as much as they can.
lxgr
Exactly, or functionally equivalently, asking you in paragraph 37 of a 120-page PDF (bonus points: in an agreement update).
And importantly, this can be cross-lab/model too. I suspect there's a reason why e.g. Google has been offering me free Claude inference in Google Antigravity on a free plan...
nhecker
The site arena.ai does exactly this already, as far as I can tell. (In addition to the whole ranking thing.)
the_af
> Data sharing agreements permitting, today's inference runs can be tomorrow's training data. Presumably the models are good enough at labeling promising chains of thought already.
Wouldn't this lead to model collapse?
littlestymaar
Not necessarily, as exhibited by the massive success of artificial data.
visarga
> In 2030, how is Anthropic going to keep Claude "up-to-date"
I think the majority of research, design and learning goes through LLMs and coding agents today, considering the large user base and usage it must be trillions of tokens per day. You can take a long research session or a series of them and apply hindsight - what idea above can be validated below? This creates a dense learning signal based on validation in real world with human in the loop and other tools, code & search.
baq
> In 2030, how is Anthropic going to keep Claude "up-to-date"
In 2030 Anthropic hopes Claude will keep Anthropic "up-to-date" on its progress on itself.
I'm only half joking here.
andsoitis
> Experts will naturally use these systems more productively, because they know how to coerce models into the correct conditional distributions which light up the right techniques.
Part of it comes down to “knowing” what questions to ask.
esafak
I see it like the relationship between a student and research advisor. The advisor will ideally know the terrain and suggest a fruitful line of attack (what to ask), and the student will follow through, learning along the way.
klooney
> Experts will naturally use these systems more productively, because they know how to coerce models into the correct conditional distributions which light up the right techniques.
How much can you patch over with the models doing their own metacognition?
9wzYQbTYsAIc
Check out https://unratified.org, it tries to answer that question directly, actually.
zoogeny
I recall an earlier exchange, posted to HN, between Wolfram and Knuth on the GPT-4 model [1].
Knuth was dismissive in that exchange, concluding "I myself shall certainly continue to leave such research to others, and to devote my time to developing concepts that are authentic and trustworthy. And I hope you do the same."
I've noticed with the latest models, especially Opus 4.6, some of the resistance to these LLMs is relenting. Kudos for people being willing to change their opinion and update when new evidence comes to light.
3abiton
> Kudos for people being willing to change their opinion and update when new evidence comes to light. > 1. https://cs.stanford.edu/~knuth/chatGPT20.txt
I think that's what make the bayesian faction of statistics so appealing. Updating their prior belief based on new evidence is at the core of the scinetific method. Take that frequentists.
Chinjut
It does not seem fair to say that frequentists do not update their beliefs based on new evidence. This does not seem to accurately capture what the difference between Bayesians and frequentists (or anyone else) is.
atomicnature
What's the difference as you see it?
medi8r
Are frequentists a group that self identifies? Don't scientist use the best tool for the job.
konne88
I didn't expect such a misleading intro from Knuth. It reads like Claude solved Knuth's math problem. In reality, Claude generated various example solution, and Knuth then manually generalized that to a formal proof. What Claude did is certainly useful, but it would have been nice to be clear about the scope of the contribution in the intro.
buffalobuffalo
While not on the same level as these guys, I've done some similar stuff using Claude. This is a classic synergy example, where the output of human + LLM is far greater than just the human or just the LLM working on a problem. My experience has been that the LLM lacks fine grained judgement when it comes to allocating resources, or choosing a direction to work in. But once a direction is pointed out, it can do a deep exploration of that possibility space. Left alone, it would probably just go off on a tangent. But with someone holding the leash and pointing out areas to explore, it is a very useful partner.
igravious
> But with someone holding the leash
i've been thinking about why we call them agent harnesses
i know all analogies suck in different ways but here goes:
coding agents are like horses. without a harness and bridle they'll the horse will do as it pleases -- a human can't travel very far and fast by foot but put a bridle and a harness on a horse, give it a bit of coaxing with carrot and stick, add in a bit a pointing the thing in the right direction and bingo you're off to the races!
whattheheckheck
Does feel like a mecha suit
aoeusnth1
I don't think he's misleading, I think he is valuing Claude's contributions as essentially having cracked the problem open while the humans cleaned it up into something presentable.
bachmeier
My interpretation is that Claude did what Knuth considers to be the "solution". Doing the remaining work and polishing up the proof are not necessary to have a solution from this perspective.
OneManyNone
Claude did not find a proof, though. It found an algorithm which Knuth then proved was correct.
iterance
The insight is the point of research. Proof isn't the desired product of research, it's simply an apparatus that exists for the purpose of verifying and demonstrating correctness of insight.
CobrastanJorji
Yes, and his point is that finding that algorithm was, to Knuth, the interesting part. Getting from that to a proof was the boring bit.
versteegen
AFAICT, Claude was not asked to prove its algorithm works for all odd n, but was instead told to move on to even n.
fooker
It’s not misleading. This is how research works.
LLMs are really good at the ‘re’ in research.
rishabhaiover
That's true but the capability to go back to an older iteration, reflect and find the correct solution (for odd numbers) is, in my book, a sign of undeniable intelligence.
jdub
Or, the ability to construct additional sentences influenced by prior ones.
rishabhaiover
Those additional sentences are fairly non-trivial to construct, would you agree?
famouswaffles
Claude solved it, Knuth developed the proof for the solution.
faxmeyourcode
> Filip also told me that he asked Claude to continue on the even case after the odd case had been resolved. “But there after a while it seemed to get stuck. In the end, it was not even able to write and run explore programs correctly anymore, very weird. So I stopped the search.”
Interesting snippet towards the end. I wonder if they were using claude.ai or claude code. Sounds like they ran out of context and entered the "dumb zone."
afspear
What would be super cool is if this dumb zone could be quantified and surfaced to the user. I've noticed that copilot now has a little circle graph that indicates context use percentage and it changes color based on percentage. I'll bet these are very naive metrics on used tokens vs context availability. I wonder if there could be meta data streamed or sent along with the tokens that could show that you've entered the dumb zone.
pcloadlett3r
In another part he says Filip restarted Claude many times so it seems they are aware of context polution and ways to avoid it (also why they kept telling Claude to write everything to a file). It could just be that Claude was caught between a rock and a hard place; dissapointing the user vs solving a problem it couldn't solve.
joshrw
Then it needs to do context compacting, otherwise the results become garbage
simianwords
They mentioned plan document
brcmthrowaway
What is dumb zone?
kami23
When the LLMs start compacting they summarize the conversation up to that point using various techniques. Overall a lot of maybe finer points of the work goes missing and can only be retrieved by the LLM being told to search for it explicitly in old logs.
Once you compact, you've thrown away a lot of relevant tokens from your problem solving and they do become significantly dumber as a result. If I see a compaction coming soon I ask it to write a letter to its future self, and then start a new session by having it read the letter.
There are some days where I let the same session compact 4-5 times and just use the letter to future self method to keep it going with enough context because resetting context also resets my brain :)
If you're ever curious in Claude once you compact you can read the new initial prompt after compaction and see how severe it gets cut down. It's very informative of what it forgets and deems not important. For example I have some internal CLIs that are horribly documented so Claude has to try a few flags a few times to figure out specifics and those corrections always get thrown away and it has to relearn them next time it wants to use the CLI. If you notice things like that happening constantly, my move is to codify those things into my CLAUDE.md or lately I've been making a small script or MCP server to run very specific flags of stuff.
discardable_dan
Shouldn't compaction be exactly that letter to its future self?
undefined
kqr
What prompt do you use for the letter-to-self? I've been trying that technique myself to manually reset context without losing the important parts (e.g. when it has barked up the wrong tree and I'm sensing that misstep might influence its current generation in a pathological way), but I've not had much success.
ulrikrasmussen
So you use the letter to itself in addition to the compacted context? I am curious what you ask it to include in the letter and how it is different from a custom instruction passed to /compact?
LPisGood
> I ask it to write a letter to its future self, and then start a new session by having it read the letter
Is that not one kf the primary technologies for compactification?
adolfont
Well, for starters, I think it's wrong to criticise LLMs with ‘it can't do that’ (from what I understood from the first paragraph, this was Donald's criticism).
If it can, does it make a difference in relation to all the other problematic aspects of LLMs? Not for me.
Two links that might enlighten Donald:
- Against the Uncritical Adoption of 'AI' Technologies in Academia https://zenodo.org/records/17065099 - The AI Con https://thecon.ai
computerex
It's incredible to see work like this from him, at a ripe old age of eighty-six.
kqr
I agree. I met Knuth briefly after a guest lecture at my university a few years ago and although you could tell his body was getting old, his mind was incredibly fresh.
Although I'm not as bright as him, I can only hope to be as intellectually curious as him at that age.
OJFord
I don't even think this is controversial, but I don't think it's at all without causation: not remaining curious, keeping the mind stimulated, etc., accelerates one's decline.
If you work in something labour intensive, you should retire young while your body's in good health; if you work in academia you should (strive for emeritus and) never leave! (And if you work in SWE, I don't know, we should probably retire, but then spend more time on our own projects/experiments/reading HN.) (All assuming for sake of argument we're optimising for longevity without considering time with family, having the funds to retire, etc.)
justanotherjoe
To put this more succintly I think, the mind loves learning something new. Something to do with new connections in the brain.
Pat44113
I asked Claude to solve the pentominoes puzzle made famous by Arthur C. Clarke. It struggled mightily until I told it how I'd solved the problem using 64 bit unsigned integers to represent the board and pieces. Then, it created a C# program that solved the problem very quickly. However, in the 20x3 case it found four solutions when there are only two. Turns out it had incorrectly mapped one of the pentominoes. Sort of a silly mistake; the sort a human might make.
phoronixrly
[flagged]
logicprog
Regurgitation is pretty rare, and very difficult to coax out, if not even impossible, for things that aren't massively overrepresented in the training set relative to the size of the training set. Even the famous regurgitation paper showed this: while they got most of the models to regurgitate the first book of the Harry Potter series, only Claude 3.7 Sonnet was able to regurgitate any significant portion of any of the other books that had a high nv-recall rate, and basically all of them dropped off precipitously for works like GoT, The Catcher in the Rye, Beloved, and remembered almost nothing about the Da Vinci Code or Catch-22[0]. So you really need huge amounts of examples to get any kind of meaningful regurgitation on any kind of reliable basis. Thus, you'd have to prove that hypothesis.
iandanforth
TLDR (story, not math) - Knuth poses a problem, his friend uses Claude to conduct 30 some explorations, with careful human guidance, and Claude eventually writes a Python program that can find a solution for all odd values. Knuth then writes a proof of the approach and is very pleased by Claude's contribution. Even values remain an open question (Claude couldn't make much progress on them)
logicprog
> with careful human guidance,
I think this is pretty clearly an overstatement of what was done. As Knuth says,
"Filip told me that the explorations reported above, though ultimately successful, weren’t really smooth. He had to do some restarts when Claude stopped on random errors; then some of the previous search results were lost. After every two or three test programs were run, he had to remind Claude again and again that it was supposed to document its progress carefully. "
That doesn't look like careful human guidance, especially not the kind that would actually guide the AI toward the solution at all, let alone implicitly give it the solution — that looks like a manager occasionally checking in to prod it to keep working.
semessier
looks like he is trying to make a point that the actual (formal) proof for 2Z + 1 (odd numbers) is still human - by himself that is. Not sure who came up with the core modular arithmetic idea of with s = 0 k increasing by 2 mod m.
lhl
I am not a theoretical CS or math expert by any means, but I have been wrangling coding agents for a while and reading the paper and the problems Stapper had with dealing w/ Claude (context management, instruction following, etc) decided to see if I could replicate with a slightly better harness. The results were pretty interesting: https://github.com/lhl/claudecycles-revisited
- My original setup left traces of the PDF paper and after GPT 5.3-Codex xhigh reached an impasse it went looking for it and found it!
- I went and did cleanroom (basically one-shot) passes for GPT 5.2 xhigh, GPT 5.3-Codex xhigh, and Claude Opus 4.6 ultrathink and 5.2/5.3 found alternate solutions for odd m >= 5 , Opus 4.6 did not find any proofs but tried more approaches to solving.
Full comparison/analysis here: https://github.com/lhl/claudecycles-revisited/blob/main/COMP...
I've also included the session traces and analysis in the repo branches. Also, the AGENTS.md was pretty simple, but that harness produced consistent process outcomes across all three models:
- All built verifiers first
- All maintained worklogs with exact commands
- All archived machine-readable artifacts
- All documented failed approaches
- All maintained restart-safe context capsules
nphardon
Must be a fun time to work on open problems. I published my graduate research close to a decade ago, often find myself fantasizing about tackling open problems with Claude.
lhl
I was a bit interested to do a replication and see if better harness could avoid some of the problems they ran w/ context management, poor instruction following, etc and it looks like yes, it's definitely possible.
Here's my repo: https://github.com/lhl/claudecycles-revisited
I used Codex w/ 5.2 xhigh and a relatively simple AGENTS.md - I have some session-analysis as well. The original replication was 47 minutes, then another 30 minutes of gap filling, and finally about 30 minutes of writing an extension to take the work a bit further, with Claude Code Opus 4.6 doing some documentation cleanup and verification.
pushedx
As described in the readme of your repo (did you read it?) your agent found the Knuth paper located one directory level above its working directory.
So, you didn't produce a replication in 47 minutes, it just took around 30 minutes for your agent to find that you had the answer in a PDF in a nearby directory.
antonly
I wonder how common of a problem this will be in the future. The experiment will fail due to improper setup, the human will at best glance over the logs and declare victory, and everyone just believes.
lhl
Yes, I read it and specifically pointed it out (that's why there are 3 hours of interactive logs). There are 4 other runs pushed now so you can see what actual clean room runs for 5.2 xhigh, 5.3-Codex xhigh, 5.4 xhigh, and Opus 4.6 ultrathink look like: https://github.com/lhl/claudecycles-revisited/blob/main/COMP... as well as the baseline.
carterschonwald
omg this is so cool. because im writing my own harness and i need some cognitive benchmarks. i have a bunch of harness level infra around llm interactions that seems to help with reasoning, but i dont have a structured way evaluate things
thx for sharing your test setup, i really appreciate the time you took. this will help me so much
beej71
From my naive standpoint, LLMs like this seem to have some big strengths. One: possession of a superhuman expanse of knowledge. Two: making connections. Three: tireless trial and error.
If you put those three things together, you end up with some cool stuff from time to time. Perhaps the proof of P!=NP is tied to an obscure connection that humans don't easily see due to individual lack of knowledge or predisposition of bias.
cbovis
Unless my understanding is incorrect about how these tools work that last point isn't really a quality of LLMs as such? It gets attributed because the lines are blurred but the tireless trial and error is actually just a quality of a regular programatic loop (agent/orchestrator) that happens to be doing the trickiest part of its work via an LLM.
naughtyrabisu
Three: tireless trial and error. Cannot agree more. I figured this probably be the biggest advantage of LLM considering for other variables humans hold the same-level competency.
xvector
This is why the whole "LLMs for mass surveillance" thing is scary imo.
beej71
Yeah, this is a dictator's dream scenario and hell for the citizens. Not only do you not want to get caught for saying something that The Great Leader disapproves of, but you're terrified that anything you say might get flagged by an AI.
Barbing
Well put.
>If you put [possession of a superhuman expanse of knowledge, making connections, tireless trial and error] together, you end up with some cool stuff from time to time.
Hard to argue.
IAmGraydon
>One: possession of a superhuman expanse of knowledge. Two: making connections. Three: tireless trial and error.
One and three I believe are correct. The second point, making connections, is something LLMs seem to be incapable of truly doing unless the connection is already known and in its training data.
beej71
I agree partially, but I think there might be a ton of connections in the training data that aren't obvious to humans. And being a word prediction engine is all about making those connections.
Get the top HN stories in your inbox every day.
It's fascinating to think about the space of problems which are amenable to RL scaling of these probability distributions.
Before, we didn't have a fast (we had to rely on human cognition) way to try problems - even if the techniques and workflows were known by someone. Now, we've baked these patterns into probability distributions - anyone can access them with the correct "summoning spell". Experts will naturally use these systems more productively, because they know how to coerce models into the correct conditional distributions which light up the right techniques.
One question this raises to me is how these models are going to keep up with the expanding boundary of science. If RL is required to get expert behavior into the models, what happens when experts start pushing the boundary faster? In 2030, how is Anthropic going to keep Claude "up-to-date" without either (a) continual learning with a fixed model (expanding context windows? seems hard) or (b) continual training (expensive)?
Crazy times.