We tasked Opus 4.6 using agent teams to build a C Compiler

Daily Digest email

Get the top HN stories in your inbox every day.

ndesaulniers

I spent a good part of my career (nearly a decade) at Google working on getting Clang to build the linux kernel. https://clangbuiltlinux.github.io/

This LLM did it in (checks notes):

> Over nearly 2,000 Claude Code sessions and $20,000 in API costs

It may build, but does it boot (was also a significant and distinct next milestone)? (Also, will it blend?). Looks like yes!

> The 100,000-line compiler can build a bootable Linux 6.9 on x86, ARM, and RISC-V.

The next milestone is:

Is the generated code correct? The jury is still out on that one for production compilers. And then you have performance of generated code.

> The generated code is not very efficient. Even with all optimizations enabled, it outputs less efficient code than GCC with all optimizations disabled.

Still a really cool project!

shakna

> Opus was unable to implement a 16-bit x86 code generator needed to boot into 16-bit real mode. While the compiler can output correct 16-bit x86 via the 66/67 opcode prefixes, the resulting compiled output is over 60kb, far exceeding the 32k code limit enforced by Linux. Instead, Claude simply cheats here and calls out to GCC for this phase

Does it really boot...?

ndesaulniers

> Does it really boot...?

They don't need 16b x86 support for the RISCV or ARM ports, so yes, but depends on what 'it' we're talking about here.

Also, FWIW, GCC doesn't directly assemble to machine code either; it shells out to GAS (GNU Assembler). This blog post calls it "GCC assembler and linker" but to be more precise the author should edit this to "GNU binutils assembler and linker." Even then GNU binutils contains two linkers (BFD and GOLD), or did they excise GOLD already (IIRC, there was some discussion a few years ago about it)?

shakna

Yeah, didn't mention gas or ld, for similar reasons. I agree that a compiler doesn't necessarily "need" those.

I don't agree that all the claims are backed up by their own comments, which means that there's probably other places where it falls down.

Its... Misrepresentation.

Like Chicken is a Scheme compiler. But they're very up front that it depends on a C compiler.

Here, they wrote a C compiler that is at least sometimes reliant on having a different C compiler around. So is the project at 50%? 75%?

Even if its 99%, thats not the same story as they tried to write. And if they wrote that tale instead, it would be more impressive, rather than "There's some holes. How many?"

TheCondor

The assembler seems like nearly the easiest part. Slurp arch manuals and knock it out, it’s fixed and complete.

jakewins

I am surprised by the number of comments that say the assembler is trivial - it is admittedly perhaps simpler than some other parts of the compiler chain, but it’s not trivial.

What you are doing is kinda serialising a self-referential graph structure of machine code entries that reference each others addresses, but you don’t know the addresses because the (x86) instructions are variable-length, so you can’t know them until you generate the machine code, chicken-and-egg problem.

Personally I find writing parsers much much simpler than writing assemblers.

shakna

Huh. A second person mentioning the assembler. Don't think I ever referred to one...?

brundolf

One thing people have pointed out is that well-specified (even if huge and tedious) projects are an ideal fit for AI, because the loop can be fully closed and it can test and verify the artifact by itself with certainty. Someone was saying they had it generate a rudimentary JS engine because the available test suite is so comprehensive

Not to invalidate this! But it's toward the "well-suited for AI" end of the spectrum

HarHarVeryFunny

Yes - the gcc "torture test suite" that is mentioned must have been one of the enablers for this.

It's notable that the article says Claude was unable to build a working assembler (& linker), which is nominally a much simpler task than building a compiler. I wonder if this was at least in part due to not having a test suite, although it seems one could be auto generated during bootstrapping with gas (GNU assembler) by creating gas-generated (asm, ELF) pairs as the necessary test suite.

It does beg the question of how they got the compiler to point of correctness of generating a valid C -> asm mapping, before tackling the issue of gcc compatibility, since the generated code apparently has no relation to what gcc generates. I wonder which compilers' source code Claude has been trained on, and how closely this compiler's code generation and attempted optimizations compares to those?

spullara

i'm sure claude has been trained on every open source compiler

undefined

[deleted]

qarl

> Still a really cool project!

Yeah. This test sorta definitely proves that AI is legit. Despite the millions of people still insisting it's a hoax.

The fact that the optimizations aren't as good as the 40 year gcc project? Eh - I think people who focus on that are probably still in some serious denial.

PostOnce

It's amazing that it "works", but viability is another issue.

It cost $20,000 and it worked, but it's also totally possible to spend $20,000 and have Claude shit out a pile of nonsense. You won't know until you've finished spending the money whether it will fail or not. Anthropic doesn't sell a contract that says "We'll only bill you if it works" like you can get from a bunch of humans.

Do catastrophic bugs exist in that code? Who knows, it's 100,000 lines, it'll take a while to review.

On top of that, Anthropic is losing money on it.

All of those things combined, viability remains a serious question.

ryanjshaw

> You won't know until you've finished spending the money whether it will fail or not.

How do you conclude that? You start off with a bunch of tests and build these things incrementally, why would you spend 20k before realizing there’s a problem?

qarl

> It cost $20,000

I'm curious - do you have ANY idea what it costs to have humans write 100,000 lines of code???

You should look it up. :)

tumdum_

> On top of that, Anthropic is losing money on it.

It seems they are *not* losing money on inference: https://bsky.app/profile/steveklabnik.com/post/3mdirf7tj5s2e

chamomeal

That's a good point! Here claude opus wrote a C compiler. Outrageously cool.

Earlier today, I couldn't get opus to replace useEffect-triggered-redux-dispatch nonsense with react-query calls. I already had a very nice react-query wrapper with tons of examples. But it just couldn't make sense of the useEffect rube goldberg machine.

To be fair, it was a pretty horrible mess of useEffects. But just another data point.

Also I was hoping opus would finally be able to handle complex typescript generics, but alas...

georgeven

it's 20,000 in 2026, with the price of tokens halving every year (at a given perf level), this will be around 1,000 dollars in 2030

RA_Fisher

Progress can be reviewed over time, and I'd think that'd take a lot of the risk out.

nly

Also, heaven knows if the result in maintainable or easy to change.

bdangubic

> On top of that, Anthropic is losing money on it

This has got to be my favorite one of them all that keeps coming up in too many comments… You know who also was losing money in the beginning?! every successful company that ever existed! some like Uber were losing billions for a decade. and when was the last time you rode in a taxi? (I still do, my kid never will). not sure how old you are and if you remember “facebook will never be able to monetize on mobile…” - they all lose money, until they do not

thesz

  > This test sorta definitely proves that AI is legit.

This is an "in distribution" test. There are a lot of C compilers out there, including ones with git history, implemented from scratch. "In distribution" tests do not test generalization.

The "out of distribution" test would be like "implement (self-bootstrapping, Linux kernel compatible) C compiler in J." J is different enough from C and I know of no such compiler.

disgruntledphd2

> This is an "in distribution" test. There are a lot of C compilers out there, including ones with git history, implemented from scratch. "In distribution" tests do not test generalization.

It's still really, really impressive though.

Like, economics aside this is amazing progress. I remember GPT3 not being able to hold context for more than a paragraph, we've come a long way since then.

Hell, I remember bag of words being state of the art when I started my career. We have come a really, really, really long way since then.

Rudybega

There are two compilers that can handle the Linux kernel. GCC and LLVM. Both are written in C, not Rust. It's "in distribution" only if you really stretch the meaning of the term. A generic C compiler isn't going to be anywhere near the level of rigour of this one.

LinXitoW

How does 20K to replicate code available in the thousands online (toy C compilers) prove anything? It requires a bunch of caveats about things that don't work, it requires a bunch of other tools to do stuff, and an experienced developer had to guide it pretty heavily to even get that lackluster result.

soperj

Only if we take them at their word. I remember thinking things were in a completely different state when Amazon had their shop and go stores, but then finding out it was 1000s of people in Pakistan just watching you via camera.

cardanome

If will write you an C compiler by hand for 19k and it will be better than what Claude made.

Writing a toy C compiler isn't that hard. Any decent programmer can write one in a few weeks or months. The optimizations are the actually interesting part and Claude fails hard at that.

kvemkon

> optimizations aren't as good as the 40 year gcc project

with all optimizations disabled:

> Even with all optimizations enabled, it outputs less efficient code than GCC with all optimizations disabled.

qarl

That distinction doesn't change my point. I am not surprised that a 40 year old project generates better code than this brand new one.

dwaite

It is legit - with some pretty severe caveats. I am pressed to come up with an example that has more formal specification, published source implementations, and public unit test coverage than a C compiler.

It is not feasible that someone will use AI to tackle genuinely new software and provide a tenth of the level of guide-rails Anthropic had for this project. They were able to keep the million monkeys on their million typewriters on an extremely short leash, and able to have it do the vast majority of iteration without human intervention.

byzantinegene

it costs $20,000 to reinvent the wheel, that it probably trained on. If that's your definition of legit, sure

organicUser

well, if in this period it is a matter of cost, tomorrow won't be anymore. 4GB of RAM in the 80s would have cost tens of millions of dollars, now even your car runs 4 gb memory only for the infotainment systems, and runs dozens GBs of RAM for the most complex assistants. So i would see this achievement more as a warning, the final result is not what's concerning, it is the premonition behind it

ip26

I’m excited and waiting for the team that shows with $20k in credits they can substantially speed up the generated code by improving clang!

byzantinegene

i'm sorry but that will take another $20 billion in AI capex to train our latest SOTA model so that it will cost $20k to improve the code.

9rx

> I spent a good part of my career (nearly a decade) at Google working on getting Clang to build the linux kernel.

How much of that time was spent writing the tests that they found to use in this experiment? You (or someone like you) were a major contributor to this. All Opus had to do here was keep brute forcing a solution until the tests passed.

It is amazing that it is possible at all, but remains an impossibly without a heavy human hand. One could easily still spend a good part of their career reproducing this if they first had to rewrite all of the tests from scratch.

beambot

This is getting close to a Ken Thompson "Trusting Trust" era -- AI could soon embed itself into the compilers themselves.

bopbopbop7

A pay to use non-deterministic compiler. Sounds amazing, you should start.

Aurornis

Application-specific AI models can be much smaller and faster than the general purpose, do-everything LLM models. This allows them to run locally.

They can also be made to be deterministic. Some extra care is required to avoid computation paths that lead to numerical differences on different machines, but this can be accomplished reliably with small models that use integer math and use kernels that follow a specific order of operations. You get a lot more freedom to do these things on the small, application-specific models than you do when you're trying to run a big LLM across different GPU implementations in floating point.

ndesaulniers

Some people care more about compile times than the performance of generated code. Perhaps even the correctness of generated code. Perhaps more so than determinism of the generated code. Different people in different contexts can have different priorities. Trying to make everyone happy can sometimes lead to making no one happy. Thus dichotomies like `-O2` vs `-Os`.

EDIT (since HN is preventing me from responding):

> Some people care more about compiler speed than the correctness?

Yeah, I think plenty of people writing code in languages that have concepts like Undefined Behavior technically don't really care as much about correctness as they may claim otherwise, as it's pretty hard to write large volumes of code without indirectly relying on UB somewhere. What is correct in such case was left up to interpretation of the implementer by ISO WG14.

ndesaulniers

We're already starting to see people experimenting with applying AI towards register allocation and inlining heuristics. I think that many fields within a compiler are still ripe for experimentation.

https://llvm.org/docs/MLGO.html

undefined

[deleted]

int_19h

What I want to know is when we get AI decompilers

Intuitively it feels like it should be a straightforward training setup - there's lots of code out there, so compile it with various compilers, flags etc and then use those pairs of source+binary to train the model.

jojobas

Sorry, clang 26.0 requires an Nvidia B200 to run.

psychoslave

Hmm, well, there are already embedded in fonts: https://hackaday.com/2024/06/26/llama-ttf-is-ai-in-a-font/

sandinmyjoints

Reminds me of https://www.teamten.com/lawrence/writings/coding-machines/

greenavocado

Then i'll be left wondering why my program requires 512TB of RAM to open

andai

The asymmetry will be between the frontier AI's ability to create exploits vs find them.

iberator

Claude did not wrote it. you wrote it with PREVIOUS EXPERIENCE with 20.000 long commandshyellihg him exactly what to do.

Real usable AI would create it with simple: 'make c compilers c99 faster than GCC'.

AI usage should be banned in general. It takes jobs faster than creating new ones ..

arcanemachiner

That's actually pretty funny. They're patting it on the back for using, in all likelihood, some significant portions of code that they actually wrote, which was stolen from them without attribution so that it could be used as part of a very expensive parlour trick.

whynotminot

Did you do diffs to confirm the code as stolen or are you just speculating.

embedding-shape

> AI usage should be banned in general. It takes jobs faster than creating new ones ..

I don't have an strong opinion about that in either direction, but curious: Do you feel the same about everything, or is just about this specific technology? For example, should the nail gun have been forbidden if it was invented today, as one person with a nail gun could probably replace 3-4 people with normal "manual" hammers?

You feel the same about programmers who are automating others out of work without the use of AI too?

wiseowise

> It takes jobs faster than creating new ones ..

You think compiler engineer from Google gives a single shit about this?

They’ll automate millions out of career existence for their amusement while cashing out stock money and retiring early comfortably.

benterix

> It takes jobs faster than creating new ones ..

I have no problems with tech making some jobs obsolete, that's normal. The problem is, the job being done with the current generation of LLMs are, at least for now, mostly of inferior quality.

The tools themselves are quite useful as helpers in several domains if used wisely though.

7thpower

Businesses do not exist to create jobs; jobs are a byproduct.

jaccola

Even that is underselling it; jobs are a necessary evil that should be minimised. If we can have more stuff with fewer people needing to spend their lives providing it, why would we NOT want that?

unglaublich

Jobs are a means, not a goal.

sc68cal

Jobs are the only way that you survive in this society (food, shelter). Look how we treat unhoused people without jobs. AI is taking jobs away and that is putting people's survival at risk.

MaskRay

I want to verify the claim that it builds the Linux kernel. It quickly runs into errors, but yeah, still pretty cool!

make O=/tmp/linux/x86 ARCH=x86_64 CC=/tmp/p/claudes-c-compiler/target/release/ccc -j30 defconfig all

``` /home/ray/Dev/linux/arch/x86/include/asm/preempt.h:44:184: error: expected ';' after expression before 'pto_tmp__' do { u32 pto_val__ = ((u32)(((unsigned long) ~0x80000000) & 0xffffffff)); if (0) { __typeof_unqual__((__preempt_count)) pto_tmp__; pto_tmp__ = (~0x80000000); (void)pto_tmp__; } asm ("and" "l " "%[val], " "%" "[var]" : [var] "+m" (((__preempt_count))) : [val] "ri" (pto_val__)); } while (0); ^~~~~~~~~ fix-it hint: insert ';' /home/ray/Dev/linux/arch/x86/include/asm/preempt.h:49:183: error: expected ';' after expression before 'pto_tmp__' do { u32 pto_val__ = ((u32)(((unsigned long) 0x80000000) & 0xffffffff)); if (0) { __typeof_unqual__((__preempt_count)) pto_tmp__; pto_tmp__ = (0x80000000); (void)pto_tmp__; } asm ("or" "l " "%[val], " "%" "[var]" : [var] "+m" (((__preempt_count))) : [val] "ri" (pto_val__)); } while (0); ^~~~~~~~~ fix-it hint: insert ';' /home/ray/Dev/linux/arch/x86/include/asm/preempt.h:61:212: error: expected ';' after expression before 'pao_tmp__' ```

silver_sun

They said it builds Linux 6.9, maybe you are trying to compile a newer version there?

MaskRay

git switch v6.9

The riscv build succeeded. For the x86-64 build I ran into

    % make O=/tmp/linux/x86 ARCH=x86_64 CC=/tmp/p/claudes-c-compiler/target/release/ccc-x86 HOSTCC=/tmp/p/claudes-c-compiler/target/release/ccc-x86 LDFLAGS=-fuse-ld=bfd LD=ld.bfd -j30 vmlinux -k
    make[1]: Entering directory '/tmp/linux/x86'
    ...
      CC      arch/x86/platform/intel/iosf_mbi.o
    ccc: error: lgdtl requires memory operand
      AR      arch/x86/platform/intel-mid/built-in.a
    make[6]: *** [/home/ray/Dev/linux/scripts/Makefile.build:362: arch/x86/realmode/rm/wakeup_asm.o] Error 1
    ld.bfd: arch/x86/entry/vdso/vdso32/sigreturn.o: warning: relocation in read-only section `.eh_frame'
    ld.bfd: error in arch/x86/entry/vdso/vdso32/sigreturn.o(.eh_frame); no .eh_frame_hdr table will be created
    ld.bfd: warning: creating DT_TEXTREL in a shared object
    ccc: error: unsupported pushw operand

There are many other errors.

tinyconfig and allnoconfig have fewer errors.

    RELOCS  arch/x86/realmode/rm/realmode.relocs
    Invalid absolute R_386_32 relocation: real_mode_seg

Still very impressive.

NitpickLawyer

This is a much more reasonable take than the cursor-browser thing. A few things that make it pretty impressive:

> I started by drafting what I wanted: a from-scratch optimizing compiler with no dependencies, GCC-compatible, able to compile the Linux kernel, and designed to support multiple backends. While I specified some aspects of the design (e.g., that it should have an SSA IR to enable multiple optimization passes) I did not go into any detail on how to do so.

> Previous Opus 4 models were barely capable of producing a functional compiler. Opus 4.5 was the first to cross a threshold that allowed it to produce a functional compiler which could pass large test suites, but it was still incapable of compiling any real large projects.

And the very open points about limitations (and hacks, as cc loves hacks):

> It lacks the 16-bit x86 compiler that is necessary to boot [...] Opus was unable to implement a 16-bit x86 code generator needed to boot into 16-bit real mode. While the compiler can output correct 16-bit x86 via the 66/67 opcode prefixes, the resulting compiled output is over 60kb, far exceeding the 32k code limit enforced by Linux. Instead, Claude simply cheats here and calls out to GCC for this phase

> It does not have its own assembler and linker;

> Even with all optimizations enabled, it outputs less efficient code than GCC with all optimizations disabled.

Ending with a very down to earth take:

> The resulting compiler has nearly reached the limits of Opus’s abilities. I tried (hard!) to fix several of the above limitations but wasn’t fully successful. New features and bugfixes frequently broke existing functionality.

All in all, I'd say it's a cool little experiment, impressive even with the limitations, and a good test-case as the author says "The resulting compiler has nearly reached the limits of Opus’s abilities". Yeah, that's fair, but still highly imrpessive IMO.

geraneum

> This was a clean-room implementation

This is really pushing it, considering it’s trained on… internet, with all available c compilers. The work is already impressive enough, no need for such misleading statements.

TacticalCoder

I'm using AI to help me code and I love Anthropic but I chocked when I read that in TFA too.

It's all but a clean-room design. A clean-room design is a very well defined term: "Clean-room design (also known as the Chinese wall technique) is the method of copying a design by reverse engineering and then recreating it without infringing any of the copyrights associated with the original design."

https://en.wikipedia.org/wiki/Clean-room_design

The "without infringing any of the copyrights" contains "any".

We know for a fact that models are extremely good at storing information with the highest compression rate ever achieved. It's not because it's typically decompressing that information in a lossy way that it didn't use that information in the first place.

Note that I'm not saying all AIs do is simply compress/decompress information. I'm saying that, as commenters noted in this thread, when a model was caught spotting out Harry Potter verbatim, there is information being stored.

It's not a clean-room design, plain and simple.

mlvljr

[dead]

raincole

It's not a clean-room implementation, but not because it's trained on the internet.

It's not a clean-room implementation because of this:

> The fix was to use GCC as an online known-good compiler oracle to compare against

Calavar

The classical definition of a clean room implementation is something that's made by looking at the output of a prior implementation but not at the source.

I agree that having a reference compiler available is a huge caveat though. Even if we completely put training data leakage aside, they're developing against a programmatic checker for a spec that's already had millions of man hours put into it. This is an optimal scenario for agentic coding, but the vast majority of problems that people will want to tackle with agentic coding are not going to look like that.

array_key_first

If you read the entire GCC source code and then create a compatible compiler, it's not clean room. Which Opus basically did since, I'm assuming, its training set contained the entire source of GCC. So even if they were actively referencing GCC I think that counts.

undefined

[deleted]

cryptonector

Hmm... If Claude iterated a lot then chances are very good that the end result bears little resemblance to open source C compilers. One could check how much resemblance the result actually bears to open source compilers, and I rather suspect that if anyone does check they'll find it doesn't resemble any open source C compiler.

GorbachevyChase

https://arxiv.org/abs/2505.03335

Check out the paper above on Absolute Zero. Language models don’t just repeat code they’ve seen. They can learn to code give the right training environment.

undefined

[deleted]

iberator

this. last sane person in HN

inchargeoncall

[flagged]

teaearlgraycold

With just a few thousand dollars of API credits you too can inefficiently download a lossy copy of a C compiler!

undefined

[deleted]

modeless

There seem to still be a lot of people who look at results like this and evaluate them purely based on the current state. I don't know how you can look at this and not realize that it represents a huge improvement over just a few months ago, there have been continuous improvements for many years now, and there is no reason to believe progress is stopping here. If you project out just one year, even assuming progress stops after that, the implications are staggering.

zamadatix

The improvements in tool use and agentic loops have been fast and furious lately, delivering great results. The model growth itself is feeling more "slow and linear" lately, but what you can do with models as part of an overall system has been increasing in growth rate and that has been delivering a lot of value. It matters less if the model natively can keep infinite context or figure things out on its own in one shot so long as it can orchestrate external tools to achieve that over time.

LinXitoW

The main issue with improvements in the last year is that a lot of it is based not on the models strictly becoming better, but on tooling being better, and simply using a fuckton more tokens for the same task.

Remember that all these companies can only exist because of massive (over)investments in the hope of insane returns and AGI promises. While all these improvements (imho) prove the exact opposite: AGI is absolutely not coming, and the investments aren't going to generate these outsized returns. The will generate decent returns, and the tools are useful.

modeless

I disagree. A year ago the models would not come close to doing this, no matter what tools you gave them or how many tokens you generated. Even three months ago. Effectively using tools to complete long tasks required huge improvements in the models themselves. These improvements were driven not by pretraining like before, but by RL with verifiable rewards. This can continue to scale with training compute for the foreseeable future, eliminating the "data wall" we were supposed to be running into.

nozzlegear

Every S-curve looks like an exponential until you hit the bend.

NitpickLawyer

We've been hearing this for 3 years now. And especially 25 was full of "they've hit a wall, no more data, running out of data, plateau this, saturated that". And yet, here we are. Models keep on getting better, at more broad tasks, and more useful by the month.

raincole

This quote would be more impactful if people haven't been repeating it since gpt-4 time.

esafak

What if it plateaus smarter than us? You wouldn't be able to discern where it stopped. I'm not convinced it won't be able to create its own training data to keep improving. I see no ceiling on the horizon, other than energy.

famouswaffles

Cool I guess. Kind of a meaningless statement yeah? Let's hit the bend, then we'll talk. Until then repeating, 'It's an S Curve guys and what's more, we're near the bend! trust me" ad infinitum is pointless. It's not some wise revelation lol.

chasd00

i have to admit, even if model and tooling progress stopped dead today the world of software development has forever changed and will never go back.

uywykjdskn

Yea the software engineering profession is over, even if all improvements stop now.

gmueckl

The result is hardly a clean room implementation. It was rather a brute force attempt to decompress fuzzily stored knowledge contained within the network and it required close steering (using a big suite of tests) to get a reasonable approximation to the desired output. The compression and storage happened during the LLM training.

Prove this statement wrong.

libraryofbabel

Nobody disputes that the LLM was drawing on knowledge in its training data. Obviously it was! But you'll need to be a bit more specific with your critique, because there is a whole spectrum of interpretations, from "it just decompressed fuzzily-stored code verbatim from the internet" (obviously wrong, since the Rust-based C compiler it wrote doesn't exist on the internet) all the way to "it used general knowledge from its training about compiler architecture and x86 and the C language."

Your post is phrased like it's a two sentence slam-dunk refutation of Anthropic's claims. I don't think it is, and I'm not even clear on what you're claiming precisely except that LLMs use knowledge acquired during training, which we all agree on here.

nicoburns

"clean room" usually means "without looking at the source code" of other similar projects. But presumably the AIs training data would have included GCC, Clang, and probably a dozen other C compilers.

gmueckl

The result is a fuzzy reproduction of the training input, specifically of the compilers contained within. The reproduction in a different, yet still similar enough programming language does not refute that. The implementation was strongly guided by a compiler and a suite of tests as an explicit filter on those outputs and limiting the acceptable solution space, which excluded unwanted interpolations of the training set that also result from the lossy input compression.

The fact that the implementation language for the compiler is rust doesn't factor into this. ML based natural language translation has proven that model training produces an abstract space of concepts internally that maps from and to different languages on the input and output side. All this points to is that there are different implicitly formed decoders for the same compressed data embedded in the LLM and the keyword rust in the input activates one specific to that programming language.

NitpickLawyer

> Prove this statement wrong.

If all it takes is "trained on the Internet" and "decompress stored knowledge", then surely gpt3, 3.5, 4, 4.1, 4o, o1, o3, o4, 5, 5.1, 5.x should have been able to do it, right? Claude 2, 3, 4, 4.1, 4.5? Surely.

shakna

Well, "Reimplement the c4 compiler - C in four functions" is absolutely something older models can do. Because most are trained, on that quite small product - its 20kb.

But reimplementing that isn't impressive, because its not a clean room implementation if you trained on that data, to make the model that regurgitates the effort.

gmueckl

This comparison is only meaningful with comparable numbers of parameters and context window tokens. And then it would mainly test the efficiency and accuracy of the information encoding. I would argue that this is the main improvement over all model generations.

geraneum

Perhaps 4.5 could also do it? We don’t know really until we try. I don’t trust the marketing material as much. The fact that the previous version (smaller versions) couldn’t or could do it does not really disprove that claim.

hn_acc1

Are you really asking for "all the previous versions were implemented so poorly they couldn't even do this simple, basic LLM task"?

Marha01

Even with 1 TB of weights (probable size of the largest state of the art models), the network is far too small to contain any significant part of the internet as compressed data, unless you really stretch the definition of data compression.

jesse__

This sounds very wrong to me.

Take the C4 training dataset for example. The uncompressed, uncleaned, size of the dataset is ~6TB, and contains an exhaustive English language scrape of the public internet from 2019. The cleaned (still uncompressed) dataset is significantly less than 1TB.

I could go on, but, I think it's already pretty obvious that 1TB is more than enough storage to represent a significant portion of the internet.

kgeist

A lot of the internet is duplicate data, low quality content, SEO spam etc. I wouldn't be surprised if 1 TB is a significant portion of the high-quality, information-dense part of the internet.

gmueckl

This is obviously wrong. There is a bunch of knowledge embedded in those weights, and some of it can be recalled verbatim. So, by virtue of this recall alone, training is a form of lossy data compression.

0xCMP

I challenge anyone to try building a C compiler without a big suite of tests. Zig is the most recent attempt and they had an extensive test suite. I don't see how that is disqualifying.

If you're testing a model I think it's reasonable that "clean room" have an exception for the model itself. They kept it offline and gave it a sandbox to avoid letting it find the answers for itself.

Yes the compression and storage happened during the training. Before it still didn't work; now it does much better.

hn_acc1

The point is - for a NEW project, no one has an extensive test suite. And if an extensive test suite exists, it's probably because the product that uses it also exists, already.

If it could translate the C++ standard INTO an extensive test suite that actually captures most corner cases, and doesn't generate false positives - again, without internet access and without using gcc as an oracle, etc?

brutalc

No one needs to prove you wrong. That’s just personal insecurity trying to justify ones own worth.

linuxtorvals

[flagged]

panzi

> clean-room implementation

Except its trained on all source out there, so I assume on GCC and clang. I wonder how similar the code is to either.

pertymcpert

I'm familiar with both compilers. There's more similarity to LLVM, it even borrows some naming such as mem2reg (which doesn't really exist anymore) and GetElementPtr. But that's pretty much where things end. The rest of it is just common sense.

shubhamjain

Yeah, I am amazed how people are brushing this off simply because GCC exists. This was far more challenging task than the browser thing, because of how far few open source compilers are there. Add to that no internet access and no dependencies.

At this point, it’s hard to deny that AI has become capable of completing extremely difficult tasks, provided it has enough time and tokens.

bjackman

I don't think this is more challenging than the browser thing. The scope is much smaller. The fact that this is "only" 100k lines is evidence for this. But, it's still very impressive.

I think this is Anthropic seeing the Cursor guy's bullshit and saying "but, we need to show people that the AI _can actually_ do very impressive shit as long as you pick a more sensible goal"

kelnos

Honestly I don't find it that impressive. I mean, it's objectively impressive that it can be done at all, but it's not impressive from the standpoint of doing stuff that nearly all real-world users will want it to do.

The C specification and Linux kernel source code are undoubtedly in its training data, as are texts about compilers from a theoretical/educational perspective.

Meanwhile, I'm certain most people will never need it to perform this task. I would be more interested in seeing if it could add support for a new instruction set to LLVM, for example. Or perhaps write a complier for a new language that someone just invented, after writing a first draft of a spec for it.

steveklabnik

> Or perhaps write a complier for a new language that someone just invented, after writing a first draft of a spec for it.

Hello, this is what I did over my Christmas break. I've been taking some time to do other things, but plan on returning to it. But this absolutely works. Claude has written far more programs in my language than I have.

https://rue-lang.dev/ if you want to check it out. Spec and code are both linked there.

simonw

Are you a frequent user of coding agents?

I ask because, as someone who uses these things every day, the idea that this kind of thing only works because of similar projects in the training data doesn't fit my mental model of how they work at all.

I'm wondering if the "it's in the training data" theorists are coding agent practitioners, or if they're mainly people who don't use the tools.

bdangubic

I am all-daily user (multiple claude max accounts). this fits my mental model mostly but not model I had before but developed with daily use. my job revolves around two core things:

1. data analysis / visualization / …

2. “is this possible? can this even be done?”

for #1 - I don’t do much anymore, for #2 I mostly do it still all “by hand” not for the lack of serious trying. so “it can do #1 1000x better than me cause it is generally solved problem(s) it is trained on while it can’t effectively do #2 otherwise” fits perfectly

dyauspitr

> Claude did not have internet access at any point during its development

Why is this even desirable? I want my LLM to take into account everything there is out there and give me the best possible output.

simonw

It's desirable if you're trying to build a C compiler as a demo of coding agent capabilities without all of the Hacker News commenters saying "yeah but it could just copy implementation details from the internet".

chamomeal

What's making these models so much better on every iteration? Is it new data? Different training methods?

Kinda waiting for them to plateau so I can stop feeling so existential ¯\_(ツ)_/¯

esafak

More compute (bigger models, and prediction-time scaling), algorithmic advances, and ever more data (including synthetic).

Remember that all white collar workers are in your position.

andrewshawcare

It used the best tests it could find for existing compilers. This is effectively steering Claude to a well-defined solution.

Hard to find fully specified problems like this in the wild.

I think this is more a testament to small, well-written tests than it is agent teams. I imagine you could do the same thing with any frontier model and a single agent in a linear flow.

I don’t know why people use parallel agents and increase accidental complexity. Isn’t one agent fast enough? Why lose accuracy over +- one week to write a compiler?

> Write extremely high-quality tests

> Claude will work autonomously to solve whatever problem I give it. So it’s important that the task verifier is nearly perfect, otherwise Claude will solve the wrong problem. Improving the testing harness required finding high-quality compiler test suites, writing verifiers and build scripts for open-source software packages, and watching for mistakes Claude was making, then designing new tests as I identified those failure modes.

> For example, near the end of the project, Claude started to frequently break existing functionality each time it implemented a new feature. To address this, I built a continuous integration pipeline and implemented stricter enforcement that allowed Claude to better test its work so that new commits can’t break existing code.

tantalor

Why didn't Claude realize on its own that it needed a continuous integration pipeline?

Far to much human intervention here.

sublimefire

> Isn’t one agent fast enough? Why lose accuracy over +- one week to write a compiler?

My thinking as well, IMO it is because you need to wait for results for longer. You basically want to shorten the loops to improve the system. It hints at a problem that most of what we see is a challenge to seed a good context for it to successfully do something in many iterations.

krzat

You know what else is well specified? LLM improving on itself.

widdershins

I wouldn't describe intelligence as well specified. We can't even agree on what it is.

GalaxyNova

> Hard to find fully specified problems like this in the wild.

This is such a big and obvious cope. This is obviously a very real problem in the wild and there are many, many others like it. Probably most problems are like this honestly or can be made to be like this.

anematode

Impressive, my sarcasm/bait detector almost failed me.

hmry

If I, a human, read the source code of $THING and then later implement my own version, that's not a "clean-room" re-implementation. The whole point of "clean-room" is that no single person has access to both the original code and the new code. (That way, you can legally prove that no copyright infringement took place.)

But when an AI does it, now it counts? Opus is trained on the source code of Clang, GCC, TCC, etc. So this is not "clean-room".

zamalek

Agree, but to an even further degree.

At one point there were issues with LLMs regurgitating licensed code verbatim. I have no doubt that Claude could parrot a large portion of GCC given correct prompting.

Being able to memorize the various C compiler implementations, alongside the sum of human knowledge, is an incredible feat. However, this is in a distinctly different domain to what a human does when writing a clean-room compiler implementation in the absence of near perfect recall of all C compiler implementations. The way that Claude solved this is probably something a human can't do, the way a human would solve this is definitely something Claude can't do.

astrange

Copyright doesn't protect ideas, it protects writing. Avoiding reading LLVM or GCC is to protect you from other kinds of IP issues, but it's not a copyright issue. The same people contribute to both projects despite their different licenses.

hmry

They don't call Clang a "clean-room implementation". Unlike Anthropic, who are calling their project exactly that

A clean-room implementation is when you implement a replacement by only looking at the behavior and documentation (possibly written by another person on your team who is not allowed to write code, only documentation).

bmandale

That's not the only way to protect yourself from accusations of copyright infringement. I remember reading that the GNU utils were designed to be as performant as possible in order to force themselves to structure the code differently from the unix originals.

Crestwave

Yes, but Anthropic is specifically claiming their implementation is clean-room, while GNU never made that claim AFAIK.

rishabhaiover

[flagged]

hmry

Just tired of AI companies having more rights than natural people when it comes to copyright infringement. Let us have some of the fun too!

rishabhaiover

I apologize for making that assumption.

whinvik

It's weird to see the expectation that the result should be perfect.

All said and done, that its even possible is remarkable. Maybe these all go into training the next Opus or Sonnet and we start getting models that can create efficient compilers from scratch. That would be something!

regularfry

This is firmly where I am. "The wonder is not how well the dog dances, it is that it dances at all."

the8472

"It's like if a squirrel started playing chess and instead of "holy shit this squirrel can play chess!" most people responded with "But his elo rating sucks""

LinXitoW

It's more like "We were promised, over and over again, that the squirrel would be autonomous grand master level. We spent insane amounts of money, labour, and opportunity costs of human progress on this. Now, here's a very expensive squirrel, that still needs guidance from a human grandmaster, and most of it's moves are just replications of existing games. Oh, it also can't move the pieces by itself, so it depends on Piece Mover library."

knollimar

I'm not trying to get coached in chess by the squirrel for 200 per month though.

echelon

"The squirrel can do my job and more? It can do five years of my work in a month? For only $20k? Pssh, but I bet it copied someone's homework."

Developer salaries are about to tank.

This is the end of the line. People are just in denial.

Soon companies will hire the squirrel instead of you. And the squirrel will transform into enormous infrastructure we can't afford ourselves.

"One mega squirrel to implement your own operating system overnight. Just $100k."

It's going to be out of the reach of humans / ICs soon. Purely industrial. And all innovation will accrue to the capital holders.

Open weights models are our only hope of keeping a foot in the door.

amlib

But the Squirrel is only playing chess because someone stuffed the pieces with food and it has learned that the only way to release it is by moving them around in some weird patterns.

emp17344

But people have been telling us for years that the squirrel was going to improve at chess at an exponential rate and take over the world through sheer chess-mastery.

sumitkumar

I was also startled when I learned about the human ancestor who was the first to see a mirror.

The brilliance of AI is that it copies(mirrors) imperfectly and you can only look at part_of_the_copy(inference) at a time.

viccis

>It's weird to see the expectation that the result should be perfect.

Given that they spent $20k on it and it's basically just advertising targeted at convincing greedy execs to fire as many of us as they can, yeah it should be fucking perfect.

minimaxir

A symptom of the increasing backlash against generative AI (both in creative industries and in coding) is that any flaw in the resulting product is predicate to call it AI slop, even if it's very explicitly upfront that it's an experimental demo/proof of concept and not the NEXT BIG THING being hyped by influencers. That nuance is dead even outside of social media.

stonogo

AI companies set that expectation when their CEOs ran around telling anyone who would listen that their product is a generational paradigm shift that will completely restructure both labor markets and human cognition itself. There is no nuance in their own PR, so why should they benefit from any when their product can't meet those expectations?

minimaxir

Because it leads to poor and nonconstructive discourse that doesn't educate anyone about the implications of the tech, which is expected on social media but has annoyingly leaked to Hacker News.

There's been more than enough drive-by comments from new accounts/green names even in this HN submission alone.

itay-maman

My first reaction: wow, incredible.

My second reaction: still incredible, but noting that a C compiler is one of the most rigorously specified pieces of software out there. The spec is precise, the expected behavior is well-defined, and test cases are unambiguous.

I'm curious how well this translates to the kind of work most of us do day-to-day where requirements are fuzzy, many edge cases are discovered on the go, and what we want to build is a moving target.

ndesaulniers

> C compiler is one of the most rigorously specified pieces of software out there

/me Laughs in "unspecified behavior."

ori_b

There's undefined behavior, which is quite well specified. What do you mean by unspecified behavior? Do you have an example?

ndesaulniers

https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3685.pdf

Read section J.1.

irishcoffee

Undefined is absolutely clear in the spec.

Unspecified is whatever you want it to mean. I am also laughing, having never heard "unspecified" before.

LiamPowell

Unspecified behaviour is defined in the glossary at the start of the spec and the term "unspecified" appears over a hundred times...

astrange

The C spec is certainly not formal or precise.

https://www.ralfj.de/blog/2020/12/14/provenance.html

Another example is that it's unclear from the standard if you can write malloc() in C.

butterNaN

Sure but the point OP is making is that it is still more spec'd than most real world problems

astrange

You're welcome to try writing a C compiler and standard library doing no research other than reading the spec.

cryptonector

> My second reaction:

This is the key: the more you constrain the LLM, the better it will perform. At least that's my experience with Claude. When working with existing code, the better the code to begin with, the better Claude performs, while if the code has issues then Claude can end up spinning its wheels.

softwaredoug

Yes I think any codegen with a lot of tests and verification is more about “fitting” to the tests. Like fitting an ML model. It’s model training, not coding.

But a lot of programming we discover correctness as we go, one reason humans don’t completely exit the loop. We need to see and build tests as we go, giving them particular care and attention to ensure they test what matters.

uywykjdskn

The agent can obviously do that

boring-human

People focused on the flaws are missing the picture. Opus wasn't even trained to be "a member of a team of engineers," it was adapted to the task by one person with a shell script loop. Specific training for this mode of operation is inevitable. And model "IQ" is increasing with every generation. If human IQ is increasing at all, it's only because the engineer pool is shrinking more at one end than the other.

This is a five-alarm fire if you're a SWE and not retiring in the next couple years.

smithcoin

> This is a five-alarm fire if you're a SWE and not retiring in the next couple years.

I’m sorry, but this is such a hype beast take. In my opinion this is equivalent to telling people not to learn to drive five years ago because of self driving from Tesla. How is that going?

Every single line of code produced is a liability. This idea that you’re going to have “gas town” like agents running and building apps without humans in the loop at any point to generate liability free revenue is insane to me.

Are humans infallible? Obviously not. But if you are telling me that ‘magic probability machines’ are creating safe, secure, and compliant software that has no need for engineers to participate in the output- first I’d like to see a citation and second I have a bridge to sell you.

boring-human

> In my opinion this is equivalent to telling people not to learn to drive five years ago because of self driving

Self-driving has different economics. We're reading tea leaves, true, but it's also true that software has zero marginal cost and that $20K pays for an engineer-month in SF.

> Every single line of code produced is a liability.

Do you have a hard spec and rock-solid test cases? If you do, you have two options to a working prototype: 2-6 engineer-years, or $20K. The second option will greatly increase in quality and likely decrease in price over the next few years.

What if the spec and the test cases are the new software? Assembly programmers used to make an argument against compiled code that's somewhat parallel to yours: every instruction is a (performance) liability.

> without humans in the loop

There will be humans, just fewer and fewer. The spec and test cases are AI-eligible too.

> safe, secure, and compliant software

I'm not sure humans' advantage here is safe, if it even exists still.

polyglotfacto

So let’s say you fund a single engineer for an open‑source project with $20k. The outcome will be a prototype with some interesting ideas. And yes, with a few hundred bucks' worth of AI assistance that single engineer might get much further than without (but not using any of the techniques presented in this blog). People can coalesce around the project as contributors. A seed was planted and watered a bit.

In this case, the $20k has been burned and produced zero value. Just look at the repo issues: looks like someone trying to get attention by spamming the issue tracker and opening hundreds of PRs. As an open source project, it’s a dead end.

So it doesn’t matter that this is “likely decrease in price over the next few years”? The value is zero, so even if superintelligence can produce this in an instant at zero cost in six months, the outcome is still worth zero.

You’re assuming a kind of inverse relationship between production cost and value.

In terms of quality, to anyone using those coding agents, it should be clear by now that letting them run autonomously and in parallel is a bad idea. That’s not going to change unless you believe LLMs will turn into something entirely different over time.

Note that what works with humans—social interaction creating some emergent properties like innovation—doesn’t translate to LLM agents for a simple reason: they don’t have agency, shared goals, or accountability, so the social dynamics that generate innovation can’t form.

201984

https://github.com/anthropics/claudes-c-compiler/issues/1

Philpax

The issue is that it's missing the include paths. The compiler itself is fine.

krupan

Thank you. That was a long article that started with a claim that was backed up by no proof, dismissing it as not the most interesting thing they were talking about when in fact it's the baseline of the whole discussion.

Retr0id

Looks like these users are just missing glibc-devel or equivalent?

delusional

Naa, it looks like it's failing to include the standard system include directories. If you take then from gcc and pass them as -I, it'll compile.

Retr0id

Can confirm (on aarch64 host)

    $ ./target/release/ccc-arm -I /usr/include/ -I /usr/local/include/ -I /usr/lib/gcc/aarch64-redhat-linux/15/include/ -o hello hello.c 

    $ ./hello
    Hello from CCC!

zamadatix

Hmm, I didn't have to do that. https://i.imgur.com/OAEtgvr.png

But yeah, either way it just needs to know where to find the stdlib.

undefined

[deleted]

worldsavior

AI is the future.

suddenlybananas

This is truly incredible.

ZeWaka

lol, lmao

btown

> This was a clean-room implementation (Claude did not have internet access at any point during its development); it depends only on the Rust standard library. The 100,000-line compiler can build Linux 6.9 on x86, ARM, and RISC-V. It can also compile QEMU, FFmpeg, SQlite, postgres, redis, and has a 99% pass rate on most compiler test suites including the GCC torture test suite. It also passes the developer's ultimate litmus test: it can compile and run Doom.

This is incredible!

But it also speaks to the limitations of these systems: while these agentic systems can do amazing things when automatically-evaluable, robust test suites exist... you hit diminishing returns when you, as a human orchestrator of agentic systems, are making business decisions as fast as the AI can bring them to your attention. And that assumes the AI isn't just making business assumptions with the same lack of context, compounded with motivation to seem self-reliant, that a non-goal-aligned human contractor would have.

_qua

Interesting how the concept of a clean room implementation changes when the agent has been trained on the entire internet already

falcor84

To the best of my knowledge, there's no Rust-based compiler that comes anywhere close to 99% on the GCC torture test suite, or able to compile Doom. So even if it saw the internals of GCC and a lot of other compilers, the ability to recreate this step-by-step in Rust is extremely impressive to me.

jsheard

The impressiveness of converting C to Rust by any means is kind of contingent on how much unnecessary unsafe there is in the end result though.

D-Machine

I think the careful response to this is:

(1) There are compilers written in C in the training set

(2) LLMs demonstrably can near-perfectly memorize training-set inputs (see other comments here)

(3) LLMs are very good at translation tasks (natural language or code, e.g.: C to Rust)

I don't think this necessarily completely deflates the impressiveness of this accomplishment, but it does qualify it to some degree.

undefined

[deleted]

jillesvangurp

You can use ai coding tools to create test suites, specifications, documentation, etc. And you can use them to scrutinize those, review them, criticize them, etc. Not having a test suite just means you start with creating one. Then the next question of course becomes "for what?".

This indeed puts human prompters in a position where their job is to set the goals, outline the vision, ask for the right things, ask critical questions, and to correct where needed.

Human contractors are a good analogy. Because they tend to come in without too much context into a new job. Their context is mainly what they've done before. But it takes time to get up to speed with whatever the customer is asking for and their context. People are slightly better at getting information out of other people. AI coding tools don't ask enough critical questions, yet. But that sounds fixable. The breakthroughs here are as much in the feedback loops and plumbing around the models as they are in the models themselves. It's all about getting the right information in and out of the context.

socalgal2

You would spend years verifying the tests actually work where as the tests for this accomplishment were already verified by humans over decades

falcor84

Agreed, but the next step is of having an AI agent actually run the business and be able to get the business context it needs as a human would. Obviously we're not quite there, but with the rapid progress on benchmarks like Vending-Bench [0], and especially with this teams approach, it doesn't seem far fetched anymore.

As a particular near-term step, I imagine that it won't be long before we see a SaaS company using an AI product manager, which can spawn agents to directly interview users as they utilize the app, independently propose and (after getting approval) run small product experiments, and come up with validated recommendations for changing the product roadmap. I still remember Tay, and wouldn't give something like that the keys to the kingdom any time soon, but as long as there's a human decision maker at the end, I think that the tech is already here.

[0] https://andonlabs.com/evals/vending-bench-2

OsrsNeedsf2P

This is like a working version of the Cursor blog. The evidence - it compiling the Linux kernel - is much more impressive than a browser that didn't even compile (until manually intervened)

ben_w

It certainly slightly spoils what I was planning to be a fun little April Fool's joke (a daft but complete programming language). Last year's AI wasn't good enough to get me past the compiler-compiler even for the most fundamental basics, now it's all this.

I'll still work on it, of course. It just won't be so surprising.

underdeserver

> when agents started to compile the Linux kernel, they got stuck. [...] Every agent would hit the same bug, fix that bug, and then overwrite each other's changes.

> [...] The fix was to use GCC as an online known-good compiler oracle to compare against. I wrote a new test harness that randomly compiled most of the kernel using GCC, and only the remaining files with Claude's C Compiler. If the kernel worked, then the problem wasn’t in Claude’s subset of the files. If it broke, then it could further refine by re-compiling some of these files with GCC. This let each agent work in parallel

This is a remarkably creative solution! Nicely done.

akrauss

I would like to see the following published:

- All prompts used

- The structure of the agent team (which agents / which roles)

- Any other material that went into the process

This would be a good source for learning, even though I'm not ready to spend 20k$ just for replicating the experiment.

password4321

Yes unfortunately these days most are satisfied with just the sausage and no details about how it was made.

a456463

Just claims with nothing to back it. Steal people's work of years, and turn around be like I make it "so much better". Support this compiler for 20 years then

Daily Digest email

Get the top HN stories in your inbox every day.