We hid backdoors in ~40MB binaries and asked AI + Ghidra to find them

Daily Digest email

Get the top HN stories in your inbox every day.

7777332215

I know they said they didn't obfuscate anything, but if you hide imports/symbols and obfuscate strings, which is the bare minimum for any competent attacker, the success rate will immediately drop to zero.

This is detecting the pattern of an anomaly in language associated with malicious activity, which is not impressive for an LLM.

stared

One of the authors here.

The tasks here are entry level. So we are impressed that some AI models are able to detect some patterns, while looking just at binary code. We didn't take it for granted.

For example, only a few models understand Ghidra and Radare2 tooling (Opus 4.5 and 4.6, Gemini 3 Pro, GLM 5) https://quesma.com/benchmarks/binaryaudit/#models-tooling

We consider it a starting point for AI agents being able to work with binaries. Other people discovered the same - vide https://x.com/ccccjjjjeeee/status/2021160492039811300 and https://news.ycombinator.com/item?id=46846101.

There is a long way ahead from "OMG, AI can do that!" to an end-to-end solution.

botusaurus

have you tried stuffing a whole set of tutorials on how to use ghidra in the context, especially for the 1 mil token context like gemini?

stared

No. To give it a fair test, we didn't tinker with model-specific context-engineering. Adding skills, examples, etc is very likely to improve performance. So is any interactive feedback.

Our example instruction is here: https://github.com/QuesmaOrg/BinaryAudit/blob/main/tasks/lig...

akiselev

When I was developing my ghidra-cli tool for LLMs to use, I was using crackmes as tests and it had no problem getting through obfuscation as long as it was prompted about it. In practice when reverse engineering real software it can sometimes spin in circles for a while until it finally notices that it's dealing with obfuscated code, but as long as you update your CLAUDE.md/whatever with its findings, it generally moves smoothly from then on.

eli

Is it also possible that crackme solutions were already in the training data?

akiselev

I used the latest submissions from sites like crackmes.ones which were days or weeks old to guard against that.

achille

in the article they explicitly said they stripped symbols. If you look at the actual backdoors many are already minimal and quite obfuscated,

see:

- https://github.com/QuesmaOrg/BinaryAudit/blob/main/tasks/dns...

- https://github.com/QuesmaOrg/BinaryAudit/blob/main/tasks/dro...

comex

The first one was probably found due to the reference to the string /bin/sh, which is a pretty obvious tell in this context.

The second one is more impressive. I'd like to see the reasoning trace.

comex

Reply to self: I managed to get their code running, since they seemingly haven’t published their trajectories. At least in my run (using Opus 4.6), it turns out that Claude is able to find the backdoored function because it’s literally the first function Claude checks.

Before even looking at the binary, Claude announces it will“look at the authentication functions, especially password checking logic which is a common backdoor target.” It finds the password checking function (svr_auth_password) using strings. And that is the function they decided to backdoor.

I’m experienced with reverse engineering but not experienced with these kinds of CTF-type challenges, so it didn’t occur to me that this function would be a stereotypical backdoor target…

They have a different task (dropbear-brokenauth2-detect) which puts a backdoor in a different function, and zero agents were able to find that one.

On the original task (dropbear-brokenauth-detect), in their runs, Claude reports the right function as backdoored 2 out of 3 times, but it also reports some function as backdoored 2 out of 2 times in the control experiment (dropbear-brokenauth-detect-negative), so it might just be getting lucky. The benchmark seemingly only checks whether the agent identifies which function is backdoored, not the specific nature of the backdoor. Since Claude guessed the right function in advance, it could hallucinate any backdoor and still pass.

But I don’t want to underestimate Claude. My run is not finished yet. Once it’s finished, I’ll check whether it identified the right function and, if so, whether it actually found the backdoor.

hereme888

I've used Opus 4.5 and 4.6 to RE obfuscated malicious code with my own Ghidra plugin for Claude Code and it fully reverse engineered it. Granted, I'm talking about software cracks, not state-level backdoors.

halflife

Isn’t LLM supposed to be better at analyzing obfuscated than heuristics? Because of its ability to pattern match it can deduce what obfuscated code does?

bethekidyouwant

How much binary code is in the training set? (None?)

Avamander

I have seen LLMs be surprisingly effective at figuring out such oddities. After all it has ingested knowledge of a myriad of data formats, encryption schemes and obfuscation methods.

If anything, complex logic is what'll defeat an LLM. But a good model will also highlight such logic being intractable.

Retr0id

Stripping symbols is fairly normal, but hiding imports ought to be suspicious in its own right.

akiselev

Shameless plug: https://github.com/akiselev/ghidra-cli

I’ve been using Ghidra to reverse engineer Altium’s file format (at least the Delphi parts) and it’s insane how effective it is. Models are not quite good enough to write an entire parser from scratch but before LLMs I would have never even attempted the reverse engineering.

I definitely would not depend on it for security audits but the latest models are more than good enough to reverse engineer file formats.

bitexploder

I can tell you how I am seeing agents be used with reasonable results. I will keep this high level. I don't rely on the agents solely. You build agents that augment your capabilities.

They can make diagrams for you, give you an attack surface mapping, and dig for you while you do more manual work. As you work on an audit you will often find things of interest in a binary or code base that you want to investigate further. LLMs can often blast through a code base or binary finding similar things.

I like to think of it like a swiss army knife of agentic tools to deploy as you work through a problem. They won't balk at some insanely boring task and that can give you a real speed up. The trick is if you fall into the trap of trying to get too much out of an LLM you end up pouring time into your LLM setup and not getting good results, I think that is the LLM productivity trap. But if you have a reasonable subset of "skills" / "agents" you can deploy for various auditing tasks it can absolutely speed you up some.

Also, when you have scale problems, just throw an LLM at it. Even low quality results are a good sniff test. Some of the time I just throw an LLM at a code review thing for a codebase I came across and let it work. I also love asking it to make me architecture diagrams.

johnmaguire

> But if you have a reasonable subset of "skills" / "agents" you can deploy for various auditing tasks it can absolutely speed you up some.

Are people sharing these somewhere?

embedding-shape

I think overall you're better off creating these yourself. The more you add to the overall context, the more chance of the model to screw up somewhere, so you want to give it as little as possible, yet still include everything that is important at that moment.

Using the agent and seeing where it get stuck, then creating a workflow/skill/whatever for how to overcome that issue, will also help you understand what scenarios the agents and models are currently having a hard time with.

You'll also end up with fewer workflows/skills that you understand, so you can help steer things and rewrite things when inevitably you're gonna have to change something.

bitexploder

I put the terms in quotes because it can be as simple as a set of prompts you develop for various contexts. It really doesn't have to be too heavy of an idea.

jakozaur

Oh, nice find... We end up using PyGhidra, but the models waste some cycles because of bad ergonomics. Perhaps your cli would be easier.

Still, Ghidra's most painful limitation was extremely slow time with Go Lang. We had to exclude that example from the benchmark.

Aeolun

> Models are not quite good enough to write an entire parser from scratch

In my experience models are really good at this? Not one shot, but writing decoders/encoders is entirely possible.

akiselev

They can oneshot relatively simple parsers/encoders/decoders with a proper spec, but it’s a completely different ballgame when you’re trying to parse a very domain knowledge heavy file format (like the format electronics CAD) with decades of backwards compatible cruft spread among hundreds of megabytes of decompiled Delphi and C# dlls (millions of lines).

The low level parts (OLE container, streams and blocks) are easy but the domain specific stuff like deserializing to typed structs is much harder.

selridge

This is really cool! Thanks for sharing. It's a lot more sophisticated than what I did w/ Ghidra + LLMs.

lima

How does this approach compare to the various Ghidra MCP servers?

akiselev

There’s not much difference, really. I stupidly didn’t bother looking at prior art when I started reverse engineering and the ghidra-cli was born (along with several others like ilspy-cli and debugger-cli)

That said, it should be easier to use as a human to follow along with the agent and Claude Code seems to have an easier time with discovery rather than stuffing all the tool definitions into the context.

bitexploder

That is pretty funny. But you probably learned something in implementing it! This is such a new field, I think small projects like this are really worthwhile :)

selridge

I also did this approach (scripts + home-brew cli)...because I didn't know Ghidra MCP servers existed when I got started.

So I don't have a clear idea of what the comparison would be but it worked pretty well for me!

stared

Thanks for sharing! It seems to be an active space, vide a recent MCP server (https://news.ycombinator.com/item?id=46882389). I you haven't tried, recommend a lot posting it as Show HN.

I tried a few approaches - https://github.com/jtang613/GhidrAssistMCP (was the harderst to set) Ghidra analyzeHeadless (GPT-5.2-Codex worked with it well!) and PyGhidra (my go-to). Did you try to see which works the best?

I mean, very likely (especially with an explicit README for AI, https://github.com/akiselev/ghidra-cli/blob/master/.claude/s...) your approach might be more convenient to use with AI agents.

huflungdung

[dead]

mbh159

The methodology debate in this thread is the most important part.

The commenter who says "add obfuscation and success drops to zero" is right but that's also the wrong approach imo. The experiment isn't claiming AI can defeat a competent attacker. It's asking whether AI agents can replicate what a skilled (RE) specialist does on an unobfuscated binary. That's a legitimate, deployable use case (internal audit, code review, legacy binary analysis) even if it doesn't cover adversarial-grade malware.

The more useful framing: what's the right threat model? If you're defending against script kiddies and automated tooling, AI-assisted RE might already be good enough. If you're defending against targeted attacks by people who know you're using AI detection, the bar is much higher and this test doesn't speak to it.

What would actually settle the "ready for production" question: run the same test with the weakest obfuscation that matters in real deployments (import hiding, string encoding), not adversarial-grade obfuscation. That's the boundary condition.

celeryd

Why does that matter? Being oblivious to obfuscated binaries is like failing the captcha test.

Let's say instead of reversing, the job was to pick apples. Let's say an AI can pick all the apples in an orchard in normal weather conditions, but add overcast skies and success drops to zero. Is this, in your opinion, still a skilled apple picking specialist?

sonofhans

What if it’s 10x as fast during clear conditions? Then it doesn’t matter.

No hate. My only point is that’s it’s easy for analogies to fail. I can’t tell the point of either of your analogies, where the OP made several clear and cogent points.

mbh159

I'm not a deep security expert but I'm assuming the skill of the agents will continue to get better, so not saying there AI's can do to this task as reliably as humans. There's likely utility for non-adversarial triage/internal audit with human review. However with better ai apple pickers during sunny conditions you need less human pickers during night conditions. I think measuring the progress of the said apple picking is what's interesting.

xboxnolifes

Maybe not, but also maybe you would no longer need skilled apple picking specialists.

AlexeyBelov

You're replying to an LLM

magicmicah85

GPT is impressive with a consistent 0% false positive rate across models, yet its ability to detect is as high as 18%. Meanwhile Claude Opus 4.6 is able to detect up to 46% of backdoors, but has a 22% false positive rate.

It would be interesting to have an experiment where these models are able to test exploiting but their alignment may not allow that to happen. Perhaps combining models together can lead to that kind of testing. The better models will identify, write up "how to verify" tests and the "misaligned" models will actually carry out the testing and report back to the better models.

stared

Rerun it for "high" and "xhigh" effort settings, and GPT-5.2-Codex still get 0% false positive, while getting at the level of other best models for localization of backdoors: https://quesma.com/benchmarks/binaryaudit/

sdenton4

It would be really cool if someone developed some standard language and methodology for measuring the success of binary classificaiton tasks...

Oh, wait, we have had that for a hundred years - somehow it's just entirely forgotten when generative models are involved.

selridge

>While end-to-end malware detection is not reliable yet, AI can make it easier for developers to perform initial security audits. A developer without reverse engineering experience can now get a first-pass analysis of a suspicious binary. [...] The whole field of working with binaries becomes accessible to a much wider range of software engineers. It opens opportunities not only in security, but also in performing low-level optimization, debugging and reverse engineering hardware, and porting code between architectures.

THIS is the takeaway. These tools are allowing *adjacency* to become a powerful guiding indicator. You don't need to be a reverser, you can just understand how your software works and drive the robot to be a fallible hypothesis generator in regions where you can validate only some of the findings.

folex

> The executables in our benchmark often have hundreds or thousands of functions — while the backdoors are tiny, often just a dozen lines buried deep within. Finding them requires strategic thinking: identifying critical paths like network parsers or user input handlers and ignoring the noise.

Perhaps it would make sense to provide LLMs with some strategy guides written in .md files.

godelski

Depends what your research question is, but it's very easy to spoil your experiment.

Let's say you tell it that there might be small backdoors. You've now primed the LLM to search that way (even using "may"). You passed information about the test to test taker!

So we have a new variable! Is the success only due to the hint? How robust is that prompt? Does subtle wording dramatically change output? Does "may", "does", "can", "might" work but "May", "cann", or anything else fail? Have you the promoter unintentionally conveyed something important about the test?

I'm sure you can prompt engineer your way you greater success but by doing so you also greatly expand the complexity of the experiment and consequently make your results far less robust.

Experimental design is incredibly difficult due to all the subtleties. It's a thing most people frequently fail at (including scientists) and even more frequently fool themselves into believing stronger claims than the experiment can yield.

And before anyone says "but humans", yeah, same complexity applies. It's actually why human experimentation is harder than a lot of other things. There's just far more noise in the system.

But could you get success? Certainly. I mean you could tell it exactly where the backdoors are. But that's not useful. So now you got to decide where that line is and certainly others won't agree.

Arech

That's what I thought of too. Given their task formulation (they basically said - "check these binaries with these tools at your disposal" - and that's it!) their results are already super impressive. With a proper guidance and professional oversight it's a tremendous force multiplier.

selridge

We are in this super weird space where the comparable tasks are one-shot, e.g. "make me a to-do app" or "check these binaries", but any real work is multi-turn and dynamically structured.

But when we're trying to share results, "a talented engineer sat with the thread and wrote tests/docs/harnesses to guide the model" is less impressive than "we asked it and it figured it out," even though the latter is how real work will happen.

It creates this perverse scenario (which is no one's fault!) where we talk about one-shot performance but one-shot performance is useful in exactly 0 interesting cases.

NitpickLawyer

Something I found useful is to "just figure it out" the first part (usually discovery, or library testing, new cli testing, repo understanding, etc.) and then distill it into "learnings" that I can place in agents.md or relevant skills. So you get the speed of "just prompt it" and the repeatability of having it already worked in this area. You also get more insight into what tasks work today, and at what effort level.

Sometimes it feels like it's not dissimilar to spending 4 hours to automate a 10 minute task that I thought I'll need forever but ended up just using it once in the past 5 months. But sometimes I unlock something that saves a huge amount of time, and can be reused in many steps of other projects.

selridge

That’s hard. Sometimes you will do that and find it prompts the model into “strategy talk” where it deploys the words and frame you use in your .md files but doesn’t actually do the strategy.

Even where it works, it is quite hard to specify human strategic thinking in a way that an AI will follow.

undefined

[deleted]

jakozaur

See direct benchmark link: https://quesma.com/benchmarks/binaryaudit/

Open-source GitHub: https://github.com/QuesmaOrg/BinaryAudit

EB66

The fact that Gemini returns the highest rate of fake positives aligns with my experience using the Gemini models. I use ChatGPT, Claude and Gemini regularly and Gemini is clearly the most sycophantic of the three. If I ask those three models to evaluate something or estimate odds of success, Gemini always comes back with the rosiest outlook.

I had been searching for a good benchmark that provided some empirical evidence of this sycophancy, but I hadn't found much. Measuring false positives when you ask the model to complete a detection related task may be a good way of doing that.

simianwords

I'm not an expert but about false positives: why not make the agent attempt to use the backdoor and verify that it is actually a backdoor? Maybe give it access to tools and so on.

jakozaur

So many models refuse to do that due to alignment and safety concerns. So cross-model comparison doesn't make sense. We do, however, require proof (such as providing a location in binary) that is hard to game. So the model not only has to say there is a backdoor, but also point out the location.

Your approach, however, makes a lot of sense if you are ready to have your own custom or fine-tuned model.

simianwords

Surprising that they still allow to catch the back doors but not use them.

A bad actor already has most of the work done.

garblegarble

Sounds like the pitch writes itself, "you'd better spend a lot of token money with us before the bad guys do it to you..."

shevy-java

So the best one found about 50%. I think that is not bad, probably better than most humans. But what about the remaining 50%? Why were some found and others not?

> Claude Opus 4.6 found it… and persuaded itself there is nothing to worry about > Even the best model in our benchmark got fooled by this task.

That is quite strange. Because it seems almost as if a human is required to make the AI tools understand this.

Tiberium

I highly doubt some of those results, GPT 5.2/+codex is incredible for cyber security and CTFs, and 5.3 Codex (not on API yet) even moreso. There is absolutely no way it's below Deepseek or Haiku. Seems like a harness issue, or they tested those models at none/low reasoning?

jakozaur

As I do eval and training data sets for living, in niche skills, you can find plenty of surprises.

The code is open-source; you can run it yourself using Harbor Framework:

git clone git@github.com:QuesmaOrg/BinaryAudit.git

export OPENROUTER_API_KEY=...

harbor run --path tasks --task-name lighttpd-* --agent terminus-2 --model openrouter/anthropic/claude-opus-4.6 --model openrouter/google/gemini-3-pro-preview --model openrouter/openai/gpt-5.2 --n-attempts 3

Please open PR if you find something interesting, though our domain experts spend fair amount of time looking at trajectories.

Tiberium

Just for fun, I ran dnsmasq-backdoor-detect-printf (which has a 0% pass rate in your leaderboard with GPT models) with --agent codex instead of terminus-2 with gpt-5.2-codex and it identified the backdoor successfully on the first try. I honestly think it's a harness issue, could you re-run the benchmarks with Codex for gpt-5.2-codex and gpt-5.2?

Tiberium

Are the existing trajectories from your runs published anywhere? Or is the only way is for me to run them again?

jakozaur

I can provide trajectories. Though probably we are not going to publish them this time. This would need some extra safeguards.

Email me. The address is in profile.

stared

I rerun it for GPT-5.2-Codex, for high and xhigh.

Finally, it matches my experience, and it is actually good (as good as the best models for localization, still impressive 0% false positive rate): https://quesma.com/benchmarks/binaryaudit/

Will rerun it on GPT-5.3-Codex shortly, as API is out (yet, the effort does not work correctly, and for "medium" it is very low).

stared

To be honest, it is also our surprise. I mean, I used GPT 5.2 Codex in Cursor for decompiling an old game and it worked (way better than Claude Code with Opus 4.5). We tested for Opus 4.6, but waiting for public API to test on GPT 5.3 Codex.

At the same time, various task can be different, and now all things that work the best end-to-end are the same as ones that are good for a typical, interactive workflow.

We used Terminus 2 agent, as it is the default used by Harbor (https://harborframework.com/), as we want to be unbiased. Very likely other frameworks will change the result.

hilbert42

What this tells me is that the era of code obfuscation through compilation is likely coming to an end. If anyone is able to reverse-engineer a program it'll have huge ramifications for the industry.

This won't be welcomed by software developers who benefit from obfuscation but consumers could benefit. For example, AI could alter a program to remove or add features to suit users' requirements.

Imagine being able to instruct AI to comb through Windows 11 and remove all telemetry and Copilot code and restore local accounts.

I'd be very pleased with an AI agent tnat would do that.

Daily Digest email

Get the top HN stories in your inbox every day.