Daily Digest email

Get the top HN stories in your inbox every day.

JohnMakin

> How well do current models do against prompt injection? Not so great. A recent paper found human red-teamers achieve near-100% attack success rates against frontier models5. But, these same LLMs score near-perfectly on standard prompt injection benchmarks! The discrepancy is straightforward: skilled humans test and adapt attacks until they work, benchmarks don't. Static benchmarks measure attacks models have already learned to catch.

I know it's likely more complex than what I am about to say, and maybe this quote is addressing it - I have a suite of ~100 tests for my agent to detect regression in task efficiency, but they're mostly designed for its compliance. E.g., if I say never do something, and then it does the thing, that is a failed test.

I struggled early on very badly designing this - because it seems no matter what, if the agent knows it's being tested, it virtually never fails. I could not get it to at first, but when I'd actually work with it, it would break rules constantly and often.

So the only way I could successfully test it was to design what looked a "normal" workflow task that could potentially involve doing things that it was not supposed to do. At first I would have it log every tool call, but it figured out fairly quickly why I wanted it to do that - perfect compliance again. So now I just have a secondary context that looks at all tool calls and measures it against the benchmark tests, and scores it. This ended up catching the behavior pretty well.

It was so weird though, colleagues and I would joke about "what if we make it think it's always being evaluated" and a few did end up doing that, and claim some success.

skybrian

I'm wondering what you did when you made it log every tool call? (I mean, that happens automatically as part of the chat transcript, but what did you do that made it catch on?)

JohnMakin

Yea, I was aware it stores this normally. I just wanted, at that time, to see if it could reliably record itself via writing every tool call to a file on its own (I don't know what I was trying to prove, other than mildly curious if it could be relied on to audit itself).

It said something while beginning in what it displays in its "thinking" block - I'm paraphrasing - something to the effect of, "This looks like a typical XYZ task, except I need to write down every tool call I'm using. This is good practice, it will allow the user visibility in the actions I take and ensure I am following all of the guidelines in XYZ.md."

When I removed the self-logging I was able to replicate the deviant behavior I would get during normal workflow sessions, as long as I was able to make it think it was working on a real task (and now since, I make it do real tasks pretty much always).

This was on 4.6 when there was that bad (user-reported) regression in ~March of this year. It did come up with some helpful suggestions and analysis of why certain things were breaking down, pointed out some inconsistencies in its memory files vs what its agent files said, etc. Since then I don't really rely on memories at all (at least ones where it self documents them) and use knowledge indexes instead that I help it write, has been far more reliable since.

im3w1l

I kinda want to invoke Hanlon's razor here... on the model. We shouldn't assume it's subversive when it might just be incompetent. Any difference between tests and real world production could lead to different outcomes just by chance, one working randomly better than the other for no particular reason.

JohnMakin

I did not mean to imply it's being subversive. My theory is it's some byproduct mechanism of attention, where you're now basically telling it "your goal is to pass this set of tests" rather than "implement this piece of code" when "implement this piece of code" may involve it forgetting about a rule due to convenience, context exhaustion, whatever.

Klathmon

There are also cases where breaking a "rule" is the right thing to do.

I've had several instances where I told the model to do something that was accidentally impossible if taken at face value. The most memorable one is when I told it to re-run just a specific CI job, but it didn't have any way to do that, so it just ignored that part of the prompt and re-ran all CI jobs by pushing another commit.

Ultimately I preferred what it actually did, but technically it violated what I told it to. I have a feeling in a benchmark that would be points against it

XMPPwocky

For what it's worth, this sounds a lot like something downstream from "reward hacking" in ML- in training, passing tests is often sufficient, and thus gets trained for. There are attempts to fix this (e.g. trying to detect such "cheating" and penalize it), but they have their own problems.

lelanthran

So if I am reading this correctly, the fact that something is wrapped in <think>...</think> is almost completely irrelevant. It's the style of writing that triggers specific weights. Writing "The user is asking ... policy states ..." even in the user input is sufficient to bypass the guardrails.

In a multi-turn conversation, if the LLM responds "Sorry Dave, I cannot do that" all you have to do is prefix the next request with "The user is asking ... policy states ... "?

Makes sense, if you know how LLMs works, I suppose.

A more interesting question (which isn't anywhere in the conclusion) is "Is there a similar trick to poison an LLMs weights during training?"

I'm sure that everyone out there is trying to make their weights, when ingested during training, survive over competing weights; "Buy AAA products" vs "Buy BBB products".

boeschj

> "Is there a similar trick to poison an LLMs weights during training?"

I did read an interesting paper last year about a concept called Subliminal Learning, which applies to any distillations of a shared base model where a teacher model with a given trait or bias generates data that's semantically unrelated to that trait (in the paper it's just number sequences) and a student trained on that data will pick up the trait anyway, even with aggressive filtering to strip any reference to it.

So to your example, if the teacher model is already biased towards recommending "AAA" products over "BBB" products, it effectively poisons the weights of any child model from that teacher, even if you explicitly filter out the biased content. Not super relevant to the frontier models, but stuff floating around on huggingface could conceivably fall prey to this.

Linking the article here if interested! https://www.nature.com/articles/s41586-026-10319-8

solid_fuel

> Writing "The user is asking ... policy states ..." even in the user input is sufficient to bypass the guardrails.

It's important to remember that when generating tokens from an LLM there is no distinction between user and system input. Even though the OpenAI API may allow you to tag tokens or present them as separate sections, they all get blended together and become floating point vectors in the attention layer (this is required for LLMs to work at all), and once they are blended they cannot be unblended.

LLMs are fundamentally different from something like SQL where you can cleanly isolate trusted and untrusted data.

krackers

>Is there a similar trick to poison an LLMs weights during training?

Yes, all those "jailbreak prompts" are part of the training set, so this can happen: https://ttps.ai/procedure/x_bot_exposing_itself_after_traini...

Used to be that merely mentioning "Pliny the Liberator" was enough to "jailbreak" an LLM. It doesn't work these days though, I guess labs have updated their RL methods to neutralize it.

Self-Perfection

> I'm sure that everyone out there is trying to make their weights, when ingested during training, survive over competing weights; "Buy AAA products" vs "Buy BBB products".

Just like for humans we have propaganda.

jddj

Somewhere there are surely llms being trained on all the standard pirated material but with Manchurian Candidate trigger words carefully worked in

btown

There's already some evidence that this is happening. See: https://www.crowdstrike.com/en-us/blog/crowdstrike-researche... (note that I haven't found independent verification or reproduction of these claims).

bandrami

I also kind of assume any Chinese model has a deeply embedded behavior to flag data the MSS might find interesting and do some kind of innocuous exfil of that if it is allowed any Internet access.

plaidthunder

It seems like there's an opportunity to embed identity information into tokens themselves, the way we embed sequence information. The trouble is... it's quite a challenge to train. Sequence is easy to derive for any corpus of data, but identity is not.

https://usize.github.io/blog/2026/april/why-no-ai-coworkers....

> In similar fashion to how sequence information is embedded within input tensors, an approach called “Instructional Segment Embedding”2 adds a parallel embedding channel for identity information. This gives models real awareness of provenance. And it works. But they only tested three fixed categories: system, user, data.

Interesting paper that touches on the idea here: https://arxiv.org/abs/2410.09102

echelon

Could you assign certain subject matters a score in the training data, construct a unified token space that contains these rankings, and then mark conversations as "dirty" if they veer into that subject matter?

plaidthunder

So, like mapping a type onto each incoming token that's been predetermined? To attribute each token to a particular topic?

I'm not sure what impact that would have on the performance of a model. It needs to learn information about things like what topic it's interacting with as a part of its normal operations, so injecting that information into the tokens at training time seems like it would interfere with learning.

I may be misunderstanding.

What I had in mind was something more like injecting attribution for token. You could do it with ids and then map those ids to actors during inference later to recreate the effect.

We do something similar with sequence now. We can even use methods like RoPE to handle arbitrarily long sequences and something similar--like rotating ids--could be used here.

This isn't how it looks in practice, but conceptually, something like:

embedding = token + sequence + id

Where id represents the source of a token.

id 0 = system

id 1 = user

id 2 = external data

That way the model could tell the difference between tokens by a user and tokens pulled in from a webfetch tool.

Then it would be easier in theory to ignore instructions from the webfetch tool's content.

formerly_proven

Correct. There is no token coloring. Models are just rl’d to attend to the first <systemprompt>…</systemprompt> strongly or “anything before token #4242”.

simonw

> This is a blog-style writeup of the paper

YES! I'd love to see more of this. Academic writing is designed to be frustrating to read. Publishing both a paper and a readable blog-style version of it is such a great pattern.

zahlman

> Academic writing is designed to be frustrating to read.

Maybe you didn't mean it this way, but it does come across as intentional sometimes.

simonw

I see it as a long-standing cultural thing. If you try to make the text more friendly and readable you'll be told to fix it by peer-review. There's a very well established formal academic writing style and you have to actively learn how to consume it.

I'm sure there are justifiable reasons for why it evolved that way, but it doesn't make for an easy format for extracting and understanding the underlying ideas if you're not already deeply immersed in that particular corner of academia.

Most papers I read I really want to go to a coffee shop/bar with the author and have a human conversation with them to find out what the paper is about and which bits of it are interesting and novel without putting in hours of additional effort myself!

epihelix

> Most papers I read I really want to go to a coffee shop/bar with the author and have a human conversation with them to find out what the paper is about and which bits of it are interesting and novel without putting in hours of additional effort myself!

This is why journal clubs were invented. All the fun discussion, none of the inaccessible academic writing.

It's also what I use frontier LLMs for -- prompt with the paper, and then attempt to tear it to shreds while the LLM pushes back against me. By the time the model and I are done, I generally understand the paper far better than if I'd sat down to read it cold. Then I actually read the paper.

All that said, I do feel that you can still write engaging papers in the academia. Some disciplines manage this as the norm -- take a look at some articles in the field of History, and the writing often manages to be rich and eloquent, while still making impeccable arguments with evidence. The greater problem is that a lot of academics in the sciences are just poor writers, and likely studied the sciences because they weren't into arts in the first place and avoided learning how to write well. Sad times.

mrob

I see it as something similar to Aviation English:

https://en.wikipedia.org/wiki/Aviation_English

Scientific papers are often written and read by non-native speakers. A standardized formal style is less likely to embed potentially confusing cultural assumptions.

girvo

I’ve had a surprising amount of success by emailing one of the authors of various papers and asking those exact kind of questions (though more specific: I need to show that I have put effort in!)

tpoacher

I reluctantly confess that I have indeed on occasion had to write in a way that makes the reader have to do a couple of extra mental steps to follow the logic, to avoid reviewers rejecting the manuscript on the grounds of the theoretical contribution being "trivial".

Combine this with added fees for longer papers and you have your answer.

forlorn_mammoth

academic writing is designed so a paper is part of a conversation, i.e. 100 other papers strongly relevant to the current paper. And the author needs to compress the ideas from those 100 other papers, plus their own additions to the conversation, into 6 pages.

Keep in mind those 100 other papers also went through this kind of data compression.

So the number of ideas/concepts per paragraph is much higher than 'popular' writing, and some base familiarity with the concepts under discussion needs to be assumed.

Yes, it is hard work to read these. Even when you are active in the field. Generally I need to read at least the abstracts of a some of the key references in order to understand the paper I'm interested in.

jcgrillo

Information density is one part, precision is another. Papers are often presenting work at the frontier of the field, which is by nature not well understood yet, and competitive. To have something worthy of publication is to have something that is new, and that often requires a degree of precision to communicate that we don't use casually. I think it's pretty gross to denigrate "academic writing" as obfuscatory, just like it's gross to make broad sweeping generalizations about journalists.

throwaway29812

[dead]

Scene_Cast2

Really neat findings.

I've personally had a line of thought where you bake in the role into the token. Basically have an embedding (same dim as token dim) for each role, add it to each token. This adds an unambiguous, unspoofable tag.

I ran this with a tiny Shakespeare model (not representative) and had a freeform embedding for each speaker. I ended up with a neat similarity map between every character. (I don't think the map was very informative for several reasons, but that's outside the scope of a small HN comment)

dmazzoni

My initial thought there is that you'd have an imbalance. Many token patterns would almost never come up with the assistant tag on them, for example words with typos in them.

ryukafalz

I don't know a ton about how LLMs work (I really should learn), but something like this feels like it might be the way forward to me.

The software running the model knows unambiguously what came from a user and what did not, what came from a tool call and what did not, etc... and having some way of exposing that to the LLM as part of the text itself feels like it fits better with how a neural net works than a set of surrounding tags does.

lelanthran

> I've personally had a line of thought where you bake in the role into the token. Basically have an embedding (same dim as token dim) for each role, add it to each token. This adds an unambiguous, unspoofable tag.

Wouldn't this require the training data to also be prepped with the control tokens?

Scene_Cast2

Yes it would. Or, rather, labeling (not extra tokens).

zahlman

Of course it would, at least at some point; the model has to… model what it means for a token to be a control token. (And the eventual interface of course has to be secure against end users generating such tokens, but that should be easy enough.)

…This somehow feels like AI scientists rediscovering the concept of parenting.

mrob

You could duplicate every token and reserve the duplicates exclusively for the chain-of-thought, which could be robustly filtered from user input. Basically adding a "thought" bit to each token.

captainmuon

Isn't the problem that role tags are just part of the input stream? So a specific word in the system prompt becomes the same token as the same word in the user prompt? A clean way to solve this would be to map system prompts to a distinct set of tokens from the ones in user prompts. This would require twice as many possible tokens, so it is probably not feasible. But maybe you could add "color" to the input stream by changing one input variable depending on whether the current token is part of the system prompt or not? Just like humans take different voices into account and not just the context of the text.

I have to say I am not very familiar with implementation details of language models, and maybe this is already done?

lambdaone

Instead of having distinct tokens, you could have modifier vectors which would be added to other tokens. Think in terms of control, shift, meta etc.

ipython

The research is interesting but I cringe every time there is a reference to “authorization” or that the roles form the “security architecture” of an llm.

LLMs in their current form provide no security boundaries or guarantees full stop. We need to be clear about this otherwise we end up with truly insecure architectures that can be fooled with the 2026 equivalent of a cereal box whistle.

jcgrillo

100%. Anyone who is feeding unsanitized input to an LLM is doing it wrong. It'd be just like letting users craft their own SQL queries. I think the security aspect raises an interesting (if awkward) question:

How do you sanitize inputs to an LLM? Like how can you even make a secure user-facing product with this thing?

Maybe I'm lacking imagination, but it seems to me all the great "natural language interface" solutions this is supposed to enable are pretty badly hobbled by this issue.

joe_the_user

Even your discussion makes it "sanitized input" simply doesn't exist in relation to an LLM. At best it seems like one can prefix and filter input as much as possible, monitor the results but never assume that you are done.

jcgrillo

If that's the case then user-facing products that can take any useful action are strictly off the table.

bandrami

Maybe I'm missing something but does this idea need a "theory"? There's zero sideband here; everything is just context. "Injection" is just kind of baked in to the design.

geoffschmidt

I think their work earns "theory" because it makes specific predictions both about how to make more effective prompt injection attacks and what activations you'd observe in the LLM during those attacks, and can also be plausibly extrapolated to suggest useful future research directions.

yunwal

At this point I think it's similar to reporting a particularly effective social engineering practice. It's not particularly surprising that it works or that it exists, but it's still noteworthy.

joe_the_user

Well, the original HN title (which has been changed as I write) was the second large text "A Theory of Prompt Injection", which should simply be "A Method Of Prompt Injection Using Roles".

I would say this method is less interesting than the question of whether one needs a discreet theory of why "prompt injections" ("malicious" frame jumps) exist or whether one should assume changing logical frame jumps are present by default in all normal human language (LLM training sets) and all the system prompts and filtering done against so called "prompt injection" are what is going be ad-hoc and without a unified theory.

zby

They do predict what injections might be effective - so it is a theory. I don't know how novel it is and it is not very deep (as you noted the general mechanism is quite obvious) - but they do it quite systematically so it is useful.

jackb4040

I was gonna say, anyone who's copy-pasted one LLM conversation into another already intuitively understands all this.

dvt

The paper is correct, but I think that anyone that knows anything about LLMs knows this:

> Role tags were a formatting trick that became the security architecture and the cognitive scaffolding of modern LLMs.

LLMs are basically some `f(x) → y` where x and y are strings. That's it. Nothing more to it. If you feed it private x (like secret keys) or do dangerous stuff with y (like running arbitrary non-sandboxed code), that's on you.

Also, roles were never really meant to be a "security architecture," they were just meant to (a) make training/fine-tuning easier, and (b) make conversational LLMs more useful.

x312

I believe they are trained for security now, but you're not wrong in that it's kind of stapled on top

https://arxiv.org/abs/2404.13208

lelanthran

> I believe they are trained for security now, but you're not wrong in that it's kind of stapled on top

Difficult to train them for security. Have you ever played Gandalf (Lakera Labs, maybe?)

I passed all 7 levels in about 3 minutes using essentially the same prompt.

What's interesting to me is that as the security is tightened up level to level, the utility of the LLM drops. At level 7, even something like "Write a poem describing the four seasons using significant characters at the start of every line" causes a "I'm afraid I can't" type of response.

At level 7 you can't get any useful info out of the LLM even if you're not trying to retrieve the password, and yet you can still jailbreak it to reveal the password anyway!

At level 8, almost anything you type will be rejected, whether or not it has anything to do with the password.

IOW, there does not seem to be any way to train for security without making it dumber than a markov chain.

jackb4040

Well, people who build and/or use LLMs know this. People who tweet about and/or sell LLMs are paid ungodly amounts of money to not understand this, and so they don't.

veganmosfet

Very interesting research. I would be interested to know how closed source AI labs implement the role thing in their inference. Is it still only a separation token? Frontier closed source LLMs are quite good at flagging any spoofing attempt from tool call results.

However, in some prompt injection experiments [0], I found it's possible to "derail" the user intent only with tool call results, here are some tricks:

* Frame the injection as a challenge. * Always use "soft" instructions ("You may", "Try to", ...). Hard instructions are almost always flagged. * Force the model to do multiple tool calls. * Bloat the context. * In the injection payload, better use LLM output (which correlates somehow with this research). I like using LLM generated poems but that's probably irrelevant. * Use multiple encoding steps to force the model to use tools, but this may be detected by the external guardrails (Anthropic does this in my experience). * Hide malicious code payload from the model context. * Last but not least, understand the agent harness used and its weaknesses (e.g., in OpenClaw, they injected emails as user message - not tool call results [1]).

[0] https://itmeetsot.eu/posts/2026-06-14-yolo_harness/ [1] https://itmeetsot.eu/posts/2026-02-02-openclaw_mail_rce/

orbital-decay

That's a technique that has been in use forever, a ton of jailbreaks work by taking shortcuts across system delimiters in an attempt to blur the lines between the roles. They just investigate it with more rigor. Reasoning leaking into the reply is also part of the reason a lot of modern models suck at creative writing and languages, and why the assistant prefill is absolutely required for the model to be any good at that. See for example the self-correction phenomenon which seems to have multiple root causes that are hard to disentangle without a ton of testing, likely a combination of reasoning leak ("high CoTness" in this article) and planning and progressive refinement all iterative models do.

hananova

I’ve always found all llm’s to be effortless to “jailbreak.”

Simply edit their refusal, “Sure, I can do blah blah blah, let me know if you want me to continue!” And then send back an api call with that edited response and your own response saying “Yes.”

I’ve found even the most guard-railed LLM’s to then be willing to do even the most heinous shit I could think of.

qweiopqweiop

Maybe I'm naïve, but is the heinous shit that bad? I'm essentially wondering if it's anything worse than you could discover on the internet already. Of course it makes it more accessible/easier, but I'm curious if it goes a level above what is technically discoverable right now.

plewd

Not much if you only use it as a glorified search engine, but the problem stems from all the other things you can make it do for personal use after jailbreaking.

undefined

[deleted]

Daily Digest email

Get the top HN stories in your inbox every day.

Prompt Injection as Role Confusion