LockPicking - The AI Version

A new paper demonstrates that one of the most promising defenses against prompt injection can be almost perfectly bypassed.

The existing defense idea is elegant and widely appealing: Watch how an LLM’s internal behavior changes when it reads retrieved text

Flag “task drift” when injected instructions appear
Monitor multiple layers and use majority voting for safety This wasn’t a hacky filter. And it worked - until the attacker adapted.

What the paper shows An adaptive attacker can generate one short, reusable suffix that:

The suffix doesn’t persuade the model in any human sense. Instead, it subtly reshapes the model’s internal signals so they resemble:

The detectors never see anything “off.”

Why? Because the attacker explicitly optimizes the suffix to avoid exactly what the detectors are looking for.

Multiple detectors don’t add safety here — they just give the attacker a clearer target.

This flips a core assumption on its head: Detection itself becomes the attack surface.

If the safety mechanism is:

Observable
Consistent
And reacts predictably to inputs Then an attacker can learn how to stay just inside its comfort zone.

At that point:

We’ve seen this pattern before:

The way forward:

In other words: Defenses improve only when they’re trained on how attackers actually adapt.

Key Takeaway Prompt injection is a systems security problem.

#AI #Security #RAG #WomenInTech #LLMs #PromptInjection