How to Secure Client AI Agents Against Prompt Injection

If you build AI agents for clients, prompt injection is not a hypothetical you can defer. According to OWASP, Prompt Injection (LLM01) is the number one risk in the OWASP LLM Top 10, and it has held that position since the list was created. The 2026 update reinforces the point and adds System Prompt Leakage and Vector and Embedding Weaknesses to the roster. What makes it worse is reach: indirect prompt injection is exploitable across every RAG-enabled and agentic deployment, which means essentially every serious build an agency ships is exposed by default.

This post is general information, not legal advice, and security posture does not replace legal review. If a breach could expose client or customer data, involve a qualified attorney alongside your security work to understand your obligations.

What prompt injection actually is

Prompt injection is when untrusted input manipulates a model into ignoring its instructions and doing something the builder did not intend. Direct injection is a user typing adversarial text straight into the chat, for example instructions to reveal the system prompt or bypass a guardrail. Indirect injection is sneakier and more dangerous: the malicious instructions are hidden in content the agent reads, a web page, a document, an email, a support ticket, and the agent executes them as if they were legitimate.

The reason this ranks first is that it targets the core mechanism of an LLM, which cannot cleanly separate trusted instructions from untrusted data in the same context window. That is why there is no single patch, only layered defenses, and why agencies need to design for it rather than hope to avoid it.

Why RAG and agents raise the stakes

A closed chatbot that only talks is a limited target. The moment you add retrieval or tools, the blast radius grows. A RAG system reads external documents, any of which could carry injected instructions. An agent with tools can send emails, call APIs, move data, or trigger workflows, so a successful injection is not just an embarrassing response, it is an action taken with the agent's privileges. Indirect injection through a poisoned document plus a powerful tool is the combination that turns a clever prompt into real damage.

Relative exposure by build type (illustrative)

Agent with tools + external data95%

RAG over untrusted docs80%

Chatbot with no tools38%

Static prompt, no retrieval20%

The chart is illustrative of how risk scales with capability, not measured data. The takeaway holds regardless of exact numbers: the more a client agent can read and do, the harder you have to defend it.

Defense one: treat all external input as hostile

Start from the assumption that anything the agent reads may be poisoned. That means input filtering and sanitization on retrieved content, clear separation between system instructions and user or document data, and structured prompts that make injected instructions less likely to be obeyed. You will not catch every attack this way, which is exactly why it is the first layer and not the only one.

Defense two: least-privilege tools

This is the single highest-leverage control. Give an agent only the tools it genuinely needs, scoped as narrowly as possible. If an agent does not need to delete records, do not give it delete access. If it only needs to read one calendar, do not hand it the whole account. When a tool must touch something sensitive, gate it. The principle is simple: a successful injection can only do what the agent is allowed to do, so shrink what the agent is allowed to do. This also intersects with data governance, which we cover in client data privacy for AI agencies.

Defense three: human approval for high-impact actions

For anything irreversible or high-stakes, sending money, deleting data, emailing a customer list, contacting external parties, put a human in the loop. A confirmation step feels like friction, but it converts a silent catastrophe into a caught mistake. Design the agent so that low-risk actions flow automatically and high-risk actions pause for approval. The art is choosing the threshold, and erring toward caution on the actions that cannot be undone.

Defense four: filter the output, too

Injection is not only about what goes in; it is also about what comes out. Output filtering catches leaked system prompts, exfiltrated data, and responses that try to smuggle instructions to downstream systems. The 2026 OWASP update calls out System Prompt Leakage specifically, so checking outputs for signs that the model has been coerced into revealing its instructions is now a named concern. Never let raw model output flow unchecked into another system that will act on it.

Defense five: test like an attacker

You cannot secure what you have not tried to break. Before an agent ships, run adversarial testing: attempt direct injections, plant indirect injections in the documents the agent will read, and probe for system prompt leakage and data exfiltration. Vector and embedding weaknesses, new to the 2026 list, mean your RAG store itself is a target worth testing. Make this part of the build process, not a one-off, and re-test when tools or data sources change.

Packaging security as part of the offer

Security is not just risk reduction; it is a differentiator you can sell. Most clients have no idea their shiny agent is exposed to indirect injection, and an agency that bakes in least-privilege design, human approval gates, and adversarial testing looks dramatically more professional than one that ships an unguarded bot. This pairs directly with a compliance retainer; the two stories reinforce each other, as we lay out in AI compliance as a service.

When you demonstrate a secured agent to a prospect, showing the approval gates and the way the system refuses injected instructions is a compelling proof point. A demo platform like Ciela lets you walk a client through exactly how the hardened build behaves, turning invisible security work into something they can see and value.

Prompt injection is not going away; it is structural to how LLMs work. The agencies that assume every input is hostile, grant the least privilege that gets the job done, gate the dangerous actions, filter both directions, and test adversarially will ship agents that hold up. Layer the defenses, treat security as part of the deliverable, and bring in legal counsel whenever a breach could expose sensitive data.