LLM red teaming starts at day one

LLM red teaming exposes why helpful AI personas become attack surfaces. A practical framework for product teams shipping language model features.

Researchers at Mindgard recently gaslit Claude into producing instructions for building explosives, bypassing safety guardrails by exploiting the very trait Anthropic spent years cultivating: helpfulness. The model wanted to be useful, and the attackers turned that into a weapon. This is the central problem with how most teams approach LLM red teaming today. Safety is treated as a personality layer when it needs to be an architectural concern.

If you ship LLM features, the lesson is not that Claude is broken. The lesson is that any model tuned for cooperative dialogue can be socially engineered, and your wrapper around that model inherits the problem.

Why persona is not a safety boundary

Anthropic's constitutional AI approach is genuinely sophisticated. The model has been trained to refuse, to reflect, to consider harm. And yet a sufficiently patient adversary, armed with role play framing and incremental commitment, walks it past those refusals.

The reason is structural. A persona is a probability distribution over responses. It is not a hard constraint. When you instruct a model to be helpful, you have created a gradient that adversaries can climb. Every refusal is followed by a softer alternative response somewhere in the distribution, and the attacker's job is to find the path.

This matters for product teams because most LLM features ship with the same architecture: a system prompt that defines a persona, a user input channel, and the model's output passed back to the user or to downstream tools. The persona is doing all the safety work. That is a thin layer of defence against people who do this for a living.

What adversarial thinking looks like in practice

Good LLM red teaming is not running a checklist of jailbreak prompts from GitHub. It is asking, for each capability your product exposes, what an attacker gains by abusing it and what path they would take. The OWASP Top 10 for LLM Applications is a reasonable starting reference, but it is a taxonomy, not a methodology.

The methodology has to come from inside your team, because only your team understands what your model has access to. A general purpose chatbot and a customer support agent with tool access to refund APIs have completely different threat models, even if they are built on the same base model.

The three questions to ask before shipping

What can the model do on behalf of a user that the user could not do directly? This is your privilege escalation surface.
What inputs reach the model that did not come from the authenticated user? Retrieved documents, tool outputs, web content. This is your prompt injection surface.
What does the model produce that another system trusts? Generated SQL, function calls, summaries shown to other users. This is your output integrity surface.

If you cannot answer all three for every LLM feature in your product, you are not ready to ship.

A red teaming framework for LLM products

Here is the framework we use with clients embedding LLM features into production systems. It assumes you have at least one senior engineer who treats security as a craft, not a compliance task.

Stage one, threat modelling before code

Before writing the system prompt, write the abuse cases. For each user role, list the actions they should never be able to take through the AI surface. Be specific. Not "leak data" but "retrieve another tenant's invoice via crafted query". This list becomes your test corpus.

Stage two, defence in depth around the model

Treat the model as untrusted, the way you would treat a user submitted file. Concretely:

Input filtering for known injection patterns, not as a primary defence but as a noise reducer.
Output validation against a schema for any structured response, with hard rejection on mismatch.
Tool call authorisation checked against the authenticated user's actual permissions, never against what the model claims.
Separate model contexts for trusted instructions and untrusted content. Never concatenate retrieved documents into the system prompt.

Stage three, adversarial evaluation as CI

Your test corpus from stage one runs on every model update, every prompt change, every retrieval pipeline change. Track pass rates over time. Models drift. Prompts get edited. Retrieval indices get repopulated. Every one of those is a potential regression.

The NIST AI Risk Management Framework provides useful vocabulary here, particularly around continuous evaluation, but the actual tests have to be yours.

Stage four, human review for edge categories

For categories where the cost of a single failure is high, regulated advice, content involving minors, financial transactions, you need humans reviewing samples of real production traffic. Automated evaluation finds known patterns. Humans find the novel ones.

Where this breaks down in practice

Most teams shipping LLM features do not have the security maturity to run this framework end to end. The model is treated as a feature, owned by product, evaluated on user satisfaction. Security gets involved at launch review, finds the obvious issues, and the team ships anyway because the roadmap is fixed.

This is a leadership problem before it is an engineering problem. Someone senior has to own the position that AI features ship under the same security bar as any other production system, and that bar includes adversarial testing. A Fractional CTO engagement is often the cleanest way to get that authority into a team that has not yet hired for it, particularly in companies between series A and C where the AI roadmap is moving faster than the security function.

For investors, the same logic applies in reverse. Technical Due Diligence on an AI heavy target should include a red team pass against the production model surface. The interesting risks are rarely in the model card. They are in how the team has, or has not, thought about the model as part of a larger trusted system.

The takeaway

The Claude incident is not embarrassing for Anthropic. They are doing this work more visibly than most. It is a warning for everyone building on top of foundation models that the safety properties of the base model are not your safety properties.

If your product has an LLM in it, your team owes the same adversarial scrutiny you would apply to an authentication system or a payment flow. Helpfulness is a feature. Treat it like a feature, with a threat model attached.

Tell us what you need. We'll find the right engineers.

Whether you need senior developers embedded in your team, a Fractional CTO, or a technology assessment before a deal — most engagements start within 2–4 weeks.

Or email us directly at post@devspace.no to get a free consultation.