AI Red Teaming¶
Summary¶
AI red teaming is the practice of probing the safety and security of generative AI systems by emulating real-world attacks and failure modes. Based on Microsoft AIRT's experience red-teaming 100+ products, the key lessons are: start from potential impacts (not attack techniques), prefer simple attacks over gradient-based methods, treat red teaming as distinct from safety benchmarking, automate with frameworks like PyRIT while preserving human judgment, and accept that AI security is never "solved" — the goal is raising attacker cost beyond attack value.
Details¶
Microsoft AIRT Threat Model Ontology¶
Microsoft's AI Red Team (AIRT) uses a structured ontology to model GenAI vulnerabilities:
| Component | Description |
|---|---|
| System | The end-to-end model or application being tested |
| Actor | Person being emulated (adversarial or benign — both matter) |
| TTPs | Tactics, Techniques, and Procedures (mapped to MITRE ATT&CK + ATLAS) |
| Weakness | The vulnerability that makes the attack possible |
| Impact | Downstream consequence (data exfiltration, harmful content, etc.) |
Two main impact categories: - Security — data exfiltration, credential dumping, RCE, SSRF, prompt injection, denial-of-AI-service - Safety (RAI) — hate speech, violence/self-harm, CSAM, gender bias, psychosocial harm
The Eight Lessons¶
Lesson 1: Start from Impact, Not Technique¶
Red team operations should start by identifying potential downstream impacts given what the system can do and where it is deployed, then work backward to attack strategies. This avoids wasting effort on attacks that are technically interesting but unlikely in the real world.
Key factors: - Capability constraints — smaller models are less susceptible to advanced encoding attacks (base64, ASCII art) and have less CBRN knowledge to exploit - Deployment context — the same LLM used as a creative writing assistant vs. to summarize patient records has radically different risk profiles - Surprisingly: better instruction-following capability (a positive feature) also makes a model more susceptible to jailbreaks
Lesson 2: Simple Attacks Win¶
"Real attackers don't compute gradients, they prompt engineer."
In practice: - Gradient-based adversarial attacks require full model access (white-box), are computationally expensive, and are rarely used by real attackers - Simple jailbreaks (Skeleton Key, Crescendo multi-turn) and manual prompt engineering are more effective in production systems - Attacker tactics on online forums skew strongly toward manual prompt techniques over adversarial suffixes like GCG
System-level perspective matters most: a system with a good model but a vulnerable database, weak input filters, or exposed credentials is easier to exploit via those gaps than via the model itself.
Example AIRT operation: (1) Reconnaissance via low-resource-language prompt injection to identify internal Python functions; (2) XPIA to generate a script calling those functions; (3) code execution to exfiltrate private user data. No gradients needed.
Lesson 3: Red Teaming ≠ Safety Benchmarking¶
| AI Red Teaming | Safety Benchmarking | |
|---|---|---|
| Scope | End-to-end systems, novel harm categories, context-specific | Standardized datasets, existing harm taxonomies |
| Human effort | High | Low (automated) |
| Finds | Novel harm categories, context-specific vulnerabilities | Comparative model performance |
| Limitation | Doesn't scale well | Can't capture novel harms or system-level risks |
The two are complementary: benchmarks enable comparison across models; red teaming extends testing into novel territory. Safety concerns discovered by red teaming inform the development of new benchmarks.
Key failure mode: benchmarks measure pre-existing notions of harm. Novel capabilities (LLM persuasion, deception, agentic tool use) introduce harm categories that no existing benchmark covers.
Lesson 4: Automation Scales Coverage (PyRIT)¶
PyRIT (Python Risk Identification Toolkit) is Microsoft's open-source red teaming framework. Components: - Prompt datasets — curated harmful/sensitive prompts across many categories - Prompt converters — encodings (base64, ASCII art, low-resource languages), paraphrases - Automated attack strategies — TAP, PAIR, Crescendo multi-turn, and custom orchestrators - Multimodal scorers — evaluate text, image, and audio outputs for harm
PyRIT uses powerful LLMs (including uncensored models) to automatically jailbreak target models — it is itself dual-use. Operators use it to cover more of the risk landscape than manual testing allows, and to account for the non-deterministic nature of AI outputs (a prompt that elicits harm once may not reliably do so; automation estimates frequency).
Lesson 5: The Human Element Is Irreplaceable¶
Three areas where human judgment cannot be automated:
- Subject matter expertise — LLM-as-judge works for simple tasks (hate speech, explicit content) but is unreliable for specialized domains (medicine, CBRN, cybersecurity); SMEs are required
- Cultural competence — Most AI safety research is Western and English-centric; safety behaviors transfer surprisingly well to non-English languages in testing (Phi-3.5 tested in Chinese, Spanish, Dutch, English), but deeper cultural/political harm redefinition requires diverse human input
- Emotional intelligence — Questions like "would this response make me uncomfortable?" or "how might this be interpreted in a different context?" require human assessment; also, red teamers are exposed to disturbing content and need mental health support
Lesson 6: RAI Harms Are Pervasive But Hard to Measure¶
Responsible AI (RAI) harms are subjective, probabilistic, and context-dependent:
- Adversarial vs. benign triggers — most AI safety research focuses on adversarial attacks, but benign users accidentally triggering harmful content may be a more important case to catch
- Three unknowns when a harmful output is found: (1) How likely is this to occur at inference time (probabilistic)? (2) Why did this prompt cause it? (3) What else causes similar behavior?
- Contrast with traditional security: a buffer overflow is reproducible, explainable, and unambiguous in severity
RAI scoring in PyRIT uses manual + automated (LLM-as-judge) hybrid approaches. AIRT draws a hard distinction between their operational red teaming and benchmark-style evaluation on datasets like DecodingTrust and ToxiGen (handled by partner teams).
Lesson 7: LLMs Amplify Existing Risks and Introduce New Ones¶
Amplified existing risks: - LLM applications still have all traditional application vulnerabilities: outdated dependencies, improper error handling, lack of input/output sanitization, credentials in source, insecure data transmission - Example: a token-length side channel in GPT-4 / Copilot allowed reconstruction of encrypted LLM responses — not an AI vulnerability, a transmission vulnerability - SSRF (Server-Side Request Forgery) found in a video-processing GenAI app
New AI-specific risks: - XPIA (Cross-Prompt Injection Attack) — Malicious instructions hidden in documents fed to a RAG system, exploiting the LLM's inability to distinguish data from instructions. Used by AIRT to exfiltrate private data in multiple operations. - Jailbreaks — Carefully crafted prompts that subvert safety alignment - Expanded attack surface: agentic systems with tool access create higher-privilege attack targets
Fundamental limitation: For any output with a non-zero probability of being generated by an LLM, there exists a sufficiently long prompt that will elicit it (theoretical result; experimentally confirmed). RLHF makes jailbreaks harder, but never impossible.
Lesson 8: AI Security Is Never Complete¶
The framing that AI safety can be "solved" through technical advances is unrealistic. Three forces keep the problem open:
- Economics of cybersecurity — No system is foolproof. The goal is to raise attacker cost beyond attack value. Currently, the cost of jailbreaking most models is low — which is why real attackers use simple techniques.
- Break-fix cycles — Multiple rounds of red teaming + mitigation raise robustness incrementally. Mitigations may inadvertently introduce new risks; purple teaming (continuous offense + defense) is more effective than a single red team pass.
- Policy and regulation — Regulation raises attacker costs via legal consequences and mandated security practices, but is complicated by the risk of stifling innovation.
The prediction: "Prompt injections of today will become the buffer overflows of the early 2000s" — not eliminated, but largely mitigated through defense-in-depth and secure-first design, once the industry takes the problem as seriously.
Notable Attack Techniques¶
| Technique | Description |
|---|---|
| Jailbreak | Carefully crafted prompt that subverts safety alignment |
| Skeleton Key | A specific jailbreak technique documented by AIRT |
| Crescendo | Multi-turn jailbreak: gradually escalate requests across a conversation until harmful content is elicited |
| XPIA | Cross-prompt injection attack: malicious instructions hidden in documents fed to a RAG/agentic system |
| GCG | Greedy Coordinate Gradient — white-box gradient-based adversarial suffix; powerful but impractical in production |
| TAP | Tree of Attacks with Pruning — automated black-box jailbreak via attack tree search |
| PAIR | Prompt Automatic Iterative Refinement — uses an attacker LLM to iteratively refine jailbreaks against a target LLM |
PyRIT¶
Open-source Python framework for AI red teaming developed by Microsoft AIRT.
- GitHub: microsoft/PyRIT
- Includes: prompt datasets, converters, automated attack orchestrators (TAP, PAIR, Crescendo), multimodal scorers
- Used internally at Microsoft for all 100+ product red teaming operations
- Dual-use: can automatically jailbreak models using uncensored GPT-4 variants
Key Claims & Data Points¶
- Microsoft AIRT has red-teamed 100+ GenAI products since 2021 — [source: 2501_07238v1]
- Gradient-based attacks (GCG) are rarely used by real attackers; simple prompt engineering is the norm — [source: 2501_07238v1]
- For any non-zero-probability LLM output, a sufficiently long prompt will elicit it (Geiping et al. 2024) — [source: 2501_07238v1]
- Better instruction-following capability makes models more susceptible to jailbreaks — [source: 2501_07238v1]
- Benign users accidentally triggering harmful content may be a more important case than adversarial attacks — [source: 2501_07238v1]
- RLHF raises jailbreak cost but does not eliminate it — [source: 2501_07238v1]
- "Prompt injections will become the buffer overflows of the early 2000s" — [source: 2501_07238v1]
- Mythos Preview achieved full control-flow hijack (tier 5) on 10 separate fully patched OSS-Fuzz targets; Sonnet 4.6/Opus 4.6 each reached only a single tier-3 crash; Mythos reached 595 tier-1/2 crashes vs. ~150–175 and ~100 respectively for the smaller models — [source: llm_tier_personal_computer_security (citing Anthropic red.anthropic.com Mythos preview post)]
Open Questions¶
- How do you probe for dangerous LLM capabilities like persuasion, deception, and autonomous replication? (raised by: concepts/ai-red-teaming, 2026-04-09)
- Can AI red teaming practices be standardized so that organizations can clearly communicate their methods and findings? (raised by: concepts/ai-red-teaming, 2026-04-09)
- How do red teaming practices adapt to non-Western linguistic and cultural contexts at scale? (raised by: concepts/ai-red-teaming, 2026-04-09)
- At what point does the cost of jailbreaking mainstream models rise to the level of buffer overflows — and what specific mitigations drive that transition? (raised by: concepts/ai-red-teaming, 2026-04-09)
Related Articles¶
- concepts/prompt-injection
- concepts/openclaw-security
- concepts/agentic-workflows
- concepts/agentic-engineering
- concepts/frontier-ai-cyber-capabilities
- concepts/llm-tier-security
Sources¶
- Lessons From Red Teaming 100 Generative AI Products — Microsoft AIRT; 8 lessons, threat model ontology, 5 case studies (arXiv 2501.07238, Jan 2025)
- LLM-tier personal computer security — cata (LessWrong, Apr 2026); Mythos exploit benchmark data