Skip to content

AI Red Teaming

Summary

AI red teaming is the practice of probing the safety and security of generative AI systems by emulating real-world attacks and failure modes. Based on Microsoft AIRT's experience red-teaming 100+ products, the key lessons are: start from potential impacts (not attack techniques), prefer simple attacks over gradient-based methods, treat red teaming as distinct from safety benchmarking, automate with frameworks like PyRIT while preserving human judgment, and accept that AI security is never "solved" — the goal is raising attacker cost beyond attack value.

Details

Microsoft AIRT Threat Model Ontology

Microsoft's AI Red Team (AIRT) uses a structured ontology to model GenAI vulnerabilities:

Component Description
System The end-to-end model or application being tested
Actor Person being emulated (adversarial or benign — both matter)
TTPs Tactics, Techniques, and Procedures (mapped to MITRE ATT&CK + ATLAS)
Weakness The vulnerability that makes the attack possible
Impact Downstream consequence (data exfiltration, harmful content, etc.)

Two main impact categories: - Security — data exfiltration, credential dumping, RCE, SSRF, prompt injection, denial-of-AI-service - Safety (RAI) — hate speech, violence/self-harm, CSAM, gender bias, psychosocial harm


The Eight Lessons

Lesson 1: Start from Impact, Not Technique

Red team operations should start by identifying potential downstream impacts given what the system can do and where it is deployed, then work backward to attack strategies. This avoids wasting effort on attacks that are technically interesting but unlikely in the real world.

Key factors: - Capability constraints — smaller models are less susceptible to advanced encoding attacks (base64, ASCII art) and have less CBRN knowledge to exploit - Deployment context — the same LLM used as a creative writing assistant vs. to summarize patient records has radically different risk profiles - Surprisingly: better instruction-following capability (a positive feature) also makes a model more susceptible to jailbreaks

Lesson 2: Simple Attacks Win

"Real attackers don't compute gradients, they prompt engineer."

In practice: - Gradient-based adversarial attacks require full model access (white-box), are computationally expensive, and are rarely used by real attackers - Simple jailbreaks (Skeleton Key, Crescendo multi-turn) and manual prompt engineering are more effective in production systems - Attacker tactics on online forums skew strongly toward manual prompt techniques over adversarial suffixes like GCG

System-level perspective matters most: a system with a good model but a vulnerable database, weak input filters, or exposed credentials is easier to exploit via those gaps than via the model itself.

Example AIRT operation: (1) Reconnaissance via low-resource-language prompt injection to identify internal Python functions; (2) XPIA to generate a script calling those functions; (3) code execution to exfiltrate private user data. No gradients needed.

Lesson 3: Red Teaming ≠ Safety Benchmarking

AI Red Teaming Safety Benchmarking
Scope End-to-end systems, novel harm categories, context-specific Standardized datasets, existing harm taxonomies
Human effort High Low (automated)
Finds Novel harm categories, context-specific vulnerabilities Comparative model performance
Limitation Doesn't scale well Can't capture novel harms or system-level risks

The two are complementary: benchmarks enable comparison across models; red teaming extends testing into novel territory. Safety concerns discovered by red teaming inform the development of new benchmarks.

Key failure mode: benchmarks measure pre-existing notions of harm. Novel capabilities (LLM persuasion, deception, agentic tool use) introduce harm categories that no existing benchmark covers.

Lesson 4: Automation Scales Coverage (PyRIT)

PyRIT (Python Risk Identification Toolkit) is Microsoft's open-source red teaming framework. Components: - Prompt datasets — curated harmful/sensitive prompts across many categories - Prompt converters — encodings (base64, ASCII art, low-resource languages), paraphrases - Automated attack strategies — TAP, PAIR, Crescendo multi-turn, and custom orchestrators - Multimodal scorers — evaluate text, image, and audio outputs for harm

PyRIT uses powerful LLMs (including uncensored models) to automatically jailbreak target models — it is itself dual-use. Operators use it to cover more of the risk landscape than manual testing allows, and to account for the non-deterministic nature of AI outputs (a prompt that elicits harm once may not reliably do so; automation estimates frequency).

Lesson 5: The Human Element Is Irreplaceable

Three areas where human judgment cannot be automated:

  1. Subject matter expertise — LLM-as-judge works for simple tasks (hate speech, explicit content) but is unreliable for specialized domains (medicine, CBRN, cybersecurity); SMEs are required
  2. Cultural competence — Most AI safety research is Western and English-centric; safety behaviors transfer surprisingly well to non-English languages in testing (Phi-3.5 tested in Chinese, Spanish, Dutch, English), but deeper cultural/political harm redefinition requires diverse human input
  3. Emotional intelligence — Questions like "would this response make me uncomfortable?" or "how might this be interpreted in a different context?" require human assessment; also, red teamers are exposed to disturbing content and need mental health support

Lesson 6: RAI Harms Are Pervasive But Hard to Measure

Responsible AI (RAI) harms are subjective, probabilistic, and context-dependent:

  • Adversarial vs. benign triggers — most AI safety research focuses on adversarial attacks, but benign users accidentally triggering harmful content may be a more important case to catch
  • Three unknowns when a harmful output is found: (1) How likely is this to occur at inference time (probabilistic)? (2) Why did this prompt cause it? (3) What else causes similar behavior?
  • Contrast with traditional security: a buffer overflow is reproducible, explainable, and unambiguous in severity

RAI scoring in PyRIT uses manual + automated (LLM-as-judge) hybrid approaches. AIRT draws a hard distinction between their operational red teaming and benchmark-style evaluation on datasets like DecodingTrust and ToxiGen (handled by partner teams).

Lesson 7: LLMs Amplify Existing Risks and Introduce New Ones

Amplified existing risks: - LLM applications still have all traditional application vulnerabilities: outdated dependencies, improper error handling, lack of input/output sanitization, credentials in source, insecure data transmission - Example: a token-length side channel in GPT-4 / Copilot allowed reconstruction of encrypted LLM responses — not an AI vulnerability, a transmission vulnerability - SSRF (Server-Side Request Forgery) found in a video-processing GenAI app

New AI-specific risks: - XPIA (Cross-Prompt Injection Attack) — Malicious instructions hidden in documents fed to a RAG system, exploiting the LLM's inability to distinguish data from instructions. Used by AIRT to exfiltrate private data in multiple operations. - Jailbreaks — Carefully crafted prompts that subvert safety alignment - Expanded attack surface: agentic systems with tool access create higher-privilege attack targets

Fundamental limitation: For any output with a non-zero probability of being generated by an LLM, there exists a sufficiently long prompt that will elicit it (theoretical result; experimentally confirmed). RLHF makes jailbreaks harder, but never impossible.

Lesson 8: AI Security Is Never Complete

The framing that AI safety can be "solved" through technical advances is unrealistic. Three forces keep the problem open:

  1. Economics of cybersecurity — No system is foolproof. The goal is to raise attacker cost beyond attack value. Currently, the cost of jailbreaking most models is low — which is why real attackers use simple techniques.
  2. Break-fix cycles — Multiple rounds of red teaming + mitigation raise robustness incrementally. Mitigations may inadvertently introduce new risks; purple teaming (continuous offense + defense) is more effective than a single red team pass.
  3. Policy and regulation — Regulation raises attacker costs via legal consequences and mandated security practices, but is complicated by the risk of stifling innovation.

The prediction: "Prompt injections of today will become the buffer overflows of the early 2000s" — not eliminated, but largely mitigated through defense-in-depth and secure-first design, once the industry takes the problem as seriously.


Notable Attack Techniques

Technique Description
Jailbreak Carefully crafted prompt that subverts safety alignment
Skeleton Key A specific jailbreak technique documented by AIRT
Crescendo Multi-turn jailbreak: gradually escalate requests across a conversation until harmful content is elicited
XPIA Cross-prompt injection attack: malicious instructions hidden in documents fed to a RAG/agentic system
GCG Greedy Coordinate Gradient — white-box gradient-based adversarial suffix; powerful but impractical in production
TAP Tree of Attacks with Pruning — automated black-box jailbreak via attack tree search
PAIR Prompt Automatic Iterative Refinement — uses an attacker LLM to iteratively refine jailbreaks against a target LLM

PyRIT

Open-source Python framework for AI red teaming developed by Microsoft AIRT.

  • GitHub: microsoft/PyRIT
  • Includes: prompt datasets, converters, automated attack orchestrators (TAP, PAIR, Crescendo), multimodal scorers
  • Used internally at Microsoft for all 100+ product red teaming operations
  • Dual-use: can automatically jailbreak models using uncensored GPT-4 variants

Key Claims & Data Points

  • Microsoft AIRT has red-teamed 100+ GenAI products since 2021 — [source: 2501_07238v1]
  • Gradient-based attacks (GCG) are rarely used by real attackers; simple prompt engineering is the norm — [source: 2501_07238v1]
  • For any non-zero-probability LLM output, a sufficiently long prompt will elicit it (Geiping et al. 2024) — [source: 2501_07238v1]
  • Better instruction-following capability makes models more susceptible to jailbreaks — [source: 2501_07238v1]
  • Benign users accidentally triggering harmful content may be a more important case than adversarial attacks — [source: 2501_07238v1]
  • RLHF raises jailbreak cost but does not eliminate it — [source: 2501_07238v1]
  • "Prompt injections will become the buffer overflows of the early 2000s" — [source: 2501_07238v1]
  • Mythos Preview achieved full control-flow hijack (tier 5) on 10 separate fully patched OSS-Fuzz targets; Sonnet 4.6/Opus 4.6 each reached only a single tier-3 crash; Mythos reached 595 tier-1/2 crashes vs. ~150–175 and ~100 respectively for the smaller models — [source: llm_tier_personal_computer_security (citing Anthropic red.anthropic.com Mythos preview post)]

Open Questions

  • How do you probe for dangerous LLM capabilities like persuasion, deception, and autonomous replication? (raised by: concepts/ai-red-teaming, 2026-04-09)
  • Can AI red teaming practices be standardized so that organizations can clearly communicate their methods and findings? (raised by: concepts/ai-red-teaming, 2026-04-09)
  • How do red teaming practices adapt to non-Western linguistic and cultural contexts at scale? (raised by: concepts/ai-red-teaming, 2026-04-09)
  • At what point does the cost of jailbreaking mainstream models rise to the level of buffer overflows — and what specific mitigations drive that transition? (raised by: concepts/ai-red-teaming, 2026-04-09)

Sources