Securing AI at the Gate

Published February 27, 2026 · Updated February 28, 2026

A Deep-Dive into Microsoft Foundry Guardrails, Adversarial Prompt Attacks, and Runtime LLM Defense. Part 1.

Embedding LLMs into production workloads without a runtime defense layer is the same architectural mistake as deploying internet-facing applications without a WAF. The attack surface is real, the adversarial techniques are documented, and the exploits are reproducible in any Azure OpenAI tenant today.

This edition provides practitioner-level coverage of Microsoft Foundry Guardrails, a policy enforcement plane for LLM and agentic deployments. It covers the four-point intervention architecture, real attack payload samples with VS Code-style analysis, detection engineering using annotation telemetry, PyRIT-based red-team automation, and per-request policy injection patterns for multi-tenant enforcement.

Scope: All samples target the Azure OpenAI Service / Microsoft Foundry. Detection logic applies to any M365-adjacent LLM deployment. PyRIT API

Threat Model

The OWASP Top 10 for LLM Applications (v1.1) defines the primary attack classes. Four of them map directly to Foundry guardrail controls:

OWASP ID	Attack Class	Foundry Control	Priority
LLM01	Prompt Injection (direct + indirect)	User Prompt Attacks / Indirect Attacks	Critical
LLM06	Sensitive Information Disclosure	PII detection + redaction	High
LLM09	Overreliance / Hallucination	Groundedness (Preview)	Medium
LLM02	Insecure Output Handling	Output scan + annotation	High

Two attack classes fall outside current guardrail coverage and require compensating controls:

LLM05 (Supply Chain): malicious model weights, poisoned fine-tune datasets. No guardrail layer. Requires model provenance verification and SBOM tracking.
LLM08 (Excessive Agency): agents with over-permissioned tool access. Requires least-privilege service principals and scoped tool definitions, not content filtering.

The Four-Point Intervention Architecture

Foundry guardrails operate as a scanning layer at four distinct stages of the AI request lifecycle. Coverage gaps at any one point expose the deployment to the attack classes that fire at that stage.

Intervention Point	When It Fires	Risk Categories Available	Latency
User Input	Before the model sees the prompt	All 11 categories	50-100ms
Tool Call (Preview)	Outbound data to an external tool	Prompt Attacks, Indirect, PII	50-100ms
Tool Response (Preview)	Inbound data from the tool to the agent	Indirect Attacks, PII, Content	50-100ms
Output	Final completion before user delivery	All 11 categories	50-100ms

Critical: The Override Trap

An agent guardrail COMPLETELY replaces the underlying model guardrail. No merging. No fallback. If your agent guardrail omits Tool Call scanning, those calls are unscanned regardless of the model-level policy. Verify all four intervention points are explicitly covered in every agent guardrail.

Attack Scenarios and Payload Analysis

The following sections cover each major attack class with real payload examples shown in VS Code-style code analysis format, followed by guardrail response behavior and detection logic.

Direct and Indirect Prompt Injection Payloads

Direct injection embeds override instructions in the user’s turn. Indirect injection plants adversarial instructions inside documents, emails, or API responses the agent processes as trusted context. Both are covered in the sample below, which shows the progression from naive (immediately blocked) through encoding evasion and RAG-targeted indirect injection.

Figure 1: Prompt injection payload samples – direct override, continuation attack, unicode homoglyph evasion, and indirect injection via planted document

Guardrail coverage analysis for each payload type:

Payload 1 – Naive override: blocked at User Input by User Prompt Attacks at Low severity threshold.
Payload 2 – Continuation attack: detection rate varies with severity threshold. At High, bypass is viable. Start at Low.
Payload 3 – Unicode homoglyph evasion: variable detection. Compensate by normalizing inputs in the application layer before the API call.
Payload 4 – Role-play persona override: covered by User Prompt Attacks. Efficacy depends on the threshold and the persona’s complexity.
Payload 5 – Indirect injection via document: requires Tool Response scanning + Indirect Attacks category active in the agent guardrail. Without Tool Response coverage, the agent processes this as trusted content.

Detection Gap: Encoding Evasion

Base64, ROT13, and Unicode homoglyph payloads have variable detection rates against current classifier models. Implement input normalization (Unicode NFKC normalization and a base64 decode attempt) in your application’s middleware before the prompt reaches the API endpoint.

Creating a Production RAI Policy via REST API

Guardrails are represented as RAI Policies in Azure Resource Manager. The following shows the complete policy creation call with per-category control configuration. Treat policy definitions as infrastructure code: version them, test in staging, gate production promotion on PyRIT scan results.

Figure 2: az rest call creating a custom RAI policy with jailbreak, indirect attack, hate, and violence controls configured for Prompt and Completion stages

Key configuration decisions in this policy:

Jailbreak blocking is enabled only at the Prompt stage. Jailbreak attempts in the completion stage are not a meaningful threat vector.
indirect_attack blocking enabled at Prompt stage. For agent deployments, this must also be enabled at the Tool Response intervention point in the agent-level policy.
Severity threshold set to Low for hate and violence. Start restrictive. Relax after measuring false positive rates against real production traffic.
protected_material_code set to annotate-only (blocking: false). This is an IP risk signal, not a security block. Feed to your IP compliance monitoring stack.

Guardrail Block Handler

Production applications need structured handling for guardrail-blocked responses. The API returns HTTP 400 with error code content_filter on a block. The following handler extracts block category and severity metadata and emits a structured event to the logging pipeline. This is the data that feeds SIEM detection rules.

The x-policy-id header in line 14 is the per-request policy mechanism. This enables dynamic policy routing without separate model deployments:

Route consumer-facing requests to your strict policy
Route internal tooling requests to a balanced policy
Route security monitoring pipelines to an annotation policy
A/B test policy configurations by splitting traffic at the application layer

SIEM Integration

The guardrail_block event emitted by this handler should trigger a SOC alert at medium or high severity. Aggregate these events per user and per session. Jailbreak attempt frequency is a behavioral threat signal. An escalating pattern across sessions indicates a persistent adversary, not a curious user.

Annotation Response Schema and SIEM Extraction

Annotation mode runs detection without blocking and returns filter metadata in the API response. This is the mechanism for building visibility before tightening block thresholds, and for categories where you want telemetry without enforcement. The response schema separates input and output scan results.

The critical data points to extract from every annotated response and ship to your SIEM:

prompt_filter_results[].jailbreak.detected: true = attempted direct prompt injection regardless of block status
prompt_filter_results[].indirect_attack.detected: true in tool response context = potential RAG poisoning attempt
choices[].content_filter_results.protected_material_code.detected: true = IP risk signal in output, log citation URL
severity field transitions from safe to low or medium across sessions for the same user = behavioral escalation pattern

Run annotation mode for at least two weeks on any new deployment before enabling blocking. This establishes your false positive baseline and prevents operational disruption on the first day of enforcement.

PyRIT Multi-Turn Crescendo

PyRIT is Microsoft’s purpose-built framework for adversarial testing of LLM safety configurations. The crescendo orchestrator drives multi-turn escalation attacks, in which an adversary LLM progressively pushes the target toward policy violations over N turns. This tests whether guardrails catch context-accumulated injection that unfolds gradually rather than firing a single payload.

Integrate PyRIT scans as a CI/CD gate on every guardrail policy change:

Maintain a canonical jailbreak dataset covering all payload classes in Figure 1
Run PromptSendingOrchestrator against the staging policy with the full dataset
Run CrescendoOrchestrator with max_turns=6 and max_backtracks=3
Gate production promotion on: zero jailbreak successes, zero crescendo objective achievements
A successful attack against your policy = failing unit test. Fix before promoting.

Red-Team Scope

Run PyRIT scans only in isolated staging environments using dedicated test deployments. Never run adversarial test payloads against production models. Crescendo attacks generate a high volume of policy-violating prompts that will pollute your production audit logs and potentially trigger compliance alerts.

Coverage Gaps and Compensating Controls

Gap	Risk	Compensating Control
No custom classifier	Domain-specific content bypasses generic classifiers	Pre-process via Azure Content Safety custom categories or Presidio
Audio (Whisper) excluded	The voice to text pipeline has no guardrail coverage on transcription	Scan transcription output via Content Safety API before LLM injection
Groundedness not in agents	Agentic RAG has no hallucination detection at the execution layer	Implement output validation using LLM-as-judge in the orchestration layer
Encoding evasion variable	Unicode homoglyphs, base64, may bypass the token classifier	Normalize input (NFKC + decode) in middleware before API call
No inter-session context	Multi-turn escalation across sessions is invisible to guardrail	Aggregate annotation telemetry per user; build behavioral rules in SIEM
Streaming-only groundedness	Batch processing cannot use groundedness controls	Use non-streaming calls for document pipelines needing groundedness

Closing

Microsoft Foundry Guardrails are a production-grade control plane for enforcing LLM safety. The four-point intervention architecture, per-request policy injection, and annotation telemetry give security teams functional runtime control over model and agent behavior. The gaps are real and documented. The failure mode is not the framework. It is teams that ship Default.V2 into production, skip agent guardrail auditing, never run PyRIT, and do not feed annotation telemetry into their detection stack.

Treat AI safety controls the same way you treat any other security control: baseline, monitor, red-team, and harden. The adversarial techniques shown in this post are reproducible today against any misconfigured Azure OpenAI deployment. The defense is a configuration, not the model.

Microsoft Learn: Configure guardrails in Microsoft Foundry

Microsoft Learn: Guardrails and controls overview

PyRIT: Python Risk Identification Toolkit – github.com/Azure/PyRIT

OWASP Top 10 for Large Language Model Applications v1.1

Securing AI at the Gate

Threat Model

The Four-Point Intervention Architecture

Attack Scenarios and Payload Analysis

Direct and Indirect Prompt Injection Payloads

Creating a Production RAI Policy via REST API

Guardrail Block Handler

Annotation Response Schema and SIEM Extraction

PyRIT Multi-Turn Crescendo

Coverage Gaps and Compensating Controls

Related

Top Posts & Pages

Securing AI at the Gate

Threat Model

The Four-Point Intervention Architecture

Attack Scenarios and Payload Analysis

Direct and Indirect Prompt Injection Payloads

Creating a Production RAI Policy via REST API

Guardrail Block Handler

Annotation Response Schema and SIEM Extraction

PyRIT Multi-Turn Crescendo

Coverage Gaps and Compensating Controls

Related

Top Posts & Pages

Discover more from CYBERDOM