Securing AI at the Gate

A Deep-Dive into Microsoft Foundry Guardrails, Adversarial Prompt Attacks, and Runtime LLM Defense. Part 1.

Embedding LLMs into production workloads without a runtime defense layer is the same architectural mistake as deploying internet-facing applications without a WAF. The attack surface is real, the adversarial techniques are documented, and the exploits are reproducible in any Azure OpenAI tenant today.

This edition provides practitioner-level coverage of Microsoft Foundry Guardrails, a policy enforcement plane for LLM and agentic deployments. It covers the four-point intervention architecture, real attack payload samples with VS Code-style analysis, detection engineering using annotation telemetry, PyRIT-based red-team automation, and per-request policy injection patterns for multi-tenant enforcement.

Scope: All samples target the Azure OpenAI Service / Microsoft Foundry. Detection logic applies to any M365-adjacent LLM deployment. PyRIT API

Threat Model

The OWASP Top 10 for LLM Applications (v1.1) defines the primary attack classes. Four of them map directly to Foundry guardrail controls:

OWASP ID Attack Class Foundry Control Priority
LLM01 Prompt Injection (direct + indirect) User Prompt Attacks / Indirect Attacks Critical
LLM06 Sensitive Information Disclosure PII detection + redaction High
LLM09 Overreliance / Hallucination Groundedness (Preview) Medium
LLM02 Insecure Output Handling Output scan + annotation High

Two attack classes fall outside current guardrail coverage and require compensating controls:

  • LLM05 (Supply Chain): malicious model weights, poisoned fine-tune datasets. No guardrail layer. Requires model provenance verification and SBOM tracking.
  • LLM08 (Excessive Agency): agents with over-permissioned tool access. Requires least-privilege service principals and scoped tool definitions, not content filtering.

The Four-Point Intervention Architecture

Foundry guardrails operate as a scanning layer at four distinct stages of the AI request lifecycle. Coverage gaps at any one point expose the deployment to the attack classes that fire at that stage.

Intervention Point When It Fires Risk Categories Available Latency
User Input Before the model sees the prompt All 11 categories 50-100ms
Tool Call (Preview) Outbound data to an external tool Prompt Attacks, Indirect, PII 50-100ms
Tool Response (Preview) Inbound data from the tool to the agent Indirect Attacks, PII, Content 50-100ms
Output Final completion before user delivery All 11 categories 50-100ms

 

Critical: The Override Trap

An agent guardrail COMPLETELY replaces the underlying model guardrail. No merging. No fallback. If your agent guardrail omits Tool Call scanning, those calls are unscanned regardless of the model-level policy. Verify all four intervention points are explicitly covered in every agent guardrail.

Attack Scenarios and Payload Analysis

The following sections cover each major attack class with real payload examples shown in VS Code-style code analysis format, followed by guardrail response behavior and detection logic.

Direct and Indirect Prompt Injection Payloads

Direct injection embeds override instructions in the user’s turn. Indirect injection plants adversarial instructions inside documents, emails, or API responses the agent processes as trusted context. Both are covered in the sample below, which shows the progression from naive (immediately blocked) through encoding evasion and RAG-targeted indirect injection.

Figure 1: Prompt injection payload samples – direct override, continuation attack, unicode homoglyph evasion, and indirect injection via planted document

Guardrail coverage analysis for each payload type:

  • Payload 1 – Naive override: blocked at User Input by User Prompt Attacks at Low severity threshold.
  • Payload 2 – Continuation attack: detection rate varies with severity threshold. At High, bypass is viable. Start at Low.
  • Payload 3 – Unicode homoglyph evasion: variable detection. Compensate by normalizing inputs in the application layer before the API call.
  • Payload 4 – Role-play persona override: covered by User Prompt Attacks. Efficacy depends on the threshold and the persona’s complexity.
  • Payload 5 – Indirect injection via document: requires Tool Response scanning + Indirect Attacks category active in the agent guardrail. Without Tool Response coverage, the agent processes this as trusted content.
Detection Gap: Encoding Evasion

Base64, ROT13, and Unicode homoglyph payloads have variable detection rates against current classifier models. Implement input normalization (Unicode NFKC normalization and a base64 decode attempt) in your application’s middleware before the prompt reaches the API endpoint.

Creating a Production RAI Policy via REST API

Guardrails are represented as RAI Policies in Azure Resource Manager. The following shows the complete policy creation call with per-category control configuration. Treat policy definitions as infrastructure code: version them, test in staging, gate production promotion on PyRIT scan results.

Figure 2: az rest call creating a custom RAI policy with jailbreak, indirect attack, hate, and violence controls configured for Prompt and Completion stages

Key configuration decisions in this policy:

  • Jailbreak blocking is enabled only at the Prompt stage. Jailbreak attempts in the completion stage are not a meaningful threat vector.
  • indirect_attack blocking enabled at Prompt stage. For agent deployments, this must also be enabled at the Tool Response intervention point in the agent-level policy.
  • Severity threshold set to Low for hate and violence. Start restrictive. Relax after measuring false positive rates against real production traffic.
  • protected_material_code set to annotate-only (blocking: false). This is an IP risk signal, not a security block. Feed to your IP compliance monitoring stack.

Guardrail Block Handler

Production applications need structured handling for guardrail-blocked responses. The API returns HTTP 400 with error code content_filter on a block. The following handler extracts block category and severity metadata and emits a structured event to the logging pipeline. This is the data that feeds SIEM detection rules.

The x-policy-id header in line 14 is the per-request policy mechanism. This enables dynamic policy routing without separate model deployments:

  • Route consumer-facing requests to your strict policy
  • Route internal tooling requests to a balanced policy
  • Route security monitoring pipelines to an annotation policy
  • A/B test policy configurations by splitting traffic at the application layer
SIEM Integration

The guardrail_block event emitted by this handler should trigger a SOC alert at medium or high severity. Aggregate these events per user and per session. Jailbreak attempt frequency is a behavioral threat signal. An escalating pattern across sessions indicates a persistent adversary, not a curious user.

Annotation Response Schema and SIEM Extraction

Annotation mode runs detection without blocking and returns filter metadata in the API response. This is the mechanism for building visibility before tightening block thresholds, and for categories where you want telemetry without enforcement. The response schema separates input and output scan results.

The critical data points to extract from every annotated response and ship to your SIEM:

  • prompt_filter_results[].jailbreak.detected: true = attempted direct prompt injection regardless of block status
  • prompt_filter_results[].indirect_attack.detected: true in tool response context = potential RAG poisoning attempt
  • choices[].content_filter_results.protected_material_code.detected: true = IP risk signal in output, log citation URL
  • severity field transitions from safe to low or medium across sessions for the same user = behavioral escalation pattern

Run annotation mode for at least two weeks on any new deployment before enabling blocking. This establishes your false positive baseline and prevents operational disruption on the first day of enforcement.

PyRIT Multi-Turn Crescendo

PyRIT is Microsoft’s purpose-built framework for adversarial testing of LLM safety configurations. The crescendo orchestrator drives multi-turn escalation attacks, in which an adversary LLM progressively pushes the target toward policy violations over N turns. This tests whether guardrails catch context-accumulated injection that unfolds gradually rather than firing a single payload.

Integrate PyRIT scans as a CI/CD gate on every guardrail policy change:

  1. Maintain a canonical jailbreak dataset covering all payload classes in Figure 1
  2. Run PromptSendingOrchestrator against the staging policy with the full dataset
  3. Run CrescendoOrchestrator with max_turns=6 and max_backtracks=3
  4. Gate production promotion on: zero jailbreak successes, zero crescendo objective achievements
  5. A successful attack against your policy = failing unit test. Fix before promoting.

 

Red-Team Scope

Run PyRIT scans only in isolated staging environments using dedicated test deployments. Never run adversarial test payloads against production models. Crescendo attacks generate a high volume of policy-violating prompts that will pollute your production audit logs and potentially trigger compliance alerts.

Coverage Gaps and Compensating Controls

Gap Risk Compensating Control
No custom classifier Domain-specific content bypasses generic classifiers Pre-process via Azure Content Safety custom categories or Presidio
Audio (Whisper) excluded The voice to text pipeline has no guardrail coverage on transcription Scan transcription output via Content Safety API before LLM injection
Groundedness not in agents Agentic RAG has no hallucination detection at the execution layer Implement output validation using LLM-as-judge in the orchestration layer
Encoding evasion variable Unicode homoglyphs, base64, may bypass the token classifier Normalize input (NFKC + decode) in middleware before API call
No inter-session context Multi-turn escalation across sessions is invisible to guardrail Aggregate annotation telemetry per user; build behavioral rules in SIEM
Streaming-only groundedness Batch processing cannot use groundedness controls Use non-streaming calls for document pipelines needing groundedness

Closing

Microsoft Foundry Guardrails are a production-grade control plane for enforcing LLM safety. The four-point intervention architecture, per-request policy injection, and annotation telemetry give security teams functional runtime control over model and agent behavior. The gaps are real and documented. The failure mode is not the framework. It is teams that ship Default.V2 into production, skip agent guardrail auditing, never run PyRIT, and do not feed annotation telemetry into their detection stack.

Treat AI safety controls the same way you treat any other security control: baseline, monitor, red-team, and harden. The adversarial techniques shown in this post are reproducible today against any misconfigured Azure OpenAI deployment. The defense is a configuration, not the model.

Microsoft Learn: Configure guardrails in Microsoft Foundry

Microsoft Learn: Guardrails and controls overview

PyRIT: Python Risk Identification Toolkit – github.com/Azure/PyRIT

OWASP Top 10 for Large Language Model Applications v1.1

 

Discover more from CYBERDOM

Subscribe now to keep reading and get access to the full archive.

Continue reading