How did you handle HIPAA compliance in the claims processing agent?

All PHI (Protected Health Information) in claims documents flows exclusively within the client's Azure tenant. Azure OpenAI runs under a HIPAA BAA, Azure Document Intelligence processes forms within the same tenant, and Private Endpoints ensure no data traverses the public internet. All audit logs capturing what PHI was processed, by which service, at what time, are written to Azure Monitor with immutable retention. No PHI is stored in prompt history or model fine-tuning datasets.

What was the biggest technical challenge in the claims build?

The most challenging aspect was handling the variability in claims document formats. The insurer received claims from 340+ different provider templates — each with different field layouts, terminology, and document structures. We solved this by training Azure Document Intelligence on a representative sample of 2,000 historical documents across 40 template types, then using GPT-4o to handle the long tail of unseen templates through few-shot prompting with extracted field examples.

How did you validate AI adjudication accuracy before going live?

We ran a 6-week parallel operation period where the AI agent processed all incoming claims simultaneously with the existing human review process. Adjudication decisions were compared at a field level — agreement rate, disagreement categorisation (AI correct vs human correct), and edge case identification. We required 94% agreement on coverage decisions and 99% agreement on payment amount calculations before enabling autonomous processing. The parallel operation period identified 12 rule gaps that were fixed before go-live.

What does the human-in-the-loop process look like for complex claims?

The agent classifies each claim into three routing tiers: Tier 1 (straightforward, high-confidence — processed autonomously, 67% of volume), Tier 2 (moderate complexity or borderline policy match — AI produces draft decision with supporting evidence for human review, 28% of volume), and Tier 3 (novel situation, litigation risk, or confidence below threshold — routed directly to senior adjudicator with full AI analysis as background, 5% of volume). The tiering thresholds are configurable and have been tuned quarterly based on production outcomes.

Case Study16 min read

From 6-day claims processing to 18 hours: Building an Azure AI triage agent

A step-by-step walkthrough of an insurance claims triage deployment — from the business problem through architecture decisions, the 4-week build, and 90-day production results. Everything that worked, and three things we'd do differently.

The problem

The client: a regional insurer with 180,000 active policies across personal lines and commercial property. 3,200 new claims arriving each week. 18 FTEs whose primary role was manual triage — reading claim documents, cross-referencing policy terms, assigning claims to the correct team, and flagging missing documents.

The problem: six-day average from claim submission to first triage decision. Three root causes — (1) document volume exceeded team capacity during peak periods, (2) every claim required policy lookup that took 15–20 minutes manually, (3) fraud signal checking was manual and inconsistent.

3,200

claims/week

FTEs on manual triage

6 days

avg to first decision

12+

document formats

Why this was a harder problem than it looked

Three complicating factors emerged in the first discovery session:

Document format diversity

Claims arrived in 12+ formats: PDFs (digital and scanned), photographs of handwritten forms, email threads, Excel worksheets, and faxed images. A single claim often contained multiple formats. Standard OCR produced inconsistent extraction quality across formats — Azure AI Document Intelligence was needed as a preprocessing layer, not just a GPT-4o call.

Simultaneous cross-referencing requirement

Accurate triage required the agent to hold three things in context simultaneously: the extracted claim fields, the relevant policy terms (from an 8-year archive of 400,000 policy documents), and the fraud signal indicators (from a separate scoring dataset). Standard single-turn RAG wasn't sufficient — multi-step retrieval was required.

Compliance risk on every error

Incorrect triage direction — especially wrong fraud classification or wrong policy coverage determination — created downstream regulatory exposure. This meant the agent needed a human-in-the-loop escalation path for low-confidence decisions, not just a binary pass/fail output.

Why Azure AI Foundry (not OpenAI.com or AWS)

Three factors made Azure AI Foundry the unambiguous choice:

Existing Azure estate: Dynamics 365, Azure Data Lake, Azure Active Directory (Entra ID) — Azure AI Foundry integrates natively with all three, eliminating integration work that would have taken 2–3 additional weeks on a different platform.
HIPAA-adjacent compliance requirements: The insurer's data governance policy required data residency in US East — Azure OpenAI private endpoint configuration provided this. OpenAI.com offered no equivalent data residency guarantee.
IT team Azure proficiency: The client's infrastructure team had deep Azure expertise. A deployment on AWS or a standalone OpenAI integration would have required external infrastructure support throughout the engagement.

The build — week by week

Week 1

Document pipeline

Azure AI Document Intelligence deployed as the preprocessing layer for all 12 claim document formats. Custom extraction models trained on 300 historical claim samples for each format category. A field normalisation layer built to standardise output structure regardless of source format — ensuring GPT-4o always receives consistently structured data.

Week 2

Policy lookup RAG pipeline

Azure AI Search indexed the full 8-year policy archive (400,000 documents, ~2.8GB) with vector + semantic hybrid search. Chunking strategy: 512-token chunks with 64-token overlap, preserving policy clause boundaries. Semantic ranking configured to prioritise temporal recency for policy version lookup. Evaluation: 95% retrieval precision on 200-question test set derived from historical triage decisions.

Week 3

Triage agent

GPT-4o via Azure OpenAI reading extracted document fields + retrieved policy context, outputting a structured JSON triage decision: claim type, priority level (P1–P4), assigned team, fraud risk score (0–100), confidence score (0–1), escalation flag, and missing documents checklist. Prompt Flow pipeline: version-controlled prompt, automated evaluation on every prompt change, A/B testing framework for refinements.

Week 4

Integration, security & compliance

Azure API Management gateway connecting the triage agent to the claims management system. Managed Identity authentication — zero stored credentials. Private endpoint on Azure OpenAI resource. Full audit logging to Azure Monitor: every triage decision, document hash, GPT-4o model version, confidence score, and processing time. Content safety configured with financial services harm thresholds. UAT: 200 historical claims re-triaged by agent vs. original human decisions — 96.4% agreement rate.

Results — 90 days post-launch

71%

claims auto-triaged

18 hrs

avg processing (was 6 days)

3 FTEs

redeployed to complex claims

$1.4M

projected annual savings

96.4%

triage accuracy on UAT set

Zero

compliance incidents

The 29% of claims not auto-triaged fall into three categories: low-confidence agent decisions (escalated with full context for human review), novel claim types outside the training distribution, and claims explicitly flagged by fraud scoring above the P1 threshold. Human reviewers reported that escalated claims arrived with significantly better context than pre-agent — the agent's structured output pre-populated the reviewer interface with extracted fields and retrieved policy clauses.

The most impactful decision

Building the Prompt Flow evaluation pipeline before go-live. We ran 200 historical claims through the agent before launch and caught three failure modes — incorrect policy version selection for multi-year policies, a chunking artifact creating hallucinated clause references, and an edge case in the fraud scoring threshold logic. Fixing these in staging saved estimated 3–4 weeks of post-launch remediation.

What we'd do differently

Start evaluation dataset earlier

We needed 300+ labelled claim examples for Prompt Flow evaluation. Compiling and labelling these took 2 weeks — running in parallel with architecture work, but creating a dependency on Week 3. Starting dataset compilation in Week 0 would have compressed the timeline by a week.

Invest in document normalisation earlier

We underestimated format diversity. The 12 identified formats became 18 by Week 2 as edge cases emerged from the archive. Building the normalisation layer as a configurable module (rather than per-format handlers) would have been faster to extend.

Build the human-review UI in parallel

Human reviewers needed UI changes to consume the agent's structured output efficiently. We scoped this as post-launch work, but reviewer feedback during UAT identified requirements that added 1 week to the timeline. Starting UI work in Week 3 would have removed this delay.

Implementation Guide

How to architect your first Azure AI Foundry agent: A practitioner's checklist

Use Case

AI Customer Service Agent — resolve 70% of support queries without a human

2-week risk-free pilot

Ready to build your Azure AI agent?

We scope, build, and deploy your first production agent in 2 weeks. Fixed price. Zero delivery risk.